Big-Link-Man/README.md

18 KiB

Big Link Man - Content Automation & Syndication Platform

AI-powered content generation and multi-tier link building system with cloud deployment.

Quick Start

# Install dependencies
uv pip install -r requirements.txt

# Setup environment
cp env.example .env
# Edit .env with your credentials

# Initialize database
uv run python scripts/init_db.py

# Create first admin user
uv run python scripts/create_first_admin.py

# Run CLI
uv run python main.py --help

Environment Configuration

Required environment variables in .env:

DATABASE_URL=sqlite:///./content_automation.db
OPENROUTER_API_KEY=your_key_here
BUNNY_ACCOUNT_API_KEY=your_bunny_key_here

See env.example for full configuration options.

Database Management

Initialize Database

uv run python scripts/init_db.py

Reset Database (drops all data)

uv run python scripts/init_db.py reset

Create First Admin

uv run python scripts/create_first_admin.py

Database Migrations

# Story 3.1 - Site deployments
uv run python scripts/migrate_story_3.1_sqlite.py

# Story 3.2 - Anchor text
uv run python scripts/migrate_add_anchor_text.py

# Story 3.3 - Template fields
uv run python scripts/migrate_add_template_fields.py

# Story 3.4 - Site pages
uv run python scripts/migrate_add_site_pages.py

# Story 4.1 - Deployment fields
uv run python scripts/migrate_add_deployment_fields.py

# Backfill site pages after migration
uv run python scripts/backfill_site_pages.py

User Management

Add User

uv run python main.py add-user \
  --username newuser \
  --password password123 \
  --role Admin \
  --admin-user admin \
  --admin-password adminpass

List Users

uv run python main.py list-users \
  --admin-user admin \
  --admin-password adminpass

Delete User

uv run python main.py delete-user \
  --username olduser \
  --admin-user admin \
  --admin-password adminpass

Site Management

Provision New Site

uv run python main.py provision-site \
  --name "My Site" \
  --domain www.example.com \
  --storage-name my-storage-zone \
  --region DE \
  --admin-user admin \
  --admin-password adminpass

Regions: DE, NY, LA, SG, SYD

Attach Domain to Existing Storage

uv run python main.py attach-domain \
  --name "Another Site" \
  --domain www.another.com \
  --storage-name my-storage-zone \
  --admin-user admin \
  --admin-password adminpass

Sync Existing Bunny.net Sites

# Dry run
uv run python main.py sync-sites \
  --admin-user admin \
  --dry-run

# Actually import
uv run python main.py sync-sites \
  --admin-user admin

List Sites

uv run python main.py list-sites \
  --admin-user admin \
  --admin-password adminpass

Get Site Details

uv run python main.py get-site \
  --domain www.example.com \
  --admin-user admin \
  --admin-password adminpass

Remove Site

uv run python main.py remove-site \
  --domain www.example.com \
  --admin-user admin \
  --admin-password adminpass

S3 Bucket Management

The platform supports AWS S3 buckets as storage providers alongside bunny.net. S3 buckets can be discovered, registered, and managed through the system.

Prerequisites

Set AWS credentials in .env:

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1  # Optional, defaults to us-east-1

Discover and Register S3 Buckets

Interactive Mode (select buckets manually):

uv run python main.py discover-s3-buckets

Or run the script directly:

uv run python scripts/discover_s3_buckets.py

Auto-Import Mode (import all unregistered buckets automatically):

uv run python scripts/discover_s3_buckets.py --auto-import-all

Auto-import mode will:

  • Discover all S3 buckets in your AWS account
  • Skip buckets already registered in the database
  • Skip buckets in the exclusion list
  • Register remaining buckets as bucket-only sites (no custom domain)

Bucket Exclusion List

To prevent certain buckets from being auto-imported (e.g., buckets manually added with FQDNs), add them to s3_bucket_exclusions.txt:

# S3 Bucket Exclusion List
# One bucket name per line (comments start with #)

learningeducationtech.com
theteacher.best
airconditionerfixer.com

The discovery script automatically loads and respects this exclusion list. Excluded buckets are marked as [EXCLUDED] in the display and are skipped during both interactive and auto-import operations.

List S3 Sites with FQDNs

To see which S3 buckets have custom domains (and should be excluded):

uv run python scripts/list_s3_fqdn_sites.py

This script lists all S3 sites with s3_custom_domain set and outputs bucket names that should be added to the exclusion list.

S3 Site Types

S3 sites can be registered in two ways:

  1. Bucket-only sites: No custom domain, accessed via S3 website endpoint

    • Created via auto-import or interactive discovery
    • Uses bucket name as site identifier
    • URL format: https://bucket-name.s3.region.amazonaws.com/
  2. FQDN sites: Manually added with custom domains

    • Created manually with s3_custom_domain set
    • Should be added to exclusion list to prevent re-import
    • URL format: https://custom-domain.com/

S3 Storage Features

  • Multi-region support: Automatically detects bucket region
  • Public read access: Buckets configured for public read-only access
  • Bucket policy: Applied automatically for public read access
  • Region mapping: AWS regions mapped to short codes (US, EU, SG, etc.)
  • Duplicate prevention: Checks existing registrations before importing

Helper Scripts

List S3 FQDN sites:

uv run python scripts/list_s3_fqdn_sites.py

Delete sites by ID:

# Edit scripts/delete_sites.py to set site_ids, then:
uv run python scripts/delete_sites.py

Check sites around specific IDs:

# Edit scripts/list_sites_by_id.py to set target_ids, then:
uv run python scripts/list_sites_by_id.py

Project Management

Ingest CORA Report

uv run python main.py ingest-cora \
  --file shaft_machining.xlsx \
  --name "Shaft Machining Project" \
  --custom-anchors "shaft repair,engine parts" \
  --username admin \
  --password adminpass

List Projects

uv run python main.py list-projects \
  --username admin \
  --password adminpass

Content Generation

Create Job Configuration

# Tier 1 only
uv run python create_job_config.py 1 tier1 15

# Multi-tier
uv run python create_job_config.py 1 multi 15 50 100

Generate Content Batch

uv run python main.py generate-batch \
  --job-file jobs/project_1_tier1_15articles.json \
  --username admin \
  --password adminpass

With options:

uv run python main.py generate-batch \
  --job-file jobs/my_job.json \
  --username admin \
  --password adminpass \
  --debug \
  --continue-on-error \
  --model gpt-4o-mini

Available models: gpt-4o-mini, claude-sonnet-4.5, - anything at openrouter

Note: If your job file contains a models config, it will override the --model flag and use different models for title, outline, and content generation stages.

Deployment

Deploy Batch

# Automatic deployment (runs after generation)
uv run python main.py generate-batch \
  --job-file jobs/my_job.json \
  --username admin \
  --password adminpass

# Manual deployment
uv run python main.py deploy-batch \
  --batch-id 123 \
  --admin-user admin \
  --admin-password adminpass

Dry Run Deployment

uv run python main.py deploy-batch \
  --batch-id 123 \
  --dry-run

Verify Deployment

# Check all URLs
uv run python main.py verify-deployment --batch-id 123

# Check random sample
uv run python main.py verify-deployment \
  --batch-id 123 \
  --sample 10 \
  --timeout 10

Export Article URLs

# Tier 1 only
uv run python main.py get-links \
  --project-id 123 \
  --tier 1

# Tier 2 and above
uv run python main.py get-links \
  --project-id 123 \
  --tier 2+

# With anchor text and destinations
uv run python main.py get-links \
  --project-id 123 \
  --tier 2+ \
  --with-anchor-text \
  --with-destination-url

Output is CSV format to stdout. Redirect to save:

uv run python main.py get-links \
  --project-id 123 \
  --tier 1 > tier1_urls.csv

Utility Scripts

Add robots.txt to All Buckets

Add a standardized robots.txt file to all storage buckets (both S3 and Bunny) that blocks SEO tools and bad bots while allowing legitimate search engines and AI crawlers:

# Preview what would be done (recommended first)
uv run python scripts/add_robots_txt_to_buckets.py --dry-run

# Upload to all buckets
uv run python scripts/add_robots_txt_to_buckets.py

# Only process S3 buckets
uv run python scripts/add_robots_txt_to_buckets.py --provider s3

# Only process Bunny storage zones
uv run python scripts/add_robots_txt_to_buckets.py --provider bunny

robots.txt behavior:

  • Allows: Google, Bing, Yahoo, DuckDuckGo, Baidu, Yandex
  • Allows: GPTBot, Claude, Common Crawl, Perplexity, ByteDance AI
  • Blocks: Ahrefs, Semrush, Moz, and other SEO tools
  • Blocks: HTTrack, Wget, and other scrapers/bad bots

The script is idempotent (safe to run multiple times) and will overwrite existing robots.txt files. It continues processing remaining buckets if one fails and reports all failures at the end.

Update Index Pages and Sitemaps

Automatically generate or update index.html and sitemap.xml files for all storage buckets (both S3 and Bunny). The script:

  • Lists all HTML files in each bucket's root directory
  • Extracts titles from <title> tags (or formats filenames as fallback)
  • Generates article listings sorted by most recent modification date
  • Creates or updates index.html with article links in <div id="article_listing">
  • Generates sitemap.xml with industry-standard settings (priority, changefreq, lastmod)
  • Tracks last run timestamps to avoid unnecessary updates
  • Excludes boilerplate pages: index.html, about.html, privacy.html, contact.html

Usage:

# Preview what would be updated (recommended first)
uv run python scripts/update_index_pages.py --dry-run

# Update all buckets
uv run python scripts/update_index_pages.py

# Only process S3 buckets
uv run python scripts/update_index_pages.py --provider s3

# Only process Bunny storage zones
uv run python scripts/update_index_pages.py --provider bunny

# Force update even if no changes detected
uv run python scripts/update_index_pages.py --force

# Test on specific site
uv run python scripts/update_index_pages.py --hostname example.com

# Limit number of sites (useful for testing)
uv run python scripts/update_index_pages.py --limit 10

How it works:

  1. Queries database for all site deployments
  2. Lists HTML files in root directory (excludes subdirectories and boilerplate pages)
  3. Checks if content has changed since last run (unless --force is used)
  4. Downloads and parses HTML files to extract titles
  5. Generates article listing HTML (sorted by most recent first)
  6. Creates new index.html or updates existing one (inserts into <div id="article_listing">)
  7. Generates sitemap.xml with all HTML files and proper metadata
  8. Uploads both files to the bucket
  9. Saves state to .update_index_state.json for tracking

Sitemap standards:

  • Priority: 1.0 for homepage (index.html), 0.8 for other pages
  • Change frequency: weekly for all pages
  • Last modified dates from file metadata
  • Includes all HTML files in root directory

Customizing article listing HTML:

The article listing format can be easily customized by editing the generate_article_listing_html() function in scripts/update_index_pages.py. The function includes detailed documentation and examples for common variations (cards, dates, descriptions, etc.).

State tracking:

The script maintains state in scripts/.update_index_state.json to track when each site was last updated. This prevents unnecessary regeneration when content hasn't changed. Use --force to bypass this check.

Check Last Generated Content

uv run python check_last_gen.py

List All Users (Direct DB Access)

uv run python scripts/list_users.py

Add Admin (Direct DB Access)

uv run python scripts/add_admin_direct.py

Check Migration Status

uv run python scripts/check_migration.py

Add Tier to Projects

uv run python scripts/add_tier_to_projects.py

Testing

Run All Tests

uv run pytest

Run Unit Tests

uv run pytest tests/unit/ -v

Run Integration Tests

uv run pytest tests/integration/ -v

Run Specific Test File

uv run pytest tests/unit/test_url_generator.py -v

Run Story 3.1 Tests

uv run pytest tests/unit/test_url_generator.py \
               tests/unit/test_site_provisioning.py \
               tests/unit/test_site_assignment.py \
               tests/unit/test_job_config_extensions.py \
               tests/integration/test_story_3_1_integration.py \
               -v

Run with Coverage

uv run pytest --cov=src --cov-report=html

System Information

Show Configuration

uv run python main.py config

Health Check

uv run python main.py health

List Available Models

uv run python main.py models

Directory Structure

Big-Link-Man/
├── main.py                 # CLI entry point
├── src/                    # Source code
│   ├── api/               # FastAPI endpoints
│   ├── auth/              # Authentication
│   ├── cli/               # CLI commands
│   ├── core/              # Configuration
│   ├── database/          # Models, repositories
│   ├── deployment/        # Cloud deployment
│   ├── generation/        # Content generation
│   ├── ingestion/         # CORA parsing
│   ├── interlinking/      # Link injection
│   └── templating/        # HTML templates
├── scripts/               # Database & utility scripts
├── tests/                 # Test suite
│   ├── unit/
│   └── integration/
├── jobs/                  # Job configuration files
├── docs/                  # Documentation
└── deployment_logs/       # Deployed URL logs

Job Configuration Format

Example job config (jobs/example.json):

{
  "job_name": "Multi-Tier Launch",
  "project_id": 1,
  "description": "Site build with 165 articles",
  "models": {
    "title": "openai/gpt-4o-mini",
    "outline": "anthropic/claude-3.5-sonnet",
    "content": "anthropic/claude-3.5-sonnet"
  },
  "tiers": [
    {
      "tier": 1,
      "article_count": 15,
      "validation_attempts": 3
    },
    {
      "tier": 2,
      "article_count": 50,
      "validation_attempts": 2
    }
  ],
  "failure_config": {
    "max_consecutive_failures": 10,
    "skip_on_failure": true
  },
  "interlinking": {
    "links_per_article_min": 2,
    "links_per_article_max": 4,
    "include_home_link": true
  },
  "deployment_targets": ["www.primary.com"],
  "tier1_preferred_sites": ["www.premium.com"],
  "auto_create_sites": true
}

Per-Stage Model Configuration

You can specify different AI models for each generation stage (title, outline, content):

{
  "models": {
    "title": "openai/gpt-4o-mini",
    "outline": "anthropic/claude-3.5-sonnet",
    "content": "openai/gpt-4o"
  }
}

Available models:

  • openai/gpt-4o-mini - Fast and cost-effective
  • openai/gpt-4o - Higher quality, more expensive
  • anthropic/claude-3.5-sonnet - Excellent for long-form content

If models is not specified in the job file, all stages use the model from the --model CLI flag (default: gpt-4o-mini).

Common Workflows

Initial Setup

uv pip install -r requirements.txt
cp env.example .env
# Edit .env
uv run python scripts/init_db.py
uv run python scripts/create_first_admin.py
uv run python main.py sync-sites --admin-user admin

New Project Workflow

# 1. Ingest CORA report
uv run python main.py ingest-cora \
  --file project.xlsx \
  --name "My Project" \
  --username admin \
  --password adminpass

# 2. Create job config
uv run python create_job_config.py 1 multi 15 50 100

# 3. Generate content (auto-deploys)
uv run python main.py generate-batch \
  --job-file jobs/project_1_multi_3tiers_165articles.json \
  --username admin \
  --password adminpass

# 4. Verify deployment
uv run python main.py verify-deployment --batch-id 1

# 5. Export URLs for link building
uv run python main.py get-links \
  --project-id 1 \
  --tier 1 > tier1_urls.csv

Re-deploy After Changes

uv run python main.py deploy-batch \
  --batch-id 123 \
  --admin-user admin \
  --admin-password adminpass

Troubleshooting

Database locked

# Stop any running processes, then:
uv run python scripts/init_db.py reset

Missing dependencies

uv pip install -r requirements.txt --force-reinstall

AI API errors

Check OPENROUTER_API_KEY in .env

Bunny.net authentication failed

Check BUNNY_ACCOUNT_API_KEY in .env

Storage upload failed

Verify storage_zone_password in database (set during site provisioning)

Documentation

  • CLI Command Reference: docs/CLI_COMMAND_REFERENCE.md - Comprehensive documentation for all CLI commands
  • Job Configuration Schema: docs/job-schema.md - Complete reference for job configuration files
  • Product Requirements: docs/prd.md - Product requirements and epics
  • Architecture: docs/architecture/ - System architecture documentation
  • Story Specifications: docs/stories/ - Current story specifications
  • Technical Debt: docs/technical-debt.md - Known technical debt items
  • Historical Documentation: docs/archive/ - Archived implementation summaries, QA reports, and analysis documents

Regenerating CLI Documentation

To regenerate the CLI command reference after adding or modifying commands:

uv run python scripts/generate_cli_docs.py

This will update docs/CLI_COMMAND_REFERENCE.md with all current commands and their options.

License

All rights reserved.