18 KiB
Big Link Man - Content Automation & Syndication Platform
AI-powered content generation and multi-tier link building system with cloud deployment.
Quick Start
# Install dependencies
uv pip install -r requirements.txt
# Setup environment
cp env.example .env
# Edit .env with your credentials
# Initialize database
uv run python scripts/init_db.py
# Create first admin user
uv run python scripts/create_first_admin.py
# Run CLI
uv run python main.py --help
Environment Configuration
Required environment variables in .env:
DATABASE_URL=sqlite:///./content_automation.db
OPENROUTER_API_KEY=your_key_here
BUNNY_ACCOUNT_API_KEY=your_bunny_key_here
See env.example for full configuration options.
Database Management
Initialize Database
uv run python scripts/init_db.py
Reset Database (drops all data)
uv run python scripts/init_db.py reset
Create First Admin
uv run python scripts/create_first_admin.py
Database Migrations
# Story 3.1 - Site deployments
uv run python scripts/migrate_story_3.1_sqlite.py
# Story 3.2 - Anchor text
uv run python scripts/migrate_add_anchor_text.py
# Story 3.3 - Template fields
uv run python scripts/migrate_add_template_fields.py
# Story 3.4 - Site pages
uv run python scripts/migrate_add_site_pages.py
# Story 4.1 - Deployment fields
uv run python scripts/migrate_add_deployment_fields.py
# Backfill site pages after migration
uv run python scripts/backfill_site_pages.py
User Management
Add User
uv run python main.py add-user \
--username newuser \
--password password123 \
--role Admin \
--admin-user admin \
--admin-password adminpass
List Users
uv run python main.py list-users \
--admin-user admin \
--admin-password adminpass
Delete User
uv run python main.py delete-user \
--username olduser \
--admin-user admin \
--admin-password adminpass
Site Management
Provision New Site
uv run python main.py provision-site \
--name "My Site" \
--domain www.example.com \
--storage-name my-storage-zone \
--region DE \
--admin-user admin \
--admin-password adminpass
Regions: DE, NY, LA, SG, SYD
Attach Domain to Existing Storage
uv run python main.py attach-domain \
--name "Another Site" \
--domain www.another.com \
--storage-name my-storage-zone \
--admin-user admin \
--admin-password adminpass
Sync Existing Bunny.net Sites
# Dry run
uv run python main.py sync-sites \
--admin-user admin \
--dry-run
# Actually import
uv run python main.py sync-sites \
--admin-user admin
List Sites
uv run python main.py list-sites \
--admin-user admin \
--admin-password adminpass
Get Site Details
uv run python main.py get-site \
--domain www.example.com \
--admin-user admin \
--admin-password adminpass
Remove Site
uv run python main.py remove-site \
--domain www.example.com \
--admin-user admin \
--admin-password adminpass
S3 Bucket Management
The platform supports AWS S3 buckets as storage providers alongside bunny.net. S3 buckets can be discovered, registered, and managed through the system.
Prerequisites
Set AWS credentials in .env:
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1 # Optional, defaults to us-east-1
Discover and Register S3 Buckets
Interactive Mode (select buckets manually):
uv run python main.py discover-s3-buckets
Or run the script directly:
uv run python scripts/discover_s3_buckets.py
Auto-Import Mode (import all unregistered buckets automatically):
uv run python scripts/discover_s3_buckets.py --auto-import-all
Auto-import mode will:
- Discover all S3 buckets in your AWS account
- Skip buckets already registered in the database
- Skip buckets in the exclusion list
- Register remaining buckets as bucket-only sites (no custom domain)
Bucket Exclusion List
To prevent certain buckets from being auto-imported (e.g., buckets manually added with FQDNs), add them to s3_bucket_exclusions.txt:
# S3 Bucket Exclusion List
# One bucket name per line (comments start with #)
learningeducationtech.com
theteacher.best
airconditionerfixer.com
The discovery script automatically loads and respects this exclusion list. Excluded buckets are marked as [EXCLUDED] in the display and are skipped during both interactive and auto-import operations.
List S3 Sites with FQDNs
To see which S3 buckets have custom domains (and should be excluded):
uv run python scripts/list_s3_fqdn_sites.py
This script lists all S3 sites with s3_custom_domain set and outputs bucket names that should be added to the exclusion list.
S3 Site Types
S3 sites can be registered in two ways:
-
Bucket-only sites: No custom domain, accessed via S3 website endpoint
- Created via auto-import or interactive discovery
- Uses bucket name as site identifier
- URL format:
https://bucket-name.s3.region.amazonaws.com/
-
FQDN sites: Manually added with custom domains
- Created manually with
s3_custom_domainset - Should be added to exclusion list to prevent re-import
- URL format:
https://custom-domain.com/
- Created manually with
S3 Storage Features
- Multi-region support: Automatically detects bucket region
- Public read access: Buckets configured for public read-only access
- Bucket policy: Applied automatically for public read access
- Region mapping: AWS regions mapped to short codes (US, EU, SG, etc.)
- Duplicate prevention: Checks existing registrations before importing
Helper Scripts
List S3 FQDN sites:
uv run python scripts/list_s3_fqdn_sites.py
Delete sites by ID:
# Edit scripts/delete_sites.py to set site_ids, then:
uv run python scripts/delete_sites.py
Check sites around specific IDs:
# Edit scripts/list_sites_by_id.py to set target_ids, then:
uv run python scripts/list_sites_by_id.py
Project Management
Ingest CORA Report
uv run python main.py ingest-cora \
--file shaft_machining.xlsx \
--name "Shaft Machining Project" \
--custom-anchors "shaft repair,engine parts" \
--username admin \
--password adminpass
List Projects
uv run python main.py list-projects \
--username admin \
--password adminpass
Content Generation
Create Job Configuration
# Tier 1 only
uv run python create_job_config.py 1 tier1 15
# Multi-tier
uv run python create_job_config.py 1 multi 15 50 100
Generate Content Batch
uv run python main.py generate-batch \
--job-file jobs/project_1_tier1_15articles.json \
--username admin \
--password adminpass
With options:
uv run python main.py generate-batch \
--job-file jobs/my_job.json \
--username admin \
--password adminpass \
--debug \
--continue-on-error \
--model gpt-4o-mini
Available models: gpt-4o-mini, claude-sonnet-4.5, - anything at openrouter
Note: If your job file contains a models config, it will override the --model flag and use different models for title, outline, and content generation stages.
Deployment
Deploy Batch
# Automatic deployment (runs after generation)
uv run python main.py generate-batch \
--job-file jobs/my_job.json \
--username admin \
--password adminpass
# Manual deployment
uv run python main.py deploy-batch \
--batch-id 123 \
--admin-user admin \
--admin-password adminpass
Dry Run Deployment
uv run python main.py deploy-batch \
--batch-id 123 \
--dry-run
Verify Deployment
# Check all URLs
uv run python main.py verify-deployment --batch-id 123
# Check random sample
uv run python main.py verify-deployment \
--batch-id 123 \
--sample 10 \
--timeout 10
Link Export
Export Article URLs
# Tier 1 only
uv run python main.py get-links \
--project-id 123 \
--tier 1
# Tier 2 and above
uv run python main.py get-links \
--project-id 123 \
--tier 2+
# With anchor text and destinations
uv run python main.py get-links \
--project-id 123 \
--tier 2+ \
--with-anchor-text \
--with-destination-url
Output is CSV format to stdout. Redirect to save:
uv run python main.py get-links \
--project-id 123 \
--tier 1 > tier1_urls.csv
Utility Scripts
Add robots.txt to All Buckets
Add a standardized robots.txt file to all storage buckets (both S3 and Bunny) that blocks SEO tools and bad bots while allowing legitimate search engines and AI crawlers:
# Preview what would be done (recommended first)
uv run python scripts/add_robots_txt_to_buckets.py --dry-run
# Upload to all buckets
uv run python scripts/add_robots_txt_to_buckets.py
# Only process S3 buckets
uv run python scripts/add_robots_txt_to_buckets.py --provider s3
# Only process Bunny storage zones
uv run python scripts/add_robots_txt_to_buckets.py --provider bunny
robots.txt behavior:
- Allows: Google, Bing, Yahoo, DuckDuckGo, Baidu, Yandex
- Allows: GPTBot, Claude, Common Crawl, Perplexity, ByteDance AI
- Blocks: Ahrefs, Semrush, Moz, and other SEO tools
- Blocks: HTTrack, Wget, and other scrapers/bad bots
The script is idempotent (safe to run multiple times) and will overwrite existing robots.txt files. It continues processing remaining buckets if one fails and reports all failures at the end.
Update Index Pages and Sitemaps
Automatically generate or update index.html and sitemap.xml files for all storage buckets (both S3 and Bunny). The script:
- Lists all HTML files in each bucket's root directory
- Extracts titles from
<title>tags (or formats filenames as fallback) - Generates article listings sorted by most recent modification date
- Creates or updates
index.htmlwith article links in<div id="article_listing"> - Generates
sitemap.xmlwith industry-standard settings (priority, changefreq, lastmod) - Tracks last run timestamps to avoid unnecessary updates
- Excludes boilerplate pages:
index.html,about.html,privacy.html,contact.html
Usage:
# Preview what would be updated (recommended first)
uv run python scripts/update_index_pages.py --dry-run
# Update all buckets
uv run python scripts/update_index_pages.py
# Only process S3 buckets
uv run python scripts/update_index_pages.py --provider s3
# Only process Bunny storage zones
uv run python scripts/update_index_pages.py --provider bunny
# Force update even if no changes detected
uv run python scripts/update_index_pages.py --force
# Test on specific site
uv run python scripts/update_index_pages.py --hostname example.com
# Limit number of sites (useful for testing)
uv run python scripts/update_index_pages.py --limit 10
How it works:
- Queries database for all site deployments
- Lists HTML files in root directory (excludes subdirectories and boilerplate pages)
- Checks if content has changed since last run (unless
--forceis used) - Downloads and parses HTML files to extract titles
- Generates article listing HTML (sorted by most recent first)
- Creates new
index.htmlor updates existing one (inserts into<div id="article_listing">) - Generates
sitemap.xmlwith all HTML files and proper metadata - Uploads both files to the bucket
- Saves state to
.update_index_state.jsonfor tracking
Sitemap standards:
- Priority:
1.0for homepage (index.html),0.8for other pages - Change frequency:
weeklyfor all pages - Last modified dates from file metadata
- Includes all HTML files in root directory
Customizing article listing HTML:
The article listing format can be easily customized by editing the generate_article_listing_html() function in scripts/update_index_pages.py. The function includes detailed documentation and examples for common variations (cards, dates, descriptions, etc.).
State tracking:
The script maintains state in scripts/.update_index_state.json to track when each site was last updated. This prevents unnecessary regeneration when content hasn't changed. Use --force to bypass this check.
Check Last Generated Content
uv run python check_last_gen.py
List All Users (Direct DB Access)
uv run python scripts/list_users.py
Add Admin (Direct DB Access)
uv run python scripts/add_admin_direct.py
Check Migration Status
uv run python scripts/check_migration.py
Add Tier to Projects
uv run python scripts/add_tier_to_projects.py
Testing
Run All Tests
uv run pytest
Run Unit Tests
uv run pytest tests/unit/ -v
Run Integration Tests
uv run pytest tests/integration/ -v
Run Specific Test File
uv run pytest tests/unit/test_url_generator.py -v
Run Story 3.1 Tests
uv run pytest tests/unit/test_url_generator.py \
tests/unit/test_site_provisioning.py \
tests/unit/test_site_assignment.py \
tests/unit/test_job_config_extensions.py \
tests/integration/test_story_3_1_integration.py \
-v
Run with Coverage
uv run pytest --cov=src --cov-report=html
System Information
Show Configuration
uv run python main.py config
Health Check
uv run python main.py health
List Available Models
uv run python main.py models
Directory Structure
Big-Link-Man/
├── main.py # CLI entry point
├── src/ # Source code
│ ├── api/ # FastAPI endpoints
│ ├── auth/ # Authentication
│ ├── cli/ # CLI commands
│ ├── core/ # Configuration
│ ├── database/ # Models, repositories
│ ├── deployment/ # Cloud deployment
│ ├── generation/ # Content generation
│ ├── ingestion/ # CORA parsing
│ ├── interlinking/ # Link injection
│ └── templating/ # HTML templates
├── scripts/ # Database & utility scripts
├── tests/ # Test suite
│ ├── unit/
│ └── integration/
├── jobs/ # Job configuration files
├── docs/ # Documentation
└── deployment_logs/ # Deployed URL logs
Job Configuration Format
Example job config (jobs/example.json):
{
"job_name": "Multi-Tier Launch",
"project_id": 1,
"description": "Site build with 165 articles",
"models": {
"title": "openai/gpt-4o-mini",
"outline": "anthropic/claude-3.5-sonnet",
"content": "anthropic/claude-3.5-sonnet"
},
"tiers": [
{
"tier": 1,
"article_count": 15,
"validation_attempts": 3
},
{
"tier": 2,
"article_count": 50,
"validation_attempts": 2
}
],
"failure_config": {
"max_consecutive_failures": 10,
"skip_on_failure": true
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4,
"include_home_link": true
},
"deployment_targets": ["www.primary.com"],
"tier1_preferred_sites": ["www.premium.com"],
"auto_create_sites": true
}
Per-Stage Model Configuration
You can specify different AI models for each generation stage (title, outline, content):
{
"models": {
"title": "openai/gpt-4o-mini",
"outline": "anthropic/claude-3.5-sonnet",
"content": "openai/gpt-4o"
}
}
Available models:
openai/gpt-4o-mini- Fast and cost-effectiveopenai/gpt-4o- Higher quality, more expensiveanthropic/claude-3.5-sonnet- Excellent for long-form content
If models is not specified in the job file, all stages use the model from the --model CLI flag (default: gpt-4o-mini).
Common Workflows
Initial Setup
uv pip install -r requirements.txt
cp env.example .env
# Edit .env
uv run python scripts/init_db.py
uv run python scripts/create_first_admin.py
uv run python main.py sync-sites --admin-user admin
New Project Workflow
# 1. Ingest CORA report
uv run python main.py ingest-cora \
--file project.xlsx \
--name "My Project" \
--username admin \
--password adminpass
# 2. Create job config
uv run python create_job_config.py 1 multi 15 50 100
# 3. Generate content (auto-deploys)
uv run python main.py generate-batch \
--job-file jobs/project_1_multi_3tiers_165articles.json \
--username admin \
--password adminpass
# 4. Verify deployment
uv run python main.py verify-deployment --batch-id 1
# 5. Export URLs for link building
uv run python main.py get-links \
--project-id 1 \
--tier 1 > tier1_urls.csv
Re-deploy After Changes
uv run python main.py deploy-batch \
--batch-id 123 \
--admin-user admin \
--admin-password adminpass
Troubleshooting
Database locked
# Stop any running processes, then:
uv run python scripts/init_db.py reset
Missing dependencies
uv pip install -r requirements.txt --force-reinstall
AI API errors
Check OPENROUTER_API_KEY in .env
Bunny.net authentication failed
Check BUNNY_ACCOUNT_API_KEY in .env
Storage upload failed
Verify storage_zone_password in database (set during site provisioning)
Documentation
- CLI Command Reference:
docs/CLI_COMMAND_REFERENCE.md- Comprehensive documentation for all CLI commands - Job Configuration Schema:
docs/job-schema.md- Complete reference for job configuration files - Product Requirements:
docs/prd.md- Product requirements and epics - Architecture:
docs/architecture/- System architecture documentation - Story Specifications:
docs/stories/- Current story specifications - Technical Debt:
docs/technical-debt.md- Known technical debt items - Historical Documentation:
docs/archive/- Archived implementation summaries, QA reports, and analysis documents
Regenerating CLI Documentation
To regenerate the CLI command reference after adding or modifying commands:
uv run python scripts/generate_cli_docs.py
This will update docs/CLI_COMMAND_REFERENCE.md with all current commands and their options.
License
All rights reserved.