731 lines
18 KiB
Markdown
731 lines
18 KiB
Markdown
# Big Link Man - Content Automation & Syndication Platform
|
|
|
|
AI-powered content generation and multi-tier link building system with cloud deployment.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv pip install -r requirements.txt
|
|
|
|
# Setup environment
|
|
cp env.example .env
|
|
# Edit .env with your credentials
|
|
|
|
# Initialize database
|
|
uv run python scripts/init_db.py
|
|
|
|
# Create first admin user
|
|
uv run python scripts/create_first_admin.py
|
|
|
|
# Run CLI
|
|
uv run python main.py --help
|
|
```
|
|
|
|
## Environment Configuration
|
|
|
|
Required environment variables in `.env`:
|
|
|
|
```bash
|
|
DATABASE_URL=sqlite:///./content_automation.db
|
|
OPENROUTER_API_KEY=your_key_here
|
|
BUNNY_ACCOUNT_API_KEY=your_bunny_key_here
|
|
```
|
|
|
|
See `env.example` for full configuration options.
|
|
|
|
## Database Management
|
|
|
|
### Initialize Database
|
|
```bash
|
|
uv run python scripts/init_db.py
|
|
```
|
|
|
|
### Reset Database (drops all data)
|
|
```bash
|
|
uv run python scripts/init_db.py reset
|
|
```
|
|
|
|
### Create First Admin
|
|
```bash
|
|
uv run python scripts/create_first_admin.py
|
|
```
|
|
|
|
### Database Migrations
|
|
```bash
|
|
# Story 3.1 - Site deployments
|
|
uv run python scripts/migrate_story_3.1_sqlite.py
|
|
|
|
# Story 3.2 - Anchor text
|
|
uv run python scripts/migrate_add_anchor_text.py
|
|
|
|
# Story 3.3 - Template fields
|
|
uv run python scripts/migrate_add_template_fields.py
|
|
|
|
# Story 3.4 - Site pages
|
|
uv run python scripts/migrate_add_site_pages.py
|
|
|
|
# Story 4.1 - Deployment fields
|
|
uv run python scripts/migrate_add_deployment_fields.py
|
|
|
|
# Backfill site pages after migration
|
|
uv run python scripts/backfill_site_pages.py
|
|
```
|
|
|
|
## User Management
|
|
|
|
### Add User
|
|
```bash
|
|
uv run python main.py add-user \
|
|
--username newuser \
|
|
--password password123 \
|
|
--role Admin \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### List Users
|
|
```bash
|
|
uv run python main.py list-users \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### Delete User
|
|
```bash
|
|
uv run python main.py delete-user \
|
|
--username olduser \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
## Site Management
|
|
|
|
### Provision New Site
|
|
```bash
|
|
uv run python main.py provision-site \
|
|
--name "My Site" \
|
|
--domain www.example.com \
|
|
--storage-name my-storage-zone \
|
|
--region DE \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
Regions: `DE`, `NY`, `LA`, `SG`, `SYD`
|
|
|
|
### Attach Domain to Existing Storage
|
|
```bash
|
|
uv run python main.py attach-domain \
|
|
--name "Another Site" \
|
|
--domain www.another.com \
|
|
--storage-name my-storage-zone \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### Sync Existing Bunny.net Sites
|
|
```bash
|
|
# Dry run
|
|
uv run python main.py sync-sites \
|
|
--admin-user admin \
|
|
--dry-run
|
|
|
|
# Actually import
|
|
uv run python main.py sync-sites \
|
|
--admin-user admin
|
|
```
|
|
|
|
### List Sites
|
|
```bash
|
|
uv run python main.py list-sites \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### Get Site Details
|
|
```bash
|
|
uv run python main.py get-site \
|
|
--domain www.example.com \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### Remove Site
|
|
```bash
|
|
uv run python main.py remove-site \
|
|
--domain www.example.com \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
## S3 Bucket Management
|
|
|
|
The platform supports AWS S3 buckets as storage providers alongside bunny.net. S3 buckets can be discovered, registered, and managed through the system.
|
|
|
|
### Prerequisites
|
|
|
|
Set AWS credentials in `.env`:
|
|
```bash
|
|
AWS_ACCESS_KEY_ID=your_access_key
|
|
AWS_SECRET_ACCESS_KEY=your_secret_key
|
|
AWS_REGION=us-east-1 # Optional, defaults to us-east-1
|
|
```
|
|
|
|
### Discover and Register S3 Buckets
|
|
|
|
**Interactive Mode** (select buckets manually):
|
|
```bash
|
|
uv run python main.py discover-s3-buckets
|
|
```
|
|
|
|
Or run the script directly:
|
|
```bash
|
|
uv run python scripts/discover_s3_buckets.py
|
|
```
|
|
|
|
**Auto-Import Mode** (import all unregistered buckets automatically):
|
|
```bash
|
|
uv run python scripts/discover_s3_buckets.py --auto-import-all
|
|
```
|
|
|
|
Auto-import mode will:
|
|
- Discover all S3 buckets in your AWS account
|
|
- Skip buckets already registered in the database
|
|
- Skip buckets in the exclusion list
|
|
- Register remaining buckets as bucket-only sites (no custom domain)
|
|
|
|
### Bucket Exclusion List
|
|
|
|
To prevent certain buckets from being auto-imported (e.g., buckets manually added with FQDNs), add them to `s3_bucket_exclusions.txt`:
|
|
|
|
```
|
|
# S3 Bucket Exclusion List
|
|
# One bucket name per line (comments start with #)
|
|
|
|
learningeducationtech.com
|
|
theteacher.best
|
|
airconditionerfixer.com
|
|
```
|
|
|
|
The discovery script automatically loads and respects this exclusion list. Excluded buckets are marked as `[EXCLUDED]` in the display and are skipped during both interactive and auto-import operations.
|
|
|
|
### List S3 Sites with FQDNs
|
|
|
|
To see which S3 buckets have custom domains (and should be excluded):
|
|
```bash
|
|
uv run python scripts/list_s3_fqdn_sites.py
|
|
```
|
|
|
|
This script lists all S3 sites with `s3_custom_domain` set and outputs bucket names that should be added to the exclusion list.
|
|
|
|
### S3 Site Types
|
|
|
|
S3 sites can be registered in two ways:
|
|
|
|
1. **Bucket-only sites**: No custom domain, accessed via S3 website endpoint
|
|
- Created via auto-import or interactive discovery
|
|
- Uses bucket name as site identifier
|
|
- URL format: `https://bucket-name.s3.region.amazonaws.com/`
|
|
|
|
2. **FQDN sites**: Manually added with custom domains
|
|
- Created manually with `s3_custom_domain` set
|
|
- Should be added to exclusion list to prevent re-import
|
|
- URL format: `https://custom-domain.com/`
|
|
|
|
### S3 Storage Features
|
|
|
|
- **Multi-region support**: Automatically detects bucket region
|
|
- **Public read access**: Buckets configured for public read-only access
|
|
- **Bucket policy**: Applied automatically for public read access
|
|
- **Region mapping**: AWS regions mapped to short codes (US, EU, SG, etc.)
|
|
- **Duplicate prevention**: Checks existing registrations before importing
|
|
|
|
### Helper Scripts
|
|
|
|
**List S3 FQDN sites**:
|
|
```bash
|
|
uv run python scripts/list_s3_fqdn_sites.py
|
|
```
|
|
|
|
**Delete sites by ID**:
|
|
```bash
|
|
# Edit scripts/delete_sites.py to set site_ids, then:
|
|
uv run python scripts/delete_sites.py
|
|
```
|
|
|
|
**Check sites around specific IDs**:
|
|
```bash
|
|
# Edit scripts/list_sites_by_id.py to set target_ids, then:
|
|
uv run python scripts/list_sites_by_id.py
|
|
```
|
|
|
|
## Project Management
|
|
|
|
### Ingest CORA Report
|
|
```bash
|
|
uv run python main.py ingest-cora \
|
|
--file shaft_machining.xlsx \
|
|
--name "Shaft Machining Project" \
|
|
--custom-anchors "shaft repair,engine parts" \
|
|
--username admin \
|
|
--password adminpass
|
|
```
|
|
|
|
### List Projects
|
|
```bash
|
|
uv run python main.py list-projects \
|
|
--username admin \
|
|
--password adminpass
|
|
```
|
|
|
|
## Content Generation
|
|
|
|
### Create Job Configuration
|
|
```bash
|
|
# Tier 1 only
|
|
uv run python create_job_config.py 1 tier1 15
|
|
|
|
# Multi-tier
|
|
uv run python create_job_config.py 1 multi 15 50 100
|
|
```
|
|
|
|
### Generate Content Batch
|
|
```bash
|
|
uv run python main.py generate-batch \
|
|
--job-file jobs/project_1_tier1_15articles.json \
|
|
--username admin \
|
|
--password adminpass
|
|
```
|
|
|
|
With options:
|
|
```bash
|
|
uv run python main.py generate-batch \
|
|
--job-file jobs/my_job.json \
|
|
--username admin \
|
|
--password adminpass \
|
|
--debug \
|
|
--continue-on-error \
|
|
--model gpt-4o-mini
|
|
```
|
|
|
|
Available models: `gpt-4o-mini`, `claude-sonnet-4.5`, - anything at openrouter
|
|
|
|
**Note:** If your job file contains a `models` config, it will override the `--model` flag and use different models for title, outline, and content generation stages.
|
|
|
|
## Deployment
|
|
|
|
### Deploy Batch
|
|
```bash
|
|
# Automatic deployment (runs after generation)
|
|
uv run python main.py generate-batch \
|
|
--job-file jobs/my_job.json \
|
|
--username admin \
|
|
--password adminpass
|
|
|
|
# Manual deployment
|
|
uv run python main.py deploy-batch \
|
|
--batch-id 123 \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
### Dry Run Deployment
|
|
```bash
|
|
uv run python main.py deploy-batch \
|
|
--batch-id 123 \
|
|
--dry-run
|
|
```
|
|
|
|
### Verify Deployment
|
|
```bash
|
|
# Check all URLs
|
|
uv run python main.py verify-deployment --batch-id 123
|
|
|
|
# Check random sample
|
|
uv run python main.py verify-deployment \
|
|
--batch-id 123 \
|
|
--sample 10 \
|
|
--timeout 10
|
|
```
|
|
|
|
## Link Export
|
|
|
|
### Export Article URLs
|
|
```bash
|
|
# Tier 1 only
|
|
uv run python main.py get-links \
|
|
--project-id 123 \
|
|
--tier 1
|
|
|
|
# Tier 2 and above
|
|
uv run python main.py get-links \
|
|
--project-id 123 \
|
|
--tier 2+
|
|
|
|
# With anchor text and destinations
|
|
uv run python main.py get-links \
|
|
--project-id 123 \
|
|
--tier 2+ \
|
|
--with-anchor-text \
|
|
--with-destination-url
|
|
```
|
|
|
|
Output is CSV format to stdout. Redirect to save:
|
|
```bash
|
|
uv run python main.py get-links \
|
|
--project-id 123 \
|
|
--tier 1 > tier1_urls.csv
|
|
```
|
|
|
|
## Utility Scripts
|
|
|
|
### Add robots.txt to All Buckets
|
|
|
|
Add a standardized robots.txt file to all storage buckets (both S3 and Bunny) that blocks SEO tools and bad bots while allowing legitimate search engines and AI crawlers:
|
|
|
|
```bash
|
|
# Preview what would be done (recommended first)
|
|
uv run python scripts/add_robots_txt_to_buckets.py --dry-run
|
|
|
|
# Upload to all buckets
|
|
uv run python scripts/add_robots_txt_to_buckets.py
|
|
|
|
# Only process S3 buckets
|
|
uv run python scripts/add_robots_txt_to_buckets.py --provider s3
|
|
|
|
# Only process Bunny storage zones
|
|
uv run python scripts/add_robots_txt_to_buckets.py --provider bunny
|
|
```
|
|
|
|
**robots.txt behavior:**
|
|
- Allows: Google, Bing, Yahoo, DuckDuckGo, Baidu, Yandex
|
|
- Allows: GPTBot, Claude, Common Crawl, Perplexity, ByteDance AI
|
|
- Blocks: Ahrefs, Semrush, Moz, and other SEO tools
|
|
- Blocks: HTTrack, Wget, and other scrapers/bad bots
|
|
|
|
The script is idempotent (safe to run multiple times) and will overwrite existing robots.txt files. It continues processing remaining buckets if one fails and reports all failures at the end.
|
|
|
|
### Update Index Pages and Sitemaps
|
|
|
|
Automatically generate or update `index.html` and `sitemap.xml` files for all storage buckets (both S3 and Bunny). The script:
|
|
|
|
- Lists all HTML files in each bucket's root directory
|
|
- Extracts titles from `<title>` tags (or formats filenames as fallback)
|
|
- Generates article listings sorted by most recent modification date
|
|
- Creates or updates `index.html` with article links in `<div id="article_listing">`
|
|
- Generates `sitemap.xml` with industry-standard settings (priority, changefreq, lastmod)
|
|
- Tracks last run timestamps to avoid unnecessary updates
|
|
- Excludes boilerplate pages: `index.html`, `about.html`, `privacy.html`, `contact.html`
|
|
|
|
**Usage:**
|
|
|
|
```bash
|
|
# Preview what would be updated (recommended first)
|
|
uv run python scripts/update_index_pages.py --dry-run
|
|
|
|
# Update all buckets
|
|
uv run python scripts/update_index_pages.py
|
|
|
|
# Only process S3 buckets
|
|
uv run python scripts/update_index_pages.py --provider s3
|
|
|
|
# Only process Bunny storage zones
|
|
uv run python scripts/update_index_pages.py --provider bunny
|
|
|
|
# Force update even if no changes detected
|
|
uv run python scripts/update_index_pages.py --force
|
|
|
|
# Test on specific site
|
|
uv run python scripts/update_index_pages.py --hostname example.com
|
|
|
|
# Limit number of sites (useful for testing)
|
|
uv run python scripts/update_index_pages.py --limit 10
|
|
```
|
|
|
|
**How it works:**
|
|
|
|
1. Queries database for all site deployments
|
|
2. Lists HTML files in root directory (excludes subdirectories and boilerplate pages)
|
|
3. Checks if content has changed since last run (unless `--force` is used)
|
|
4. Downloads and parses HTML files to extract titles
|
|
5. Generates article listing HTML (sorted by most recent first)
|
|
6. Creates new `index.html` or updates existing one (inserts into `<div id="article_listing">`)
|
|
7. Generates `sitemap.xml` with all HTML files and proper metadata
|
|
8. Uploads both files to the bucket
|
|
9. Saves state to `.update_index_state.json` for tracking
|
|
|
|
**Sitemap standards:**
|
|
- Priority: `1.0` for homepage (`index.html`), `0.8` for other pages
|
|
- Change frequency: `weekly` for all pages
|
|
- Last modified dates from file metadata
|
|
- Includes all HTML files in root directory
|
|
|
|
**Customizing article listing HTML:**
|
|
|
|
The article listing format can be easily customized by editing the `generate_article_listing_html()` function in `scripts/update_index_pages.py`. The function includes detailed documentation and examples for common variations (cards, dates, descriptions, etc.).
|
|
|
|
**State tracking:**
|
|
|
|
The script maintains state in `scripts/.update_index_state.json` to track when each site was last updated. This prevents unnecessary regeneration when content hasn't changed. Use `--force` to bypass this check.
|
|
|
|
### Check Last Generated Content
|
|
```bash
|
|
uv run python check_last_gen.py
|
|
```
|
|
|
|
### List All Users (Direct DB Access)
|
|
```bash
|
|
uv run python scripts/list_users.py
|
|
```
|
|
|
|
### Add Admin (Direct DB Access)
|
|
```bash
|
|
uv run python scripts/add_admin_direct.py
|
|
```
|
|
|
|
### Check Migration Status
|
|
```bash
|
|
uv run python scripts/check_migration.py
|
|
```
|
|
|
|
### Add Tier to Projects
|
|
```bash
|
|
uv run python scripts/add_tier_to_projects.py
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Run All Tests
|
|
```bash
|
|
uv run pytest
|
|
```
|
|
|
|
### Run Unit Tests
|
|
```bash
|
|
uv run pytest tests/unit/ -v
|
|
```
|
|
|
|
### Run Integration Tests
|
|
```bash
|
|
uv run pytest tests/integration/ -v
|
|
```
|
|
|
|
### Run Specific Test File
|
|
```bash
|
|
uv run pytest tests/unit/test_url_generator.py -v
|
|
```
|
|
|
|
### Run Story 3.1 Tests
|
|
```bash
|
|
uv run pytest tests/unit/test_url_generator.py \
|
|
tests/unit/test_site_provisioning.py \
|
|
tests/unit/test_site_assignment.py \
|
|
tests/unit/test_job_config_extensions.py \
|
|
tests/integration/test_story_3_1_integration.py \
|
|
-v
|
|
```
|
|
|
|
### Run with Coverage
|
|
```bash
|
|
uv run pytest --cov=src --cov-report=html
|
|
```
|
|
|
|
## System Information
|
|
|
|
### Show Configuration
|
|
```bash
|
|
uv run python main.py config
|
|
```
|
|
|
|
### Health Check
|
|
```bash
|
|
uv run python main.py health
|
|
```
|
|
|
|
### List Available Models
|
|
```bash
|
|
uv run python main.py models
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
Big-Link-Man/
|
|
├── main.py # CLI entry point
|
|
├── src/ # Source code
|
|
│ ├── api/ # FastAPI endpoints
|
|
│ ├── auth/ # Authentication
|
|
│ ├── cli/ # CLI commands
|
|
│ ├── core/ # Configuration
|
|
│ ├── database/ # Models, repositories
|
|
│ ├── deployment/ # Cloud deployment
|
|
│ ├── generation/ # Content generation
|
|
│ ├── ingestion/ # CORA parsing
|
|
│ ├── interlinking/ # Link injection
|
|
│ └── templating/ # HTML templates
|
|
├── scripts/ # Database & utility scripts
|
|
├── tests/ # Test suite
|
|
│ ├── unit/
|
|
│ └── integration/
|
|
├── jobs/ # Job configuration files
|
|
├── docs/ # Documentation
|
|
└── deployment_logs/ # Deployed URL logs
|
|
```
|
|
|
|
## Job Configuration Format
|
|
|
|
Example job config (`jobs/example.json`):
|
|
|
|
```json
|
|
{
|
|
"job_name": "Multi-Tier Launch",
|
|
"project_id": 1,
|
|
"description": "Site build with 165 articles",
|
|
"models": {
|
|
"title": "openai/gpt-4o-mini",
|
|
"outline": "anthropic/claude-3.5-sonnet",
|
|
"content": "anthropic/claude-3.5-sonnet"
|
|
},
|
|
"tiers": [
|
|
{
|
|
"tier": 1,
|
|
"article_count": 15,
|
|
"validation_attempts": 3
|
|
},
|
|
{
|
|
"tier": 2,
|
|
"article_count": 50,
|
|
"validation_attempts": 2
|
|
}
|
|
],
|
|
"failure_config": {
|
|
"max_consecutive_failures": 10,
|
|
"skip_on_failure": true
|
|
},
|
|
"interlinking": {
|
|
"links_per_article_min": 2,
|
|
"links_per_article_max": 4,
|
|
"include_home_link": true
|
|
},
|
|
"deployment_targets": ["www.primary.com"],
|
|
"tier1_preferred_sites": ["www.premium.com"],
|
|
"auto_create_sites": true
|
|
}
|
|
```
|
|
|
|
### Per-Stage Model Configuration
|
|
|
|
You can specify different AI models for each generation stage (title, outline, content):
|
|
|
|
```json
|
|
{
|
|
"models": {
|
|
"title": "openai/gpt-4o-mini",
|
|
"outline": "anthropic/claude-3.5-sonnet",
|
|
"content": "openai/gpt-4o"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Available models:**
|
|
- `openai/gpt-4o-mini` - Fast and cost-effective
|
|
- `openai/gpt-4o` - Higher quality, more expensive
|
|
- `anthropic/claude-3.5-sonnet` - Excellent for long-form content
|
|
|
|
If `models` is not specified in the job file, all stages use the model from the `--model` CLI flag (default: `gpt-4o-mini`).
|
|
|
|
## Common Workflows
|
|
|
|
### Initial Setup
|
|
```bash
|
|
uv pip install -r requirements.txt
|
|
cp env.example .env
|
|
# Edit .env
|
|
uv run python scripts/init_db.py
|
|
uv run python scripts/create_first_admin.py
|
|
uv run python main.py sync-sites --admin-user admin
|
|
```
|
|
|
|
### New Project Workflow
|
|
```bash
|
|
# 1. Ingest CORA report
|
|
uv run python main.py ingest-cora \
|
|
--file project.xlsx \
|
|
--name "My Project" \
|
|
--username admin \
|
|
--password adminpass
|
|
|
|
# 2. Create job config
|
|
uv run python create_job_config.py 1 multi 15 50 100
|
|
|
|
# 3. Generate content (auto-deploys)
|
|
uv run python main.py generate-batch \
|
|
--job-file jobs/project_1_multi_3tiers_165articles.json \
|
|
--username admin \
|
|
--password adminpass
|
|
|
|
# 4. Verify deployment
|
|
uv run python main.py verify-deployment --batch-id 1
|
|
|
|
# 5. Export URLs for link building
|
|
uv run python main.py get-links \
|
|
--project-id 1 \
|
|
--tier 1 > tier1_urls.csv
|
|
```
|
|
|
|
### Re-deploy After Changes
|
|
```bash
|
|
uv run python main.py deploy-batch \
|
|
--batch-id 123 \
|
|
--admin-user admin \
|
|
--admin-password adminpass
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Database locked
|
|
```bash
|
|
# Stop any running processes, then:
|
|
uv run python scripts/init_db.py reset
|
|
```
|
|
|
|
### Missing dependencies
|
|
```bash
|
|
uv pip install -r requirements.txt --force-reinstall
|
|
```
|
|
|
|
### AI API errors
|
|
Check `OPENROUTER_API_KEY` in `.env`
|
|
|
|
### Bunny.net authentication failed
|
|
Check `BUNNY_ACCOUNT_API_KEY` in `.env`
|
|
|
|
### Storage upload failed
|
|
Verify `storage_zone_password` in database (set during site provisioning)
|
|
|
|
## Documentation
|
|
|
|
- **CLI Command Reference**: `docs/CLI_COMMAND_REFERENCE.md` - Comprehensive documentation for all CLI commands
|
|
- **Job Configuration Schema**: `docs/job-schema.md` - Complete reference for job configuration files
|
|
- **Product Requirements**: `docs/prd.md` - Product requirements and epics
|
|
- **Architecture**: `docs/architecture/` - System architecture documentation
|
|
- **Story Specifications**: `docs/stories/` - Current story specifications
|
|
- **Technical Debt**: `docs/technical-debt.md` - Known technical debt items
|
|
- **Historical Documentation**: `docs/archive/` - Archived implementation summaries, QA reports, and analysis documents
|
|
|
|
### Regenerating CLI Documentation
|
|
|
|
To regenerate the CLI command reference after adding or modifying commands:
|
|
|
|
```bash
|
|
uv run python scripts/generate_cli_docs.py
|
|
```
|
|
|
|
This will update `docs/CLI_COMMAND_REFERENCE.md` with all current commands and their options.
|
|
|
|
## License
|
|
|
|
All rights reserved.
|
|
|