Big-Link-Man/README.md

731 lines
18 KiB
Markdown

# Big Link Man - Content Automation & Syndication Platform
AI-powered content generation and multi-tier link building system with cloud deployment.
## Quick Start
```bash
# Install dependencies
uv pip install -r requirements.txt
# Setup environment
cp env.example .env
# Edit .env with your credentials
# Initialize database
uv run python scripts/init_db.py
# Create first admin user
uv run python scripts/create_first_admin.py
# Run CLI
uv run python main.py --help
```
## Environment Configuration
Required environment variables in `.env`:
```bash
DATABASE_URL=sqlite:///./content_automation.db
OPENROUTER_API_KEY=your_key_here
BUNNY_ACCOUNT_API_KEY=your_bunny_key_here
```
See `env.example` for full configuration options.
## Database Management
### Initialize Database
```bash
uv run python scripts/init_db.py
```
### Reset Database (drops all data)
```bash
uv run python scripts/init_db.py reset
```
### Create First Admin
```bash
uv run python scripts/create_first_admin.py
```
### Database Migrations
```bash
# Story 3.1 - Site deployments
uv run python scripts/migrate_story_3.1_sqlite.py
# Story 3.2 - Anchor text
uv run python scripts/migrate_add_anchor_text.py
# Story 3.3 - Template fields
uv run python scripts/migrate_add_template_fields.py
# Story 3.4 - Site pages
uv run python scripts/migrate_add_site_pages.py
# Story 4.1 - Deployment fields
uv run python scripts/migrate_add_deployment_fields.py
# Backfill site pages after migration
uv run python scripts/backfill_site_pages.py
```
## User Management
### Add User
```bash
uv run python main.py add-user \
--username newuser \
--password password123 \
--role Admin \
--admin-user admin \
--admin-password adminpass
```
### List Users
```bash
uv run python main.py list-users \
--admin-user admin \
--admin-password adminpass
```
### Delete User
```bash
uv run python main.py delete-user \
--username olduser \
--admin-user admin \
--admin-password adminpass
```
## Site Management
### Provision New Site
```bash
uv run python main.py provision-site \
--name "My Site" \
--domain www.example.com \
--storage-name my-storage-zone \
--region DE \
--admin-user admin \
--admin-password adminpass
```
Regions: `DE`, `NY`, `LA`, `SG`, `SYD`
### Attach Domain to Existing Storage
```bash
uv run python main.py attach-domain \
--name "Another Site" \
--domain www.another.com \
--storage-name my-storage-zone \
--admin-user admin \
--admin-password adminpass
```
### Sync Existing Bunny.net Sites
```bash
# Dry run
uv run python main.py sync-sites \
--admin-user admin \
--dry-run
# Actually import
uv run python main.py sync-sites \
--admin-user admin
```
### List Sites
```bash
uv run python main.py list-sites \
--admin-user admin \
--admin-password adminpass
```
### Get Site Details
```bash
uv run python main.py get-site \
--domain www.example.com \
--admin-user admin \
--admin-password adminpass
```
### Remove Site
```bash
uv run python main.py remove-site \
--domain www.example.com \
--admin-user admin \
--admin-password adminpass
```
## S3 Bucket Management
The platform supports AWS S3 buckets as storage providers alongside bunny.net. S3 buckets can be discovered, registered, and managed through the system.
### Prerequisites
Set AWS credentials in `.env`:
```bash
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1 # Optional, defaults to us-east-1
```
### Discover and Register S3 Buckets
**Interactive Mode** (select buckets manually):
```bash
uv run python main.py discover-s3-buckets
```
Or run the script directly:
```bash
uv run python scripts/discover_s3_buckets.py
```
**Auto-Import Mode** (import all unregistered buckets automatically):
```bash
uv run python scripts/discover_s3_buckets.py --auto-import-all
```
Auto-import mode will:
- Discover all S3 buckets in your AWS account
- Skip buckets already registered in the database
- Skip buckets in the exclusion list
- Register remaining buckets as bucket-only sites (no custom domain)
### Bucket Exclusion List
To prevent certain buckets from being auto-imported (e.g., buckets manually added with FQDNs), add them to `s3_bucket_exclusions.txt`:
```
# S3 Bucket Exclusion List
# One bucket name per line (comments start with #)
learningeducationtech.com
theteacher.best
airconditionerfixer.com
```
The discovery script automatically loads and respects this exclusion list. Excluded buckets are marked as `[EXCLUDED]` in the display and are skipped during both interactive and auto-import operations.
### List S3 Sites with FQDNs
To see which S3 buckets have custom domains (and should be excluded):
```bash
uv run python scripts/list_s3_fqdn_sites.py
```
This script lists all S3 sites with `s3_custom_domain` set and outputs bucket names that should be added to the exclusion list.
### S3 Site Types
S3 sites can be registered in two ways:
1. **Bucket-only sites**: No custom domain, accessed via S3 website endpoint
- Created via auto-import or interactive discovery
- Uses bucket name as site identifier
- URL format: `https://bucket-name.s3.region.amazonaws.com/`
2. **FQDN sites**: Manually added with custom domains
- Created manually with `s3_custom_domain` set
- Should be added to exclusion list to prevent re-import
- URL format: `https://custom-domain.com/`
### S3 Storage Features
- **Multi-region support**: Automatically detects bucket region
- **Public read access**: Buckets configured for public read-only access
- **Bucket policy**: Applied automatically for public read access
- **Region mapping**: AWS regions mapped to short codes (US, EU, SG, etc.)
- **Duplicate prevention**: Checks existing registrations before importing
### Helper Scripts
**List S3 FQDN sites**:
```bash
uv run python scripts/list_s3_fqdn_sites.py
```
**Delete sites by ID**:
```bash
# Edit scripts/delete_sites.py to set site_ids, then:
uv run python scripts/delete_sites.py
```
**Check sites around specific IDs**:
```bash
# Edit scripts/list_sites_by_id.py to set target_ids, then:
uv run python scripts/list_sites_by_id.py
```
## Project Management
### Ingest CORA Report
```bash
uv run python main.py ingest-cora \
--file shaft_machining.xlsx \
--name "Shaft Machining Project" \
--custom-anchors "shaft repair,engine parts" \
--username admin \
--password adminpass
```
### List Projects
```bash
uv run python main.py list-projects \
--username admin \
--password adminpass
```
## Content Generation
### Create Job Configuration
```bash
# Tier 1 only
uv run python create_job_config.py 1 tier1 15
# Multi-tier
uv run python create_job_config.py 1 multi 15 50 100
```
### Generate Content Batch
```bash
uv run python main.py generate-batch \
--job-file jobs/project_1_tier1_15articles.json \
--username admin \
--password adminpass
```
With options:
```bash
uv run python main.py generate-batch \
--job-file jobs/my_job.json \
--username admin \
--password adminpass \
--debug \
--continue-on-error \
--model gpt-4o-mini
```
Available models: `gpt-4o-mini`, `claude-sonnet-4.5`, - anything at openrouter
**Note:** If your job file contains a `models` config, it will override the `--model` flag and use different models for title, outline, and content generation stages.
## Deployment
### Deploy Batch
```bash
# Automatic deployment (runs after generation)
uv run python main.py generate-batch \
--job-file jobs/my_job.json \
--username admin \
--password adminpass
# Manual deployment
uv run python main.py deploy-batch \
--batch-id 123 \
--admin-user admin \
--admin-password adminpass
```
### Dry Run Deployment
```bash
uv run python main.py deploy-batch \
--batch-id 123 \
--dry-run
```
### Verify Deployment
```bash
# Check all URLs
uv run python main.py verify-deployment --batch-id 123
# Check random sample
uv run python main.py verify-deployment \
--batch-id 123 \
--sample 10 \
--timeout 10
```
## Link Export
### Export Article URLs
```bash
# Tier 1 only
uv run python main.py get-links \
--project-id 123 \
--tier 1
# Tier 2 and above
uv run python main.py get-links \
--project-id 123 \
--tier 2+
# With anchor text and destinations
uv run python main.py get-links \
--project-id 123 \
--tier 2+ \
--with-anchor-text \
--with-destination-url
```
Output is CSV format to stdout. Redirect to save:
```bash
uv run python main.py get-links \
--project-id 123 \
--tier 1 > tier1_urls.csv
```
## Utility Scripts
### Add robots.txt to All Buckets
Add a standardized robots.txt file to all storage buckets (both S3 and Bunny) that blocks SEO tools and bad bots while allowing legitimate search engines and AI crawlers:
```bash
# Preview what would be done (recommended first)
uv run python scripts/add_robots_txt_to_buckets.py --dry-run
# Upload to all buckets
uv run python scripts/add_robots_txt_to_buckets.py
# Only process S3 buckets
uv run python scripts/add_robots_txt_to_buckets.py --provider s3
# Only process Bunny storage zones
uv run python scripts/add_robots_txt_to_buckets.py --provider bunny
```
**robots.txt behavior:**
- Allows: Google, Bing, Yahoo, DuckDuckGo, Baidu, Yandex
- Allows: GPTBot, Claude, Common Crawl, Perplexity, ByteDance AI
- Blocks: Ahrefs, Semrush, Moz, and other SEO tools
- Blocks: HTTrack, Wget, and other scrapers/bad bots
The script is idempotent (safe to run multiple times) and will overwrite existing robots.txt files. It continues processing remaining buckets if one fails and reports all failures at the end.
### Update Index Pages and Sitemaps
Automatically generate or update `index.html` and `sitemap.xml` files for all storage buckets (both S3 and Bunny). The script:
- Lists all HTML files in each bucket's root directory
- Extracts titles from `<title>` tags (or formats filenames as fallback)
- Generates article listings sorted by most recent modification date
- Creates or updates `index.html` with article links in `<div id="article_listing">`
- Generates `sitemap.xml` with industry-standard settings (priority, changefreq, lastmod)
- Tracks last run timestamps to avoid unnecessary updates
- Excludes boilerplate pages: `index.html`, `about.html`, `privacy.html`, `contact.html`
**Usage:**
```bash
# Preview what would be updated (recommended first)
uv run python scripts/update_index_pages.py --dry-run
# Update all buckets
uv run python scripts/update_index_pages.py
# Only process S3 buckets
uv run python scripts/update_index_pages.py --provider s3
# Only process Bunny storage zones
uv run python scripts/update_index_pages.py --provider bunny
# Force update even if no changes detected
uv run python scripts/update_index_pages.py --force
# Test on specific site
uv run python scripts/update_index_pages.py --hostname example.com
# Limit number of sites (useful for testing)
uv run python scripts/update_index_pages.py --limit 10
```
**How it works:**
1. Queries database for all site deployments
2. Lists HTML files in root directory (excludes subdirectories and boilerplate pages)
3. Checks if content has changed since last run (unless `--force` is used)
4. Downloads and parses HTML files to extract titles
5. Generates article listing HTML (sorted by most recent first)
6. Creates new `index.html` or updates existing one (inserts into `<div id="article_listing">`)
7. Generates `sitemap.xml` with all HTML files and proper metadata
8. Uploads both files to the bucket
9. Saves state to `.update_index_state.json` for tracking
**Sitemap standards:**
- Priority: `1.0` for homepage (`index.html`), `0.8` for other pages
- Change frequency: `weekly` for all pages
- Last modified dates from file metadata
- Includes all HTML files in root directory
**Customizing article listing HTML:**
The article listing format can be easily customized by editing the `generate_article_listing_html()` function in `scripts/update_index_pages.py`. The function includes detailed documentation and examples for common variations (cards, dates, descriptions, etc.).
**State tracking:**
The script maintains state in `scripts/.update_index_state.json` to track when each site was last updated. This prevents unnecessary regeneration when content hasn't changed. Use `--force` to bypass this check.
### Check Last Generated Content
```bash
uv run python check_last_gen.py
```
### List All Users (Direct DB Access)
```bash
uv run python scripts/list_users.py
```
### Add Admin (Direct DB Access)
```bash
uv run python scripts/add_admin_direct.py
```
### Check Migration Status
```bash
uv run python scripts/check_migration.py
```
### Add Tier to Projects
```bash
uv run python scripts/add_tier_to_projects.py
```
## Testing
### Run All Tests
```bash
uv run pytest
```
### Run Unit Tests
```bash
uv run pytest tests/unit/ -v
```
### Run Integration Tests
```bash
uv run pytest tests/integration/ -v
```
### Run Specific Test File
```bash
uv run pytest tests/unit/test_url_generator.py -v
```
### Run Story 3.1 Tests
```bash
uv run pytest tests/unit/test_url_generator.py \
tests/unit/test_site_provisioning.py \
tests/unit/test_site_assignment.py \
tests/unit/test_job_config_extensions.py \
tests/integration/test_story_3_1_integration.py \
-v
```
### Run with Coverage
```bash
uv run pytest --cov=src --cov-report=html
```
## System Information
### Show Configuration
```bash
uv run python main.py config
```
### Health Check
```bash
uv run python main.py health
```
### List Available Models
```bash
uv run python main.py models
```
## Directory Structure
```
Big-Link-Man/
├── main.py # CLI entry point
├── src/ # Source code
│ ├── api/ # FastAPI endpoints
│ ├── auth/ # Authentication
│ ├── cli/ # CLI commands
│ ├── core/ # Configuration
│ ├── database/ # Models, repositories
│ ├── deployment/ # Cloud deployment
│ ├── generation/ # Content generation
│ ├── ingestion/ # CORA parsing
│ ├── interlinking/ # Link injection
│ └── templating/ # HTML templates
├── scripts/ # Database & utility scripts
├── tests/ # Test suite
│ ├── unit/
│ └── integration/
├── jobs/ # Job configuration files
├── docs/ # Documentation
└── deployment_logs/ # Deployed URL logs
```
## Job Configuration Format
Example job config (`jobs/example.json`):
```json
{
"job_name": "Multi-Tier Launch",
"project_id": 1,
"description": "Site build with 165 articles",
"models": {
"title": "openai/gpt-4o-mini",
"outline": "anthropic/claude-3.5-sonnet",
"content": "anthropic/claude-3.5-sonnet"
},
"tiers": [
{
"tier": 1,
"article_count": 15,
"validation_attempts": 3
},
{
"tier": 2,
"article_count": 50,
"validation_attempts": 2
}
],
"failure_config": {
"max_consecutive_failures": 10,
"skip_on_failure": true
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4,
"include_home_link": true
},
"deployment_targets": ["www.primary.com"],
"tier1_preferred_sites": ["www.premium.com"],
"auto_create_sites": true
}
```
### Per-Stage Model Configuration
You can specify different AI models for each generation stage (title, outline, content):
```json
{
"models": {
"title": "openai/gpt-4o-mini",
"outline": "anthropic/claude-3.5-sonnet",
"content": "openai/gpt-4o"
}
}
```
**Available models:**
- `openai/gpt-4o-mini` - Fast and cost-effective
- `openai/gpt-4o` - Higher quality, more expensive
- `anthropic/claude-3.5-sonnet` - Excellent for long-form content
If `models` is not specified in the job file, all stages use the model from the `--model` CLI flag (default: `gpt-4o-mini`).
## Common Workflows
### Initial Setup
```bash
uv pip install -r requirements.txt
cp env.example .env
# Edit .env
uv run python scripts/init_db.py
uv run python scripts/create_first_admin.py
uv run python main.py sync-sites --admin-user admin
```
### New Project Workflow
```bash
# 1. Ingest CORA report
uv run python main.py ingest-cora \
--file project.xlsx \
--name "My Project" \
--username admin \
--password adminpass
# 2. Create job config
uv run python create_job_config.py 1 multi 15 50 100
# 3. Generate content (auto-deploys)
uv run python main.py generate-batch \
--job-file jobs/project_1_multi_3tiers_165articles.json \
--username admin \
--password adminpass
# 4. Verify deployment
uv run python main.py verify-deployment --batch-id 1
# 5. Export URLs for link building
uv run python main.py get-links \
--project-id 1 \
--tier 1 > tier1_urls.csv
```
### Re-deploy After Changes
```bash
uv run python main.py deploy-batch \
--batch-id 123 \
--admin-user admin \
--admin-password adminpass
```
## Troubleshooting
### Database locked
```bash
# Stop any running processes, then:
uv run python scripts/init_db.py reset
```
### Missing dependencies
```bash
uv pip install -r requirements.txt --force-reinstall
```
### AI API errors
Check `OPENROUTER_API_KEY` in `.env`
### Bunny.net authentication failed
Check `BUNNY_ACCOUNT_API_KEY` in `.env`
### Storage upload failed
Verify `storage_zone_password` in database (set during site provisioning)
## Documentation
- **CLI Command Reference**: `docs/CLI_COMMAND_REFERENCE.md` - Comprehensive documentation for all CLI commands
- **Job Configuration Schema**: `docs/job-schema.md` - Complete reference for job configuration files
- **Product Requirements**: `docs/prd.md` - Product requirements and epics
- **Architecture**: `docs/architecture/` - System architecture documentation
- **Story Specifications**: `docs/stories/` - Current story specifications
- **Technical Debt**: `docs/technical-debt.md` - Known technical debt items
- **Historical Documentation**: `docs/archive/` - Archived implementation summaries, QA reports, and analysis documents
### Regenerating CLI Documentation
To regenerate the CLI command reference after adding or modifying commands:
```bash
uv run python scripts/generate_cli_docs.py
```
This will update `docs/CLI_COMMAND_REFERENCE.md` with all current commands and their options.
## License
All rights reserved.