Big-Link-Man/docs/stories/story-2.3-ai-content-genera...

14 KiB

Story 2.3: AI-Powered Content Generation - COMPLETED

Overview

Implemented a comprehensive AI-powered content generation system with three-stage pipeline (title → outline → content), validation at each stage, programmatic augmentation for CORA compliance, and batch job processing across multiple tiers.

Status

COMPLETED

Story Details

As a User, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically.

Acceptance Criteria - ALL MET

1. Script Initiation for Projects

Status: COMPLETE

  • CLI command: generate-batch --job-file <path>
  • Supports batch processing across multiple tiers
  • Job configuration via JSON files
  • Progress tracking and error reporting

2. AI-Powered Generation Using SEO Data

Status: COMPLETE

  • Title generation with keyword validation
  • Outline generation meeting CORA H2/H3 targets
  • Full HTML content generation
  • Uses project's SEO data (keywords, entities, related searches)
  • Multiple AI models supported via OpenRouter

3. Content Rule Engine Validation

Status: COMPLETE

  • Validates at each stage (title, outline, content)
  • Uses ContentRuleEngine from Story 2.2
  • Tier-aware validation (strict for Tier 1)
  • Detailed error reporting

4. Database Storage

Status: COMPLETE

  • Title, outline, and content stored in GeneratedContent table
  • Version tracking and metadata
  • Tracks attempts, models used, validation results
  • Augmentation logs

5. Progress Logging

Status: COMPLETE

  • Real-time progress updates via CLI
  • Logs: "Generating title...", "Generating content...", etc.
  • Tracks successful, failed, and skipped articles
  • Detailed summary reports

6. AI Service Error Handling

Status: COMPLETE

  • Graceful handling of API errors
  • Retry logic with configurable attempts
  • Fallback to programmatic augmentation
  • Continue or stop on failures (configurable)

Implementation Details

Architecture Components

1. Database Models (src/database/models.py)

GeneratedContent Model:

class GeneratedContent(Base):
    id, project_id, tier
    title, outline, content
    status, is_active
    generation_stage
    title_attempts, outline_attempts, content_attempts
    title_model, outline_model, content_model
    validation_errors, validation_warnings
    validation_report (JSON)
    word_count, augmented
    augmentation_log (JSON)
    generation_duration
    error_message
    created_at, updated_at

2. AI Client (src/generation/ai_client.py)

Features:

  • OpenRouter API integration
  • Multiple model support
  • JSON-formatted responses
  • Error handling and retries
  • Model validation

Available Models:

  • Claude 3.5 Sonnet (default)
  • Claude 3 Haiku
  • GPT-4o / GPT-4o-mini
  • Llama 3.1 70B/8B
  • Gemini Pro 1.5

3. Job Configuration (src/generation/job_config.py)

Job Structure:

{
  "job_name": "Batch Name",
  "project_id": 1,
  "tiers": [
    {
      "tier": 1,
      "article_count": 15,
      "models": {
        "title": "model-id",
        "outline": "model-id",
        "content": "model-id"
      },
      "anchor_text_config": {
        "mode": "default|override|append"
      },
      "validation_attempts": 3
    }
  ],
  "failure_config": {
    "max_consecutive_failures": 5,
    "skip_on_failure": true
  }
}

4. Three-Stage Generation Pipeline (src/generation/service.py)

Stage 1: Title Generation

  • Uses title_generation.json prompt
  • Validates keyword presence and length
  • Retries on validation failure
  • Max attempts configurable

Stage 2: Outline Generation

  • Uses outline_generation.json prompt
  • Returns JSON structure with H1, H2s, H3s
  • Validates CORA targets (H2/H3 counts, keyword distribution)
  • AI retry → Programmatic augmentation if needed
  • Ensures FAQ section present

Stage 3: Content Generation

  • Uses content_generation.json prompt
  • Follows validated outline structure
  • Generates full HTML (no CSS, just semantic markup)
  • Validates against all CORA rules
  • AI retry → Augmentation if needed

5. Stage Validation (src/generation/validator.py)

Title Validation:

  • Length (30-100 chars)
  • Keyword presence
  • Non-empty

Outline Validation:

  • H1 contains keyword
  • H2/H3 counts meet targets
  • Keyword distribution in headings
  • Entity and related search incorporation
  • FAQ section present
  • Tier-aware strictness

Content Validation:

  • Full CORA rule validation
  • Word count (min/max)
  • Keyword frequency
  • Heading structure
  • FAQ format
  • Image alt text (when applicable)

6. Content Augmentation (src/generation/augmenter.py)

Outline Augmentation:

  • Add missing H2s with keywords
  • Add H3s with entities
  • Modify existing headings
  • Maintain logical flow

Content Augmentation:

  • Strategy 1: Ask AI to add paragraphs (small deficits)
  • Strategy 2: Programmatically insert terms (large deficits)
  • Insert keywords into random sentences
  • Capitalize if sentence-initial
  • Add complete paragraphs with missing elements

7. Batch Processor (src/generation/batch_processor.py)

Features:

  • Process multiple tiers sequentially
  • Track progress per tier
  • Handle failures (skip or stop)
  • Consecutive failure threshold
  • Real-time progress callbacks
  • Detailed result reporting

8. Prompt Templates (src/generation/prompts/)

Files:

  • title_generation.json - Title prompts
  • outline_generation.json - Outline structure prompts
  • content_generation.json - Full content prompts
  • outline_augmentation.json - Outline fix prompts
  • content_augmentation.json - Content enhancement prompts

Format:

{
  "system": "System message",
  "user_template": "Prompt with {placeholders}",
  "validation": {
    "output_format": "text|json|html",
    "requirements": []
  }
}

CLI Command

python main.py generate-batch \
  --job-file jobs/example_tier1_batch.json \
  --username admin \
  --password password

Options:

  • --job-file, -j: Path to job configuration JSON (required)
  • --force-regenerate, -f: Force regeneration (flag, not implemented)
  • --username, -u: Authentication username
  • --password, -p: Authentication password

Example Output:

Authenticated as: admin (Admin)

Loading Job: Tier 1 Launch Batch
Project ID: 1
Total Articles: 15

Tiers:
  Tier 1: 15 articles
    Models: gpt-4o-mini / claude-3.5-sonnet / claude-3.5-sonnet

Proceed with generation? [y/N]: y

Starting batch generation...
--------------------------------------------------------------------------------
[Tier 1] Article 1/15: Generating...
[Tier 1] Article 1/15: Completed (ID: 1)
[Tier 1] Article 2/15: Generating...
...
--------------------------------------------------------------------------------

Batch Generation Complete!
Job: Tier 1 Launch Batch
Project ID: 1
Duration: 1234.56s

Results:
  Total Articles: 15
  Successful: 14
  Failed: 0
  Skipped: 1

By Tier:
  Tier 1:
    Successful: 14
    Failed: 0
    Skipped: 1

Example Job Files

Located in jobs/ directory:

  • example_tier1_batch.json - 15 tier 1 articles
  • example_multi_tier_batch.json - 165 articles across 3 tiers
  • example_custom_anchors.json - Custom anchor text demo
  • README.md - Job configuration guide

Test Coverage

Unit Tests (30+ tests):

  • test_generation_service.py - Pipeline stages
  • test_augmenter.py - Content augmentation
  • test_job_config.py - Job configuration validation

Integration Tests:

  • test_content_generation.py - Full pipeline with mocked AI
  • Repository CRUD operations
  • Service initialization
  • Job validation

Database Schema

New Table: generated_content

CREATE TABLE generated_content (
    id INTEGER PRIMARY KEY,
    project_id INTEGER REFERENCES projects(id),
    tier INTEGER,
    title TEXT,
    outline TEXT,
    content TEXT,
    status VARCHAR(20) DEFAULT 'pending',
    is_active BOOLEAN DEFAULT 0,
    generation_stage VARCHAR(20) DEFAULT 'title',
    title_attempts INTEGER DEFAULT 0,
    outline_attempts INTEGER DEFAULT 0,
    content_attempts INTEGER DEFAULT 0,
    title_model VARCHAR(100),
    outline_model VARCHAR(100),
    content_model VARCHAR(100),
    validation_errors INTEGER DEFAULT 0,
    validation_warnings INTEGER DEFAULT 0,
    validation_report JSON,
    word_count INTEGER,
    augmented BOOLEAN DEFAULT 0,
    augmentation_log JSON,
    generation_duration FLOAT,
    error_message TEXT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE INDEX idx_generated_content_project_id ON generated_content(project_id);
CREATE INDEX idx_generated_content_tier ON generated_content(tier);
CREATE INDEX idx_generated_content_status ON generated_content(status);

Dependencies Added

  • beautifulsoup4==4.12.2 - HTML parsing for augmentation

All other dependencies already present (OpenAI SDK for OpenRouter).

Configuration

Environment Variables:

AI_API_KEY=sk-or-v1-your-openrouter-key
AI_API_BASE_URL=https://openrouter.ai/api/v1  # Optional
AI_MODEL=anthropic/claude-3.5-sonnet  # Optional

master.config.json: Already configured in Story 2.2 with:

  • ai_service section
  • content_rules for validation
  • Available models list

Design Decisions

Why Three Separate Stages?

  1. Title First: Validates keyword usage early, informs outline
  2. Outline Next: Ensures structure before expensive content generation
  3. Content Last: Follows validated structure, reduces failures

Better success rate than single-prompt approach.

Why Programmatic Augmentation?

  • AI is unreliable at precise keyword placement
  • Validation failures are common with strict CORA targets
  • Hybrid approach: AI for quality, programmatic for precision
  • Saves API costs (no endless retries)

Why Separate GeneratedContent Table?

  • Version history preserved
  • Can rollback to previous generation
  • Track attempts and augmentation
  • Rich metadata for debugging
  • A/B testing capability

Why Job Configuration Files?

  • Reusable batch configurations
  • Version control job definitions
  • Easy to share and modify
  • Future: Auto-process job folder
  • Clear audit trail

Why Tier-Aware Validation?

  • Tier 1: Strictest (all CORA targets mandatory)
  • Tier 2+: Warnings only (more lenient)
  • Matches real-world content quality needs
  • Saves costs on bulk tier 2+ content

Known Limitations

  1. No Interlinking Yet: Links added in Epic 3 (Story 3.3)
  2. No CSS/Templates: Added in Story 2.4
  3. Sequential Processing: No parallel generation (future enhancement)
  4. Force-Regenerate Flag: Not yet implemented
  5. No Image Generation: Placeholder for future
  6. Single Project per Job: Can't mix projects in one batch

Next Steps

Story 2.4: HTML Formatting with Multiple Templates

  • Wrap generated content in full HTML documents
  • Apply CSS templates
  • Map templates to deployment targets
  • Add meta tags and SEO elements

Epic 3: Pre-Deployment & Interlinking

  • Generate final URLs
  • Inject interlinks (wheel structure)
  • Add home page links
  • Random existing article links

Technical Debt Added

Items added to technical-debt.md:

  1. A/B test different prompt templates
  2. Prompt optimization comparison tool
  3. Parallel article generation
  4. Job folder auto-processing
  5. Cost tracking per generation
  6. Model performance analytics

Files Created/Modified

New Files:

  • src/database/models.py - Added GeneratedContent model
  • src/database/interfaces.py - Added IGeneratedContentRepository
  • src/database/repositories.py - Added GeneratedContentRepository
  • src/generation/ai_client.py - OpenRouter AI client
  • src/generation/service.py - Content generation service
  • src/generation/validator.py - Stage validation
  • src/generation/augmenter.py - Content augmentation
  • src/generation/job_config.py - Job configuration schema
  • src/generation/batch_processor.py - Batch job processor
  • src/generation/prompts/title_generation.json
  • src/generation/prompts/outline_generation.json
  • src/generation/prompts/content_generation.json
  • src/generation/prompts/outline_augmentation.json
  • src/generation/prompts/content_augmentation.json
  • tests/unit/test_generation_service.py
  • tests/unit/test_augmenter.py
  • tests/unit/test_job_config.py
  • tests/integration/test_content_generation.py
  • jobs/example_tier1_batch.json
  • jobs/example_multi_tier_batch.json
  • jobs/example_custom_anchors.json
  • jobs/README.md
  • docs/stories/story-2.3-ai-content-generation.md

Modified Files:

  • src/cli/commands.py - Added generate-batch command
  • requirements.txt - Added beautifulsoup4
  • docs/technical-debt.md - Added new items

Manual Testing

Prerequisites:

  1. Set AI_API_KEY in .env
  2. Initialize database: python scripts/init_db.py reset
  3. Create admin user: python scripts/create_first_admin.py
  4. Ingest CORA file: python main.py ingest-cora --file <path> --name "Test" -u admin -p pass

Test Commands:

# Test single tier batch
python main.py generate-batch -j jobs/example_tier1_batch.json -u admin -p password

# Test multi-tier batch  
python main.py generate-batch -j jobs/example_multi_tier_batch.json -u admin -p password

# Test custom anchors
python main.py generate-batch -j jobs/example_custom_anchors.json -u admin -p password

Validation:

-- Check generated content
SELECT id, project_id, tier, status, generation_stage, 
       title_attempts, outline_attempts, content_attempts,
       validation_errors, validation_warnings
FROM generated_content;

-- Check active content
SELECT id, project_id, tier, is_active, word_count, augmented
FROM generated_content
WHERE is_active = 1;

Performance Notes

  • Title generation: ~2-5 seconds
  • Outline generation: ~5-10 seconds
  • Content generation: ~20-60 seconds
  • Total per article: ~30-75 seconds
  • Batch of 15 (Tier 1): ~10-20 minutes

Varies by model and complexity.

Completion Checklist

  • GeneratedContent database model
  • GeneratedContentRepository
  • AI client service
  • Prompt templates
  • ContentGenerationService (3-stage pipeline)
  • ContentAugmenter
  • Stage validation
  • Batch processor
  • Job configuration schema
  • CLI command
  • Example job files
  • Unit tests (30+ tests)
  • Integration tests
  • Documentation
  • Database initialization support

Notes

  • OpenRouter provides unified API for multiple models
  • JSON prompt format preferred by user for better consistency
  • Augmentation essential for CORA compliance
  • Batch processing architecture scales well
  • Version tracking enables rollback and comparison
  • Tier system balances quality vs cost