Big-Link-Man/docs/stories/story-2.3-ai-content-genera...

# Story 2.3: AI-Powered Content Generation - COMPLETED

## Overview
Implemented a comprehensive AI-powered content generation system with three-stage pipeline (title → outline → content), validation at each stage, programmatic augmentation for CORA compliance, and batch job processing across multiple tiers.

## Status
**COMPLETED**

## Story Details
**As a User**, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically.

## Acceptance Criteria - ALL MET

### 1. Script Initiation for Projects
**Status:** COMPLETE

- CLI command: `generate-batch --job-file <path>`
- Supports batch processing across multiple tiers
- Job configuration via JSON files
- Progress tracking and error reporting

### 2. AI-Powered Generation Using SEO Data
**Status:** COMPLETE

- Title generation with keyword validation
- Outline generation meeting CORA H2/H3 targets
- Full HTML content generation
- Uses project's SEO data (keywords, entities, related searches)
- Multiple AI models supported via OpenRouter

### 3. Content Rule Engine Validation
**Status:** COMPLETE

- Validates at each stage (title, outline, content)
- Uses ContentRuleEngine from Story 2.2
- Tier-aware validation (strict for Tier 1)
- Detailed error reporting

### 4. Database Storage
**Status:** COMPLETE

- Title, outline, and content stored in GeneratedContent table
- Version tracking and metadata
- Tracks attempts, models used, validation results
- Augmentation logs

### 5. Progress Logging
**Status:** COMPLETE

- Real-time progress updates via CLI
- Logs: "Generating title...", "Generating content...", etc.
- Tracks successful, failed, and skipped articles
- Detailed summary reports

### 6. AI Service Error Handling
**Status:** COMPLETE

- Graceful handling of API errors
- Retry logic with configurable attempts
- Fallback to programmatic augmentation
- Continue or stop on failures (configurable)

## Implementation Details

### Architecture Components

#### 1. Database Models (`src/database/models.py`)

**GeneratedContent Model:**
```python
class GeneratedContent(Base):
    id, project_id, tier
    title, outline, content
    status, is_active
    generation_stage
    title_attempts, outline_attempts, content_attempts
    title_model, outline_model, content_model
    validation_errors, validation_warnings
    validation_report (JSON)
    word_count, augmented
    augmentation_log (JSON)
    generation_duration
    error_message
    created_at, updated_at
```

#### 2. AI Client (`src/generation/ai_client.py`)

**Features:**
- OpenRouter API integration
- Multiple model support
- JSON-formatted responses
- Error handling and retries
- Model validation

**Available Models:**
- Claude 3.5 Sonnet (default)
- Claude 3 Haiku
- GPT-4o / GPT-4o-mini
- Llama 3.1 70B/8B
- Gemini Pro 1.5

#### 3. Job Configuration (`src/generation/job_config.py`)

**Job Structure:**
```json
{
  "job_name": "Batch Name",
  "project_id": 1,
  "tiers": [
    {
      "tier": 1,
      "article_count": 15,
      "models": {
        "title": "model-id",
        "outline": "model-id",
        "content": "model-id"
      },
      "anchor_text_config": {
        "mode": "default|override|append"
      },
      "validation_attempts": 3
    }
  ],
  "failure_config": {
    "max_consecutive_failures": 5,
    "skip_on_failure": true
  }
}
```

#### 4. Three-Stage Generation Pipeline (`src/generation/service.py`)

**Stage 1: Title Generation**
- Uses title_generation.json prompt
- Validates keyword presence and length
- Retries on validation failure
- Max attempts configurable

**Stage 2: Outline Generation**
- Uses outline_generation.json prompt
- Returns JSON structure with H1, H2s, H3s
- Validates CORA targets (H2/H3 counts, keyword distribution)
- AI retry → Programmatic augmentation if needed
- Ensures FAQ section present

**Stage 3: Content Generation**
- Uses content_generation.json prompt
- Follows validated outline structure
- Generates full HTML (no CSS, just semantic markup)
- Validates against all CORA rules
- AI retry → Augmentation if needed

#### 5. Stage Validation (`src/generation/validator.py`)

**Title Validation:**
- Length (30-100 chars)
- Keyword presence
- Non-empty

**Outline Validation:**
- H1 contains keyword
- H2/H3 counts meet targets
- Keyword distribution in headings
- Entity and related search incorporation
- FAQ section present
- Tier-aware strictness

**Content Validation:**
- Full CORA rule validation
- Word count (min/max)
- Keyword frequency
- Heading structure
- FAQ format
- Image alt text (when applicable)

#### 6. Content Augmentation (`src/generation/augmenter.py`)

**Outline Augmentation:**
- Add missing H2s with keywords
- Add H3s with entities
- Modify existing headings
- Maintain logical flow

**Content Augmentation:**
- Strategy 1: Ask AI to add paragraphs (small deficits)
- Strategy 2: Programmatically insert terms (large deficits)
- Insert keywords into random sentences
- Capitalize if sentence-initial
- Add complete paragraphs with missing elements

#### 7. Batch Processor (`src/generation/batch_processor.py`)

**Features:**
- Process multiple tiers sequentially
- Track progress per tier
- Handle failures (skip or stop)
- Consecutive failure threshold
- Real-time progress callbacks
- Detailed result reporting

#### 8. Prompt Templates (`src/generation/prompts/`)

**Files:**
- `title_generation.json` - Title prompts
- `outline_generation.json` - Outline structure prompts
- `content_generation.json` - Full content prompts
- `outline_augmentation.json` - Outline fix prompts
- `content_augmentation.json` - Content enhancement prompts

**Format:**
```json
{
  "system": "System message",
  "user_template": "Prompt with {placeholders}",
  "validation": {
    "output_format": "text|json|html",
    "requirements": []
  }
}
```

### CLI Command

```bash
python main.py generate-batch \
  --job-file jobs/example_tier1_batch.json \
  --username admin \
  --password password
```

**Options:**
- `--job-file, -j`: Path to job configuration JSON (required)
- `--force-regenerate, -f`: Force regeneration (flag, not implemented)
- `--username, -u`: Authentication username
- `--password, -p`: Authentication password

**Example Output:**
```
Authenticated as: admin (Admin)

Loading Job: Tier 1 Launch Batch
Project ID: 1
Total Articles: 15

Tiers:
  Tier 1: 15 articles
    Models: gpt-4o-mini / claude-3.5-sonnet / claude-3.5-sonnet

Proceed with generation? [y/N]: y

Starting batch generation...
--------------------------------------------------------------------------------
[Tier 1] Article 1/15: Generating...
[Tier 1] Article 1/15: Completed (ID: 1)
[Tier 1] Article 2/15: Generating...
...
--------------------------------------------------------------------------------

Batch Generation Complete!
Job: Tier 1 Launch Batch
Project ID: 1
Duration: 1234.56s

Results:
  Total Articles: 15
  Successful: 14
  Failed: 0
  Skipped: 1

By Tier:
  Tier 1:
    Successful: 14
    Failed: 0
    Skipped: 1
```

### Example Job Files

Located in `jobs/` directory:
- `example_tier1_batch.json` - 15 tier 1 articles
- `example_multi_tier_batch.json` - 165 articles across 3 tiers
- `example_custom_anchors.json` - Custom anchor text demo
- `README.md` - Job configuration guide

### Test Coverage

**Unit Tests (30+ tests):**
- `test_generation_service.py` - Pipeline stages
- `test_augmenter.py` - Content augmentation
- `test_job_config.py` - Job configuration validation

**Integration Tests:**
- `test_content_generation.py` - Full pipeline with mocked AI
- Repository CRUD operations
- Service initialization
- Job validation

### Database Schema

**New Table: generated_content**
```sql
CREATE TABLE generated_content (
    id INTEGER PRIMARY KEY,
    project_id INTEGER REFERENCES projects(id),
    tier INTEGER,
    title TEXT,
    outline TEXT,
    content TEXT,
    status VARCHAR(20) DEFAULT 'pending',
    is_active BOOLEAN DEFAULT 0,
    generation_stage VARCHAR(20) DEFAULT 'title',
    title_attempts INTEGER DEFAULT 0,
    outline_attempts INTEGER DEFAULT 0,
    content_attempts INTEGER DEFAULT 0,
    title_model VARCHAR(100),
    outline_model VARCHAR(100),
    content_model VARCHAR(100),
    validation_errors INTEGER DEFAULT 0,
    validation_warnings INTEGER DEFAULT 0,
    validation_report JSON,
    word_count INTEGER,
    augmented BOOLEAN DEFAULT 0,
    augmentation_log JSON,
    generation_duration FLOAT,
    error_message TEXT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE INDEX idx_generated_content_project_id ON generated_content(project_id);
CREATE INDEX idx_generated_content_tier ON generated_content(tier);
CREATE INDEX idx_generated_content_status ON generated_content(status);
```

### Dependencies Added

- `beautifulsoup4==4.12.2` - HTML parsing for augmentation

All other dependencies already present (OpenAI SDK for OpenRouter).

### Configuration

**Environment Variables:**
```bash
AI_API_KEY=sk-or-v1-your-openrouter-key
AI_API_BASE_URL=https://openrouter.ai/api/v1  # Optional
AI_MODEL=anthropic/claude-3.5-sonnet  # Optional
```

**master.config.json:**
Already configured in Story 2.2 with:
- `ai_service` section
- `content_rules` for validation
- Available models list

## Design Decisions

### Why Three Separate Stages?

1. **Title First**: Validates keyword usage early, informs outline
2. **Outline Next**: Ensures structure before expensive content generation
3. **Content Last**: Follows validated structure, reduces failures

Better success rate than single-prompt approach.

### Why Programmatic Augmentation?

- AI is unreliable at precise keyword placement
- Validation failures are common with strict CORA targets
- Hybrid approach: AI for quality, programmatic for precision
- Saves API costs (no endless retries)

### Why Separate GeneratedContent Table?

- Version history preserved
- Can rollback to previous generation
- Track attempts and augmentation
- Rich metadata for debugging
- A/B testing capability

### Why Job Configuration Files?

- Reusable batch configurations
- Version control job definitions
- Easy to share and modify
- Future: Auto-process job folder
- Clear audit trail

### Why Tier-Aware Validation?

- Tier 1: Strictest (all CORA targets mandatory)
- Tier 2+: Warnings only (more lenient)
- Matches real-world content quality needs
- Saves costs on bulk tier 2+ content

## Known Limitations

1. **No Interlinking Yet**: Links added in Epic 3 (Story 3.3)
2. **No CSS/Templates**: Added in Story 2.4
3. **Sequential Processing**: No parallel generation (future enhancement)
4. **Force-Regenerate Flag**: Not yet implemented
5. **No Image Generation**: Placeholder for future
6. **Single Project per Job**: Can't mix projects in one batch

## Next Steps

**Story 2.4: HTML Formatting with Multiple Templates**
- Wrap generated content in full HTML documents
- Apply CSS templates
- Map templates to deployment targets
- Add meta tags and SEO elements

**Epic 3: Pre-Deployment & Interlinking**
- Generate final URLs
- Inject interlinks (wheel structure)
- Add home page links
- Random existing article links

## Technical Debt Added

Items added to `technical-debt.md`:
1. A/B test different prompt templates
2. Prompt optimization comparison tool
3. Parallel article generation
4. Job folder auto-processing
5. Cost tracking per generation
6. Model performance analytics

## Files Created/Modified

### New Files:
- `src/database/models.py` - Added GeneratedContent model
- `src/database/interfaces.py` - Added IGeneratedContentRepository
- `src/database/repositories.py` - Added GeneratedContentRepository
- `src/generation/ai_client.py` - OpenRouter AI client
- `src/generation/service.py` - Content generation service
- `src/generation/validator.py` - Stage validation
- `src/generation/augmenter.py` - Content augmentation
- `src/generation/job_config.py` - Job configuration schema
- `src/generation/batch_processor.py` - Batch job processor
- `src/generation/prompts/title_generation.json`
- `src/generation/prompts/outline_generation.json`
- `src/generation/prompts/content_generation.json`
- `src/generation/prompts/outline_augmentation.json`
- `src/generation/prompts/content_augmentation.json`
- `tests/unit/test_generation_service.py`
- `tests/unit/test_augmenter.py`
- `tests/unit/test_job_config.py`
- `tests/integration/test_content_generation.py`
- `jobs/example_tier1_batch.json`
- `jobs/example_multi_tier_batch.json`
- `jobs/example_custom_anchors.json`
- `jobs/README.md`
- `docs/stories/story-2.3-ai-content-generation.md`

### Modified Files:
- `src/cli/commands.py` - Added generate-batch command
- `requirements.txt` - Added beautifulsoup4
- `docs/technical-debt.md` - Added new items

## Manual Testing

### Prerequisites:
1. Set AI_API_KEY in `.env`
2. Initialize database: `python scripts/init_db.py reset`
3. Create admin user: `python scripts/create_first_admin.py`
4. Ingest CORA file: `python main.py ingest-cora --file <path> --name "Test" -u admin -p pass`

### Test Commands:

```bash
# Test single tier batch
python main.py generate-batch -j jobs/example_tier1_batch.json -u admin -p password

# Test multi-tier batch
python main.py generate-batch -j jobs/example_multi_tier_batch.json -u admin -p password

# Test custom anchors
python main.py generate-batch -j jobs/example_custom_anchors.json -u admin -p password
```

### Validation:

```sql
-- Check generated content
SELECT id, project_id, tier, status, generation_stage,
       title_attempts, outline_attempts, content_attempts,
       validation_errors, validation_warnings
FROM generated_content;

-- Check active content
SELECT id, project_id, tier, is_active, word_count, augmented
FROM generated_content
WHERE is_active = 1;
```

## Performance Notes

- Title generation: ~2-5 seconds
- Outline generation: ~5-10 seconds
- Content generation: ~20-60 seconds
- Total per article: ~30-75 seconds
- Batch of 15 (Tier 1): ~10-20 minutes

Varies by model and complexity.

## Completion Checklist

- [x] GeneratedContent database model
- [x] GeneratedContentRepository
- [x] AI client service
- [x] Prompt templates
- [x] ContentGenerationService (3-stage pipeline)
- [x] ContentAugmenter
- [x] Stage validation
- [x] Batch processor
- [x] Job configuration schema
- [x] CLI command
- [x] Example job files
- [x] Unit tests (30+ tests)
- [x] Integration tests
- [x] Documentation
- [x] Database initialization support

## Notes

- OpenRouter provides unified API for multiple models
- JSON prompt format preferred by user for better consistency
- Augmentation essential for CORA compliance
- Batch processing architecture scales well
- Version tracking enables rollback and comparison
- Tier system balances quality vs cost