# Story 2.3: AI-Powered Content Generation - COMPLETED ## Overview Implemented a comprehensive AI-powered content generation system with three-stage pipeline (title → outline → content), validation at each stage, programmatic augmentation for CORA compliance, and batch job processing across multiple tiers. ## Status **COMPLETED** ## Story Details **As a User**, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically. ## Acceptance Criteria - ALL MET ### 1. Script Initiation for Projects **Status:** COMPLETE - CLI command: `generate-batch --job-file ` - Supports batch processing across multiple tiers - Job configuration via JSON files - Progress tracking and error reporting ### 2. AI-Powered Generation Using SEO Data **Status:** COMPLETE - Title generation with keyword validation - Outline generation meeting CORA H2/H3 targets - Full HTML content generation - Uses project's SEO data (keywords, entities, related searches) - Multiple AI models supported via OpenRouter ### 3. Content Rule Engine Validation **Status:** COMPLETE - Validates at each stage (title, outline, content) - Uses ContentRuleEngine from Story 2.2 - Tier-aware validation (strict for Tier 1) - Detailed error reporting ### 4. Database Storage **Status:** COMPLETE - Title, outline, and content stored in GeneratedContent table - Version tracking and metadata - Tracks attempts, models used, validation results - Augmentation logs ### 5. Progress Logging **Status:** COMPLETE - Real-time progress updates via CLI - Logs: "Generating title...", "Generating content...", etc. - Tracks successful, failed, and skipped articles - Detailed summary reports ### 6. AI Service Error Handling **Status:** COMPLETE - Graceful handling of API errors - Retry logic with configurable attempts - Fallback to programmatic augmentation - Continue or stop on failures (configurable) ## Implementation Details ### Architecture Components #### 1. Database Models (`src/database/models.py`) **GeneratedContent Model:** ```python class GeneratedContent(Base): id, project_id, tier title, outline, content status, is_active generation_stage title_attempts, outline_attempts, content_attempts title_model, outline_model, content_model validation_errors, validation_warnings validation_report (JSON) word_count, augmented augmentation_log (JSON) generation_duration error_message created_at, updated_at ``` #### 2. AI Client (`src/generation/ai_client.py`) **Features:** - OpenRouter API integration - Multiple model support - JSON-formatted responses - Error handling and retries - Model validation **Available Models:** - Claude 3.5 Sonnet (default) - Claude 3 Haiku - GPT-4o / GPT-4o-mini - Llama 3.1 70B/8B - Gemini Pro 1.5 #### 3. Job Configuration (`src/generation/job_config.py`) **Job Structure:** ```json { "job_name": "Batch Name", "project_id": 1, "tiers": [ { "tier": 1, "article_count": 15, "models": { "title": "model-id", "outline": "model-id", "content": "model-id" }, "anchor_text_config": { "mode": "default|override|append" }, "validation_attempts": 3 } ], "failure_config": { "max_consecutive_failures": 5, "skip_on_failure": true } } ``` #### 4. Three-Stage Generation Pipeline (`src/generation/service.py`) **Stage 1: Title Generation** - Uses title_generation.json prompt - Validates keyword presence and length - Retries on validation failure - Max attempts configurable **Stage 2: Outline Generation** - Uses outline_generation.json prompt - Returns JSON structure with H1, H2s, H3s - Validates CORA targets (H2/H3 counts, keyword distribution) - AI retry → Programmatic augmentation if needed - Ensures FAQ section present **Stage 3: Content Generation** - Uses content_generation.json prompt - Follows validated outline structure - Generates full HTML (no CSS, just semantic markup) - Validates against all CORA rules - AI retry → Augmentation if needed #### 5. Stage Validation (`src/generation/validator.py`) **Title Validation:** - Length (30-100 chars) - Keyword presence - Non-empty **Outline Validation:** - H1 contains keyword - H2/H3 counts meet targets - Keyword distribution in headings - Entity and related search incorporation - FAQ section present - Tier-aware strictness **Content Validation:** - Full CORA rule validation - Word count (min/max) - Keyword frequency - Heading structure - FAQ format - Image alt text (when applicable) #### 6. Content Augmentation (`src/generation/augmenter.py`) **Outline Augmentation:** - Add missing H2s with keywords - Add H3s with entities - Modify existing headings - Maintain logical flow **Content Augmentation:** - Strategy 1: Ask AI to add paragraphs (small deficits) - Strategy 2: Programmatically insert terms (large deficits) - Insert keywords into random sentences - Capitalize if sentence-initial - Add complete paragraphs with missing elements #### 7. Batch Processor (`src/generation/batch_processor.py`) **Features:** - Process multiple tiers sequentially - Track progress per tier - Handle failures (skip or stop) - Consecutive failure threshold - Real-time progress callbacks - Detailed result reporting #### 8. Prompt Templates (`src/generation/prompts/`) **Files:** - `title_generation.json` - Title prompts - `outline_generation.json` - Outline structure prompts - `content_generation.json` - Full content prompts - `outline_augmentation.json` - Outline fix prompts - `content_augmentation.json` - Content enhancement prompts **Format:** ```json { "system": "System message", "user_template": "Prompt with {placeholders}", "validation": { "output_format": "text|json|html", "requirements": [] } } ``` ### CLI Command ```bash python main.py generate-batch \ --job-file jobs/example_tier1_batch.json \ --username admin \ --password password ``` **Options:** - `--job-file, -j`: Path to job configuration JSON (required) - `--force-regenerate, -f`: Force regeneration (flag, not implemented) - `--username, -u`: Authentication username - `--password, -p`: Authentication password **Example Output:** ``` Authenticated as: admin (Admin) Loading Job: Tier 1 Launch Batch Project ID: 1 Total Articles: 15 Tiers: Tier 1: 15 articles Models: gpt-4o-mini / claude-3.5-sonnet / claude-3.5-sonnet Proceed with generation? [y/N]: y Starting batch generation... -------------------------------------------------------------------------------- [Tier 1] Article 1/15: Generating... [Tier 1] Article 1/15: Completed (ID: 1) [Tier 1] Article 2/15: Generating... ... -------------------------------------------------------------------------------- Batch Generation Complete! Job: Tier 1 Launch Batch Project ID: 1 Duration: 1234.56s Results: Total Articles: 15 Successful: 14 Failed: 0 Skipped: 1 By Tier: Tier 1: Successful: 14 Failed: 0 Skipped: 1 ``` ### Example Job Files Located in `jobs/` directory: - `example_tier1_batch.json` - 15 tier 1 articles - `example_multi_tier_batch.json` - 165 articles across 3 tiers - `example_custom_anchors.json` - Custom anchor text demo - `README.md` - Job configuration guide ### Test Coverage **Unit Tests (30+ tests):** - `test_generation_service.py` - Pipeline stages - `test_augmenter.py` - Content augmentation - `test_job_config.py` - Job configuration validation **Integration Tests:** - `test_content_generation.py` - Full pipeline with mocked AI - Repository CRUD operations - Service initialization - Job validation ### Database Schema **New Table: generated_content** ```sql CREATE TABLE generated_content ( id INTEGER PRIMARY KEY, project_id INTEGER REFERENCES projects(id), tier INTEGER, title TEXT, outline TEXT, content TEXT, status VARCHAR(20) DEFAULT 'pending', is_active BOOLEAN DEFAULT 0, generation_stage VARCHAR(20) DEFAULT 'title', title_attempts INTEGER DEFAULT 0, outline_attempts INTEGER DEFAULT 0, content_attempts INTEGER DEFAULT 0, title_model VARCHAR(100), outline_model VARCHAR(100), content_model VARCHAR(100), validation_errors INTEGER DEFAULT 0, validation_warnings INTEGER DEFAULT 0, validation_report JSON, word_count INTEGER, augmented BOOLEAN DEFAULT 0, augmentation_log JSON, generation_duration FLOAT, error_message TEXT, created_at TIMESTAMP, updated_at TIMESTAMP ); CREATE INDEX idx_generated_content_project_id ON generated_content(project_id); CREATE INDEX idx_generated_content_tier ON generated_content(tier); CREATE INDEX idx_generated_content_status ON generated_content(status); ``` ### Dependencies Added - `beautifulsoup4==4.12.2` - HTML parsing for augmentation All other dependencies already present (OpenAI SDK for OpenRouter). ### Configuration **Environment Variables:** ```bash AI_API_KEY=sk-or-v1-your-openrouter-key AI_API_BASE_URL=https://openrouter.ai/api/v1 # Optional AI_MODEL=anthropic/claude-3.5-sonnet # Optional ``` **master.config.json:** Already configured in Story 2.2 with: - `ai_service` section - `content_rules` for validation - Available models list ## Design Decisions ### Why Three Separate Stages? 1. **Title First**: Validates keyword usage early, informs outline 2. **Outline Next**: Ensures structure before expensive content generation 3. **Content Last**: Follows validated structure, reduces failures Better success rate than single-prompt approach. ### Why Programmatic Augmentation? - AI is unreliable at precise keyword placement - Validation failures are common with strict CORA targets - Hybrid approach: AI for quality, programmatic for precision - Saves API costs (no endless retries) ### Why Separate GeneratedContent Table? - Version history preserved - Can rollback to previous generation - Track attempts and augmentation - Rich metadata for debugging - A/B testing capability ### Why Job Configuration Files? - Reusable batch configurations - Version control job definitions - Easy to share and modify - Future: Auto-process job folder - Clear audit trail ### Why Tier-Aware Validation? - Tier 1: Strictest (all CORA targets mandatory) - Tier 2+: Warnings only (more lenient) - Matches real-world content quality needs - Saves costs on bulk tier 2+ content ## Known Limitations 1. **No Interlinking Yet**: Links added in Epic 3 (Story 3.3) 2. **No CSS/Templates**: Added in Story 2.4 3. **Sequential Processing**: No parallel generation (future enhancement) 4. **Force-Regenerate Flag**: Not yet implemented 5. **No Image Generation**: Placeholder for future 6. **Single Project per Job**: Can't mix projects in one batch ## Next Steps **Story 2.4: HTML Formatting with Multiple Templates** - Wrap generated content in full HTML documents - Apply CSS templates - Map templates to deployment targets - Add meta tags and SEO elements **Epic 3: Pre-Deployment & Interlinking** - Generate final URLs - Inject interlinks (wheel structure) - Add home page links - Random existing article links ## Technical Debt Added Items added to `technical-debt.md`: 1. A/B test different prompt templates 2. Prompt optimization comparison tool 3. Parallel article generation 4. Job folder auto-processing 5. Cost tracking per generation 6. Model performance analytics ## Files Created/Modified ### New Files: - `src/database/models.py` - Added GeneratedContent model - `src/database/interfaces.py` - Added IGeneratedContentRepository - `src/database/repositories.py` - Added GeneratedContentRepository - `src/generation/ai_client.py` - OpenRouter AI client - `src/generation/service.py` - Content generation service - `src/generation/validator.py` - Stage validation - `src/generation/augmenter.py` - Content augmentation - `src/generation/job_config.py` - Job configuration schema - `src/generation/batch_processor.py` - Batch job processor - `src/generation/prompts/title_generation.json` - `src/generation/prompts/outline_generation.json` - `src/generation/prompts/content_generation.json` - `src/generation/prompts/outline_augmentation.json` - `src/generation/prompts/content_augmentation.json` - `tests/unit/test_generation_service.py` - `tests/unit/test_augmenter.py` - `tests/unit/test_job_config.py` - `tests/integration/test_content_generation.py` - `jobs/example_tier1_batch.json` - `jobs/example_multi_tier_batch.json` - `jobs/example_custom_anchors.json` - `jobs/README.md` - `docs/stories/story-2.3-ai-content-generation.md` ### Modified Files: - `src/cli/commands.py` - Added generate-batch command - `requirements.txt` - Added beautifulsoup4 - `docs/technical-debt.md` - Added new items ## Manual Testing ### Prerequisites: 1. Set AI_API_KEY in `.env` 2. Initialize database: `python scripts/init_db.py reset` 3. Create admin user: `python scripts/create_first_admin.py` 4. Ingest CORA file: `python main.py ingest-cora --file --name "Test" -u admin -p pass` ### Test Commands: ```bash # Test single tier batch python main.py generate-batch -j jobs/example_tier1_batch.json -u admin -p password # Test multi-tier batch python main.py generate-batch -j jobs/example_multi_tier_batch.json -u admin -p password # Test custom anchors python main.py generate-batch -j jobs/example_custom_anchors.json -u admin -p password ``` ### Validation: ```sql -- Check generated content SELECT id, project_id, tier, status, generation_stage, title_attempts, outline_attempts, content_attempts, validation_errors, validation_warnings FROM generated_content; -- Check active content SELECT id, project_id, tier, is_active, word_count, augmented FROM generated_content WHERE is_active = 1; ``` ## Performance Notes - Title generation: ~2-5 seconds - Outline generation: ~5-10 seconds - Content generation: ~20-60 seconds - Total per article: ~30-75 seconds - Batch of 15 (Tier 1): ~10-20 minutes Varies by model and complexity. ## Completion Checklist - [x] GeneratedContent database model - [x] GeneratedContentRepository - [x] AI client service - [x] Prompt templates - [x] ContentGenerationService (3-stage pipeline) - [x] ContentAugmenter - [x] Stage validation - [x] Batch processor - [x] Job configuration schema - [x] CLI command - [x] Example job files - [x] Unit tests (30+ tests) - [x] Integration tests - [x] Documentation - [x] Database initialization support ## Notes - OpenRouter provides unified API for multiple models - JSON prompt format preferred by user for better consistency - Augmentation essential for CORA compliance - Batch processing architecture scales well - Version tracking enables rollback and comparison - Tier system balances quality vs cost