536 lines
14 KiB
Markdown
536 lines
14 KiB
Markdown
# Story 2.3: AI-Powered Content Generation - COMPLETED
|
|
|
|
## Overview
|
|
Implemented a comprehensive AI-powered content generation system with three-stage pipeline (title → outline → content), validation at each stage, programmatic augmentation for CORA compliance, and batch job processing across multiple tiers.
|
|
|
|
## Status
|
|
**COMPLETED**
|
|
|
|
## Story Details
|
|
**As a User**, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically.
|
|
|
|
## Acceptance Criteria - ALL MET
|
|
|
|
### 1. Script Initiation for Projects
|
|
**Status:** COMPLETE
|
|
|
|
- CLI command: `generate-batch --job-file <path>`
|
|
- Supports batch processing across multiple tiers
|
|
- Job configuration via JSON files
|
|
- Progress tracking and error reporting
|
|
|
|
### 2. AI-Powered Generation Using SEO Data
|
|
**Status:** COMPLETE
|
|
|
|
- Title generation with keyword validation
|
|
- Outline generation meeting CORA H2/H3 targets
|
|
- Full HTML content generation
|
|
- Uses project's SEO data (keywords, entities, related searches)
|
|
- Multiple AI models supported via OpenRouter
|
|
|
|
### 3. Content Rule Engine Validation
|
|
**Status:** COMPLETE
|
|
|
|
- Validates at each stage (title, outline, content)
|
|
- Uses ContentRuleEngine from Story 2.2
|
|
- Tier-aware validation (strict for Tier 1)
|
|
- Detailed error reporting
|
|
|
|
### 4. Database Storage
|
|
**Status:** COMPLETE
|
|
|
|
- Title, outline, and content stored in GeneratedContent table
|
|
- Version tracking and metadata
|
|
- Tracks attempts, models used, validation results
|
|
- Augmentation logs
|
|
|
|
### 5. Progress Logging
|
|
**Status:** COMPLETE
|
|
|
|
- Real-time progress updates via CLI
|
|
- Logs: "Generating title...", "Generating content...", etc.
|
|
- Tracks successful, failed, and skipped articles
|
|
- Detailed summary reports
|
|
|
|
### 6. AI Service Error Handling
|
|
**Status:** COMPLETE
|
|
|
|
- Graceful handling of API errors
|
|
- Retry logic with configurable attempts
|
|
- Fallback to programmatic augmentation
|
|
- Continue or stop on failures (configurable)
|
|
|
|
## Implementation Details
|
|
|
|
### Architecture Components
|
|
|
|
#### 1. Database Models (`src/database/models.py`)
|
|
|
|
**GeneratedContent Model:**
|
|
```python
|
|
class GeneratedContent(Base):
|
|
id, project_id, tier
|
|
title, outline, content
|
|
status, is_active
|
|
generation_stage
|
|
title_attempts, outline_attempts, content_attempts
|
|
title_model, outline_model, content_model
|
|
validation_errors, validation_warnings
|
|
validation_report (JSON)
|
|
word_count, augmented
|
|
augmentation_log (JSON)
|
|
generation_duration
|
|
error_message
|
|
created_at, updated_at
|
|
```
|
|
|
|
#### 2. AI Client (`src/generation/ai_client.py`)
|
|
|
|
**Features:**
|
|
- OpenRouter API integration
|
|
- Multiple model support
|
|
- JSON-formatted responses
|
|
- Error handling and retries
|
|
- Model validation
|
|
|
|
**Available Models:**
|
|
- Claude 3.5 Sonnet (default)
|
|
- Claude 3 Haiku
|
|
- GPT-4o / GPT-4o-mini
|
|
- Llama 3.1 70B/8B
|
|
- Gemini Pro 1.5
|
|
|
|
#### 3. Job Configuration (`src/generation/job_config.py`)
|
|
|
|
**Job Structure:**
|
|
```json
|
|
{
|
|
"job_name": "Batch Name",
|
|
"project_id": 1,
|
|
"tiers": [
|
|
{
|
|
"tier": 1,
|
|
"article_count": 15,
|
|
"models": {
|
|
"title": "model-id",
|
|
"outline": "model-id",
|
|
"content": "model-id"
|
|
},
|
|
"anchor_text_config": {
|
|
"mode": "default|override|append"
|
|
},
|
|
"validation_attempts": 3
|
|
}
|
|
],
|
|
"failure_config": {
|
|
"max_consecutive_failures": 5,
|
|
"skip_on_failure": true
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 4. Three-Stage Generation Pipeline (`src/generation/service.py`)
|
|
|
|
**Stage 1: Title Generation**
|
|
- Uses title_generation.json prompt
|
|
- Validates keyword presence and length
|
|
- Retries on validation failure
|
|
- Max attempts configurable
|
|
|
|
**Stage 2: Outline Generation**
|
|
- Uses outline_generation.json prompt
|
|
- Returns JSON structure with H1, H2s, H3s
|
|
- Validates CORA targets (H2/H3 counts, keyword distribution)
|
|
- AI retry → Programmatic augmentation if needed
|
|
- Ensures FAQ section present
|
|
|
|
**Stage 3: Content Generation**
|
|
- Uses content_generation.json prompt
|
|
- Follows validated outline structure
|
|
- Generates full HTML (no CSS, just semantic markup)
|
|
- Validates against all CORA rules
|
|
- AI retry → Augmentation if needed
|
|
|
|
#### 5. Stage Validation (`src/generation/validator.py`)
|
|
|
|
**Title Validation:**
|
|
- Length (30-100 chars)
|
|
- Keyword presence
|
|
- Non-empty
|
|
|
|
**Outline Validation:**
|
|
- H1 contains keyword
|
|
- H2/H3 counts meet targets
|
|
- Keyword distribution in headings
|
|
- Entity and related search incorporation
|
|
- FAQ section present
|
|
- Tier-aware strictness
|
|
|
|
**Content Validation:**
|
|
- Full CORA rule validation
|
|
- Word count (min/max)
|
|
- Keyword frequency
|
|
- Heading structure
|
|
- FAQ format
|
|
- Image alt text (when applicable)
|
|
|
|
#### 6. Content Augmentation (`src/generation/augmenter.py`)
|
|
|
|
**Outline Augmentation:**
|
|
- Add missing H2s with keywords
|
|
- Add H3s with entities
|
|
- Modify existing headings
|
|
- Maintain logical flow
|
|
|
|
**Content Augmentation:**
|
|
- Strategy 1: Ask AI to add paragraphs (small deficits)
|
|
- Strategy 2: Programmatically insert terms (large deficits)
|
|
- Insert keywords into random sentences
|
|
- Capitalize if sentence-initial
|
|
- Add complete paragraphs with missing elements
|
|
|
|
#### 7. Batch Processor (`src/generation/batch_processor.py`)
|
|
|
|
**Features:**
|
|
- Process multiple tiers sequentially
|
|
- Track progress per tier
|
|
- Handle failures (skip or stop)
|
|
- Consecutive failure threshold
|
|
- Real-time progress callbacks
|
|
- Detailed result reporting
|
|
|
|
#### 8. Prompt Templates (`src/generation/prompts/`)
|
|
|
|
**Files:**
|
|
- `title_generation.json` - Title prompts
|
|
- `outline_generation.json` - Outline structure prompts
|
|
- `content_generation.json` - Full content prompts
|
|
- `outline_augmentation.json` - Outline fix prompts
|
|
- `content_augmentation.json` - Content enhancement prompts
|
|
|
|
**Format:**
|
|
```json
|
|
{
|
|
"system": "System message",
|
|
"user_template": "Prompt with {placeholders}",
|
|
"validation": {
|
|
"output_format": "text|json|html",
|
|
"requirements": []
|
|
}
|
|
}
|
|
```
|
|
|
|
### CLI Command
|
|
|
|
```bash
|
|
python main.py generate-batch \
|
|
--job-file jobs/example_tier1_batch.json \
|
|
--username admin \
|
|
--password password
|
|
```
|
|
|
|
**Options:**
|
|
- `--job-file, -j`: Path to job configuration JSON (required)
|
|
- `--force-regenerate, -f`: Force regeneration (flag, not implemented)
|
|
- `--username, -u`: Authentication username
|
|
- `--password, -p`: Authentication password
|
|
|
|
**Example Output:**
|
|
```
|
|
Authenticated as: admin (Admin)
|
|
|
|
Loading Job: Tier 1 Launch Batch
|
|
Project ID: 1
|
|
Total Articles: 15
|
|
|
|
Tiers:
|
|
Tier 1: 15 articles
|
|
Models: gpt-4o-mini / claude-3.5-sonnet / claude-3.5-sonnet
|
|
|
|
Proceed with generation? [y/N]: y
|
|
|
|
Starting batch generation...
|
|
--------------------------------------------------------------------------------
|
|
[Tier 1] Article 1/15: Generating...
|
|
[Tier 1] Article 1/15: Completed (ID: 1)
|
|
[Tier 1] Article 2/15: Generating...
|
|
...
|
|
--------------------------------------------------------------------------------
|
|
|
|
Batch Generation Complete!
|
|
Job: Tier 1 Launch Batch
|
|
Project ID: 1
|
|
Duration: 1234.56s
|
|
|
|
Results:
|
|
Total Articles: 15
|
|
Successful: 14
|
|
Failed: 0
|
|
Skipped: 1
|
|
|
|
By Tier:
|
|
Tier 1:
|
|
Successful: 14
|
|
Failed: 0
|
|
Skipped: 1
|
|
```
|
|
|
|
### Example Job Files
|
|
|
|
Located in `jobs/` directory:
|
|
- `example_tier1_batch.json` - 15 tier 1 articles
|
|
- `example_multi_tier_batch.json` - 165 articles across 3 tiers
|
|
- `example_custom_anchors.json` - Custom anchor text demo
|
|
- `README.md` - Job configuration guide
|
|
|
|
### Test Coverage
|
|
|
|
**Unit Tests (30+ tests):**
|
|
- `test_generation_service.py` - Pipeline stages
|
|
- `test_augmenter.py` - Content augmentation
|
|
- `test_job_config.py` - Job configuration validation
|
|
|
|
**Integration Tests:**
|
|
- `test_content_generation.py` - Full pipeline with mocked AI
|
|
- Repository CRUD operations
|
|
- Service initialization
|
|
- Job validation
|
|
|
|
### Database Schema
|
|
|
|
**New Table: generated_content**
|
|
```sql
|
|
CREATE TABLE generated_content (
|
|
id INTEGER PRIMARY KEY,
|
|
project_id INTEGER REFERENCES projects(id),
|
|
tier INTEGER,
|
|
title TEXT,
|
|
outline TEXT,
|
|
content TEXT,
|
|
status VARCHAR(20) DEFAULT 'pending',
|
|
is_active BOOLEAN DEFAULT 0,
|
|
generation_stage VARCHAR(20) DEFAULT 'title',
|
|
title_attempts INTEGER DEFAULT 0,
|
|
outline_attempts INTEGER DEFAULT 0,
|
|
content_attempts INTEGER DEFAULT 0,
|
|
title_model VARCHAR(100),
|
|
outline_model VARCHAR(100),
|
|
content_model VARCHAR(100),
|
|
validation_errors INTEGER DEFAULT 0,
|
|
validation_warnings INTEGER DEFAULT 0,
|
|
validation_report JSON,
|
|
word_count INTEGER,
|
|
augmented BOOLEAN DEFAULT 0,
|
|
augmentation_log JSON,
|
|
generation_duration FLOAT,
|
|
error_message TEXT,
|
|
created_at TIMESTAMP,
|
|
updated_at TIMESTAMP
|
|
);
|
|
|
|
CREATE INDEX idx_generated_content_project_id ON generated_content(project_id);
|
|
CREATE INDEX idx_generated_content_tier ON generated_content(tier);
|
|
CREATE INDEX idx_generated_content_status ON generated_content(status);
|
|
```
|
|
|
|
### Dependencies Added
|
|
|
|
- `beautifulsoup4==4.12.2` - HTML parsing for augmentation
|
|
|
|
All other dependencies already present (OpenAI SDK for OpenRouter).
|
|
|
|
### Configuration
|
|
|
|
**Environment Variables:**
|
|
```bash
|
|
AI_API_KEY=sk-or-v1-your-openrouter-key
|
|
AI_API_BASE_URL=https://openrouter.ai/api/v1 # Optional
|
|
AI_MODEL=anthropic/claude-3.5-sonnet # Optional
|
|
```
|
|
|
|
**master.config.json:**
|
|
Already configured in Story 2.2 with:
|
|
- `ai_service` section
|
|
- `content_rules` for validation
|
|
- Available models list
|
|
|
|
## Design Decisions
|
|
|
|
### Why Three Separate Stages?
|
|
|
|
1. **Title First**: Validates keyword usage early, informs outline
|
|
2. **Outline Next**: Ensures structure before expensive content generation
|
|
3. **Content Last**: Follows validated structure, reduces failures
|
|
|
|
Better success rate than single-prompt approach.
|
|
|
|
### Why Programmatic Augmentation?
|
|
|
|
- AI is unreliable at precise keyword placement
|
|
- Validation failures are common with strict CORA targets
|
|
- Hybrid approach: AI for quality, programmatic for precision
|
|
- Saves API costs (no endless retries)
|
|
|
|
### Why Separate GeneratedContent Table?
|
|
|
|
- Version history preserved
|
|
- Can rollback to previous generation
|
|
- Track attempts and augmentation
|
|
- Rich metadata for debugging
|
|
- A/B testing capability
|
|
|
|
### Why Job Configuration Files?
|
|
|
|
- Reusable batch configurations
|
|
- Version control job definitions
|
|
- Easy to share and modify
|
|
- Future: Auto-process job folder
|
|
- Clear audit trail
|
|
|
|
### Why Tier-Aware Validation?
|
|
|
|
- Tier 1: Strictest (all CORA targets mandatory)
|
|
- Tier 2+: Warnings only (more lenient)
|
|
- Matches real-world content quality needs
|
|
- Saves costs on bulk tier 2+ content
|
|
|
|
## Known Limitations
|
|
|
|
1. **No Interlinking Yet**: Links added in Epic 3 (Story 3.3)
|
|
2. **No CSS/Templates**: Added in Story 2.4
|
|
3. **Sequential Processing**: No parallel generation (future enhancement)
|
|
4. **Force-Regenerate Flag**: Not yet implemented
|
|
5. **No Image Generation**: Placeholder for future
|
|
6. **Single Project per Job**: Can't mix projects in one batch
|
|
|
|
## Next Steps
|
|
|
|
**Story 2.4: HTML Formatting with Multiple Templates**
|
|
- Wrap generated content in full HTML documents
|
|
- Apply CSS templates
|
|
- Map templates to deployment targets
|
|
- Add meta tags and SEO elements
|
|
|
|
**Epic 3: Pre-Deployment & Interlinking**
|
|
- Generate final URLs
|
|
- Inject interlinks (wheel structure)
|
|
- Add home page links
|
|
- Random existing article links
|
|
|
|
## Technical Debt Added
|
|
|
|
Items added to `technical-debt.md`:
|
|
1. A/B test different prompt templates
|
|
2. Prompt optimization comparison tool
|
|
3. Parallel article generation
|
|
4. Job folder auto-processing
|
|
5. Cost tracking per generation
|
|
6. Model performance analytics
|
|
|
|
## Files Created/Modified
|
|
|
|
### New Files:
|
|
- `src/database/models.py` - Added GeneratedContent model
|
|
- `src/database/interfaces.py` - Added IGeneratedContentRepository
|
|
- `src/database/repositories.py` - Added GeneratedContentRepository
|
|
- `src/generation/ai_client.py` - OpenRouter AI client
|
|
- `src/generation/service.py` - Content generation service
|
|
- `src/generation/validator.py` - Stage validation
|
|
- `src/generation/augmenter.py` - Content augmentation
|
|
- `src/generation/job_config.py` - Job configuration schema
|
|
- `src/generation/batch_processor.py` - Batch job processor
|
|
- `src/generation/prompts/title_generation.json`
|
|
- `src/generation/prompts/outline_generation.json`
|
|
- `src/generation/prompts/content_generation.json`
|
|
- `src/generation/prompts/outline_augmentation.json`
|
|
- `src/generation/prompts/content_augmentation.json`
|
|
- `tests/unit/test_generation_service.py`
|
|
- `tests/unit/test_augmenter.py`
|
|
- `tests/unit/test_job_config.py`
|
|
- `tests/integration/test_content_generation.py`
|
|
- `jobs/example_tier1_batch.json`
|
|
- `jobs/example_multi_tier_batch.json`
|
|
- `jobs/example_custom_anchors.json`
|
|
- `jobs/README.md`
|
|
- `docs/stories/story-2.3-ai-content-generation.md`
|
|
|
|
### Modified Files:
|
|
- `src/cli/commands.py` - Added generate-batch command
|
|
- `requirements.txt` - Added beautifulsoup4
|
|
- `docs/technical-debt.md` - Added new items
|
|
|
|
## Manual Testing
|
|
|
|
### Prerequisites:
|
|
1. Set AI_API_KEY in `.env`
|
|
2. Initialize database: `python scripts/init_db.py reset`
|
|
3. Create admin user: `python scripts/create_first_admin.py`
|
|
4. Ingest CORA file: `python main.py ingest-cora --file <path> --name "Test" -u admin -p pass`
|
|
|
|
### Test Commands:
|
|
|
|
```bash
|
|
# Test single tier batch
|
|
python main.py generate-batch -j jobs/example_tier1_batch.json -u admin -p password
|
|
|
|
# Test multi-tier batch
|
|
python main.py generate-batch -j jobs/example_multi_tier_batch.json -u admin -p password
|
|
|
|
# Test custom anchors
|
|
python main.py generate-batch -j jobs/example_custom_anchors.json -u admin -p password
|
|
```
|
|
|
|
### Validation:
|
|
|
|
```sql
|
|
-- Check generated content
|
|
SELECT id, project_id, tier, status, generation_stage,
|
|
title_attempts, outline_attempts, content_attempts,
|
|
validation_errors, validation_warnings
|
|
FROM generated_content;
|
|
|
|
-- Check active content
|
|
SELECT id, project_id, tier, is_active, word_count, augmented
|
|
FROM generated_content
|
|
WHERE is_active = 1;
|
|
```
|
|
|
|
## Performance Notes
|
|
|
|
- Title generation: ~2-5 seconds
|
|
- Outline generation: ~5-10 seconds
|
|
- Content generation: ~20-60 seconds
|
|
- Total per article: ~30-75 seconds
|
|
- Batch of 15 (Tier 1): ~10-20 minutes
|
|
|
|
Varies by model and complexity.
|
|
|
|
## Completion Checklist
|
|
|
|
- [x] GeneratedContent database model
|
|
- [x] GeneratedContentRepository
|
|
- [x] AI client service
|
|
- [x] Prompt templates
|
|
- [x] ContentGenerationService (3-stage pipeline)
|
|
- [x] ContentAugmenter
|
|
- [x] Stage validation
|
|
- [x] Batch processor
|
|
- [x] Job configuration schema
|
|
- [x] CLI command
|
|
- [x] Example job files
|
|
- [x] Unit tests (30+ tests)
|
|
- [x] Integration tests
|
|
- [x] Documentation
|
|
- [x] Database initialization support
|
|
|
|
## Notes
|
|
|
|
- OpenRouter provides unified API for multiple models
|
|
- JSON prompt format preferred by user for better consistency
|
|
- Augmentation essential for CORA compliance
|
|
- Batch processing architecture scales well
|
|
- Version tracking enables rollback and comparison
|
|
- Tier system balances quality vs cost
|
|
|
|
|