Big-Link-Man/docs/stories/story-2.2-task-breakdown.md

921 lines
24 KiB
Markdown

# Story 2.2: Simplified AI Content Generation - Detailed Task Breakdown
## Overview
This document breaks down Story 2.2 into detailed tasks with specific implementation notes.
---
## **PHASE 1: Data Model & Schema Design**
### Task 1.1: Create GeneratedContent Database Model
**File**: `src/database/models.py`
**Add new model class:**
```python
class GeneratedContent(Base):
__tablename__ = "generated_content"
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
title: Mapped[str] = mapped_column(Text, nullable=False)
outline: Mapped[dict] = mapped_column(JSON, nullable=False)
content: Mapped[str] = mapped_column(Text, nullable=False)
word_count: Mapped[int] = mapped_column(Integer, nullable=False)
status: Mapped[str] = mapped_column(String(20), nullable=False)
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
updated_at: Mapped[datetime] = mapped_column(
DateTime,
default=datetime.utcnow,
onupdate=datetime.utcnow,
nullable=False
)
```
**Status values**: `generated`, `augmented`, `failed`
**Update**: `scripts/init_db.py` to create the table
---
### Task 1.2: Create GeneratedContent Repository
**File**: `src/database/repositories.py`
**Add repository class:**
```python
class GeneratedContentRepository(BaseRepository[GeneratedContent]):
def __init__(self, session: Session):
super().__init__(GeneratedContent, session)
def get_by_project_id(self, project_id: int) -> list[GeneratedContent]:
pass
def get_by_project_and_tier(self, project_id: int, tier: str) -> list[GeneratedContent]:
pass
def get_by_keyword(self, keyword: str) -> list[GeneratedContent]:
pass
```
---
### Task 1.3: Define Job File JSON Schema
**File**: `jobs/README.md` (create/update)
**Job file structure** (one project per job, multiple jobs per file):
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"count": 10,
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"count": 15,
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
},
{
"project_id": 2,
"tiers": {
"tier1": { ... }
}
}
]
}
```
**Tier defaults** (constants if not specified in job file):
```python
TIER_DEFAULTS = {
"tier1": {
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
```
**Future extensibility note**: This structure allows adding more fields per job in future stories.
---
## **PHASE 2: AI Client & Prompt Management**
### Task 2.1: Implement AIClient for OpenRouter
**File**: `src/generation/ai_client.py`
**OpenRouter API details**:
- Base URL: `https://openrouter.ai/api/v1`
- Compatible with OpenAI SDK
- Requires `OPENROUTER_API_KEY` env variable
**Initial model list**:
```python
AVAILABLE_MODELS = {
"gpt-4o-mini": "openai/gpt-4o-mini",
"claude-sonnet-4.5": "anthropic/claude-3.5-sonnet",
MANY OTHERS _ CHECK OUT OPENROUTER API FOR MORE
}
```
**Implementation**:
```python
class AIClient:
def __init__(self, api_key: str, model: str, base_url: str = "https://openrouter.ai/api/v1"):
self.client = OpenAI(api_key=api_key, base_url=base_url)
self.model = model
def generate_completion(
self,
prompt: str,
system_message: str = None,
max_tokens: int = 4000,
temperature: float = 0.7,
json_mode: bool = False
) -> str:
"""
Generate completion from OpenRouter API
json_mode: if True, adds response_format={"type": "json_object"}
"""
pass
```
**Error handling**: Retry 3x with exponential backoff for network/rate limit errors
---
### Task 2.2: Create Prompt Templates
**Files**: `src/generation/prompts/*.json`
**title_generation.json**:
```json
{
"system_message": "You are an expert SEO content writer...",
"user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting."
}
```
**outline_generation.json**:
```json
{
"system_message": "You are an expert content outliner...",
"user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- {min_h2} to {max_h2} H2 headings\n- {min_h3} to {max_h3} H3 subheadings total\n\nEntities: {entities}\nRelated searches: {related_searches}\n\nReturn as JSON: {\"outline\": [{\"h2\": \"...\", \"h3\": [\"...\", \"...\"]}]}"
}
```
**content_generation.json**:
```json
{
"system_message": "You are an expert content writer...",
"user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include: {entities}\nRelated searches: {related_searches}\n\nReturn as HTML fragment with <h2>, <h3>, <p> tags. Do NOT include <html>, <head>, or <body> tags."
}
```
**content_augmentation.json**:
```json
{
"system_message": "You are an expert content editor...",
"user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count}\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment."
}
```
---
### Task 2.3: Create PromptManager
**File**: `src/generation/ai_client.py` (add to same file)
```python
class PromptManager:
def __init__(self, prompts_dir: str = "src/generation/prompts"):
self.prompts_dir = prompts_dir
self.prompts = {}
def load_prompt(self, prompt_name: str) -> dict:
"""Load prompt from JSON file"""
pass
def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
"""
Format prompt with variables
Returns: (system_message, user_prompt)
"""
pass
```
---
## **PHASE 3: Core Generation Pipeline**
### Task 3.1: Implement ContentGenerator Service
**File**: `src/generation/service.py`
```python
class ContentGenerator:
def __init__(
self,
ai_client: AIClient,
prompt_manager: PromptManager,
project_repo: ProjectRepository,
content_repo: GeneratedContentRepository
):
self.ai_client = ai_client
self.prompt_manager = prompt_manager
self.project_repo = project_repo
self.content_repo = content_repo
```
---
### Task 3.2: Implement Stage 1 - Title Generation
**File**: `src/generation/service.py`
```python
def generate_title(self, project_id: int, debug: bool = False) -> str:
"""
Generate SEO-optimized title
Returns: title string
Saves to debug_output/title_project_{id}_{timestamp}.txt if debug=True
"""
# Fetch project
# Load prompt
# Call AI
# If debug: save response to debug_output/
# Return title
pass
```
---
### Task 3.3: Implement Stage 2 - Outline Generation
**File**: `src/generation/service.py`
```python
def generate_outline(
self,
project_id: int,
title: str,
min_h2: int,
max_h2: int,
min_h3: int,
max_h3: int,
debug: bool = False
) -> dict:
"""
Generate article outline in JSON format
Returns: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
Uses json_mode=True in AI call to ensure JSON response
Validates: at least min_h2 headings, at least min_h3 total subheadings
Saves to debug_output/outline_project_{id}_{timestamp}.json if debug=True
"""
pass
```
**Validation**:
- Parse JSON response
- Count h2 tags (must be >= min_h2)
- Count total h3 tags across all h2s (must be >= min_h3)
- Raise error if validation fails
---
### Task 3.4: Implement Stage 3 - Content Generation
**File**: `src/generation/service.py`
```python
def generate_content(
self,
project_id: int,
title: str,
outline: dict,
debug: bool = False
) -> str:
"""
Generate full article HTML fragment
Returns: HTML string with <h2>, <h3>, <p> tags
Does NOT include <html>, <head>, or <body> tags
Saves to debug_output/content_project_{id}_{timestamp}.html if debug=True
"""
pass
```
**HTML fragment format**:
```html
<h2>First Heading</h2>
<p>Paragraph content...</p>
<h3>Subheading</h3>
<p>More content...</p>
```
---
### Task 3.5: Implement Word Count Validation
**File**: `src/generation/service.py`
```python
def validate_word_count(self, content: str, min_words: int, max_words: int) -> tuple[bool, int]:
"""
Validate content word count
Returns: (is_valid, actual_count)
- is_valid: True if min_words <= actual_count <= max_words
- actual_count: number of words in content
Implementation: Strip HTML tags, split on whitespace, count tokens
"""
pass
```
---
### Task 3.6: Implement Simple Augmentation
**File**: `src/generation/service.py`
```python
def augment_content(
self,
content: str,
target_word_count: int,
debug: bool = False
) -> str:
"""
Expand article content to meet minimum word count
Called ONLY if word_count < min_word_count
Makes ONE API call only
Saves to debug_output/augmented_project_{id}_{timestamp}.html if debug=True
"""
pass
```
---
## **PHASE 4: Batch Processing**
### Task 4.1: Create JobConfig Parser
**File**: `src/generation/job_config.py`
```python
from dataclasses import dataclass
from typing import Optional
TIER_DEFAULTS = {
"tier1": {
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
@dataclass
class TierConfig:
count: int
min_word_count: int
max_word_count: int
min_h2_tags: int
max_h2_tags: int
min_h3_tags: int
max_h3_tags: int
@dataclass
class Job:
project_id: int
tiers: dict[str, TierConfig]
class JobConfig:
def __init__(self, job_file_path: str):
"""Load and parse job file, apply defaults"""
pass
def get_jobs(self) -> list[Job]:
"""Return list of all jobs in file"""
pass
def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
"""Get tier config with defaults applied"""
pass
```
---
### Task 4.2: Create BatchProcessor
**File**: `src/generation/batch_processor.py`
```python
class BatchProcessor:
def __init__(
self,
content_generator: ContentGenerator,
content_repo: GeneratedContentRepository,
project_repo: ProjectRepository
):
pass
def process_job(
self,
job_file_path: str,
debug: bool = False,
continue_on_error: bool = False
):
"""
Process all jobs in job file
For each job:
0. Validate project configuration (fail fast if invalid)
- Check project exists
- Validate money_site_url is set (required for tiered linking strategy)
For each tier:
For count times:
1. Generate title (log to console)
2. Generate outline
3. Generate content
4. Validate word count
5. If below min, augment once
6. Save to GeneratedContent table
Logs progress to console
If debug=True, saves AI responses to debug_output/
"""
pass
```
**Console output format**:
```
Processing Job 1/3: Project ID 5
Tier 1: Generating 5 articles
[1/5] Generating title... "Ultimate Guide to SEO in 2025"
[1/5] Generating outline... 4 H2s, 8 H3s
[1/5] Generating content... 1,845 words
[1/5] Below minimum (2000), augmenting... 2,123 words
[1/5] Saved (ID: 42, Status: augmented)
[2/5] Generating title... "Advanced SEO Techniques"
...
Tier 2: Generating 10 articles
...
Summary:
Jobs processed: 3/3
Articles generated: 45/45
Augmented: 12
Failed: 0
```
---
### Task 4.3: Error Handling & Retry Logic
**File**: `src/generation/batch_processor.py`
**Error handling strategy**:
- Project validation errors: Fail fast before generation starts
- Missing project: Abort with clear error
- Missing money_site_url: Abort with clear error (required for all jobs)
- AI API errors: Log error, mark as `status='failed'`, save to DB
- If `continue_on_error=True`: continue to next article
- If `continue_on_error=False`: stop batch processing
- Database errors: Always abort (data integrity)
- Invalid job file: Fail fast with validation error
**Retry logic** (in AIClient):
- Network errors: 3 retries with exponential backoff (1s, 2s, 4s)
- Rate limit errors: Respect Retry-After header
- Other errors: No retry, raise immediately
---
## **PHASE 5: CLI Integration**
### Task 5.1: Add generate-batch Command
**File**: `src/cli/commands.py`
```python
@app.command("generate-batch")
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
help='Path to job JSON file')
@click.option('--username', '-u', help='Username for authentication')
@click.option('--password', '-p', help='Password for authentication')
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
@click.option('--continue-on-error', is_flag=True,
help='Continue processing if article generation fails')
@click.option('--model', '-m', default='gpt-4o-mini',
help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
def generate_batch(
job_file: str,
username: Optional[str],
password: Optional[str],
debug: bool,
continue_on_error: bool,
model: str
):
"""Generate content batch from job file"""
# Authenticate user
# Initialize AIClient with OpenRouter
# Initialize PromptManager, ContentGenerator, BatchProcessor
# Call process_job()
# Show summary
pass
```
---
### Task 5.2: Add Progress Logging & Debug Output
**File**: `src/generation/batch_processor.py`
**Debug output** (when `--debug` flag used):
- Create `debug_output/` directory if not exists
- For each AI call, save response to file:
- `debug_output/title_project{id}_tier{tier}_{n}_{timestamp}.txt`
- `debug_output/outline_project{id}_tier{tier}_{n}_{timestamp}.json`
- `debug_output/content_project{id}_tier{tier}_{n}_{timestamp}.html`
- `debug_output/augmented_project{id}_tier{tier}_{n}_{timestamp}.html`
- Also echo to console with `click.echo()`
**Normal output** (without `--debug`):
- Always show title when generated: `"Generated title: {title}"`
- Show word counts and status
- Show progress counter `[n/total]`
---
## **PHASE 6: Testing & Validation**
### Task 6.1: Create Unit Tests
#### `tests/unit/test_ai_client.py`
```python
def test_generate_completion_success():
"""Test successful AI completion"""
pass
def test_generate_completion_json_mode():
"""Test JSON mode returns valid JSON"""
pass
def test_generate_completion_retry_on_network_error():
"""Test retry logic for network errors"""
pass
```
#### `tests/unit/test_content_generator.py`
```python
def test_generate_title():
"""Test title generation with mocked AI response"""
pass
def test_generate_outline_valid_structure():
"""Test outline generation returns valid JSON with min h2/h3"""
pass
def test_generate_content_html_fragment():
"""Test content is HTML fragment (no <html> tag)"""
pass
def test_validate_word_count():
"""Test word count validation with various HTML inputs"""
pass
def test_augment_content_called_once():
"""Test augmentation only called once"""
pass
```
#### `tests/unit/test_job_config.py`
```python
def test_load_job_config_valid():
"""Test loading valid job file"""
pass
def test_tier_defaults_applied():
"""Test defaults applied when not in job file"""
pass
def test_multiple_jobs_in_file():
"""Test parsing file with multiple jobs"""
pass
```
#### `tests/unit/test_batch_processor.py`
```python
def test_process_job_success():
"""Test successful batch processing"""
pass
def test_process_job_with_augmentation():
"""Test articles below min word count are augmented"""
pass
def test_process_job_continue_on_error():
"""Test continue_on_error flag behavior"""
pass
```
---
### Task 6.2: Create Integration Test
**File**: `tests/integration/test_generate_batch.py`
```python
def test_generate_batch_end_to_end(test_db, mock_ai_client):
"""
End-to-end test:
1. Create test project in DB
2. Create test job file
3. Run batch processor
4. Verify GeneratedContent records created
5. Verify word counts within range
6. Verify HTML structure
"""
pass
```
---
### Task 6.3: Create Example Job Files
#### `jobs/example_tier1_batch.json`
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5
}
}
}
]
}
```
(Uses all defaults for tier1)
#### `jobs/example_multi_tier_batch.json`
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 2200,
"max_word_count": 2600
},
"tier2": {
"count": 10
},
"tier3": {
"count": 15,
"max_h2_tags": 4
}
}
},
{
"project_id": 2,
"tiers": {
"tier1": {
"count": 3
}
}
}
]
}
```
#### `jobs/README.md`
Document job file format and examples
---
## **PHASE 7: Cleanup & Deprecation**
### Task 7.1: Remove Old ContentRuleEngine
**Action**: Delete or gut `src/generation/rule_engine.py`
Only keep if it has reusable utilities. Otherwise remove entirely.
---
### Task 7.2: Remove Old Validator Logic
**Action**: Review `src/generation/validator.py` (if exists)
Remove any strict CORA validation beyond word count. Keep only simple validation utilities.
---
### Task 7.3: Update Documentation
**Files to update**:
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Status to "In Progress" → "Done"
- `docs/architecture/workflows.md` - Document simplified generation flow
- `docs/architecture/components.md` - Update generation component description
---
## Implementation Order Recommendation
1. **Phase 1** (Data Layer) - Required foundation
2. **Phase 2** (AI Client) - Required for generation
3. **Phase 3** (Core Logic) - Implement one stage at a time, test each
4. **Phase 4** (Batch Processing) - Orchestrate stages
5. **Phase 5** (CLI) - Make accessible to users
6. **Phase 6** (Testing) - Can be done in parallel with implementation
7. **Phase 7** (Cleanup) - Final polish
**Estimated effort**:
- Phase 1-2: 4-6 hours
- Phase 3: 6-8 hours
- Phase 4: 3-4 hours
- Phase 5: 2-3 hours
- Phase 6: 4-6 hours
- Phase 7: 1-2 hours
- **Total**: 20-29 hours
---
## Critical Dev Notes
### OpenRouter Specifics
- API key from environment: `OPENROUTER_API_KEY`
- Model format: `"provider/model-name"`
- Supports OpenAI SDK drop-in replacement
- Rate limits vary by model (check OpenRouter docs)
### HTML Fragment Format
Content generation returns HTML like:
```html
<h2>Main Topic</h2>
<p>Introduction paragraph with relevant keywords and entities.</p>
<h3>Subtopic One</h3>
<p>Detailed content about subtopic.</p>
<h3>Subtopic Two</h3>
<p>More detailed content.</p>
<h2>Second Main Topic</h2>
<p>Content continues...</p>
```
**No document structure**: No `<!DOCTYPE>`, `<html>`, `<head>`, or `<body>` tags.
### Word Count Method
```python
import re
from html import unescape
def count_words(html_content: str) -> int:
# Strip HTML tags
text = re.sub(r'<[^>]+>', '', html_content)
# Unescape HTML entities
text = unescape(text)
# Split and count
words = text.split()
return len(words)
```
### Debug Output Directory
- Create `debug_output/` at project root if not exists
- Add to `.gitignore`
- Filename format: `{stage}_project{id}_tier{tier}_article{n}_{timestamp}.{ext}`
- Example: `title_project5_tier1_article3_20251020_143022.txt`
### Tier Constants Location
Define in `src/generation/job_config.py` as module-level constant for easy reference.
### Future Extensibility
Job file structure designed to support:
- Custom interlinking rules (Story 2.4+)
- Template selection (Story 3.x)
- Deployment targets (Story 4.x)
- SEO metadata overrides
Keep job parsing flexible to add new fields without breaking existing jobs.
---
## Testing Strategy
### Unit Test Mocking
Mock `AIClient.generate_completion()` to return realistic HTML:
```python
@pytest.fixture
def mock_title_response():
return "The Ultimate Guide to Sustainable Gardening in 2025"
@pytest.fixture
def mock_outline_response():
return {
"outline": [
{"h2": "Getting Started", "h3": ["Tools", "Planning"]},
{"h2": "Best Practices", "h3": ["Watering", "Composting"]}
]
}
@pytest.fixture
def mock_content_response():
return """<h2>Getting Started</h2>
<p>Sustainable gardening begins with proper planning...</p>
<h3>Tools</h3>
<p>Essential tools include...</p>"""
```
### Integration Test Database
Use `conftest.py` fixture with in-memory SQLite and test data:
```python
@pytest.fixture
def test_project(test_db):
project_repo = ProjectRepository(test_db)
return project_repo.create(
user_id=1,
name="Test Project",
data={
"main_keyword": "sustainable gardening",
"entities": ["composting", "organic soil"],
"related_searches": ["how to compost", "organic gardening tips"]
}
)
```
---
## Success Criteria
Story is complete when:
1. All database models and repositories implemented
2. AIClient successfully calls OpenRouter API
3. Three-stage generation pipeline works end-to-end
4. Batch processor handles multiple jobs/tiers
5. CLI command `generate-batch` functional
6. Debug output saves to `debug_output/` when `--debug` used
7. All unit tests pass
8. Integration test demonstrates full workflow
9. Example job files work correctly
10. Documentation updated
**Acceptance**: Run `generate-batch` on real project, verify content saved to database with correct word count and structure.