Big-Link-Man/docs/stories/story-2.2-task-breakdown.md

# Story 2.2: Simplified AI Content Generation - Detailed Task Breakdown

## Overview
This document breaks down Story 2.2 into detailed tasks with specific implementation notes.

---

## **PHASE 1: Data Model & Schema Design**

### Task 1.1: Create GeneratedContent Database Model
**File**: `src/database/models.py`

**Add new model class:**
```python
class GeneratedContent(Base):
    __tablename__ = "generated_content"

    id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
    project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
    tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
    keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
    title: Mapped[str] = mapped_column(Text, nullable=False)
    outline: Mapped[dict] = mapped_column(JSON, nullable=False)
    content: Mapped[str] = mapped_column(Text, nullable=False)
    word_count: Mapped[int] = mapped_column(Integer, nullable=False)
    status: Mapped[str] = mapped_column(String(20), nullable=False)
    created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at: Mapped[datetime] = mapped_column(
        DateTime,
        default=datetime.utcnow,
        onupdate=datetime.utcnow,
        nullable=False
    )
```

**Status values**: `generated`, `augmented`, `failed`

**Update**: `scripts/init_db.py` to create the table

---

### Task 1.2: Create GeneratedContent Repository
**File**: `src/database/repositories.py`

**Add repository class:**
```python
class GeneratedContentRepository(BaseRepository[GeneratedContent]):
    def __init__(self, session: Session):
        super().__init__(GeneratedContent, session)

    def get_by_project_id(self, project_id: int) -> list[GeneratedContent]:
        pass

    def get_by_project_and_tier(self, project_id: int, tier: str) -> list[GeneratedContent]:
        pass

    def get_by_keyword(self, keyword: str) -> list[GeneratedContent]:
        pass
```

---

### Task 1.3: Define Job File JSON Schema
**File**: `jobs/README.md` (create/update)

**Job file structure** (one project per job, multiple jobs per file):
```json
{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5,
          "min_word_count": 2000,
          "max_word_count": 2500,
          "min_h2_tags": 3,
          "max_h2_tags": 5,
          "min_h3_tags": 5,
          "max_h3_tags": 10
        },
        "tier2": {
          "count": 10,
          "min_word_count": 1500,
          "max_word_count": 2000,
          "min_h2_tags": 2,
          "max_h2_tags": 4,
          "min_h3_tags": 3,
          "max_h3_tags": 8
        },
        "tier3": {
          "count": 15,
          "min_word_count": 1000,
          "max_word_count": 1500,
          "min_h2_tags": 2,
          "max_h2_tags": 3,
          "min_h3_tags": 2,
          "max_h3_tags": 6
        }
      }
    },
    {
      "project_id": 2,
      "tiers": {
        "tier1": { ... }
      }
    }
  ]
}
```

**Tier defaults** (constants if not specified in job file):
```python
TIER_DEFAULTS = {
    "tier1": {
        "min_word_count": 2000,
        "max_word_count": 2500,
        "min_h2_tags": 3,
        "max_h2_tags": 5,
        "min_h3_tags": 5,
        "max_h3_tags": 10
    },
    "tier2": {
        "min_word_count": 1500,
        "max_word_count": 2000,
        "min_h2_tags": 2,
        "max_h2_tags": 4,
        "min_h3_tags": 3,
        "max_h3_tags": 8
    },
    "tier3": {
        "min_word_count": 1000,
        "max_word_count": 1500,
        "min_h2_tags": 2,
        "max_h2_tags": 3,
        "min_h3_tags": 2,
        "max_h3_tags": 6
    }
}
```

**Future extensibility note**: This structure allows adding more fields per job in future stories.

---

## **PHASE 2: AI Client & Prompt Management**

### Task 2.1: Implement AIClient for OpenRouter
**File**: `src/generation/ai_client.py`

**OpenRouter API details**:
- Base URL: `https://openrouter.ai/api/v1`
- Compatible with OpenAI SDK
- Requires `OPENROUTER_API_KEY` env variable

**Initial model list**:
```python
AVAILABLE_MODELS = {
    "gpt-4o-mini": "openai/gpt-4o-mini",
    "claude-sonnet-4.5": "anthropic/claude-3.5-sonnet",
    MANY OTHERS _ CHECK OUT OPENROUTER API FOR MORE
}
```

**Implementation**:
```python
class AIClient:
    def __init__(self, api_key: str, model: str, base_url: str = "https://openrouter.ai/api/v1"):
        self.client = OpenAI(api_key=api_key, base_url=base_url)
        self.model = model

    def generate_completion(
        self,
        prompt: str,
        system_message: str = None,
        max_tokens: int = 4000,
        temperature: float = 0.7,
        json_mode: bool = False
    ) -> str:
        """
        Generate completion from OpenRouter API
        json_mode: if True, adds response_format={"type": "json_object"}
        """
        pass
```

**Error handling**: Retry 3x with exponential backoff for network/rate limit errors

---

### Task 2.2: Create Prompt Templates
**Files**: `src/generation/prompts/*.json`

**title_generation.json**:
```json
{
  "system_message": "You are an expert SEO content writer...",
  "user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting."
}
```

**outline_generation.json**:
```json
{
  "system_message": "You are an expert content outliner...",
  "user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- {min_h2} to {max_h2} H2 headings\n- {min_h3} to {max_h3} H3 subheadings total\n\nEntities: {entities}\nRelated searches: {related_searches}\n\nReturn as JSON: {\"outline\": [{\"h2\": \"...\", \"h3\": [\"...\", \"...\"]}]}"
}
```

**content_generation.json**:
```json
{
  "system_message": "You are an expert content writer...",
  "user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include: {entities}\nRelated searches: {related_searches}\n\nReturn as HTML fragment with <h2>, <h3>, <p> tags. Do NOT include <html>, <head>, or <body> tags."
}
```

**content_augmentation.json**:
```json
{
  "system_message": "You are an expert content editor...",
  "user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count}\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment."
}
```

---

### Task 2.3: Create PromptManager
**File**: `src/generation/ai_client.py` (add to same file)

```python
class PromptManager:
    def __init__(self, prompts_dir: str = "src/generation/prompts"):
        self.prompts_dir = prompts_dir
        self.prompts = {}

    def load_prompt(self, prompt_name: str) -> dict:
        """Load prompt from JSON file"""
        pass

    def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
        """
        Format prompt with variables
        Returns: (system_message, user_prompt)
        """
        pass
```

---

## **PHASE 3: Core Generation Pipeline**

### Task 3.1: Implement ContentGenerator Service
**File**: `src/generation/service.py`

```python
class ContentGenerator:
    def __init__(
        self,
        ai_client: AIClient,
        prompt_manager: PromptManager,
        project_repo: ProjectRepository,
        content_repo: GeneratedContentRepository
    ):
        self.ai_client = ai_client
        self.prompt_manager = prompt_manager
        self.project_repo = project_repo
        self.content_repo = content_repo
```

---

### Task 3.2: Implement Stage 1 - Title Generation
**File**: `src/generation/service.py`

```python
def generate_title(self, project_id: int, debug: bool = False) -> str:
    """
    Generate SEO-optimized title

    Returns: title string
    Saves to debug_output/title_project_{id}_{timestamp}.txt if debug=True
    """
    # Fetch project
    # Load prompt
    # Call AI
    # If debug: save response to debug_output/
    # Return title
    pass
```

---

### Task 3.3: Implement Stage 2 - Outline Generation
**File**: `src/generation/service.py`

```python
def generate_outline(
    self,
    project_id: int,
    title: str,
    min_h2: int,
    max_h2: int,
    min_h3: int,
    max_h3: int,
    debug: bool = False
) -> dict:
    """
    Generate article outline in JSON format

    Returns: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}

    Uses json_mode=True in AI call to ensure JSON response
    Validates: at least min_h2 headings, at least min_h3 total subheadings
    Saves to debug_output/outline_project_{id}_{timestamp}.json if debug=True
    """
    pass
```

**Validation**:
- Parse JSON response
- Count h2 tags (must be >= min_h2)
- Count total h3 tags across all h2s (must be >= min_h3)
- Raise error if validation fails

---

### Task 3.4: Implement Stage 3 - Content Generation
**File**: `src/generation/service.py`

```python
def generate_content(
    self,
    project_id: int,
    title: str,
    outline: dict,
    debug: bool = False
) -> str:
    """
    Generate full article HTML fragment

    Returns: HTML string with <h2>, <h3>, <p> tags
    Does NOT include <html>, <head>, or <body> tags

    Saves to debug_output/content_project_{id}_{timestamp}.html if debug=True
    """
    pass
```

**HTML fragment format**:
```html
<h2>First Heading</h2>
<p>Paragraph content...</p>
<h3>Subheading</h3>
<p>More content...</p>
```

---

### Task 3.5: Implement Word Count Validation
**File**: `src/generation/service.py`

```python
def validate_word_count(self, content: str, min_words: int, max_words: int) -> tuple[bool, int]:
    """
    Validate content word count

    Returns: (is_valid, actual_count)
    - is_valid: True if min_words <= actual_count <= max_words
    - actual_count: number of words in content

    Implementation: Strip HTML tags, split on whitespace, count tokens
    """
    pass
```

---

### Task 3.6: Implement Simple Augmentation
**File**: `src/generation/service.py`

```python
def augment_content(
    self,
    content: str,
    target_word_count: int,
    debug: bool = False
) -> str:
    """
    Expand article content to meet minimum word count

    Called ONLY if word_count < min_word_count
    Makes ONE API call only

    Saves to debug_output/augmented_project_{id}_{timestamp}.html if debug=True
    """
    pass
```

---

## **PHASE 4: Batch Processing**

### Task 4.1: Create JobConfig Parser
**File**: `src/generation/job_config.py`

```python
from dataclasses import dataclass
from typing import Optional

TIER_DEFAULTS = {
    "tier1": {
        "min_word_count": 2000,
        "max_word_count": 2500,
        "min_h2_tags": 3,
        "max_h2_tags": 5,
        "min_h3_tags": 5,
        "max_h3_tags": 10
    },
    "tier2": {
        "min_word_count": 1500,
        "max_word_count": 2000,
        "min_h2_tags": 2,
        "max_h2_tags": 4,
        "min_h3_tags": 3,
        "max_h3_tags": 8
    },
    "tier3": {
        "min_word_count": 1000,
        "max_word_count": 1500,
        "min_h2_tags": 2,
        "max_h2_tags": 3,
        "min_h3_tags": 2,
        "max_h3_tags": 6
    }
}

@dataclass
class TierConfig:
    count: int
    min_word_count: int
    max_word_count: int
    min_h2_tags: int
    max_h2_tags: int
    min_h3_tags: int
    max_h3_tags: int

@dataclass
class Job:
    project_id: int
    tiers: dict[str, TierConfig]

class JobConfig:
    def __init__(self, job_file_path: str):
        """Load and parse job file, apply defaults"""
        pass

    def get_jobs(self) -> list[Job]:
        """Return list of all jobs in file"""
        pass

    def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
        """Get tier config with defaults applied"""
        pass
```

---

### Task 4.2: Create BatchProcessor
**File**: `src/generation/batch_processor.py`

```python
class BatchProcessor:
    def __init__(
        self,
        content_generator: ContentGenerator,
        content_repo: GeneratedContentRepository,
        project_repo: ProjectRepository
    ):
        pass

    def process_job(
        self,
        job_file_path: str,
        debug: bool = False,
        continue_on_error: bool = False
    ):
        """
        Process all jobs in job file

        For each job:
          0. Validate project configuration (fail fast if invalid)
             - Check project exists
             - Validate money_site_url is set (required for tiered linking strategy)
          For each tier:
            For count times:
              1. Generate title (log to console)
              2. Generate outline
              3. Generate content
              4. Validate word count
              5. If below min, augment once
              6. Save to GeneratedContent table

        Logs progress to console
        If debug=True, saves AI responses to debug_output/
        """
        pass
```

**Console output format**:
```
Processing Job 1/3: Project ID 5
  Tier 1: Generating 5 articles
    [1/5] Generating title... "Ultimate Guide to SEO in 2025"
    [1/5] Generating outline... 4 H2s, 8 H3s
    [1/5] Generating content... 1,845 words
    [1/5] Below minimum (2000), augmenting... 2,123 words
    [1/5] Saved (ID: 42, Status: augmented)
    [2/5] Generating title... "Advanced SEO Techniques"
    ...
  Tier 2: Generating 10 articles
    ...

Summary:
  Jobs processed: 3/3
  Articles generated: 45/45
  Augmented: 12
  Failed: 0
```

---

### Task 4.3: Error Handling & Retry Logic
**File**: `src/generation/batch_processor.py`

**Error handling strategy**:
- Project validation errors: Fail fast before generation starts
  - Missing project: Abort with clear error
  - Missing money_site_url: Abort with clear error (required for all jobs)
- AI API errors: Log error, mark as `status='failed'`, save to DB
- If `continue_on_error=True`: continue to next article
- If `continue_on_error=False`: stop batch processing
- Database errors: Always abort (data integrity)
- Invalid job file: Fail fast with validation error

**Retry logic** (in AIClient):
- Network errors: 3 retries with exponential backoff (1s, 2s, 4s)
- Rate limit errors: Respect Retry-After header
- Other errors: No retry, raise immediately

---

## **PHASE 5: CLI Integration**

### Task 5.1: Add generate-batch Command
**File**: `src/cli/commands.py`

```python
@app.command("generate-batch")
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
              help='Path to job JSON file')
@click.option('--username', '-u', help='Username for authentication')
@click.option('--password', '-p', help='Password for authentication')
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
@click.option('--continue-on-error', is_flag=True,
              help='Continue processing if article generation fails')
@click.option('--model', '-m', default='gpt-4o-mini',
              help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
def generate_batch(
    job_file: str,
    username: Optional[str],
    password: Optional[str],
    debug: bool,
    continue_on_error: bool,
    model: str
):
    """Generate content batch from job file"""
    # Authenticate user
    # Initialize AIClient with OpenRouter
    # Initialize PromptManager, ContentGenerator, BatchProcessor
    # Call process_job()
    # Show summary
    pass
```

---

### Task 5.2: Add Progress Logging & Debug Output
**File**: `src/generation/batch_processor.py`

**Debug output** (when `--debug` flag used):
- Create `debug_output/` directory if not exists
- For each AI call, save response to file:
  - `debug_output/title_project{id}_tier{tier}_{n}_{timestamp}.txt`
  - `debug_output/outline_project{id}_tier{tier}_{n}_{timestamp}.json`
  - `debug_output/content_project{id}_tier{tier}_{n}_{timestamp}.html`
  - `debug_output/augmented_project{id}_tier{tier}_{n}_{timestamp}.html`
- Also echo to console with `click.echo()`

**Normal output** (without `--debug`):
- Always show title when generated: `"Generated title: {title}"`
- Show word counts and status
- Show progress counter `[n/total]`

---

## **PHASE 6: Testing & Validation**

### Task 6.1: Create Unit Tests

#### `tests/unit/test_ai_client.py`
```python
def test_generate_completion_success():
    """Test successful AI completion"""
    pass

def test_generate_completion_json_mode():
    """Test JSON mode returns valid JSON"""
    pass

def test_generate_completion_retry_on_network_error():
    """Test retry logic for network errors"""
    pass
```

#### `tests/unit/test_content_generator.py`
```python
def test_generate_title():
    """Test title generation with mocked AI response"""
    pass

def test_generate_outline_valid_structure():
    """Test outline generation returns valid JSON with min h2/h3"""
    pass

def test_generate_content_html_fragment():
    """Test content is HTML fragment (no <html> tag)"""
    pass

def test_validate_word_count():
    """Test word count validation with various HTML inputs"""
    pass

def test_augment_content_called_once():
    """Test augmentation only called once"""
    pass
```

#### `tests/unit/test_job_config.py`
```python
def test_load_job_config_valid():
    """Test loading valid job file"""
    pass

def test_tier_defaults_applied():
    """Test defaults applied when not in job file"""
    pass

def test_multiple_jobs_in_file():
    """Test parsing file with multiple jobs"""
    pass
```

#### `tests/unit/test_batch_processor.py`
```python
def test_process_job_success():
    """Test successful batch processing"""
    pass

def test_process_job_with_augmentation():
    """Test articles below min word count are augmented"""
    pass

def test_process_job_continue_on_error():
    """Test continue_on_error flag behavior"""
    pass
```

---

### Task 6.2: Create Integration Test
**File**: `tests/integration/test_generate_batch.py`

```python
def test_generate_batch_end_to_end(test_db, mock_ai_client):
    """
    End-to-end test:
    1. Create test project in DB
    2. Create test job file
    3. Run batch processor
    4. Verify GeneratedContent records created
    5. Verify word counts within range
    6. Verify HTML structure
    """
    pass
```

---

### Task 6.3: Create Example Job Files

#### `jobs/example_tier1_batch.json`
```json
{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5
        }
      }
    }
  ]
}
```
(Uses all defaults for tier1)

#### `jobs/example_multi_tier_batch.json`
```json
{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5,
          "min_word_count": 2200,
          "max_word_count": 2600
        },
        "tier2": {
          "count": 10
        },
        "tier3": {
          "count": 15,
          "max_h2_tags": 4
        }
      }
    },
    {
      "project_id": 2,
      "tiers": {
        "tier1": {
          "count": 3
        }
      }
    }
  ]
}
```

#### `jobs/README.md`
Document job file format and examples

---

## **PHASE 7: Cleanup & Deprecation**

### Task 7.1: Remove Old ContentRuleEngine
**Action**: Delete or gut `src/generation/rule_engine.py`

Only keep if it has reusable utilities. Otherwise remove entirely.

---

### Task 7.2: Remove Old Validator Logic
**Action**: Review `src/generation/validator.py` (if exists)

Remove any strict CORA validation beyond word count. Keep only simple validation utilities.

---

### Task 7.3: Update Documentation
**Files to update**:
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Status to "In Progress" → "Done"
- `docs/architecture/workflows.md` - Document simplified generation flow
- `docs/architecture/components.md` - Update generation component description

---

## Implementation Order Recommendation

1. **Phase 1** (Data Layer) - Required foundation
2. **Phase 2** (AI Client) - Required for generation
3. **Phase 3** (Core Logic) - Implement one stage at a time, test each
4. **Phase 4** (Batch Processing) - Orchestrate stages
5. **Phase 5** (CLI) - Make accessible to users
6. **Phase 6** (Testing) - Can be done in parallel with implementation
7. **Phase 7** (Cleanup) - Final polish

**Estimated effort**:
- Phase 1-2: 4-6 hours
- Phase 3: 6-8 hours
- Phase 4: 3-4 hours
- Phase 5: 2-3 hours
- Phase 6: 4-6 hours
- Phase 7: 1-2 hours
- **Total**: 20-29 hours

---

## Critical Dev Notes

### OpenRouter Specifics
- API key from environment: `OPENROUTER_API_KEY`
- Model format: `"provider/model-name"`
- Supports OpenAI SDK drop-in replacement
- Rate limits vary by model (check OpenRouter docs)

### HTML Fragment Format
Content generation returns HTML like:
```html
<h2>Main Topic</h2>
<p>Introduction paragraph with relevant keywords and entities.</p>
<h3>Subtopic One</h3>
<p>Detailed content about subtopic.</p>
<h3>Subtopic Two</h3>
<p>More detailed content.</p>
<h2>Second Main Topic</h2>
<p>Content continues...</p>
```

**No document structure**: No `<!DOCTYPE>`, `<html>`, `<head>`, or `<body>` tags.

### Word Count Method
```python
import re
from html import unescape

def count_words(html_content: str) -> int:
    # Strip HTML tags
    text = re.sub(r'<[^>]+>', '', html_content)
    # Unescape HTML entities
    text = unescape(text)
    # Split and count
    words = text.split()
    return len(words)
```

### Debug Output Directory
- Create `debug_output/` at project root if not exists
- Add to `.gitignore`
- Filename format: `{stage}_project{id}_tier{tier}_article{n}_{timestamp}.{ext}`
- Example: `title_project5_tier1_article3_20251020_143022.txt`

### Tier Constants Location
Define in `src/generation/job_config.py` as module-level constant for easy reference.

### Future Extensibility
Job file structure designed to support:
- Custom interlinking rules (Story 2.4+)
- Template selection (Story 3.x)
- Deployment targets (Story 4.x)
- SEO metadata overrides

Keep job parsing flexible to add new fields without breaking existing jobs.

---

## Testing Strategy

### Unit Test Mocking
Mock `AIClient.generate_completion()` to return realistic HTML:
```python
@pytest.fixture
def mock_title_response():
    return "The Ultimate Guide to Sustainable Gardening in 2025"

@pytest.fixture
def mock_outline_response():
    return {
        "outline": [
            {"h2": "Getting Started", "h3": ["Tools", "Planning"]},
            {"h2": "Best Practices", "h3": ["Watering", "Composting"]}
        ]
    }

@pytest.fixture
def mock_content_response():
    return """<h2>Getting Started</h2>
<p>Sustainable gardening begins with proper planning...</p>
<h3>Tools</h3>
<p>Essential tools include...</p>"""
```

### Integration Test Database
Use `conftest.py` fixture with in-memory SQLite and test data:
```python
@pytest.fixture
def test_project(test_db):
    project_repo = ProjectRepository(test_db)
    return project_repo.create(
        user_id=1,
        name="Test Project",
        data={
            "main_keyword": "sustainable gardening",
            "entities": ["composting", "organic soil"],
            "related_searches": ["how to compost", "organic gardening tips"]
        }
    )
```

---

## Success Criteria

Story is complete when:
1. All database models and repositories implemented
2. AIClient successfully calls OpenRouter API
3. Three-stage generation pipeline works end-to-end
4. Batch processor handles multiple jobs/tiers
5. CLI command `generate-batch` functional
6. Debug output saves to `debug_output/` when `--debug` used
7. All unit tests pass
8. Integration test demonstrates full workflow
9. Example job files work correctly
10. Documentation updated

**Acceptance**: Run `generate-batch` on real project, verify content saved to database with correct word count and structure.