Big-Link-Man/docs/stories/story-2.2-task-breakdown.md

24 KiB

Story 2.2: Simplified AI Content Generation - Detailed Task Breakdown

Overview

This document breaks down Story 2.2 into detailed tasks with specific implementation notes.


PHASE 1: Data Model & Schema Design

Task 1.1: Create GeneratedContent Database Model

File: src/database/models.py

Add new model class:

class GeneratedContent(Base):
    __tablename__ = "generated_content"
    
    id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
    project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
    tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
    keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
    title: Mapped[str] = mapped_column(Text, nullable=False)
    outline: Mapped[dict] = mapped_column(JSON, nullable=False)
    content: Mapped[str] = mapped_column(Text, nullable=False)
    word_count: Mapped[int] = mapped_column(Integer, nullable=False)
    status: Mapped[str] = mapped_column(String(20), nullable=False)
    created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at: Mapped[datetime] = mapped_column(
        DateTime, 
        default=datetime.utcnow, 
        onupdate=datetime.utcnow, 
        nullable=False
    )

Status values: generated, augmented, failed

Update: scripts/init_db.py to create the table


Task 1.2: Create GeneratedContent Repository

File: src/database/repositories.py

Add repository class:

class GeneratedContentRepository(BaseRepository[GeneratedContent]):
    def __init__(self, session: Session):
        super().__init__(GeneratedContent, session)
    
    def get_by_project_id(self, project_id: int) -> list[GeneratedContent]:
        pass
    
    def get_by_project_and_tier(self, project_id: int, tier: str) -> list[GeneratedContent]:
        pass
    
    def get_by_keyword(self, keyword: str) -> list[GeneratedContent]:
        pass

Task 1.3: Define Job File JSON Schema

File: jobs/README.md (create/update)

Job file structure (one project per job, multiple jobs per file):

{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5,
          "min_word_count": 2000,
          "max_word_count": 2500,
          "min_h2_tags": 3,
          "max_h2_tags": 5,
          "min_h3_tags": 5,
          "max_h3_tags": 10
        },
        "tier2": {
          "count": 10,
          "min_word_count": 1500,
          "max_word_count": 2000,
          "min_h2_tags": 2,
          "max_h2_tags": 4,
          "min_h3_tags": 3,
          "max_h3_tags": 8
        },
        "tier3": {
          "count": 15,
          "min_word_count": 1000,
          "max_word_count": 1500,
          "min_h2_tags": 2,
          "max_h2_tags": 3,
          "min_h3_tags": 2,
          "max_h3_tags": 6
        }
      }
    },
    {
      "project_id": 2,
      "tiers": {
        "tier1": { ... }
      }
    }
  ]
}

Tier defaults (constants if not specified in job file):

TIER_DEFAULTS = {
    "tier1": {
        "min_word_count": 2000,
        "max_word_count": 2500,
        "min_h2_tags": 3,
        "max_h2_tags": 5,
        "min_h3_tags": 5,
        "max_h3_tags": 10
    },
    "tier2": {
        "min_word_count": 1500,
        "max_word_count": 2000,
        "min_h2_tags": 2,
        "max_h2_tags": 4,
        "min_h3_tags": 3,
        "max_h3_tags": 8
    },
    "tier3": {
        "min_word_count": 1000,
        "max_word_count": 1500,
        "min_h2_tags": 2,
        "max_h2_tags": 3,
        "min_h3_tags": 2,
        "max_h3_tags": 6
    }
}

Future extensibility note: This structure allows adding more fields per job in future stories.


PHASE 2: AI Client & Prompt Management

Task 2.1: Implement AIClient for OpenRouter

File: src/generation/ai_client.py

OpenRouter API details:

  • Base URL: https://openrouter.ai/api/v1
  • Compatible with OpenAI SDK
  • Requires OPENROUTER_API_KEY env variable

Initial model list:

AVAILABLE_MODELS = {
    "gpt-4o-mini": "openai/gpt-4o-mini",
    "claude-sonnet-4.5": "anthropic/claude-3.5-sonnet",
    MANY OTHERS _ CHECK OUT OPENROUTER API FOR MORE
}

Implementation:

class AIClient:
    def __init__(self, api_key: str, model: str, base_url: str = "https://openrouter.ai/api/v1"):
        self.client = OpenAI(api_key=api_key, base_url=base_url)
        self.model = model
    
    def generate_completion(
        self, 
        prompt: str, 
        system_message: str = None,
        max_tokens: int = 4000,
        temperature: float = 0.7,
        json_mode: bool = False
    ) -> str:
        """
        Generate completion from OpenRouter API
        json_mode: if True, adds response_format={"type": "json_object"}
        """
        pass

Error handling: Retry 3x with exponential backoff for network/rate limit errors


Task 2.2: Create Prompt Templates

Files: src/generation/prompts/*.json

title_generation.json:

{
  "system_message": "You are an expert SEO content writer...",
  "user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting."
}

outline_generation.json:

{
  "system_message": "You are an expert content outliner...",
  "user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- {min_h2} to {max_h2} H2 headings\n- {min_h3} to {max_h3} H3 subheadings total\n\nEntities: {entities}\nRelated searches: {related_searches}\n\nReturn as JSON: {\"outline\": [{\"h2\": \"...\", \"h3\": [\"...\", \"...\"]}]}"
}

content_generation.json:

{
  "system_message": "You are an expert content writer...",
  "user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include: {entities}\nRelated searches: {related_searches}\n\nReturn as HTML fragment with <h2>, <h3>, <p> tags. Do NOT include <html>, <head>, or <body> tags."
}

content_augmentation.json:

{
  "system_message": "You are an expert content editor...",
  "user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count}\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment."
}

Task 2.3: Create PromptManager

File: src/generation/ai_client.py (add to same file)

class PromptManager:
    def __init__(self, prompts_dir: str = "src/generation/prompts"):
        self.prompts_dir = prompts_dir
        self.prompts = {}
    
    def load_prompt(self, prompt_name: str) -> dict:
        """Load prompt from JSON file"""
        pass
    
    def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
        """
        Format prompt with variables
        Returns: (system_message, user_prompt)
        """
        pass

PHASE 3: Core Generation Pipeline

Task 3.1: Implement ContentGenerator Service

File: src/generation/service.py

class ContentGenerator:
    def __init__(
        self,
        ai_client: AIClient,
        prompt_manager: PromptManager,
        project_repo: ProjectRepository,
        content_repo: GeneratedContentRepository
    ):
        self.ai_client = ai_client
        self.prompt_manager = prompt_manager
        self.project_repo = project_repo
        self.content_repo = content_repo

Task 3.2: Implement Stage 1 - Title Generation

File: src/generation/service.py

def generate_title(self, project_id: int, debug: bool = False) -> str:
    """
    Generate SEO-optimized title
    
    Returns: title string
    Saves to debug_output/title_project_{id}_{timestamp}.txt if debug=True
    """
    # Fetch project
    # Load prompt
    # Call AI
    # If debug: save response to debug_output/
    # Return title
    pass

Task 3.3: Implement Stage 2 - Outline Generation

File: src/generation/service.py

def generate_outline(
    self, 
    project_id: int, 
    title: str, 
    min_h2: int,
    max_h2: int,
    min_h3: int,
    max_h3: int,
    debug: bool = False
) -> dict:
    """
    Generate article outline in JSON format
    
    Returns: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
    
    Uses json_mode=True in AI call to ensure JSON response
    Validates: at least min_h2 headings, at least min_h3 total subheadings
    Saves to debug_output/outline_project_{id}_{timestamp}.json if debug=True
    """
    pass

Validation:

  • Parse JSON response
  • Count h2 tags (must be >= min_h2)
  • Count total h3 tags across all h2s (must be >= min_h3)
  • Raise error if validation fails

Task 3.4: Implement Stage 3 - Content Generation

File: src/generation/service.py

def generate_content(
    self, 
    project_id: int, 
    title: str, 
    outline: dict,
    debug: bool = False
) -> str:
    """
    Generate full article HTML fragment
    
    Returns: HTML string with <h2>, <h3>, <p> tags
    Does NOT include <html>, <head>, or <body> tags
    
    Saves to debug_output/content_project_{id}_{timestamp}.html if debug=True
    """
    pass

HTML fragment format:

<h2>First Heading</h2>
<p>Paragraph content...</p>
<h3>Subheading</h3>
<p>More content...</p>

Task 3.5: Implement Word Count Validation

File: src/generation/service.py

def validate_word_count(self, content: str, min_words: int, max_words: int) -> tuple[bool, int]:
    """
    Validate content word count
    
    Returns: (is_valid, actual_count)
    - is_valid: True if min_words <= actual_count <= max_words
    - actual_count: number of words in content
    
    Implementation: Strip HTML tags, split on whitespace, count tokens
    """
    pass

Task 3.6: Implement Simple Augmentation

File: src/generation/service.py

def augment_content(
    self, 
    content: str, 
    target_word_count: int,
    debug: bool = False
) -> str:
    """
    Expand article content to meet minimum word count
    
    Called ONLY if word_count < min_word_count
    Makes ONE API call only
    
    Saves to debug_output/augmented_project_{id}_{timestamp}.html if debug=True
    """
    pass

PHASE 4: Batch Processing

Task 4.1: Create JobConfig Parser

File: src/generation/job_config.py

from dataclasses import dataclass
from typing import Optional

TIER_DEFAULTS = {
    "tier1": {
        "min_word_count": 2000,
        "max_word_count": 2500,
        "min_h2_tags": 3,
        "max_h2_tags": 5,
        "min_h3_tags": 5,
        "max_h3_tags": 10
    },
    "tier2": {
        "min_word_count": 1500,
        "max_word_count": 2000,
        "min_h2_tags": 2,
        "max_h2_tags": 4,
        "min_h3_tags": 3,
        "max_h3_tags": 8
    },
    "tier3": {
        "min_word_count": 1000,
        "max_word_count": 1500,
        "min_h2_tags": 2,
        "max_h2_tags": 3,
        "min_h3_tags": 2,
        "max_h3_tags": 6
    }
}

@dataclass
class TierConfig:
    count: int
    min_word_count: int
    max_word_count: int
    min_h2_tags: int
    max_h2_tags: int
    min_h3_tags: int
    max_h3_tags: int

@dataclass
class Job:
    project_id: int
    tiers: dict[str, TierConfig]

class JobConfig:
    def __init__(self, job_file_path: str):
        """Load and parse job file, apply defaults"""
        pass
    
    def get_jobs(self) -> list[Job]:
        """Return list of all jobs in file"""
        pass
    
    def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
        """Get tier config with defaults applied"""
        pass

Task 4.2: Create BatchProcessor

File: src/generation/batch_processor.py

class BatchProcessor:
    def __init__(
        self,
        content_generator: ContentGenerator,
        content_repo: GeneratedContentRepository,
        project_repo: ProjectRepository
    ):
        pass
    
    def process_job(
        self, 
        job_file_path: str, 
        debug: bool = False,
        continue_on_error: bool = False
    ):
        """
        Process all jobs in job file
        
        For each job:
          0. Validate project configuration (fail fast if invalid)
             - Check project exists
             - Validate money_site_url is set (required for tiered linking strategy)
          For each tier:
            For count times:
              1. Generate title (log to console)
              2. Generate outline
              3. Generate content
              4. Validate word count
              5. If below min, augment once
              6. Save to GeneratedContent table
        
        Logs progress to console
        If debug=True, saves AI responses to debug_output/
        """
        pass

Console output format:

Processing Job 1/3: Project ID 5
  Tier 1: Generating 5 articles
    [1/5] Generating title... "Ultimate Guide to SEO in 2025"
    [1/5] Generating outline... 4 H2s, 8 H3s
    [1/5] Generating content... 1,845 words
    [1/5] Below minimum (2000), augmenting... 2,123 words
    [1/5] Saved (ID: 42, Status: augmented)
    [2/5] Generating title... "Advanced SEO Techniques"
    ...
  Tier 2: Generating 10 articles
    ...
  
Summary:
  Jobs processed: 3/3
  Articles generated: 45/45
  Augmented: 12
  Failed: 0

Task 4.3: Error Handling & Retry Logic

File: src/generation/batch_processor.py

Error handling strategy:

  • Project validation errors: Fail fast before generation starts
    • Missing project: Abort with clear error
    • Missing money_site_url: Abort with clear error (required for all jobs)
  • AI API errors: Log error, mark as status='failed', save to DB
  • If continue_on_error=True: continue to next article
  • If continue_on_error=False: stop batch processing
  • Database errors: Always abort (data integrity)
  • Invalid job file: Fail fast with validation error

Retry logic (in AIClient):

  • Network errors: 3 retries with exponential backoff (1s, 2s, 4s)
  • Rate limit errors: Respect Retry-After header
  • Other errors: No retry, raise immediately

PHASE 5: CLI Integration

Task 5.1: Add generate-batch Command

File: src/cli/commands.py

@app.command("generate-batch")
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True), 
              help='Path to job JSON file')
@click.option('--username', '-u', help='Username for authentication')
@click.option('--password', '-p', help='Password for authentication')
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
@click.option('--continue-on-error', is_flag=True, 
              help='Continue processing if article generation fails')
@click.option('--model', '-m', default='gpt-4o-mini',
              help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
def generate_batch(
    job_file: str, 
    username: Optional[str], 
    password: Optional[str],
    debug: bool,
    continue_on_error: bool,
    model: str
):
    """Generate content batch from job file"""
    # Authenticate user
    # Initialize AIClient with OpenRouter
    # Initialize PromptManager, ContentGenerator, BatchProcessor
    # Call process_job()
    # Show summary
    pass

Task 5.2: Add Progress Logging & Debug Output

File: src/generation/batch_processor.py

Debug output (when --debug flag used):

  • Create debug_output/ directory if not exists
  • For each AI call, save response to file:
    • debug_output/title_project{id}_tier{tier}_{n}_{timestamp}.txt
    • debug_output/outline_project{id}_tier{tier}_{n}_{timestamp}.json
    • debug_output/content_project{id}_tier{tier}_{n}_{timestamp}.html
    • debug_output/augmented_project{id}_tier{tier}_{n}_{timestamp}.html
  • Also echo to console with click.echo()

Normal output (without --debug):

  • Always show title when generated: "Generated title: {title}"
  • Show word counts and status
  • Show progress counter [n/total]

PHASE 6: Testing & Validation

Task 6.1: Create Unit Tests

tests/unit/test_ai_client.py

def test_generate_completion_success():
    """Test successful AI completion"""
    pass

def test_generate_completion_json_mode():
    """Test JSON mode returns valid JSON"""
    pass

def test_generate_completion_retry_on_network_error():
    """Test retry logic for network errors"""
    pass

tests/unit/test_content_generator.py

def test_generate_title():
    """Test title generation with mocked AI response"""
    pass

def test_generate_outline_valid_structure():
    """Test outline generation returns valid JSON with min h2/h3"""
    pass

def test_generate_content_html_fragment():
    """Test content is HTML fragment (no <html> tag)"""
    pass

def test_validate_word_count():
    """Test word count validation with various HTML inputs"""
    pass

def test_augment_content_called_once():
    """Test augmentation only called once"""
    pass

tests/unit/test_job_config.py

def test_load_job_config_valid():
    """Test loading valid job file"""
    pass

def test_tier_defaults_applied():
    """Test defaults applied when not in job file"""
    pass

def test_multiple_jobs_in_file():
    """Test parsing file with multiple jobs"""
    pass

tests/unit/test_batch_processor.py

def test_process_job_success():
    """Test successful batch processing"""
    pass

def test_process_job_with_augmentation():
    """Test articles below min word count are augmented"""
    pass

def test_process_job_continue_on_error():
    """Test continue_on_error flag behavior"""
    pass

Task 6.2: Create Integration Test

File: tests/integration/test_generate_batch.py

def test_generate_batch_end_to_end(test_db, mock_ai_client):
    """
    End-to-end test:
    1. Create test project in DB
    2. Create test job file
    3. Run batch processor
    4. Verify GeneratedContent records created
    5. Verify word counts within range
    6. Verify HTML structure
    """
    pass

Task 6.3: Create Example Job Files

jobs/example_tier1_batch.json

{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5
        }
      }
    }
  ]
}

(Uses all defaults for tier1)

jobs/example_multi_tier_batch.json

{
  "jobs": [
    {
      "project_id": 1,
      "tiers": {
        "tier1": {
          "count": 5,
          "min_word_count": 2200,
          "max_word_count": 2600
        },
        "tier2": {
          "count": 10
        },
        "tier3": {
          "count": 15,
          "max_h2_tags": 4
        }
      }
    },
    {
      "project_id": 2,
      "tiers": {
        "tier1": {
          "count": 3
        }
      }
    }
  ]
}

jobs/README.md

Document job file format and examples


PHASE 7: Cleanup & Deprecation

Task 7.1: Remove Old ContentRuleEngine

Action: Delete or gut src/generation/rule_engine.py

Only keep if it has reusable utilities. Otherwise remove entirely.


Task 7.2: Remove Old Validator Logic

Action: Review src/generation/validator.py (if exists)

Remove any strict CORA validation beyond word count. Keep only simple validation utilities.


Task 7.3: Update Documentation

Files to update:

  • docs/stories/story-2.2. simplified-ai-content-generation.md - Status to "In Progress" → "Done"
  • docs/architecture/workflows.md - Document simplified generation flow
  • docs/architecture/components.md - Update generation component description

Implementation Order Recommendation

  1. Phase 1 (Data Layer) - Required foundation
  2. Phase 2 (AI Client) - Required for generation
  3. Phase 3 (Core Logic) - Implement one stage at a time, test each
  4. Phase 4 (Batch Processing) - Orchestrate stages
  5. Phase 5 (CLI) - Make accessible to users
  6. Phase 6 (Testing) - Can be done in parallel with implementation
  7. Phase 7 (Cleanup) - Final polish

Estimated effort:

  • Phase 1-2: 4-6 hours
  • Phase 3: 6-8 hours
  • Phase 4: 3-4 hours
  • Phase 5: 2-3 hours
  • Phase 6: 4-6 hours
  • Phase 7: 1-2 hours
  • Total: 20-29 hours

Critical Dev Notes

OpenRouter Specifics

  • API key from environment: OPENROUTER_API_KEY
  • Model format: "provider/model-name"
  • Supports OpenAI SDK drop-in replacement
  • Rate limits vary by model (check OpenRouter docs)

HTML Fragment Format

Content generation returns HTML like:

<h2>Main Topic</h2>
<p>Introduction paragraph with relevant keywords and entities.</p>
<h3>Subtopic One</h3>
<p>Detailed content about subtopic.</p>
<h3>Subtopic Two</h3>
<p>More detailed content.</p>
<h2>Second Main Topic</h2>
<p>Content continues...</p>

No document structure: No <!DOCTYPE>, <html>, <head>, or <body> tags.

Word Count Method

import re
from html import unescape

def count_words(html_content: str) -> int:
    # Strip HTML tags
    text = re.sub(r'<[^>]+>', '', html_content)
    # Unescape HTML entities
    text = unescape(text)
    # Split and count
    words = text.split()
    return len(words)

Debug Output Directory

  • Create debug_output/ at project root if not exists
  • Add to .gitignore
  • Filename format: {stage}_project{id}_tier{tier}_article{n}_{timestamp}.{ext}
  • Example: title_project5_tier1_article3_20251020_143022.txt

Tier Constants Location

Define in src/generation/job_config.py as module-level constant for easy reference.

Future Extensibility

Job file structure designed to support:

  • Custom interlinking rules (Story 2.4+)
  • Template selection (Story 3.x)
  • Deployment targets (Story 4.x)
  • SEO metadata overrides

Keep job parsing flexible to add new fields without breaking existing jobs.


Testing Strategy

Unit Test Mocking

Mock AIClient.generate_completion() to return realistic HTML:

@pytest.fixture
def mock_title_response():
    return "The Ultimate Guide to Sustainable Gardening in 2025"

@pytest.fixture
def mock_outline_response():
    return {
        "outline": [
            {"h2": "Getting Started", "h3": ["Tools", "Planning"]},
            {"h2": "Best Practices", "h3": ["Watering", "Composting"]}
        ]
    }

@pytest.fixture
def mock_content_response():
    return """<h2>Getting Started</h2>
<p>Sustainable gardening begins with proper planning...</p>
<h3>Tools</h3>
<p>Essential tools include...</p>"""

Integration Test Database

Use conftest.py fixture with in-memory SQLite and test data:

@pytest.fixture
def test_project(test_db):
    project_repo = ProjectRepository(test_db)
    return project_repo.create(
        user_id=1,
        name="Test Project",
        data={
            "main_keyword": "sustainable gardening",
            "entities": ["composting", "organic soil"],
            "related_searches": ["how to compost", "organic gardening tips"]
        }
    )

Success Criteria

Story is complete when:

  1. All database models and repositories implemented
  2. AIClient successfully calls OpenRouter API
  3. Three-stage generation pipeline works end-to-end
  4. Batch processor handles multiple jobs/tiers
  5. CLI command generate-batch functional
  6. Debug output saves to debug_output/ when --debug used
  7. All unit tests pass
  8. Integration test demonstrates full workflow
  9. Example job files work correctly
  10. Documentation updated

Acceptance: Run generate-batch on real project, verify content saved to database with correct word count and structure.