15 KiB

Raw Blame History

Story 2.6: Batch Title Generation

Overview

Refactor title generation to generate all titles for a tier in batches before article generation begins. This prevents title similarity issues that occur when titles are generated sequentially one at a time.

Status

PLANNED

Story Details

As a User, I want all article titles for a tier to be generated together in batches, so that the AI can ensure title diversity and prevent repetitive titles.

Acceptance Criteria

1. Batch Title Generation Before Articles

Status: PENDING

All titles for a tier are generated before any article content generation begins
Titles are generated in batches of 25 (or the tier count if less than 25)
AI prompt instructs generation of N distinct titles in a single call
Each batch request includes instructions to ensure title diversity

2. Title File Persistence

Status: PENDING

Generated titles written to: debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt
One title per line
File is written before article generation loop begins
Titles loaded from file and used sequentially during article generation

3. Console Output

Status: PENDING

Print complete list of generated titles to console after generation
Show title count and batch information
Format: numbered list for easy review

4. Error Handling

Status: PENDING

Retry entire batch on generation failure (up to 3 attempts)
Fail tier processing after 3 failed batch attempts
If AI returns fewer titles than requested (e.g., 20 instead of 25):
- Log warning to console
- Continue with partial batch
- Generate remaining titles in next batch or individually

5. Existing Title Validation

Status: PENDING

Continue to validate individual titles (keyword presence, length)
No new diversity or similarity validation required
Existing validation logic unchanged

6. Backward Compatibility

Status: PENDING

No changes to job file schema
No changes to CLI interface
Transparent change to users
Article generation loop works with pre-generated titles

Implementation Details

Architecture Changes

1. New Prompt Template

File: src/generation/prompts/batch_title_generation.json

Format:

{
  "system_message": "You are an expert creative content writer who creates compelling, search-optimized titles that attract clicks while accurately representing the content topic. When generating multiple titles, ensure each takes a unique angle or approach to maximize diversity. Be creative - the titles just need to be tangentially related to the search topic {keyword}.  ",
  "user_prompt": "Generate {count} distinct, creative titles for articles about: {keyword}\n\nRelated entities: {entities}\nRelated searches: {related_searches}\n\nIMPORTANT: Each title should take a different angle or approach. Ensure diversity across all titles.\n\nReturn exactly {count} titles, one per line. No numbering, quotes, or formatting - just the title text."
}

2. ContentGenerator Service Enhancement

File: src/generation/service.py

New Method:

def generate_titles_batch(
    self, 
    project_id: int, 
    count: int, 
    batch_size: int = 25,
    debug: bool = False,
    model: Optional[str] = None
) -> List[str]:
    """
    Generate multiple titles in batches
    
    Args:
        project_id: Project ID to generate titles for
        count: Total number of titles needed
        batch_size: Number of titles per AI call (default: 25)
        debug: If True, save responses to debug_output/
        model: Optional model override for this generation stage
    
    Returns:
        List of generated title strings
    """
    # Load project data
    # Loop in batches of batch_size
    # For each batch:
    #   - Call AI with batch_title_generation prompt
    #   - Parse newline-separated titles
    #   - Validate each title
    #   - Retry batch up to 3 times on failure
    #   - Warn if fewer titles returned than requested
    # Aggregate all titles
    # Return list

Key Details:

Use max_tokens: 100 * batch_size (e.g., 2500 for 25 titles)
Temperature: 0.7 (same as current)
Parse response by splitting on newlines
Strip whitespace, quotes, numbering from each line
Validate each title using existing validation logic
3 retry attempts per batch

3. BatchProcessor Refactoring

File: src/generation/batch_processor.py

New Method:

def _generate_all_titles_for_tier(
    self,
    project_id: int,
    tier_name: str,
    tier_config: TierConfig,
    debug: bool
) -> str:
    """
    Generate all titles for a tier and save to file
    
    Args:
        project_id: Project ID
        tier_name: Name of tier (e.g., "tier1")
        tier_config: Tier configuration
        debug: Debug mode flag
    
    Returns:
        Path to generated titles file
    """
    # Generate timestamp
    # Call service.generate_titles_batch(count=tier_config.count)
    # Create filename: debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt
    # Write titles to file (one per line)
    # Print titles to console (numbered list)
    # Return file path

Modified Method: _process_tier()

def _process_tier(...):
    """Process a single tier with pre-generated titles"""
    
    # NEW: Generate all titles first
    click.echo(f"\n[{tier_name}] Generating {tier_config.count} titles in batches...")
    titles_file = self._generate_all_titles_for_tier(
        project_id, tier_name, tier_config, debug
    )
    
    # NEW: Load titles from file
    with open(titles_file, 'r', encoding='utf-8') as f:
        titles = [line.strip() for line in f if line.strip()]
    
    click.echo(f"[{tier_name}] Generated {len(titles)} titles")
    click.echo(f"[{tier_name}] Titles saved to: {titles_file}")
    
    # NEW: Print titles to console
    click.echo(f"\n[{tier_name}] Title List:")
    for i, title in enumerate(titles, 1):
        click.echo(f"  {i}. {title}")
    click.echo()
    
    # EXISTING: Loop through articles
    for article_num in range(1, tier_config.count + 1):
        article_index = article_num - 1
        
        # NEW: Get pre-generated title
        if article_index < len(titles):
            title = titles[article_index]
        else:
            click.echo(f"  Warning: Not enough titles generated, skipping article {article_num}")
            continue
        
        # MODIFIED: Call with pre-generated title
        self._generate_single_article(
            project_id=project_id,
            tier_name=tier_name,
            tier_config=tier_config,
            article_num=article_num,
            article_index=article_index,
            title=title,  # NEW PARAMETER
            keyword=keyword,
            resolved_targets=resolved_targets,
            debug=debug
        )

Modified Method: _generate_single_article()

def _generate_single_article(
    self,
    project_id: int,
    tier_name: str,
    tier_config: TierConfig,
    article_num: int,
    article_index: int,
    title: str,  # NEW PARAMETER
    keyword: str,
    resolved_targets: Dict[str, int],
    debug: bool
):
    """Generate a single article with pre-generated title"""
    prefix = f"    [{article_num}/{tier_config.count}]"
    
    # ... site assignment logic ...
    
    # REMOVED: Title generation block
    # click.echo(f"{prefix} Generating title...")
    # title = self.generator.generate_title(...)
    
    # NEW: Just use the provided title
    click.echo(f"{prefix} Using title: \"{title}\"")
    
    # EXISTING: Generate outline and content
    click.echo(f"{prefix} Generating outline...")
    outline = self.generator.generate_outline(...)
    # ... rest of method unchanged ...

Console Output Example

[tier1] Generating 5 titles in batches...
[tier1] Generated 5 titles
[tier1] Titles saved to: debug_output/project_1_tier1_titles_20251024_143052.txt

[tier1] Title List:
  1. Complete Guide to Shaft Machining: Techniques and Best Practices
  2. Advanced CNC Shaft Machining: From Setup to Finish
  3. Troubleshooting Common Shaft Machining Challenges
  4. Precision Shaft Manufacturing: Tools and Equipment Guide
  5. How to Optimize Shaft Machining Operations for Higher Output

Processing tier1: 5 articles...
    [1/5] Assigned to site: getcnc.info (ID: 1)
    [1/5] Using title: "Complete Guide to Shaft Machining: Techniques and Best Practices"
    [1/5] Generating outline...
    [1/5] Generated outline: 4 H2s, 8 H3s
    [1/5] Generating content...
    ...

Batch Size Logic

Determining Batch Size:

If tier count <= 25: Use tier count (single batch)
If tier count > 25: Use batches of 25

Examples:

5 articles: 1 batch of 5
20 articles: 1 batch of 20
25 articles: 1 batch of 25
50 articles: 2 batches of 25 each
100 articles: 4 batches of 25 each

Error Scenarios

Scenario 1: AI Call Fails

Retry entire batch (up to 3 attempts)
After 3 failures: Fail tier processing
Log error message to console

Scenario 2: AI Returns Fewer Titles Than Requested

Warning: Requested 25 titles but received 20. Continuing with partial batch.

Continue with titles received
Process remaining count in next batch

Scenario 3: AI Returns More Titles Than Requested

Use first N titles (where N = requested count)
Discard extras

Scenario 4: Malformed Response

Retry batch (counts toward 3 attempts)
Log parsing error

File Management

Title File Format:

Complete Guide to Shaft Machining: Techniques and Best Practices
Advanced CNC Shaft Machining: From Setup to Finish
Troubleshooting Common Shaft Machining Challenges
Precision Shaft Manufacturing: Tools and Equipment Guide
How to Optimize Shaft Machining Operations for Higher Output

File Location:

Directory: debug_output/
Naming: project_{project_id}_tier_{tier_name}_titles_{timestamp}.txt
Encoding: UTF-8
Format: One title per line, no extra formatting

File Lifecycle:

Created at start of tier processing
Read once after creation
Preserved for debugging/review
Not deleted after processing

Testing Strategy

Unit Tests

File: tests/unit/test_generation_service.py

New tests:

test_generate_titles_batch_single_batch() - 5 titles
test_generate_titles_batch_multiple_batches() - 50 titles
test_generate_titles_batch_exact_25() - 25 titles
test_generate_titles_batch_retry_on_failure() - Failure handling
test_generate_titles_batch_partial_return() - Fewer titles returned
test_generate_titles_batch_validation() - Individual title validation

Integration Tests

File: tests/integration/test_batch_title_generation.py

New tests:

test_tier_processing_with_batch_titles() - Full tier with pre-generated titles
test_title_file_creation_and_loading() - File I/O
test_console_output_formatting() - Output validation
test_multiple_batches_aggregation() - 100 articles across 4 batches

Manual Testing

# Small batch (5 articles)
python main.py generate-batch -j jobs/test_shaft_machining.json -u admin -p password

# Medium batch (20 articles)  
python main.py generate-batch -j jobs/tier2_20articles.json -u admin -p password

# Large batch (100 articles)
python main.py generate-batch -j jobs/tier3_100articles.json -u admin -p password

Validation Checklist:

Titles file created in debug_output/
All titles printed to console
No duplicate/similar titles in batch
Article generation uses pre-generated titles
"Generating title..." message removed from article loop
"Using title: ..." message present instead

Design Decisions

Why Batches of 25?

Balances context window usage vs API efficiency
Allows AI to see enough titles to ensure diversity
Reasonable token count (~2500 output tokens)
Easy to retry on failure

Why Write to File?

Provides debugging artifact
Separates title generation from article pipeline
Enables manual review if needed
Fault tolerance: titles preserved if article generation crashes

Why Not Store in Database First?

Simpler implementation
No partial GeneratedContent records
Clear separation of concerns
File serves as intermediate format

Why Print to Console?

Immediate visibility for user
Quick sanity check on title quality
Helps identify if batch generation is working
Minimal cost (just console output)

Why Allow Partial Batches?

More resilient to AI inconsistencies
Better than failing entire tier
Warning provides visibility
Can continue processing with available titles

Known Limitations

No Similarity Scoring: Does not quantitatively measure title diversity
No Manual Review Step: Fully automated, no approval gate
Sequential Batches: Batches generated sequentially, not in parallel
Fixed Batch Size: 25 is hardcoded (not configurable per job)
No Title Regeneration: Can't regenerate individual bad titles

Migration Notes

No Breaking Changes:

CLI interface unchanged
Job file schema unchanged
Database schema unchanged
Existing validation unchanged

Transparent to Users:

Only console output differs
New debug files appear
Articles generated same way

Files Created/Modified

New Files:

src/generation/prompts/batch_title_generation.json - Batch title prompt
tests/unit/test_batch_title_generation.py - Unit tests
tests/integration/test_batch_title_generation.py - Integration tests
docs/stories/story-2.6-batch-title-generation.md - This document

Modified Files:

src/generation/service.py - Add generate_titles_batch() method
src/generation/batch_processor.py - Refactor _process_tier() and _generate_single_article()
src/generation/ai_client.py - May need token limit adjustments (if hardcoded)

Performance Impact

Before (Sequential):

Title per article: ~3-5 seconds
25 articles: ~75-125 seconds for titles alone

After (Batch):

25 titles in 1 batch: ~8-12 seconds
25 articles: ~8-12 seconds for all titles

Improvement:

~85% faster title generation
Better API efficiency (fewer calls)
Improved title diversity (subjective)

Next Steps

After Story 2.6 completion:

Monitor title quality and diversity in production
Consider adding similarity scoring if issues persist
Potential future: Manual review step for Tier 1 titles
Potential future: Configurable batch size in job files

Completion Checklist

Create batch_title_generation.json prompt
Add generate_titles_batch() to ContentGenerator
Add _generate_all_titles_for_tier() to BatchProcessor
Refactor _process_tier() for batch titles
Modify _generate_single_article() signature
Implement title file I/O
Add console output formatting
Implement retry logic (3 attempts)
Implement partial batch handling
Write unit tests
Write integration tests
Manual testing with 5, 20, 100 article batches
Update documentation
Code review

Success Metrics

Primary:

All titles generated before article content generation
Titles stored in debug_output files
Article generation uses pre-generated titles

Secondary:

Subjectively less repetitive titles (manual review)
Faster title generation (85% improvement)
No regression in title quality validation

Notes

This change addresses user feedback about title similarity
Batch generation allows AI to "see" all titles and ensure diversity
File-based approach provides debugging capability
No changes to downstream systems (outline, content, interlinking)
Maintains existing validation and error handling patterns

15 KiB Raw Blame History

Story 2.6: Batch Title Generation

Overview

Status

Story Details

Acceptance Criteria

1. Batch Title Generation Before Articles

2. Title File Persistence

3. Console Output

4. Error Handling

5. Existing Title Validation

6. Backward Compatibility

Implementation Details

Architecture Changes

1. New Prompt Template

2. ContentGenerator Service Enhancement

3. BatchProcessor Refactoring

Console Output Example

Batch Size Logic

Error Scenarios

File Management

Testing Strategy

Unit Tests

Integration Tests

Manual Testing

Design Decisions

Why Batches of 25?

Why Write to File?

Why Not Store in Database First?

Why Print to Console?

Why Allow Partial Batches?

Known Limitations

Migration Notes

Files Created/Modified

New Files:

Modified Files:

Performance Impact

Next Steps

Completion Checklist

Success Metrics

Notes

15 KiB

Raw Blame History