# Story 2.6: Batch Title Generation

## Overview
Refactor title generation to generate all titles for a tier in batches before article generation begins. This prevents title similarity issues that occur when titles are generated sequentially one at a time.

## Status
**PLANNED**

## Story Details
**As a User**, I want all article titles for a tier to be generated together in batches, so that the AI can ensure title diversity and prevent repetitive titles.

## Acceptance Criteria

### 1. Batch Title Generation Before Articles
**Status:** PENDING

- All titles for a tier are generated before any article content generation begins
- Titles are generated in batches of 25 (or the tier count if less than 25)
- AI prompt instructs generation of N distinct titles in a single call
- Each batch request includes instructions to ensure title diversity

### 2. Title File Persistence
**Status:** PENDING

- Generated titles written to: `debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt`
- One title per line
- File is written before article generation loop begins
- Titles loaded from file and used sequentially during article generation

### 3. Console Output
**Status:** PENDING

- Print complete list of generated titles to console after generation
- Show title count and batch information
- Format: numbered list for easy review

### 4. Error Handling
**Status:** PENDING

- Retry entire batch on generation failure (up to 3 attempts)
- Fail tier processing after 3 failed batch attempts
- If AI returns fewer titles than requested (e.g., 20 instead of 25):
  - Log warning to console
  - Continue with partial batch
  - Generate remaining titles in next batch or individually

### 5. Existing Title Validation
**Status:** PENDING

- Continue to validate individual titles (keyword presence, length)
- No new diversity or similarity validation required
- Existing validation logic unchanged

### 6. Backward Compatibility
**Status:** PENDING

- No changes to job file schema
- No changes to CLI interface
- Transparent change to users
- Article generation loop works with pre-generated titles

## Implementation Details

### Architecture Changes

#### 1. New Prompt Template
**File:** `src/generation/prompts/batch_title_generation.json`

**Format:**
```json
{
  "system_message": "You are an expert creative content writer who creates compelling, search-optimized titles that attract clicks while accurately representing the content topic. When generating multiple titles, ensure each takes a unique angle or approach to maximize diversity. Be creative - the titles just need to be tangentially related to the search topic {keyword}.  ",
  "user_prompt": "Generate {count} distinct, creative titles for articles about: {keyword}\n\nRelated entities: {entities}\nRelated searches: {related_searches}\n\nIMPORTANT: Each title should take a different angle or approach. Ensure diversity across all titles.\n\nReturn exactly {count} titles, one per line. No numbering, quotes, or formatting - just the title text."
}
```

#### 2. ContentGenerator Service Enhancement
**File:** `src/generation/service.py`

**New Method:**
```python
def generate_titles_batch(
    self, 
    project_id: int, 
    count: int, 
    batch_size: int = 25,
    debug: bool = False,
    model: Optional[str] = None
) -> List[str]:
    """
    Generate multiple titles in batches
    
    Args:
        project_id: Project ID to generate titles for
        count: Total number of titles needed
        batch_size: Number of titles per AI call (default: 25)
        debug: If True, save responses to debug_output/
        model: Optional model override for this generation stage
    
    Returns:
        List of generated title strings
    """
    # Load project data
    # Loop in batches of batch_size
    # For each batch:
    #   - Call AI with batch_title_generation prompt
    #   - Parse newline-separated titles
    #   - Validate each title
    #   - Retry batch up to 3 times on failure
    #   - Warn if fewer titles returned than requested
    # Aggregate all titles
    # Return list
```

**Key Details:**
- Use max_tokens: 100 * batch_size (e.g., 2500 for 25 titles)
- Temperature: 0.7 (same as current)
- Parse response by splitting on newlines
- Strip whitespace, quotes, numbering from each line
- Validate each title using existing validation logic
- 3 retry attempts per batch

#### 3. BatchProcessor Refactoring
**File:** `src/generation/batch_processor.py`

**New Method:**
```python
def _generate_all_titles_for_tier(
    self,
    project_id: int,
    tier_name: str,
    tier_config: TierConfig,
    debug: bool
) -> str:
    """
    Generate all titles for a tier and save to file
    
    Args:
        project_id: Project ID
        tier_name: Name of tier (e.g., "tier1")
        tier_config: Tier configuration
        debug: Debug mode flag
    
    Returns:
        Path to generated titles file
    """
    # Generate timestamp
    # Call service.generate_titles_batch(count=tier_config.count)
    # Create filename: debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt
    # Write titles to file (one per line)
    # Print titles to console (numbered list)
    # Return file path
```

**Modified Method:** `_process_tier()`
```python
def _process_tier(...):
    """Process a single tier with pre-generated titles"""
    
    # NEW: Generate all titles first
    click.echo(f"\n[{tier_name}] Generating {tier_config.count} titles in batches...")
    titles_file = self._generate_all_titles_for_tier(
        project_id, tier_name, tier_config, debug
    )
    
    # NEW: Load titles from file
    with open(titles_file, 'r', encoding='utf-8') as f:
        titles = [line.strip() for line in f if line.strip()]
    
    click.echo(f"[{tier_name}] Generated {len(titles)} titles")
    click.echo(f"[{tier_name}] Titles saved to: {titles_file}")
    
    # NEW: Print titles to console
    click.echo(f"\n[{tier_name}] Title List:")
    for i, title in enumerate(titles, 1):
        click.echo(f"  {i}. {title}")
    click.echo()
    
    # EXISTING: Loop through articles
    for article_num in range(1, tier_config.count + 1):
        article_index = article_num - 1
        
        # NEW: Get pre-generated title
        if article_index < len(titles):
            title = titles[article_index]
        else:
            click.echo(f"  Warning: Not enough titles generated, skipping article {article_num}")
            continue
        
        # MODIFIED: Call with pre-generated title
        self._generate_single_article(
            project_id=project_id,
            tier_name=tier_name,
            tier_config=tier_config,
            article_num=article_num,
            article_index=article_index,
            title=title,  # NEW PARAMETER
            keyword=keyword,
            resolved_targets=resolved_targets,
            debug=debug
        )
```

**Modified Method:** `_generate_single_article()`
```python
def _generate_single_article(
    self,
    project_id: int,
    tier_name: str,
    tier_config: TierConfig,
    article_num: int,
    article_index: int,
    title: str,  # NEW PARAMETER
    keyword: str,
    resolved_targets: Dict[str, int],
    debug: bool
):
    """Generate a single article with pre-generated title"""
    prefix = f"    [{article_num}/{tier_config.count}]"
    
    # ... site assignment logic ...
    
    # REMOVED: Title generation block
    # click.echo(f"{prefix} Generating title...")
    # title = self.generator.generate_title(...)
    
    # NEW: Just use the provided title
    click.echo(f"{prefix} Using title: \"{title}\"")
    
    # EXISTING: Generate outline and content
    click.echo(f"{prefix} Generating outline...")
    outline = self.generator.generate_outline(...)
    # ... rest of method unchanged ...
```

### Console Output Example

```
[tier1] Generating 5 titles in batches...
[tier1] Generated 5 titles
[tier1] Titles saved to: debug_output/project_1_tier1_titles_20251024_143052.txt

[tier1] Title List:
  1. Complete Guide to Shaft Machining: Techniques and Best Practices
  2. Advanced CNC Shaft Machining: From Setup to Finish
  3. Troubleshooting Common Shaft Machining Challenges
  4. Precision Shaft Manufacturing: Tools and Equipment Guide
  5. How to Optimize Shaft Machining Operations for Higher Output

Processing tier1: 5 articles...
    [1/5] Assigned to site: getcnc.info (ID: 1)
    [1/5] Using title: "Complete Guide to Shaft Machining: Techniques and Best Practices"
    [1/5] Generating outline...
    [1/5] Generated outline: 4 H2s, 8 H3s
    [1/5] Generating content...
    ...
```

### Batch Size Logic

**Determining Batch Size:**
- If tier count <= 25: Use tier count (single batch)
- If tier count > 25: Use batches of 25

**Examples:**
- 5 articles: 1 batch of 5
- 20 articles: 1 batch of 20
- 25 articles: 1 batch of 25
- 50 articles: 2 batches of 25 each
- 100 articles: 4 batches of 25 each

### Error Scenarios

**Scenario 1: AI Call Fails**
- Retry entire batch (up to 3 attempts)
- After 3 failures: Fail tier processing
- Log error message to console

**Scenario 2: AI Returns Fewer Titles Than Requested**
```
Warning: Requested 25 titles but received 20. Continuing with partial batch.
```
- Continue with titles received
- Process remaining count in next batch

**Scenario 3: AI Returns More Titles Than Requested**
- Use first N titles (where N = requested count)
- Discard extras

**Scenario 4: Malformed Response**
- Retry batch (counts toward 3 attempts)
- Log parsing error

### File Management

**Title File Format:**
```
Complete Guide to Shaft Machining: Techniques and Best Practices
Advanced CNC Shaft Machining: From Setup to Finish
Troubleshooting Common Shaft Machining Challenges
Precision Shaft Manufacturing: Tools and Equipment Guide
How to Optimize Shaft Machining Operations for Higher Output
```

**File Location:**
- Directory: `debug_output/`
- Naming: `project_{project_id}_tier_{tier_name}_titles_{timestamp}.txt`
- Encoding: UTF-8
- Format: One title per line, no extra formatting

**File Lifecycle:**
- Created at start of tier processing
- Read once after creation
- Preserved for debugging/review
- Not deleted after processing

## Testing Strategy

### Unit Tests
**File:** `tests/unit/test_generation_service.py`

New tests:
- `test_generate_titles_batch_single_batch()` - 5 titles
- `test_generate_titles_batch_multiple_batches()` - 50 titles  
- `test_generate_titles_batch_exact_25()` - 25 titles
- `test_generate_titles_batch_retry_on_failure()` - Failure handling
- `test_generate_titles_batch_partial_return()` - Fewer titles returned
- `test_generate_titles_batch_validation()` - Individual title validation

### Integration Tests
**File:** `tests/integration/test_batch_title_generation.py`

New tests:
- `test_tier_processing_with_batch_titles()` - Full tier with pre-generated titles
- `test_title_file_creation_and_loading()` - File I/O
- `test_console_output_formatting()` - Output validation
- `test_multiple_batches_aggregation()` - 100 articles across 4 batches

### Manual Testing
```bash
# Small batch (5 articles)
python main.py generate-batch -j jobs/test_shaft_machining.json -u admin -p password

# Medium batch (20 articles)  
python main.py generate-batch -j jobs/tier2_20articles.json -u admin -p password

# Large batch (100 articles)
python main.py generate-batch -j jobs/tier3_100articles.json -u admin -p password
```

**Validation Checklist:**
- [ ] Titles file created in debug_output/
- [ ] All titles printed to console
- [ ] No duplicate/similar titles in batch
- [ ] Article generation uses pre-generated titles
- [ ] "Generating title..." message removed from article loop
- [ ] "Using title: ..." message present instead

## Design Decisions

### Why Batches of 25?
- Balances context window usage vs API efficiency
- Allows AI to see enough titles to ensure diversity
- Reasonable token count (~2500 output tokens)
- Easy to retry on failure

### Why Write to File?
- Provides debugging artifact
- Separates title generation from article pipeline
- Enables manual review if needed
- Fault tolerance: titles preserved if article generation crashes

### Why Not Store in Database First?
- Simpler implementation
- No partial GeneratedContent records
- Clear separation of concerns
- File serves as intermediate format

### Why Print to Console?
- Immediate visibility for user
- Quick sanity check on title quality
- Helps identify if batch generation is working
- Minimal cost (just console output)

### Why Allow Partial Batches?
- More resilient to AI inconsistencies
- Better than failing entire tier
- Warning provides visibility
- Can continue processing with available titles

## Known Limitations

1. **No Similarity Scoring**: Does not quantitatively measure title diversity
2. **No Manual Review Step**: Fully automated, no approval gate
3. **Sequential Batches**: Batches generated sequentially, not in parallel
4. **Fixed Batch Size**: 25 is hardcoded (not configurable per job)
5. **No Title Regeneration**: Can't regenerate individual bad titles

## Migration Notes

**No Breaking Changes:**
- CLI interface unchanged
- Job file schema unchanged
- Database schema unchanged
- Existing validation unchanged

**Transparent to Users:**
- Only console output differs
- New debug files appear
- Articles generated same way

## Files Created/Modified

### New Files:
- `src/generation/prompts/batch_title_generation.json` - Batch title prompt
- `tests/unit/test_batch_title_generation.py` - Unit tests
- `tests/integration/test_batch_title_generation.py` - Integration tests
- `docs/stories/story-2.6-batch-title-generation.md` - This document

### Modified Files:
- `src/generation/service.py` - Add generate_titles_batch() method
- `src/generation/batch_processor.py` - Refactor _process_tier() and _generate_single_article()
- `src/generation/ai_client.py` - May need token limit adjustments (if hardcoded)

## Performance Impact

**Before (Sequential):**
- Title per article: ~3-5 seconds
- 25 articles: ~75-125 seconds for titles alone

**After (Batch):**
- 25 titles in 1 batch: ~8-12 seconds
- 25 articles: ~8-12 seconds for all titles

**Improvement:**
- ~85% faster title generation
- Better API efficiency (fewer calls)
- Improved title diversity (subjective)

## Next Steps

After Story 2.6 completion:
- Monitor title quality and diversity in production
- Consider adding similarity scoring if issues persist
- Potential future: Manual review step for Tier 1 titles
- Potential future: Configurable batch size in job files

## Completion Checklist

- [ ] Create batch_title_generation.json prompt
- [ ] Add generate_titles_batch() to ContentGenerator
- [ ] Add _generate_all_titles_for_tier() to BatchProcessor
- [ ] Refactor _process_tier() for batch titles
- [ ] Modify _generate_single_article() signature
- [ ] Implement title file I/O
- [ ] Add console output formatting
- [ ] Implement retry logic (3 attempts)
- [ ] Implement partial batch handling
- [ ] Write unit tests
- [ ] Write integration tests
- [ ] Manual testing with 5, 20, 100 article batches
- [ ] Update documentation
- [ ] Code review

## Success Metrics

**Primary:**
- All titles generated before article content generation
- Titles stored in debug_output files
- Article generation uses pre-generated titles

**Secondary:**
- Subjectively less repetitive titles (manual review)
- Faster title generation (85% improvement)
- No regression in title quality validation

## Notes

- This change addresses user feedback about title similarity
- Batch generation allows AI to "see" all titles and ensure diversity
- File-based approach provides debugging capability
- No changes to downstream systems (outline, content, interlinking)
- Maintains existing validation and error handling patterns