# Story 2.6: Batch Title Generation ## Overview Refactor title generation to generate all titles for a tier in batches before article generation begins. This prevents title similarity issues that occur when titles are generated sequentially one at a time. ## Status **PLANNED** ## Story Details **As a User**, I want all article titles for a tier to be generated together in batches, so that the AI can ensure title diversity and prevent repetitive titles. ## Acceptance Criteria ### 1. Batch Title Generation Before Articles **Status:** PENDING - All titles for a tier are generated before any article content generation begins - Titles are generated in batches of 25 (or the tier count if less than 25) - AI prompt instructs generation of N distinct titles in a single call - Each batch request includes instructions to ensure title diversity ### 2. Title File Persistence **Status:** PENDING - Generated titles written to: `debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt` - One title per line - File is written before article generation loop begins - Titles loaded from file and used sequentially during article generation ### 3. Console Output **Status:** PENDING - Print complete list of generated titles to console after generation - Show title count and batch information - Format: numbered list for easy review ### 4. Error Handling **Status:** PENDING - Retry entire batch on generation failure (up to 3 attempts) - Fail tier processing after 3 failed batch attempts - If AI returns fewer titles than requested (e.g., 20 instead of 25): - Log warning to console - Continue with partial batch - Generate remaining titles in next batch or individually ### 5. Existing Title Validation **Status:** PENDING - Continue to validate individual titles (keyword presence, length) - No new diversity or similarity validation required - Existing validation logic unchanged ### 6. Backward Compatibility **Status:** PENDING - No changes to job file schema - No changes to CLI interface - Transparent change to users - Article generation loop works with pre-generated titles ## Implementation Details ### Architecture Changes #### 1. New Prompt Template **File:** `src/generation/prompts/batch_title_generation.json` **Format:** ```json { "system_message": "You are an expert creative content writer who creates compelling, search-optimized titles that attract clicks while accurately representing the content topic. When generating multiple titles, ensure each takes a unique angle or approach to maximize diversity. Be creative - the titles just need to be tangentially related to the search topic {keyword}. ", "user_prompt": "Generate {count} distinct, creative titles for articles about: {keyword}\n\nRelated entities: {entities}\nRelated searches: {related_searches}\n\nIMPORTANT: Each title should take a different angle or approach. Ensure diversity across all titles.\n\nReturn exactly {count} titles, one per line. No numbering, quotes, or formatting - just the title text." } ``` #### 2. ContentGenerator Service Enhancement **File:** `src/generation/service.py` **New Method:** ```python def generate_titles_batch( self, project_id: int, count: int, batch_size: int = 25, debug: bool = False, model: Optional[str] = None ) -> List[str]: """ Generate multiple titles in batches Args: project_id: Project ID to generate titles for count: Total number of titles needed batch_size: Number of titles per AI call (default: 25) debug: If True, save responses to debug_output/ model: Optional model override for this generation stage Returns: List of generated title strings """ # Load project data # Loop in batches of batch_size # For each batch: # - Call AI with batch_title_generation prompt # - Parse newline-separated titles # - Validate each title # - Retry batch up to 3 times on failure # - Warn if fewer titles returned than requested # Aggregate all titles # Return list ``` **Key Details:** - Use max_tokens: 100 * batch_size (e.g., 2500 for 25 titles) - Temperature: 0.7 (same as current) - Parse response by splitting on newlines - Strip whitespace, quotes, numbering from each line - Validate each title using existing validation logic - 3 retry attempts per batch #### 3. BatchProcessor Refactoring **File:** `src/generation/batch_processor.py` **New Method:** ```python def _generate_all_titles_for_tier( self, project_id: int, tier_name: str, tier_config: TierConfig, debug: bool ) -> str: """ Generate all titles for a tier and save to file Args: project_id: Project ID tier_name: Name of tier (e.g., "tier1") tier_config: Tier configuration debug: Debug mode flag Returns: Path to generated titles file """ # Generate timestamp # Call service.generate_titles_batch(count=tier_config.count) # Create filename: debug_output/project_{id}_tier_{name}_titles_{timestamp}.txt # Write titles to file (one per line) # Print titles to console (numbered list) # Return file path ``` **Modified Method:** `_process_tier()` ```python def _process_tier(...): """Process a single tier with pre-generated titles""" # NEW: Generate all titles first click.echo(f"\n[{tier_name}] Generating {tier_config.count} titles in batches...") titles_file = self._generate_all_titles_for_tier( project_id, tier_name, tier_config, debug ) # NEW: Load titles from file with open(titles_file, 'r', encoding='utf-8') as f: titles = [line.strip() for line in f if line.strip()] click.echo(f"[{tier_name}] Generated {len(titles)} titles") click.echo(f"[{tier_name}] Titles saved to: {titles_file}") # NEW: Print titles to console click.echo(f"\n[{tier_name}] Title List:") for i, title in enumerate(titles, 1): click.echo(f" {i}. {title}") click.echo() # EXISTING: Loop through articles for article_num in range(1, tier_config.count + 1): article_index = article_num - 1 # NEW: Get pre-generated title if article_index < len(titles): title = titles[article_index] else: click.echo(f" Warning: Not enough titles generated, skipping article {article_num}") continue # MODIFIED: Call with pre-generated title self._generate_single_article( project_id=project_id, tier_name=tier_name, tier_config=tier_config, article_num=article_num, article_index=article_index, title=title, # NEW PARAMETER keyword=keyword, resolved_targets=resolved_targets, debug=debug ) ``` **Modified Method:** `_generate_single_article()` ```python def _generate_single_article( self, project_id: int, tier_name: str, tier_config: TierConfig, article_num: int, article_index: int, title: str, # NEW PARAMETER keyword: str, resolved_targets: Dict[str, int], debug: bool ): """Generate a single article with pre-generated title""" prefix = f" [{article_num}/{tier_config.count}]" # ... site assignment logic ... # REMOVED: Title generation block # click.echo(f"{prefix} Generating title...") # title = self.generator.generate_title(...) # NEW: Just use the provided title click.echo(f"{prefix} Using title: \"{title}\"") # EXISTING: Generate outline and content click.echo(f"{prefix} Generating outline...") outline = self.generator.generate_outline(...) # ... rest of method unchanged ... ``` ### Console Output Example ``` [tier1] Generating 5 titles in batches... [tier1] Generated 5 titles [tier1] Titles saved to: debug_output/project_1_tier1_titles_20251024_143052.txt [tier1] Title List: 1. Complete Guide to Shaft Machining: Techniques and Best Practices 2. Advanced CNC Shaft Machining: From Setup to Finish 3. Troubleshooting Common Shaft Machining Challenges 4. Precision Shaft Manufacturing: Tools and Equipment Guide 5. How to Optimize Shaft Machining Operations for Higher Output Processing tier1: 5 articles... [1/5] Assigned to site: getcnc.info (ID: 1) [1/5] Using title: "Complete Guide to Shaft Machining: Techniques and Best Practices" [1/5] Generating outline... [1/5] Generated outline: 4 H2s, 8 H3s [1/5] Generating content... ... ``` ### Batch Size Logic **Determining Batch Size:** - If tier count <= 25: Use tier count (single batch) - If tier count > 25: Use batches of 25 **Examples:** - 5 articles: 1 batch of 5 - 20 articles: 1 batch of 20 - 25 articles: 1 batch of 25 - 50 articles: 2 batches of 25 each - 100 articles: 4 batches of 25 each ### Error Scenarios **Scenario 1: AI Call Fails** - Retry entire batch (up to 3 attempts) - After 3 failures: Fail tier processing - Log error message to console **Scenario 2: AI Returns Fewer Titles Than Requested** ``` Warning: Requested 25 titles but received 20. Continuing with partial batch. ``` - Continue with titles received - Process remaining count in next batch **Scenario 3: AI Returns More Titles Than Requested** - Use first N titles (where N = requested count) - Discard extras **Scenario 4: Malformed Response** - Retry batch (counts toward 3 attempts) - Log parsing error ### File Management **Title File Format:** ``` Complete Guide to Shaft Machining: Techniques and Best Practices Advanced CNC Shaft Machining: From Setup to Finish Troubleshooting Common Shaft Machining Challenges Precision Shaft Manufacturing: Tools and Equipment Guide How to Optimize Shaft Machining Operations for Higher Output ``` **File Location:** - Directory: `debug_output/` - Naming: `project_{project_id}_tier_{tier_name}_titles_{timestamp}.txt` - Encoding: UTF-8 - Format: One title per line, no extra formatting **File Lifecycle:** - Created at start of tier processing - Read once after creation - Preserved for debugging/review - Not deleted after processing ## Testing Strategy ### Unit Tests **File:** `tests/unit/test_generation_service.py` New tests: - `test_generate_titles_batch_single_batch()` - 5 titles - `test_generate_titles_batch_multiple_batches()` - 50 titles - `test_generate_titles_batch_exact_25()` - 25 titles - `test_generate_titles_batch_retry_on_failure()` - Failure handling - `test_generate_titles_batch_partial_return()` - Fewer titles returned - `test_generate_titles_batch_validation()` - Individual title validation ### Integration Tests **File:** `tests/integration/test_batch_title_generation.py` New tests: - `test_tier_processing_with_batch_titles()` - Full tier with pre-generated titles - `test_title_file_creation_and_loading()` - File I/O - `test_console_output_formatting()` - Output validation - `test_multiple_batches_aggregation()` - 100 articles across 4 batches ### Manual Testing ```bash # Small batch (5 articles) python main.py generate-batch -j jobs/test_shaft_machining.json -u admin -p password # Medium batch (20 articles) python main.py generate-batch -j jobs/tier2_20articles.json -u admin -p password # Large batch (100 articles) python main.py generate-batch -j jobs/tier3_100articles.json -u admin -p password ``` **Validation Checklist:** - [ ] Titles file created in debug_output/ - [ ] All titles printed to console - [ ] No duplicate/similar titles in batch - [ ] Article generation uses pre-generated titles - [ ] "Generating title..." message removed from article loop - [ ] "Using title: ..." message present instead ## Design Decisions ### Why Batches of 25? - Balances context window usage vs API efficiency - Allows AI to see enough titles to ensure diversity - Reasonable token count (~2500 output tokens) - Easy to retry on failure ### Why Write to File? - Provides debugging artifact - Separates title generation from article pipeline - Enables manual review if needed - Fault tolerance: titles preserved if article generation crashes ### Why Not Store in Database First? - Simpler implementation - No partial GeneratedContent records - Clear separation of concerns - File serves as intermediate format ### Why Print to Console? - Immediate visibility for user - Quick sanity check on title quality - Helps identify if batch generation is working - Minimal cost (just console output) ### Why Allow Partial Batches? - More resilient to AI inconsistencies - Better than failing entire tier - Warning provides visibility - Can continue processing with available titles ## Known Limitations 1. **No Similarity Scoring**: Does not quantitatively measure title diversity 2. **No Manual Review Step**: Fully automated, no approval gate 3. **Sequential Batches**: Batches generated sequentially, not in parallel 4. **Fixed Batch Size**: 25 is hardcoded (not configurable per job) 5. **No Title Regeneration**: Can't regenerate individual bad titles ## Migration Notes **No Breaking Changes:** - CLI interface unchanged - Job file schema unchanged - Database schema unchanged - Existing validation unchanged **Transparent to Users:** - Only console output differs - New debug files appear - Articles generated same way ## Files Created/Modified ### New Files: - `src/generation/prompts/batch_title_generation.json` - Batch title prompt - `tests/unit/test_batch_title_generation.py` - Unit tests - `tests/integration/test_batch_title_generation.py` - Integration tests - `docs/stories/story-2.6-batch-title-generation.md` - This document ### Modified Files: - `src/generation/service.py` - Add generate_titles_batch() method - `src/generation/batch_processor.py` - Refactor _process_tier() and _generate_single_article() - `src/generation/ai_client.py` - May need token limit adjustments (if hardcoded) ## Performance Impact **Before (Sequential):** - Title per article: ~3-5 seconds - 25 articles: ~75-125 seconds for titles alone **After (Batch):** - 25 titles in 1 batch: ~8-12 seconds - 25 articles: ~8-12 seconds for all titles **Improvement:** - ~85% faster title generation - Better API efficiency (fewer calls) - Improved title diversity (subjective) ## Next Steps After Story 2.6 completion: - Monitor title quality and diversity in production - Consider adding similarity scoring if issues persist - Potential future: Manual review step for Tier 1 titles - Potential future: Configurable batch size in job files ## Completion Checklist - [ ] Create batch_title_generation.json prompt - [ ] Add generate_titles_batch() to ContentGenerator - [ ] Add _generate_all_titles_for_tier() to BatchProcessor - [ ] Refactor _process_tier() for batch titles - [ ] Modify _generate_single_article() signature - [ ] Implement title file I/O - [ ] Add console output formatting - [ ] Implement retry logic (3 attempts) - [ ] Implement partial batch handling - [ ] Write unit tests - [ ] Write integration tests - [ ] Manual testing with 5, 20, 100 article batches - [ ] Update documentation - [ ] Code review ## Success Metrics **Primary:** - All titles generated before article content generation - Titles stored in debug_output files - Article generation uses pre-generated titles **Secondary:** - Subjectively less repetitive titles (manual review) - Faster title generation (85% improvement) - No regression in title quality validation ## Notes - This change addresses user feedback about title similarity - Batch generation allows AI to "see" all titles and ensure diversity - File-based approach provides debugging capability - No changes to downstream systems (outline, content, interlinking) - Maintains existing validation and error handling patterns