# Technical Debt & Future Enhancements This document tracks technical debt, future enhancements, and features that were deferred from the MVP. --- ## Story 1.6: Deployment Infrastructure Management ### Domain Health Check / Verification Status **Priority**: Medium **Epic Suggestion**: Epic 4 (Deployment) or Epic 3 (Pre-deployment) **Estimated Effort**: Small (1-2 days) #### Problem After importing or provisioning sites, there's no way to verify: - Domain ownership is still valid (user didn't let domain expire) - DNS configuration is correct and pointing to bunny.net - Custom domain is actually serving content - SSL certificates are valid With 50+ domains, manual checking is impractical. #### Proposed Solution **Option 1: Active Health Check** 1. Create a health check file in each Storage Zone (e.g., `.health-check.txt`) 2. Periodically attempt to fetch it via the custom domain 3. Record results in database **Option 2: Use bunny.net API** - Check if bunny.net exposes domain verification status via API - Query verification status for each custom hostname **Database Changes** Add `health_status` field to `SiteDeployment` table: - `unknown` - Not yet checked - `healthy` - Domain resolving and serving content - `dns_failure` - Cannot resolve domain - `ssl_error` - Certificate issues - `unreachable` - Domain not responding - `expired` - Likely domain ownership lost Add `last_health_check` timestamp field. **CLI Commands** ```bash # Check single domain check-site-health --domain www.example.com # Check all domains check-all-sites-health # List unhealthy sites list-sites --status unhealthy ``` **Use Cases** - Automated monitoring to detect when domains expire - Pre-deployment validation before pushing new content - Dashboard showing health of entire portfolio - Alert system for broken domains #### Impact - Prevents wasted effort deploying to expired domains - Early detection of DNS/SSL issues - Better operational visibility across large domain portfolios --- ## Story 2.3: AI-Powered Content Generation ### Prompt Template A/B Testing & Optimization **Priority**: Medium **Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP **Estimated Effort**: Medium (3-5 days) #### Problem Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to: - Test different prompt variations - Compare results objectively - Select optimal prompts for different scenarios - Track which prompts work best with which models #### Proposed Solution **Prompt Versioning System:** 1. Support multiple versions of each prompt template 2. Name prompts with version suffix (e.g., `title_generation_v1.json`, `title_generation_v2.json`) 3. Job config specifies which prompt version to use per stage **Comparison Tool:** ```bash # Generate with multiple prompt versions compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline # Outputs: # - Side-by-side content comparison # - Validation scores # - Augmentation requirements # - Generation time/cost # - Recommendation ``` **Metrics to Track:** - Validation pass rate - Augmentation frequency - Average attempts per stage - Word count variance - Keyword density accuracy - Generation time - API cost **Database Changes:** Add `prompt_version` fields to `GeneratedContent`: - `title_prompt_version` - `outline_prompt_version` - `content_prompt_version` #### Impact - Higher quality content - Reduced augmentation needs - Lower API costs - Model-specific optimizations - Data-driven prompt improvements --- ### Parallel Article Generation **Priority**: Low **Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP **Estimated Effort**: Medium (3-5 days) #### Problem Articles are generated sequentially, which is slow for large batches: - 15 tier 1 articles: ~10-20 minutes - 150 tier 2 articles: ~2-3 hours This could be parallelized since articles are independent. #### Proposed Solution **Multi-threading/Multi-processing:** 1. Add `--parallel N` flag to `generate-batch` command 2. Process N articles simultaneously 3. Share database session pool 4. Rate limit API calls to avoid throttling **Considerations:** - Database connection pooling - OpenRouter rate limits - Memory usage (N concurrent AI calls) - Progress tracking complexity - Error handling across threads **Example:** ```bash # Generate 4 articles in parallel generate-batch -j job.json --parallel 4 ``` #### Impact - 3-4x faster for large batches - Better resource utilization - Reduced total job time --- ### Job Folder Auto-Processing **Priority**: Low **Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP **Estimated Effort**: Small (1-2 days) #### Problem Currently must run each job file individually. For large operations with many batches, want to: - Queue multiple jobs - Process jobs/folder automatically - Run overnight batches #### Proposed Solution **Job Queue System:** ```bash # Process all jobs in folder generate-batch --folder jobs/pending/ # Process and move to completed/ generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/ # Watch folder for new jobs generate-batch --watch jobs/queue/ --interval 60 ``` **Features:** - Process jobs in order (alphabetical or by timestamp) - Move completed jobs to archive folder - Skip failed jobs or retry - Summary report for all jobs **Database Changes:** Add `JobRun` table to track batch job executions: - job_file_path - start_time, end_time - total_articles, successful, failed - status (running/completed/failed) #### Impact - Hands-off batch processing - Better for large-scale operations - Easier job management --- ### Cost Tracking & Analytics **Priority**: Medium **Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP **Estimated Effort**: Medium (2-4 days) #### Problem No visibility into: - API costs per article/batch - Which models are most cost-effective - Cost per tier/quality level - Budget tracking #### Proposed Solution **Track API Usage:** 1. Log tokens used per API call 2. Store in database with cost calculation 3. Dashboard showing costs **Cost Fields in GeneratedContent:** - `title_tokens_used` - `title_cost_usd` - `outline_tokens_used` - `outline_cost_usd` - `content_tokens_used` - `content_cost_usd` - `total_cost_usd` **Analytics Commands:** ```bash # Show costs for project cost-report --project-id 1 # Compare model costs model-cost-comparison --models claude-3.5-sonnet,gpt-4o # Budget tracking cost-summary --date-range 2025-10-01:2025-10-31 ``` **Reports:** - Cost per article by tier - Model efficiency (cost vs quality) - Daily/weekly/monthly spend - Budget alerts #### Impact - Cost optimization - Better budget planning - Model selection data - ROI tracking --- ### Model Performance Analytics **Priority**: Low **Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP **Estimated Effort**: Medium (3-5 days) #### Problem No data on which models perform best for: - Different tiers - Different content types - Title vs outline vs content generation - Pass rates and quality scores #### Proposed Solution **Performance Tracking:** 1. Track validation metrics per model 2. Generate comparison reports 3. Recommend optimal models for scenarios **Metrics:** - First-attempt pass rate - Average attempts to success - Augmentation frequency - Validation score distributions - Generation time - Cost per successful article **Dashboard:** ```bash # Model performance report model-performance --days 30 # Output: Model: claude-3.5-sonnet Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost Model: gpt-4o ... Recommendations: - Use claude-3.5-sonnet for titles (best pass rate) - Use gpt-4o for content (better quality scores) ``` #### Impact - Data-driven model selection - Optimize quality vs cost - Identify model strengths/weaknesses - Better tier-model mapping --- ### Improved Content Augmentation **Priority**: Medium **Epic Suggestion**: Epic 2 (Content Generation) - Enhancement **Estimated Effort**: Medium (3-5 days) #### Problem Current augmentation is basic: - Random word insertion can break sentence flow - Doesn't consider context - Can feel unnatural - No quality scoring #### Proposed Solution **Smarter Augmentation:** 1. Use AI to rewrite sentences with missing terms 2. Analyze sentence structure before insertion 3. Add quality scoring for augmented vs original 4. User-reviewable augmentation suggestions **Example:** ```python # Instead of: "The process involves machine learning techniques." # Random insert: "The process involves keyword machine learning techniques." # Smarter: "The process involves keyword-driven machine learning techniques." # Or: "The process, focused on keyword optimization, involves machine learning." ``` **Features:** - Context-aware term insertion - Sentence rewriting option - A/B comparison (original vs augmented) - Quality scoring - Manual review mode #### Impact - More natural augmented content - Better readability - Higher quality scores - User confidence in output --- ## Story 3.1: URL Generation and Site Assignment ### Fuzzy Keyword/Entity Matching for Site Assignment **Priority**: Medium **Epic Suggestion**: Epic 3 (Pre-deployment) - Enhancement **Estimated Effort**: Medium (5-8 story points) #### Problem Currently tier1 site assignment uses: 1. Explicit preferred sites from job config 2. Random selection from available pool This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance. #### Proposed Solution **Intelligent Site Matching:** 1. Extract article keywords and entities from GeneratedContent 2. Parse keywords/entities from site hostnames and names 3. Score each (article, site) pair based on keyword/entity overlap 4. Assign tier1 articles to highest-scoring available sites 5. Fall back to random if no good matches **Example:** ``` Article: "Engine Repair Basics" Keywords: ["engine repair", "automotive", "maintenance"] Entities: ["engine", "carburetor", "cylinder"] Available Sites: - auto-repair-tips.com Score: 0.85 (high match) - engine-maintenance-guide.com Score: 0.92 (very high match) - cooking-recipes.com Score: 0.05 (low match) Assignment: engine-maintenance-guide.com (best match) ``` **Implementation Details:** - Scoring algorithm: weighted combination of keyword match + entity match - Fuzzy matching: use Levenshtein distance or similar for partial matches - Track assignments to avoid reusing sites within same batch - Configurable threshold (e.g., only assign if score > 0.5, else random) **Job Configuration:** ```json { "tier1_site_matching": { "enabled": true, "min_score": 0.5, "weight_keywords": 0.6, "weight_entities": 0.4 } } ``` **Database Changes:** None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name) #### Complexity Factors - Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"]) - Entity recognition and normalization - Scoring algorithm design and tuning - Testing with various domain/content combinations - Performance optimization for large site pools #### Impact - Better SEO through topical site clustering - More organized content portfolio - Easier to identify which sites cover which topics - Improved content discoverability #### Alternative: Simpler Keyword-Only Matching If full fuzzy matching is too complex, start with exact keyword substring matching: ```python # Simple version: check if article keyword appears in hostname if article.main_keyword.lower() in site.custom_hostname.lower(): score = 1.0 else: score = 0.0 ``` This would still provide value with much less complexity (2-3 story points instead of 5-8). --- ## Story 3.3: Content Interlinking Injection ### Boilerplate Site Pages (About, Contact, Privacy) **Priority**: High **Epic Suggestion**: Epic 3 (Pre-deployment) - Story 3.4 **Estimated Effort**: Medium (20 story points, 2-3 days) **Status**: ✅ **PROMOTED TO STORY 3.4** (specification complete) #### Problem During Story 3.3 implementation, we added navigation menus to all HTML templates with links to: - `about.html` - `contact.html` - `privacy.html` - `/index.html` However, these pages don't exist, creating broken links on every deployed site. #### Impact - Unprofessional appearance (404 errors on nav links) - Poor user experience - Privacy policy may be legally required for public sites - No contact mechanism for users #### Solution (Now Story 3.4) See full specification: `docs/stories/story-3.4-boilerplate-site-pages.md` **Summary:** - Automatically generate boilerplate pages for each site during batch generation - Store in new `site_pages` table - Use same template as articles for visual consistency - Generic but professional content suitable for any niche - Generated once per site, skip if already exists **Implementation tracked in Story 3.4.** --- ## Future Sections Add new technical debt items below as they're identified during development.