17 KiB
Technical Debt & Future Enhancements
This document tracks technical debt, future enhancements, and features that were deferred from the MVP.
Story 1.6: Deployment Infrastructure Management
Domain Health Check / Verification Status
Priority: Medium
Epic Suggestion: Epic 4 (Deployment) or Epic 3 (Pre-deployment)
Estimated Effort: Small (1-2 days)
Problem
After importing or provisioning sites, there's no way to verify:
- Domain ownership is still valid (user didn't let domain expire)
- DNS configuration is correct and pointing to bunny.net
- Custom domain is actually serving content
- SSL certificates are valid
With 50+ domains, manual checking is impractical.
Proposed Solution
Option 1: Active Health Check
- Create a health check file in each Storage Zone (e.g.,
.health-check.txt) - Periodically attempt to fetch it via the custom domain
- Record results in database
Option 2: Use bunny.net API
- Check if bunny.net exposes domain verification status via API
- Query verification status for each custom hostname
Database Changes
Add health_status field to SiteDeployment table:
unknown- Not yet checkedhealthy- Domain resolving and serving contentdns_failure- Cannot resolve domainssl_error- Certificate issuesunreachable- Domain not respondingexpired- Likely domain ownership lost
Add last_health_check timestamp field.
CLI Commands
# Check single domain
check-site-health --domain www.example.com
# Check all domains
check-all-sites-health
# List unhealthy sites
list-sites --status unhealthy
Use Cases
- Automated monitoring to detect when domains expire
- Pre-deployment validation before pushing new content
- Dashboard showing health of entire portfolio
- Alert system for broken domains
Impact
- Prevents wasted effort deploying to expired domains
- Early detection of DNS/SSL issues
- Better operational visibility across large domain portfolios
Story 2.3: AI-Powered Content Generation
Prompt Template A/B Testing & Optimization
Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)
Problem
Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to:
- Test different prompt variations
- Compare results objectively
- Select optimal prompts for different scenarios
- Track which prompts work best with which models
Proposed Solution
Prompt Versioning System:
- Support multiple versions of each prompt template
- Name prompts with version suffix (e.g.,
title_generation_v1.json,title_generation_v2.json) - Job config specifies which prompt version to use per stage
Comparison Tool:
# Generate with multiple prompt versions
compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline
# Outputs:
# - Side-by-side content comparison
# - Validation scores
# - Augmentation requirements
# - Generation time/cost
# - Recommendation
Metrics to Track:
- Validation pass rate
- Augmentation frequency
- Average attempts per stage
- Word count variance
- Keyword density accuracy
- Generation time
- API cost
Database Changes:
Add prompt_version fields to GeneratedContent:
title_prompt_versionoutline_prompt_versioncontent_prompt_version
Impact
- Higher quality content
- Reduced augmentation needs
- Lower API costs
- Model-specific optimizations
- Data-driven prompt improvements
Parallel Article Generation
Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)
Problem
Articles are generated sequentially, which is slow for large batches:
- 15 tier 1 articles: ~10-20 minutes
- 150 tier 2 articles: ~2-3 hours
This could be parallelized since articles are independent.
Proposed Solution
Multi-threading/Multi-processing:
- Add
--parallel Nflag togenerate-batchcommand - Process N articles simultaneously
- Share database session pool
- Rate limit API calls to avoid throttling
Considerations:
- Database connection pooling
- OpenRouter rate limits
- Memory usage (N concurrent AI calls)
- Progress tracking complexity
- Error handling across threads
Example:
# Generate 4 articles in parallel
generate-batch -j job.json --parallel 4
Impact
- 3-4x faster for large batches
- Better resource utilization
- Reduced total job time
Job Folder Auto-Processing
Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Small (1-2 days)
Problem
Currently must run each job file individually. For large operations with many batches, want to:
- Queue multiple jobs
- Process jobs/folder automatically
- Run overnight batches
Proposed Solution
Job Queue System:
# Process all jobs in folder
generate-batch --folder jobs/pending/
# Process and move to completed/
generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/
# Watch folder for new jobs
generate-batch --watch jobs/queue/ --interval 60
Features:
- Process jobs in order (alphabetical or by timestamp)
- Move completed jobs to archive folder
- Skip failed jobs or retry
- Summary report for all jobs
Database Changes:
Add JobRun table to track batch job executions:
- job_file_path
- start_time, end_time
- total_articles, successful, failed
- status (running/completed/failed)
Impact
- Hands-off batch processing
- Better for large-scale operations
- Easier job management
Cost Tracking & Analytics
Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (2-4 days)
Problem
No visibility into:
- API costs per article/batch
- Which models are most cost-effective
- Cost per tier/quality level
- Budget tracking
Proposed Solution
Track API Usage:
- Log tokens used per API call
- Store in database with cost calculation
- Dashboard showing costs
Cost Fields in GeneratedContent:
title_tokens_usedtitle_cost_usdoutline_tokens_usedoutline_cost_usdcontent_tokens_usedcontent_cost_usdtotal_cost_usd
Analytics Commands:
# Show costs for project
cost-report --project-id 1
# Compare model costs
model-cost-comparison --models claude-3.5-sonnet,gpt-4o
# Budget tracking
cost-summary --date-range 2025-10-01:2025-10-31
Reports:
- Cost per article by tier
- Model efficiency (cost vs quality)
- Daily/weekly/monthly spend
- Budget alerts
Impact
- Cost optimization
- Better budget planning
- Model selection data
- ROI tracking
Model Performance Analytics
Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)
Problem
No data on which models perform best for:
- Different tiers
- Different content types
- Title vs outline vs content generation
- Pass rates and quality scores
Proposed Solution
Performance Tracking:
- Track validation metrics per model
- Generate comparison reports
- Recommend optimal models for scenarios
Metrics:
- First-attempt pass rate
- Average attempts to success
- Augmentation frequency
- Validation score distributions
- Generation time
- Cost per successful article
Dashboard:
# Model performance report
model-performance --days 30
# Output:
Model: claude-3.5-sonnet
Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost
Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost
Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost
Model: gpt-4o
...
Recommendations:
- Use claude-3.5-sonnet for titles (best pass rate)
- Use gpt-4o for content (better quality scores)
Impact
- Data-driven model selection
- Optimize quality vs cost
- Identify model strengths/weaknesses
- Better tier-model mapping
Improved Content Augmentation
Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Enhancement
Estimated Effort: Medium (3-5 days)
Problem
Current augmentation is basic:
- Random word insertion can break sentence flow
- Doesn't consider context
- Can feel unnatural
- No quality scoring
Proposed Solution
Smarter Augmentation:
- Use AI to rewrite sentences with missing terms
- Analyze sentence structure before insertion
- Add quality scoring for augmented vs original
- User-reviewable augmentation suggestions
Example:
# Instead of: "The process involves machine learning techniques."
# Random insert: "The process involves keyword machine learning techniques."
# Smarter: "The process involves keyword-driven machine learning techniques."
# Or: "The process, focused on keyword optimization, involves machine learning."
Features:
- Context-aware term insertion
- Sentence rewriting option
- A/B comparison (original vs augmented)
- Quality scoring
- Manual review mode
Impact
- More natural augmented content
- Better readability
- Higher quality scores
- User confidence in output
Story 3.1: URL Generation and Site Assignment
Fuzzy Keyword/Entity Matching for Site Assignment
Priority: Medium
Epic Suggestion: Epic 3 (Pre-deployment) - Enhancement
Estimated Effort: Medium (5-8 story points)
Problem
Currently tier1 site assignment uses:
- Explicit preferred sites from job config
- Random selection from available pool
This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.
Proposed Solution
Intelligent Site Matching:
- Extract article keywords and entities from GeneratedContent
- Parse keywords/entities from site hostnames and names
- Score each (article, site) pair based on keyword/entity overlap
- Assign tier1 articles to highest-scoring available sites
- Fall back to random if no good matches
Example:
Article: "Engine Repair Basics"
Keywords: ["engine repair", "automotive", "maintenance"]
Entities: ["engine", "carburetor", "cylinder"]
Available Sites:
- auto-repair-tips.com Score: 0.85 (high match)
- engine-maintenance-guide.com Score: 0.92 (very high match)
- cooking-recipes.com Score: 0.05 (low match)
Assignment: engine-maintenance-guide.com (best match)
Implementation Details:
- Scoring algorithm: weighted combination of keyword match + entity match
- Fuzzy matching: use Levenshtein distance or similar for partial matches
- Track assignments to avoid reusing sites within same batch
- Configurable threshold (e.g., only assign if score > 0.5, else random)
Job Configuration:
{
"tier1_site_matching": {
"enabled": true,
"min_score": 0.5,
"weight_keywords": 0.6,
"weight_entities": 0.4
}
}
Database Changes: None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)
Complexity Factors
- Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
- Entity recognition and normalization
- Scoring algorithm design and tuning
- Testing with various domain/content combinations
- Performance optimization for large site pools
Impact
- Better SEO through topical site clustering
- More organized content portfolio
- Easier to identify which sites cover which topics
- Improved content discoverability
Alternative: Simpler Keyword-Only Matching
If full fuzzy matching is too complex, start with exact keyword substring matching:
# Simple version: check if article keyword appears in hostname
if article.main_keyword.lower() in site.custom_hostname.lower():
score = 1.0
else:
score = 0.0
This would still provide value with much less complexity (2-3 story points instead of 5-8).
Story 3.3: Content Interlinking Injection
Boilerplate Site Pages (About, Contact, Privacy)
Priority: High
Epic Suggestion: Epic 3 (Pre-deployment) - Story 3.4
Estimated Effort: Medium (20 story points, 2-3 days)
Status: ✅ PROMOTED TO STORY 3.4 (specification complete)
Problem
During Story 3.3 implementation, we added navigation menus to all HTML templates with links to:
about.htmlcontact.htmlprivacy.html/index.html
However, these pages don't exist, creating broken links on every deployed site.
Impact
- Unprofessional appearance (404 errors on nav links)
- Poor user experience
- Privacy policy may be legally required for public sites
- No contact mechanism for users
Solution (Now Story 3.4)
See full specification: docs/stories/story-3.4-boilerplate-site-pages.md
Summary:
- Automatically generate boilerplate pages for each site during batch generation
- Store in new
site_pagestable - Use same template as articles for visual consistency
- Generic but professional content suitable for any niche
- Generated once per site, skip if already exists
Implementation tracked in Story 3.4.
Epic 4: Cloud Deployment
Multi-Cloud Storage Support
Priority: Low
Epic: Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)
Status: Deferred from Story 4.1
Problem
Story 4.1 implements deployment to Bunny.net storage only. Support for other cloud providers (AWS S3, Azure Blob Storage, DigitalOcean Spaces, Backblaze B2, etc.) was deferred.
Impact
- Limited flexibility for users who prefer or require other providers
- Cannot leverage existing infrastructure on other platforms
- Vendor lock-in to Bunny.net
Solution
Implement a storage provider abstraction layer with pluggable backends:
- Abstract
StorageClientinterface - Provider-specific implementations (S3Client, AzureClient, etc.)
- Provider selection via site deployment configuration
- All credentials via
.envfile
Dependencies: None (can be implemented anytime)
CDN Cache Purging After Deployment
Priority: Medium
Epic: Epic 4 (Deployment)
Estimated Effort: Small (2-3 story points)
Problem
After deploying updated content, old versions may remain cached in CDN, causing users to see stale content until cache naturally expires.
Impact
- Content updates not immediately visible
- Confusing for testing/verification
- May take hours for changes to propagate
Solution
Add cache purging step after successful deployment:
- Bunny.net: Use Pull Zone purge API
- Purge specific URLs or entire zone
- Optional flag to skip purging (for performance)
- Report purge status in deployment summary
Dependencies: Story 4.1 (deployment must work first)
Boilerplate Page Storage Optimization
Priority: Low
Epic: Epic 3/4 (Pre-deployment/Deployment)
Estimated Effort: Small (2-3 story points)
Problem
Story 3.4 stores full HTML for boilerplate pages (about, contact, privacy) in the database. This is inefficient and creates consistency issues if templates change.
Impact
- Database bloat (HTML is large)
- Template changes don't retroactively apply to existing pages
- Difficult to update content across all sites
Solution
Store only metadata, regenerate HTML on-the-fly during deployment:
- Database: Store only
page_typemarker (not full HTML) - Deployment: Generate HTML using current template at deploy time
- Ensures consistency with latest templates
- Reduces storage requirements
Alternative: Keep current approach if regeneration adds too much complexity.
Dependencies: Story 3.4 and 4.1 (both must exist first)
Homepage (index.html) Generation
Priority: Medium
Epic: Epic 3 (Pre-deployment) or Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)
Problem
Sites have navigation with /index.html link, but no homepage exists. Users landing on root domain see 404 or directory listing.
Impact
- Poor user experience for site visitors
- Unprofessional appearance
- Lost SEO opportunity (homepage is important)
Solution
Generate index.html for each site with:
- List of recent articles (with links)
- Site branding/header
- Brief description
- Professional layout using same template system
Options:
- Static page generated once during site creation
- Dynamic listing updated after each deployment
- Simple redirect to first article
Dependencies: Story 3.4 (boilerplate page infrastructure)
www vs root in domain imports
Problem
Domains are stored as either www.domain.com or domain.com in the table, but if you search on the wrong one through any of the scripts (like main.py get-site or on an job.json import) it will fail.
Solution
partial match on search? search for both www or root in the logic? Just ideas, havent fleshed it out.
Future Sections
Add new technical debt items below as they're identified during development.