17 KiB

Raw Blame History

Technical Debt & Future Enhancements

This document tracks technical debt, future enhancements, and features that were deferred from the MVP.

Story 1.6: Deployment Infrastructure Management

Domain Health Check / Verification Status

Priority: Medium
Epic Suggestion: Epic 4 (Deployment) or Epic 3 (Pre-deployment)
Estimated Effort: Small (1-2 days)

Problem

After importing or provisioning sites, there's no way to verify:

Domain ownership is still valid (user didn't let domain expire)
DNS configuration is correct and pointing to bunny.net
Custom domain is actually serving content
SSL certificates are valid

With 50+ domains, manual checking is impractical.

Proposed Solution

Option 1: Active Health Check

Create a health check file in each Storage Zone (e.g., .health-check.txt)
Periodically attempt to fetch it via the custom domain
Record results in database

Option 2: Use bunny.net API

Check if bunny.net exposes domain verification status via API
Query verification status for each custom hostname

Database Changes Add health_status field to SiteDeployment table:

unknown - Not yet checked
healthy - Domain resolving and serving content
dns_failure - Cannot resolve domain
ssl_error - Certificate issues
unreachable - Domain not responding
expired - Likely domain ownership lost

Add last_health_check timestamp field.

CLI Commands

# Check single domain
check-site-health --domain www.example.com

# Check all domains
check-all-sites-health

# List unhealthy sites
list-sites --status unhealthy

Use Cases

Automated monitoring to detect when domains expire
Pre-deployment validation before pushing new content
Dashboard showing health of entire portfolio
Alert system for broken domains

Impact

Prevents wasted effort deploying to expired domains
Early detection of DNS/SSL issues
Better operational visibility across large domain portfolios

Story 2.3: AI-Powered Content Generation

Prompt Template A/B Testing & Optimization

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to:

Test different prompt variations
Compare results objectively
Select optimal prompts for different scenarios
Track which prompts work best with which models

Proposed Solution

Prompt Versioning System:

Support multiple versions of each prompt template
Name prompts with version suffix (e.g., title_generation_v1.json, title_generation_v2.json)
Job config specifies which prompt version to use per stage

Comparison Tool:

# Generate with multiple prompt versions
compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline

# Outputs:
# - Side-by-side content comparison
# - Validation scores
# - Augmentation requirements
# - Generation time/cost
# - Recommendation

Metrics to Track:

Validation pass rate
Augmentation frequency
Average attempts per stage
Word count variance
Keyword density accuracy
Generation time
API cost

Database Changes: Add prompt_version fields to GeneratedContent:

title_prompt_version
outline_prompt_version
content_prompt_version

Impact

Higher quality content
Reduced augmentation needs
Lower API costs
Model-specific optimizations
Data-driven prompt improvements

Parallel Article Generation

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Articles are generated sequentially, which is slow for large batches:

15 tier 1 articles: ~10-20 minutes
150 tier 2 articles: ~2-3 hours

This could be parallelized since articles are independent.

Proposed Solution

Multi-threading/Multi-processing:

Add --parallel N flag to generate-batch command
Process N articles simultaneously
Share database session pool
Rate limit API calls to avoid throttling

Considerations:

Database connection pooling
OpenRouter rate limits
Memory usage (N concurrent AI calls)
Progress tracking complexity
Error handling across threads

Example:

# Generate 4 articles in parallel
generate-batch -j job.json --parallel 4

Impact

3-4x faster for large batches
Better resource utilization
Reduced total job time

Job Folder Auto-Processing

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Small (1-2 days)

Problem

Currently must run each job file individually. For large operations with many batches, want to:

Queue multiple jobs
Process jobs/folder automatically
Run overnight batches

Proposed Solution

Job Queue System:

# Process all jobs in folder
generate-batch --folder jobs/pending/

# Process and move to completed/
generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/

# Watch folder for new jobs
generate-batch --watch jobs/queue/ --interval 60

Features:

Process jobs in order (alphabetical or by timestamp)
Move completed jobs to archive folder
Skip failed jobs or retry
Summary report for all jobs

Database Changes: Add JobRun table to track batch job executions:

job_file_path
start_time, end_time
total_articles, successful, failed
status (running/completed/failed)

Impact

Hands-off batch processing
Better for large-scale operations
Easier job management

Cost Tracking & Analytics

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (2-4 days)

Problem

No visibility into:

API costs per article/batch
Which models are most cost-effective
Cost per tier/quality level
Budget tracking

Proposed Solution

Track API Usage:

Log tokens used per API call
Store in database with cost calculation
Dashboard showing costs

Cost Fields in GeneratedContent:

title_tokens_used
title_cost_usd
outline_tokens_used
outline_cost_usd
content_tokens_used
content_cost_usd
total_cost_usd

Analytics Commands:

# Show costs for project
cost-report --project-id 1

# Compare model costs
model-cost-comparison --models claude-3.5-sonnet,gpt-4o

# Budget tracking
cost-summary --date-range 2025-10-01:2025-10-31

Reports:

Cost per article by tier
Model efficiency (cost vs quality)
Daily/weekly/monthly spend
Budget alerts

Impact

Cost optimization
Better budget planning
Model selection data
ROI tracking

Model Performance Analytics

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

No data on which models perform best for:

Different tiers
Different content types
Title vs outline vs content generation
Pass rates and quality scores

Proposed Solution

Performance Tracking:

Track validation metrics per model
Generate comparison reports
Recommend optimal models for scenarios

Metrics:

First-attempt pass rate
Average attempts to success
Augmentation frequency
Validation score distributions
Generation time
Cost per successful article

Dashboard:

# Model performance report
model-performance --days 30

# Output:
Model: claude-3.5-sonnet
  Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost
  Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost  
  Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost
  
Model: gpt-4o
  ...
  
Recommendations:
- Use claude-3.5-sonnet for titles (best pass rate)
- Use gpt-4o for content (better quality scores)

Impact

Data-driven model selection
Optimize quality vs cost
Identify model strengths/weaknesses
Better tier-model mapping

Improved Content Augmentation

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Enhancement
Estimated Effort: Medium (3-5 days)

Problem

Current augmentation is basic:

Random word insertion can break sentence flow
Doesn't consider context
Can feel unnatural
No quality scoring

Proposed Solution

Smarter Augmentation:

Use AI to rewrite sentences with missing terms
Analyze sentence structure before insertion
Add quality scoring for augmented vs original
User-reviewable augmentation suggestions

Example:

# Instead of: "The process involves machine learning techniques."
# Random insert: "The process involves keyword machine learning techniques."

# Smarter: "The process involves keyword-driven machine learning techniques."
# Or: "The process, focused on keyword optimization, involves machine learning."

Features:

Context-aware term insertion
Sentence rewriting option
A/B comparison (original vs augmented)
Quality scoring
Manual review mode

Impact

More natural augmented content
Better readability
Higher quality scores
User confidence in output

Story 3.1: URL Generation and Site Assignment

Fuzzy Keyword/Entity Matching for Site Assignment

Priority: Medium
Epic Suggestion: Epic 3 (Pre-deployment) - Enhancement
Estimated Effort: Medium (5-8 story points)

Problem

Currently tier1 site assignment uses:

Explicit preferred sites from job config
Random selection from available pool

This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.

Proposed Solution

Intelligent Site Matching:

Extract article keywords and entities from GeneratedContent
Parse keywords/entities from site hostnames and names
Score each (article, site) pair based on keyword/entity overlap
Assign tier1 articles to highest-scoring available sites
Fall back to random if no good matches

Example:

Article: "Engine Repair Basics" 
  Keywords: ["engine repair", "automotive", "maintenance"]
  Entities: ["engine", "carburetor", "cylinder"]

Available Sites:
  - auto-repair-tips.com           Score: 0.85 (high match)
  - engine-maintenance-guide.com   Score: 0.92 (very high match)
  - cooking-recipes.com            Score: 0.05 (low match)

Assignment: engine-maintenance-guide.com (best match)

Implementation Details:

Scoring algorithm: weighted combination of keyword match + entity match
Fuzzy matching: use Levenshtein distance or similar for partial matches
Track assignments to avoid reusing sites within same batch
Configurable threshold (e.g., only assign if score > 0.5, else random)

Job Configuration:

{
  "tier1_site_matching": {
    "enabled": true,
    "min_score": 0.5,
    "weight_keywords": 0.6,
    "weight_entities": 0.4
  }
}

Database Changes: None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)

Complexity Factors

Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
Entity recognition and normalization
Scoring algorithm design and tuning
Testing with various domain/content combinations
Performance optimization for large site pools

Impact

Better SEO through topical site clustering
More organized content portfolio
Easier to identify which sites cover which topics
Improved content discoverability

Alternative: Simpler Keyword-Only Matching

If full fuzzy matching is too complex, start with exact keyword substring matching:

# Simple version: check if article keyword appears in hostname
if article.main_keyword.lower() in site.custom_hostname.lower():
    score = 1.0
else:
    score = 0.0

This would still provide value with much less complexity (2-3 story points instead of 5-8).

Story 3.3: Content Interlinking Injection

Boilerplate Site Pages (About, Contact, Privacy)

Priority: High
Epic Suggestion: Epic 3 (Pre-deployment) - Story 3.4
Estimated Effort: Medium (20 story points, 2-3 days)
Status: ✅ PROMOTED TO STORY 3.4 (specification complete)

Problem

During Story 3.3 implementation, we added navigation menus to all HTML templates with links to:

about.html
contact.html
privacy.html
/index.html

However, these pages don't exist, creating broken links on every deployed site.

Impact

Unprofessional appearance (404 errors on nav links)
Poor user experience
Privacy policy may be legally required for public sites
No contact mechanism for users

Solution (Now Story 3.4)

See full specification: docs/stories/story-3.4-boilerplate-site-pages.md

Summary:

Automatically generate boilerplate pages for each site during batch generation
Store in new site_pages table
Use same template as articles for visual consistency
Generic but professional content suitable for any niche
Generated once per site, skip if already exists

Implementation tracked in Story 3.4.

Epic 4: Cloud Deployment

Multi-Cloud Storage Support

Priority: Low
Epic: Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)
Status: Deferred from Story 4.1

Problem

Story 4.1 implements deployment to Bunny.net storage only. Support for other cloud providers (AWS S3, Azure Blob Storage, DigitalOcean Spaces, Backblaze B2, etc.) was deferred.

Impact

Limited flexibility for users who prefer or require other providers
Cannot leverage existing infrastructure on other platforms
Vendor lock-in to Bunny.net

Solution

Implement a storage provider abstraction layer with pluggable backends:

Abstract StorageClient interface
Provider-specific implementations (S3Client, AzureClient, etc.)
Provider selection via site deployment configuration
All credentials via .env file

Dependencies: None (can be implemented anytime)

CDN Cache Purging After Deployment

Priority: Medium
Epic: Epic 4 (Deployment)
Estimated Effort: Small (2-3 story points)

Problem

After deploying updated content, old versions may remain cached in CDN, causing users to see stale content until cache naturally expires.

Impact

Content updates not immediately visible
Confusing for testing/verification
May take hours for changes to propagate

Solution

Add cache purging step after successful deployment:

Bunny.net: Use Pull Zone purge API
Purge specific URLs or entire zone
Optional flag to skip purging (for performance)
Report purge status in deployment summary

Dependencies: Story 4.1 (deployment must work first)

Boilerplate Page Storage Optimization

Priority: Low
Epic: Epic 3/4 (Pre-deployment/Deployment)
Estimated Effort: Small (2-3 story points)

Problem

Story 3.4 stores full HTML for boilerplate pages (about, contact, privacy) in the database. This is inefficient and creates consistency issues if templates change.

Impact

Database bloat (HTML is large)
Template changes don't retroactively apply to existing pages
Difficult to update content across all sites

Solution

Store only metadata, regenerate HTML on-the-fly during deployment:

Database: Store only page_type marker (not full HTML)
Deployment: Generate HTML using current template at deploy time
Ensures consistency with latest templates
Reduces storage requirements

Alternative: Keep current approach if regeneration adds too much complexity.

Dependencies: Story 3.4 and 4.1 (both must exist first)

Homepage (index.html) Generation

Priority: Medium
Epic: Epic 3 (Pre-deployment) or Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)

Problem

Sites have navigation with /index.html link, but no homepage exists. Users landing on root domain see 404 or directory listing.

Impact

Poor user experience for site visitors
Unprofessional appearance
Lost SEO opportunity (homepage is important)

Solution

Generate index.html for each site with:

List of recent articles (with links)
Site branding/header
Brief description
Professional layout using same template system

Options:

Static page generated once during site creation
Dynamic listing updated after each deployment
Simple redirect to first article

Dependencies: Story 3.4 (boilerplate page infrastructure)

www vs root in domain imports

Problem

Domains are stored as either www.domain.com or domain.com in the table, but if you search on the wrong one through any of the scripts (like main.py get-site or on an job.json import) it will fail.

Solution

partial match on search? search for both www or root in the logic? Just ideas, havent fleshed it out.

Future Sections

Add new technical debt items below as they're identified during development.

17 KiB Raw Blame History

Technical Debt & Future Enhancements

Story 1.6: Deployment Infrastructure Management

Domain Health Check / Verification Status

Problem

Proposed Solution

Impact

Story 2.3: AI-Powered Content Generation

Prompt Template A/B Testing & Optimization

Problem

Proposed Solution

Impact

Parallel Article Generation

Problem

Proposed Solution

Impact

Job Folder Auto-Processing

Problem

Proposed Solution

Impact

Cost Tracking & Analytics

Problem

Proposed Solution

Impact

Model Performance Analytics

Problem

Proposed Solution

Impact

Improved Content Augmentation

Problem

Proposed Solution

Impact

Story 3.1: URL Generation and Site Assignment

Fuzzy Keyword/Entity Matching for Site Assignment

Problem

Proposed Solution

Complexity Factors

Impact

Alternative: Simpler Keyword-Only Matching

Story 3.3: Content Interlinking Injection

Boilerplate Site Pages (About, Contact, Privacy)

Problem

Impact

Solution (Now Story 3.4)

Epic 4: Cloud Deployment

Multi-Cloud Storage Support

Problem

Impact

Solution

CDN Cache Purging After Deployment

Problem

Impact

Solution

Boilerplate Page Storage Optimization

Problem

Impact

Solution

Homepage (index.html) Generation

Problem

Impact

Solution

www vs root in domain imports

Problem

Solution

Future Sections

17 KiB

Raw Blame History