Big-Link-Man/docs/technical-debt.md

16 KiB

Technical Debt & Future Enhancements

This document tracks technical debt, future enhancements, and features that were deferred from the MVP.


Story 1.6: Deployment Infrastructure Management

Domain Health Check / Verification Status

Priority: Medium
Epic Suggestion: Epic 4 (Deployment) or Epic 3 (Pre-deployment)
Estimated Effort: Small (1-2 days)

Problem

After importing or provisioning sites, there's no way to verify:

  • Domain ownership is still valid (user didn't let domain expire)
  • DNS configuration is correct and pointing to bunny.net
  • Custom domain is actually serving content
  • SSL certificates are valid

With 50+ domains, manual checking is impractical.

Proposed Solution

Option 1: Active Health Check

  1. Create a health check file in each Storage Zone (e.g., .health-check.txt)
  2. Periodically attempt to fetch it via the custom domain
  3. Record results in database

Option 2: Use bunny.net API

  • Check if bunny.net exposes domain verification status via API
  • Query verification status for each custom hostname

Database Changes Add health_status field to SiteDeployment table:

  • unknown - Not yet checked
  • healthy - Domain resolving and serving content
  • dns_failure - Cannot resolve domain
  • ssl_error - Certificate issues
  • unreachable - Domain not responding
  • expired - Likely domain ownership lost

Add last_health_check timestamp field.

CLI Commands

# Check single domain
check-site-health --domain www.example.com

# Check all domains
check-all-sites-health

# List unhealthy sites
list-sites --status unhealthy

Use Cases

  • Automated monitoring to detect when domains expire
  • Pre-deployment validation before pushing new content
  • Dashboard showing health of entire portfolio
  • Alert system for broken domains

Impact

  • Prevents wasted effort deploying to expired domains
  • Early detection of DNS/SSL issues
  • Better operational visibility across large domain portfolios

Story 2.3: AI-Powered Content Generation

Prompt Template A/B Testing & Optimization

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to:

  • Test different prompt variations
  • Compare results objectively
  • Select optimal prompts for different scenarios
  • Track which prompts work best with which models

Proposed Solution

Prompt Versioning System:

  1. Support multiple versions of each prompt template
  2. Name prompts with version suffix (e.g., title_generation_v1.json, title_generation_v2.json)
  3. Job config specifies which prompt version to use per stage

Comparison Tool:

# Generate with multiple prompt versions
compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline

# Outputs:
# - Side-by-side content comparison
# - Validation scores
# - Augmentation requirements
# - Generation time/cost
# - Recommendation

Metrics to Track:

  • Validation pass rate
  • Augmentation frequency
  • Average attempts per stage
  • Word count variance
  • Keyword density accuracy
  • Generation time
  • API cost

Database Changes: Add prompt_version fields to GeneratedContent:

  • title_prompt_version
  • outline_prompt_version
  • content_prompt_version

Impact

  • Higher quality content
  • Reduced augmentation needs
  • Lower API costs
  • Model-specific optimizations
  • Data-driven prompt improvements

Parallel Article Generation

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Articles are generated sequentially, which is slow for large batches:

  • 15 tier 1 articles: ~10-20 minutes
  • 150 tier 2 articles: ~2-3 hours

This could be parallelized since articles are independent.

Proposed Solution

Multi-threading/Multi-processing:

  1. Add --parallel N flag to generate-batch command
  2. Process N articles simultaneously
  3. Share database session pool
  4. Rate limit API calls to avoid throttling

Considerations:

  • Database connection pooling
  • OpenRouter rate limits
  • Memory usage (N concurrent AI calls)
  • Progress tracking complexity
  • Error handling across threads

Example:

# Generate 4 articles in parallel
generate-batch -j job.json --parallel 4

Impact

  • 3-4x faster for large batches
  • Better resource utilization
  • Reduced total job time

Job Folder Auto-Processing

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Small (1-2 days)

Problem

Currently must run each job file individually. For large operations with many batches, want to:

  • Queue multiple jobs
  • Process jobs/folder automatically
  • Run overnight batches

Proposed Solution

Job Queue System:

# Process all jobs in folder
generate-batch --folder jobs/pending/

# Process and move to completed/
generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/

# Watch folder for new jobs
generate-batch --watch jobs/queue/ --interval 60

Features:

  • Process jobs in order (alphabetical or by timestamp)
  • Move completed jobs to archive folder
  • Skip failed jobs or retry
  • Summary report for all jobs

Database Changes: Add JobRun table to track batch job executions:

  • job_file_path
  • start_time, end_time
  • total_articles, successful, failed
  • status (running/completed/failed)

Impact

  • Hands-off batch processing
  • Better for large-scale operations
  • Easier job management

Cost Tracking & Analytics

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (2-4 days)

Problem

No visibility into:

  • API costs per article/batch
  • Which models are most cost-effective
  • Cost per tier/quality level
  • Budget tracking

Proposed Solution

Track API Usage:

  1. Log tokens used per API call
  2. Store in database with cost calculation
  3. Dashboard showing costs

Cost Fields in GeneratedContent:

  • title_tokens_used
  • title_cost_usd
  • outline_tokens_used
  • outline_cost_usd
  • content_tokens_used
  • content_cost_usd
  • total_cost_usd

Analytics Commands:

# Show costs for project
cost-report --project-id 1

# Compare model costs
model-cost-comparison --models claude-3.5-sonnet,gpt-4o

# Budget tracking
cost-summary --date-range 2025-10-01:2025-10-31

Reports:

  • Cost per article by tier
  • Model efficiency (cost vs quality)
  • Daily/weekly/monthly spend
  • Budget alerts

Impact

  • Cost optimization
  • Better budget planning
  • Model selection data
  • ROI tracking

Model Performance Analytics

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

No data on which models perform best for:

  • Different tiers
  • Different content types
  • Title vs outline vs content generation
  • Pass rates and quality scores

Proposed Solution

Performance Tracking:

  1. Track validation metrics per model
  2. Generate comparison reports
  3. Recommend optimal models for scenarios

Metrics:

  • First-attempt pass rate
  • Average attempts to success
  • Augmentation frequency
  • Validation score distributions
  • Generation time
  • Cost per successful article

Dashboard:

# Model performance report
model-performance --days 30

# Output:
Model: claude-3.5-sonnet
  Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost
  Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost  
  Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost
  
Model: gpt-4o
  ...
  
Recommendations:
- Use claude-3.5-sonnet for titles (best pass rate)
- Use gpt-4o for content (better quality scores)

Impact

  • Data-driven model selection
  • Optimize quality vs cost
  • Identify model strengths/weaknesses
  • Better tier-model mapping

Improved Content Augmentation

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Enhancement
Estimated Effort: Medium (3-5 days)

Problem

Current augmentation is basic:

  • Random word insertion can break sentence flow
  • Doesn't consider context
  • Can feel unnatural
  • No quality scoring

Proposed Solution

Smarter Augmentation:

  1. Use AI to rewrite sentences with missing terms
  2. Analyze sentence structure before insertion
  3. Add quality scoring for augmented vs original
  4. User-reviewable augmentation suggestions

Example:

# Instead of: "The process involves machine learning techniques."
# Random insert: "The process involves keyword machine learning techniques."

# Smarter: "The process involves keyword-driven machine learning techniques."
# Or: "The process, focused on keyword optimization, involves machine learning."

Features:

  • Context-aware term insertion
  • Sentence rewriting option
  • A/B comparison (original vs augmented)
  • Quality scoring
  • Manual review mode

Impact

  • More natural augmented content
  • Better readability
  • Higher quality scores
  • User confidence in output

Story 3.1: URL Generation and Site Assignment

Fuzzy Keyword/Entity Matching for Site Assignment

Priority: Medium
Epic Suggestion: Epic 3 (Pre-deployment) - Enhancement
Estimated Effort: Medium (5-8 story points)

Problem

Currently tier1 site assignment uses:

  1. Explicit preferred sites from job config
  2. Random selection from available pool

This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.

Proposed Solution

Intelligent Site Matching:

  1. Extract article keywords and entities from GeneratedContent
  2. Parse keywords/entities from site hostnames and names
  3. Score each (article, site) pair based on keyword/entity overlap
  4. Assign tier1 articles to highest-scoring available sites
  5. Fall back to random if no good matches

Example:

Article: "Engine Repair Basics" 
  Keywords: ["engine repair", "automotive", "maintenance"]
  Entities: ["engine", "carburetor", "cylinder"]

Available Sites:
  - auto-repair-tips.com           Score: 0.85 (high match)
  - engine-maintenance-guide.com   Score: 0.92 (very high match)
  - cooking-recipes.com            Score: 0.05 (low match)

Assignment: engine-maintenance-guide.com (best match)

Implementation Details:

  • Scoring algorithm: weighted combination of keyword match + entity match
  • Fuzzy matching: use Levenshtein distance or similar for partial matches
  • Track assignments to avoid reusing sites within same batch
  • Configurable threshold (e.g., only assign if score > 0.5, else random)

Job Configuration:

{
  "tier1_site_matching": {
    "enabled": true,
    "min_score": 0.5,
    "weight_keywords": 0.6,
    "weight_entities": 0.4
  }
}

Database Changes: None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)

Complexity Factors

  • Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
  • Entity recognition and normalization
  • Scoring algorithm design and tuning
  • Testing with various domain/content combinations
  • Performance optimization for large site pools

Impact

  • Better SEO through topical site clustering
  • More organized content portfolio
  • Easier to identify which sites cover which topics
  • Improved content discoverability

Alternative: Simpler Keyword-Only Matching

If full fuzzy matching is too complex, start with exact keyword substring matching:

# Simple version: check if article keyword appears in hostname
if article.main_keyword.lower() in site.custom_hostname.lower():
    score = 1.0
else:
    score = 0.0

This would still provide value with much less complexity (2-3 story points instead of 5-8).


Story 3.3: Content Interlinking Injection

Boilerplate Site Pages (About, Contact, Privacy)

Priority: High
Epic Suggestion: Epic 3 (Pre-deployment) - Story 3.4
Estimated Effort: Medium (20 story points, 2-3 days)
Status: PROMOTED TO STORY 3.4 (specification complete)

Problem

During Story 3.3 implementation, we added navigation menus to all HTML templates with links to:

  • about.html
  • contact.html
  • privacy.html
  • /index.html

However, these pages don't exist, creating broken links on every deployed site.

Impact

  • Unprofessional appearance (404 errors on nav links)
  • Poor user experience
  • Privacy policy may be legally required for public sites
  • No contact mechanism for users

Solution (Now Story 3.4)

See full specification: docs/stories/story-3.4-boilerplate-site-pages.md

Summary:

  • Automatically generate boilerplate pages for each site during batch generation
  • Store in new site_pages table
  • Use same template as articles for visual consistency
  • Generic but professional content suitable for any niche
  • Generated once per site, skip if already exists

Implementation tracked in Story 3.4.


Epic 4: Cloud Deployment

Multi-Cloud Storage Support

Priority: Low
Epic: Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)
Status: Deferred from Story 4.1

Problem

Story 4.1 implements deployment to Bunny.net storage only. Support for other cloud providers (AWS S3, Azure Blob Storage, DigitalOcean Spaces, Backblaze B2, etc.) was deferred.

Impact

  • Limited flexibility for users who prefer or require other providers
  • Cannot leverage existing infrastructure on other platforms
  • Vendor lock-in to Bunny.net

Solution

Implement a storage provider abstraction layer with pluggable backends:

  • Abstract StorageClient interface
  • Provider-specific implementations (S3Client, AzureClient, etc.)
  • Provider selection via site deployment configuration
  • All credentials via .env file

Dependencies: None (can be implemented anytime)


CDN Cache Purging After Deployment

Priority: Medium
Epic: Epic 4 (Deployment)
Estimated Effort: Small (2-3 story points)

Problem

After deploying updated content, old versions may remain cached in CDN, causing users to see stale content until cache naturally expires.

Impact

  • Content updates not immediately visible
  • Confusing for testing/verification
  • May take hours for changes to propagate

Solution

Add cache purging step after successful deployment:

  • Bunny.net: Use Pull Zone purge API
  • Purge specific URLs or entire zone
  • Optional flag to skip purging (for performance)
  • Report purge status in deployment summary

Dependencies: Story 4.1 (deployment must work first)


Boilerplate Page Storage Optimization

Priority: Low
Epic: Epic 3/4 (Pre-deployment/Deployment)
Estimated Effort: Small (2-3 story points)

Problem

Story 3.4 stores full HTML for boilerplate pages (about, contact, privacy) in the database. This is inefficient and creates consistency issues if templates change.

Impact

  • Database bloat (HTML is large)
  • Template changes don't retroactively apply to existing pages
  • Difficult to update content across all sites

Solution

Store only metadata, regenerate HTML on-the-fly during deployment:

  • Database: Store only page_type marker (not full HTML)
  • Deployment: Generate HTML using current template at deploy time
  • Ensures consistency with latest templates
  • Reduces storage requirements

Alternative: Keep current approach if regeneration adds too much complexity.

Dependencies: Story 3.4 and 4.1 (both must exist first)


Homepage (index.html) Generation

Priority: Medium
Epic: Epic 3 (Pre-deployment) or Epic 4 (Deployment)
Estimated Effort: Medium (5-8 story points)

Problem

Sites have navigation with /index.html link, but no homepage exists. Users landing on root domain see 404 or directory listing.

Impact

  • Poor user experience for site visitors
  • Unprofessional appearance
  • Lost SEO opportunity (homepage is important)

Solution

Generate index.html for each site with:

  • List of recent articles (with links)
  • Site branding/header
  • Brief description
  • Professional layout using same template system

Options:

  1. Static page generated once during site creation
  2. Dynamic listing updated after each deployment
  3. Simple redirect to first article

Dependencies: Story 3.4 (boilerplate page infrastructure)


Future Sections

Add new technical debt items below as they're identified during development.