Big-Link-Man/docs/technical-debt.md

12 KiB

Technical Debt & Future Enhancements

This document tracks technical debt, future enhancements, and features that were deferred from the MVP.


Story 1.6: Deployment Infrastructure Management

Domain Health Check / Verification Status

Priority: Medium
Epic Suggestion: Epic 4 (Deployment) or Epic 3 (Pre-deployment)
Estimated Effort: Small (1-2 days)

Problem

After importing or provisioning sites, there's no way to verify:

  • Domain ownership is still valid (user didn't let domain expire)
  • DNS configuration is correct and pointing to bunny.net
  • Custom domain is actually serving content
  • SSL certificates are valid

With 50+ domains, manual checking is impractical.

Proposed Solution

Option 1: Active Health Check

  1. Create a health check file in each Storage Zone (e.g., .health-check.txt)
  2. Periodically attempt to fetch it via the custom domain
  3. Record results in database

Option 2: Use bunny.net API

  • Check if bunny.net exposes domain verification status via API
  • Query verification status for each custom hostname

Database Changes Add health_status field to SiteDeployment table:

  • unknown - Not yet checked
  • healthy - Domain resolving and serving content
  • dns_failure - Cannot resolve domain
  • ssl_error - Certificate issues
  • unreachable - Domain not responding
  • expired - Likely domain ownership lost

Add last_health_check timestamp field.

CLI Commands

# Check single domain
check-site-health --domain www.example.com

# Check all domains
check-all-sites-health

# List unhealthy sites
list-sites --status unhealthy

Use Cases

  • Automated monitoring to detect when domains expire
  • Pre-deployment validation before pushing new content
  • Dashboard showing health of entire portfolio
  • Alert system for broken domains

Impact

  • Prevents wasted effort deploying to expired domains
  • Early detection of DNS/SSL issues
  • Better operational visibility across large domain portfolios

Story 2.3: AI-Powered Content Generation

Prompt Template A/B Testing & Optimization

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to:

  • Test different prompt variations
  • Compare results objectively
  • Select optimal prompts for different scenarios
  • Track which prompts work best with which models

Proposed Solution

Prompt Versioning System:

  1. Support multiple versions of each prompt template
  2. Name prompts with version suffix (e.g., title_generation_v1.json, title_generation_v2.json)
  3. Job config specifies which prompt version to use per stage

Comparison Tool:

# Generate with multiple prompt versions
compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline

# Outputs:
# - Side-by-side content comparison
# - Validation scores
# - Augmentation requirements
# - Generation time/cost
# - Recommendation

Metrics to Track:

  • Validation pass rate
  • Augmentation frequency
  • Average attempts per stage
  • Word count variance
  • Keyword density accuracy
  • Generation time
  • API cost

Database Changes: Add prompt_version fields to GeneratedContent:

  • title_prompt_version
  • outline_prompt_version
  • content_prompt_version

Impact

  • Higher quality content
  • Reduced augmentation needs
  • Lower API costs
  • Model-specific optimizations
  • Data-driven prompt improvements

Parallel Article Generation

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

Articles are generated sequentially, which is slow for large batches:

  • 15 tier 1 articles: ~10-20 minutes
  • 150 tier 2 articles: ~2-3 hours

This could be parallelized since articles are independent.

Proposed Solution

Multi-threading/Multi-processing:

  1. Add --parallel N flag to generate-batch command
  2. Process N articles simultaneously
  3. Share database session pool
  4. Rate limit API calls to avoid throttling

Considerations:

  • Database connection pooling
  • OpenRouter rate limits
  • Memory usage (N concurrent AI calls)
  • Progress tracking complexity
  • Error handling across threads

Example:

# Generate 4 articles in parallel
generate-batch -j job.json --parallel 4

Impact

  • 3-4x faster for large batches
  • Better resource utilization
  • Reduced total job time

Job Folder Auto-Processing

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Small (1-2 days)

Problem

Currently must run each job file individually. For large operations with many batches, want to:

  • Queue multiple jobs
  • Process jobs/folder automatically
  • Run overnight batches

Proposed Solution

Job Queue System:

# Process all jobs in folder
generate-batch --folder jobs/pending/

# Process and move to completed/
generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/

# Watch folder for new jobs
generate-batch --watch jobs/queue/ --interval 60

Features:

  • Process jobs in order (alphabetical or by timestamp)
  • Move completed jobs to archive folder
  • Skip failed jobs or retry
  • Summary report for all jobs

Database Changes: Add JobRun table to track batch job executions:

  • job_file_path
  • start_time, end_time
  • total_articles, successful, failed
  • status (running/completed/failed)

Impact

  • Hands-off batch processing
  • Better for large-scale operations
  • Easier job management

Cost Tracking & Analytics

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (2-4 days)

Problem

No visibility into:

  • API costs per article/batch
  • Which models are most cost-effective
  • Cost per tier/quality level
  • Budget tracking

Proposed Solution

Track API Usage:

  1. Log tokens used per API call
  2. Store in database with cost calculation
  3. Dashboard showing costs

Cost Fields in GeneratedContent:

  • title_tokens_used
  • title_cost_usd
  • outline_tokens_used
  • outline_cost_usd
  • content_tokens_used
  • content_cost_usd
  • total_cost_usd

Analytics Commands:

# Show costs for project
cost-report --project-id 1

# Compare model costs
model-cost-comparison --models claude-3.5-sonnet,gpt-4o

# Budget tracking
cost-summary --date-range 2025-10-01:2025-10-31

Reports:

  • Cost per article by tier
  • Model efficiency (cost vs quality)
  • Daily/weekly/monthly spend
  • Budget alerts

Impact

  • Cost optimization
  • Better budget planning
  • Model selection data
  • ROI tracking

Model Performance Analytics

Priority: Low
Epic Suggestion: Epic 2 (Content Generation) - Post-MVP
Estimated Effort: Medium (3-5 days)

Problem

No data on which models perform best for:

  • Different tiers
  • Different content types
  • Title vs outline vs content generation
  • Pass rates and quality scores

Proposed Solution

Performance Tracking:

  1. Track validation metrics per model
  2. Generate comparison reports
  3. Recommend optimal models for scenarios

Metrics:

  • First-attempt pass rate
  • Average attempts to success
  • Augmentation frequency
  • Validation score distributions
  • Generation time
  • Cost per successful article

Dashboard:

# Model performance report
model-performance --days 30

# Output:
Model: claude-3.5-sonnet
  Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost
  Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost  
  Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost
  
Model: gpt-4o
  ...
  
Recommendations:
- Use claude-3.5-sonnet for titles (best pass rate)
- Use gpt-4o for content (better quality scores)

Impact

  • Data-driven model selection
  • Optimize quality vs cost
  • Identify model strengths/weaknesses
  • Better tier-model mapping

Improved Content Augmentation

Priority: Medium
Epic Suggestion: Epic 2 (Content Generation) - Enhancement
Estimated Effort: Medium (3-5 days)

Problem

Current augmentation is basic:

  • Random word insertion can break sentence flow
  • Doesn't consider context
  • Can feel unnatural
  • No quality scoring

Proposed Solution

Smarter Augmentation:

  1. Use AI to rewrite sentences with missing terms
  2. Analyze sentence structure before insertion
  3. Add quality scoring for augmented vs original
  4. User-reviewable augmentation suggestions

Example:

# Instead of: "The process involves machine learning techniques."
# Random insert: "The process involves keyword machine learning techniques."

# Smarter: "The process involves keyword-driven machine learning techniques."
# Or: "The process, focused on keyword optimization, involves machine learning."

Features:

  • Context-aware term insertion
  • Sentence rewriting option
  • A/B comparison (original vs augmented)
  • Quality scoring
  • Manual review mode

Impact

  • More natural augmented content
  • Better readability
  • Higher quality scores
  • User confidence in output

Story 3.1: URL Generation and Site Assignment

Fuzzy Keyword/Entity Matching for Site Assignment

Priority: Medium
Epic Suggestion: Epic 3 (Pre-deployment) - Enhancement
Estimated Effort: Medium (5-8 story points)

Problem

Currently tier1 site assignment uses:

  1. Explicit preferred sites from job config
  2. Random selection from available pool

This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.

Proposed Solution

Intelligent Site Matching:

  1. Extract article keywords and entities from GeneratedContent
  2. Parse keywords/entities from site hostnames and names
  3. Score each (article, site) pair based on keyword/entity overlap
  4. Assign tier1 articles to highest-scoring available sites
  5. Fall back to random if no good matches

Example:

Article: "Engine Repair Basics" 
  Keywords: ["engine repair", "automotive", "maintenance"]
  Entities: ["engine", "carburetor", "cylinder"]

Available Sites:
  - auto-repair-tips.com           Score: 0.85 (high match)
  - engine-maintenance-guide.com   Score: 0.92 (very high match)
  - cooking-recipes.com            Score: 0.05 (low match)

Assignment: engine-maintenance-guide.com (best match)

Implementation Details:

  • Scoring algorithm: weighted combination of keyword match + entity match
  • Fuzzy matching: use Levenshtein distance or similar for partial matches
  • Track assignments to avoid reusing sites within same batch
  • Configurable threshold (e.g., only assign if score > 0.5, else random)

Job Configuration:

{
  "tier1_site_matching": {
    "enabled": true,
    "min_score": 0.5,
    "weight_keywords": 0.6,
    "weight_entities": 0.4
  }
}

Database Changes: None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)

Complexity Factors

  • Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
  • Entity recognition and normalization
  • Scoring algorithm design and tuning
  • Testing with various domain/content combinations
  • Performance optimization for large site pools

Impact

  • Better SEO through topical site clustering
  • More organized content portfolio
  • Easier to identify which sites cover which topics
  • Improved content discoverability

Alternative: Simpler Keyword-Only Matching

If full fuzzy matching is too complex, start with exact keyword substring matching:

# Simple version: check if article keyword appears in hostname
if article.main_keyword.lower() in site.custom_hostname.lower():
    score = 1.0
else:
    score = 0.0

This would still provide value with much less complexity (2-3 story points instead of 5-8).


Future Sections

Add new technical debt items below as they're identified during development.