# Technical Debt & Future Enhancements

This document tracks technical debt, future enhancements, and features that were deferred from the MVP.

---

## Story 1.6: Deployment Infrastructure Management

### Domain Health Check / Verification Status

**Priority**: Medium  
**Epic Suggestion**: Epic 4 (Deployment) or Epic 3 (Pre-deployment)  
**Estimated Effort**: Small (1-2 days)

#### Problem
After importing or provisioning sites, there's no way to verify:
- Domain ownership is still valid (user didn't let domain expire)
- DNS configuration is correct and pointing to bunny.net
- Custom domain is actually serving content
- SSL certificates are valid

With 50+ domains, manual checking is impractical.

#### Proposed Solution

**Option 1: Active Health Check**
1. Create a health check file in each Storage Zone (e.g., `.health-check.txt`)
2. Periodically attempt to fetch it via the custom domain
3. Record results in database

**Option 2: Use bunny.net API**
- Check if bunny.net exposes domain verification status via API
- Query verification status for each custom hostname

**Database Changes**
Add `health_status` field to `SiteDeployment` table:
- `unknown` - Not yet checked
- `healthy` - Domain resolving and serving content
- `dns_failure` - Cannot resolve domain
- `ssl_error` - Certificate issues
- `unreachable` - Domain not responding
- `expired` - Likely domain ownership lost

Add `last_health_check` timestamp field.

**CLI Commands**
```bash
# Check single domain
check-site-health --domain www.example.com

# Check all domains
check-all-sites-health

# List unhealthy sites
list-sites --status unhealthy
```

**Use Cases**
- Automated monitoring to detect when domains expire
- Pre-deployment validation before pushing new content
- Dashboard showing health of entire portfolio
- Alert system for broken domains

#### Impact
- Prevents wasted effort deploying to expired domains
- Early detection of DNS/SSL issues
- Better operational visibility across large domain portfolios

---

## Story 2.3: AI-Powered Content Generation

### Prompt Template A/B Testing & Optimization

**Priority**: Medium  
**Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP  
**Estimated Effort**: Medium (3-5 days)

#### Problem
Content quality and AI compliance with CORA targets varies based on prompt wording. No systematic way to:
- Test different prompt variations
- Compare results objectively
- Select optimal prompts for different scenarios
- Track which prompts work best with which models

#### Proposed Solution

**Prompt Versioning System:**
1. Support multiple versions of each prompt template
2. Name prompts with version suffix (e.g., `title_generation_v1.json`, `title_generation_v2.json`)
3. Job config specifies which prompt version to use per stage

**Comparison Tool:**
```bash
# Generate with multiple prompt versions
compare-prompts --project-id 1 --variants v1,v2,v3 --stages title,outline

# Outputs:
# - Side-by-side content comparison
# - Validation scores
# - Augmentation requirements
# - Generation time/cost
# - Recommendation
```

**Metrics to Track:**
- Validation pass rate
- Augmentation frequency
- Average attempts per stage
- Word count variance
- Keyword density accuracy
- Generation time
- API cost

**Database Changes:**
Add `prompt_version` fields to `GeneratedContent`:
- `title_prompt_version`
- `outline_prompt_version`
- `content_prompt_version`

#### Impact
- Higher quality content
- Reduced augmentation needs
- Lower API costs
- Model-specific optimizations
- Data-driven prompt improvements

---

### Parallel Article Generation

**Priority**: Low  
**Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP  
**Estimated Effort**: Medium (3-5 days)

#### Problem
Articles are generated sequentially, which is slow for large batches:
- 15 tier 1 articles: ~10-20 minutes
- 150 tier 2 articles: ~2-3 hours

This could be parallelized since articles are independent.

#### Proposed Solution

**Multi-threading/Multi-processing:**
1. Add `--parallel N` flag to `generate-batch` command
2. Process N articles simultaneously
3. Share database session pool
4. Rate limit API calls to avoid throttling

**Considerations:**
- Database connection pooling
- OpenRouter rate limits
- Memory usage (N concurrent AI calls)
- Progress tracking complexity
- Error handling across threads

**Example:**
```bash
# Generate 4 articles in parallel
generate-batch -j job.json --parallel 4
```

#### Impact
- 3-4x faster for large batches
- Better resource utilization
- Reduced total job time

---

### Job Folder Auto-Processing

**Priority**: Low  
**Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP  
**Estimated Effort**: Small (1-2 days)

#### Problem
Currently must run each job file individually. For large operations with many batches, want to:
- Queue multiple jobs
- Process jobs/folder automatically
- Run overnight batches

#### Proposed Solution

**Job Queue System:**
```bash
# Process all jobs in folder
generate-batch --folder jobs/pending/

# Process and move to completed/
generate-batch --folder jobs/pending/ --move-on-complete jobs/completed/

# Watch folder for new jobs
generate-batch --watch jobs/queue/ --interval 60
```

**Features:**
- Process jobs in order (alphabetical or by timestamp)
- Move completed jobs to archive folder
- Skip failed jobs or retry
- Summary report for all jobs

**Database Changes:**
Add `JobRun` table to track batch job executions:
- job_file_path
- start_time, end_time
- total_articles, successful, failed
- status (running/completed/failed)

#### Impact
- Hands-off batch processing
- Better for large-scale operations
- Easier job management

---

### Cost Tracking & Analytics

**Priority**: Medium  
**Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP  
**Estimated Effort**: Medium (2-4 days)

#### Problem
No visibility into:
- API costs per article/batch
- Which models are most cost-effective
- Cost per tier/quality level
- Budget tracking

#### Proposed Solution

**Track API Usage:**
1. Log tokens used per API call
2. Store in database with cost calculation
3. Dashboard showing costs

**Cost Fields in GeneratedContent:**
- `title_tokens_used`
- `title_cost_usd`
- `outline_tokens_used`
- `outline_cost_usd`
- `content_tokens_used`
- `content_cost_usd`
- `total_cost_usd`

**Analytics Commands:**
```bash
# Show costs for project
cost-report --project-id 1

# Compare model costs
model-cost-comparison --models claude-3.5-sonnet,gpt-4o

# Budget tracking
cost-summary --date-range 2025-10-01:2025-10-31
```

**Reports:**
- Cost per article by tier
- Model efficiency (cost vs quality)
- Daily/weekly/monthly spend
- Budget alerts

#### Impact
- Cost optimization
- Better budget planning
- Model selection data
- ROI tracking

---

### Model Performance Analytics

**Priority**: Low  
**Epic Suggestion**: Epic 2 (Content Generation) - Post-MVP  
**Estimated Effort**: Medium (3-5 days)

#### Problem
No data on which models perform best for:
- Different tiers
- Different content types
- Title vs outline vs content generation
- Pass rates and quality scores

#### Proposed Solution

**Performance Tracking:**
1. Track validation metrics per model
2. Generate comparison reports
3. Recommend optimal models for scenarios

**Metrics:**
- First-attempt pass rate
- Average attempts to success
- Augmentation frequency
- Validation score distributions
- Generation time
- Cost per successful article

**Dashboard:**
```bash
# Model performance report
model-performance --days 30

# Output:
Model: claude-3.5-sonnet
  Title: 98% pass rate, 1.02 avg attempts, $0.05 avg cost
  Outline: 85% pass rate, 1.35 avg attempts, $0.15 avg cost  
  Content: 72% pass rate, 1.67 avg attempts, $0.89 avg cost
  
Model: gpt-4o
  ...
  
Recommendations:
- Use claude-3.5-sonnet for titles (best pass rate)
- Use gpt-4o for content (better quality scores)
```

#### Impact
- Data-driven model selection
- Optimize quality vs cost
- Identify model strengths/weaknesses
- Better tier-model mapping

---

### Improved Content Augmentation

**Priority**: Medium  
**Epic Suggestion**: Epic 2 (Content Generation) - Enhancement  
**Estimated Effort**: Medium (3-5 days)

#### Problem
Current augmentation is basic:
- Random word insertion can break sentence flow
- Doesn't consider context
- Can feel unnatural
- No quality scoring

#### Proposed Solution

**Smarter Augmentation:**
1. Use AI to rewrite sentences with missing terms
2. Analyze sentence structure before insertion
3. Add quality scoring for augmented vs original
4. User-reviewable augmentation suggestions

**Example:**
```python
# Instead of: "The process involves machine learning techniques."
# Random insert: "The process involves keyword machine learning techniques."

# Smarter: "The process involves keyword-driven machine learning techniques."
# Or: "The process, focused on keyword optimization, involves machine learning."
```

**Features:**
- Context-aware term insertion
- Sentence rewriting option
- A/B comparison (original vs augmented)
- Quality scoring
- Manual review mode

#### Impact
- More natural augmented content
- Better readability
- Higher quality scores
- User confidence in output

---

## Story 3.1: URL Generation and Site Assignment

### Fuzzy Keyword/Entity Matching for Site Assignment

**Priority**: Medium  
**Epic Suggestion**: Epic 3 (Pre-deployment) - Enhancement  
**Estimated Effort**: Medium (5-8 story points)

#### Problem
Currently tier1 site assignment uses:
1. Explicit preferred sites from job config
2. Random selection from available pool

This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.

#### Proposed Solution

**Intelligent Site Matching:**
1. Extract article keywords and entities from GeneratedContent
2. Parse keywords/entities from site hostnames and names
3. Score each (article, site) pair based on keyword/entity overlap
4. Assign tier1 articles to highest-scoring available sites
5. Fall back to random if no good matches

**Example:**
```
Article: "Engine Repair Basics" 
  Keywords: ["engine repair", "automotive", "maintenance"]
  Entities: ["engine", "carburetor", "cylinder"]

Available Sites:
  - auto-repair-tips.com           Score: 0.85 (high match)
  - engine-maintenance-guide.com   Score: 0.92 (very high match)
  - cooking-recipes.com            Score: 0.05 (low match)

Assignment: engine-maintenance-guide.com (best match)
```

**Implementation Details:**
- Scoring algorithm: weighted combination of keyword match + entity match
- Fuzzy matching: use Levenshtein distance or similar for partial matches
- Track assignments to avoid reusing sites within same batch
- Configurable threshold (e.g., only assign if score > 0.5, else random)

**Job Configuration:**
```json
{
  "tier1_site_matching": {
    "enabled": true,
    "min_score": 0.5,
    "weight_keywords": 0.6,
    "weight_entities": 0.4
  }
}
```

**Database Changes:**
None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)

#### Complexity Factors
- Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
- Entity recognition and normalization
- Scoring algorithm design and tuning
- Testing with various domain/content combinations
- Performance optimization for large site pools

#### Impact
- Better SEO through topical site clustering
- More organized content portfolio
- Easier to identify which sites cover which topics
- Improved content discoverability

#### Alternative: Simpler Keyword-Only Matching
If full fuzzy matching is too complex, start with exact keyword substring matching:
```python
# Simple version: check if article keyword appears in hostname
if article.main_keyword.lower() in site.custom_hostname.lower():
    score = 1.0
else:
    score = 0.0
```

This would still provide value with much less complexity (2-3 story points instead of 5-8).

---

## Story 3.3: Content Interlinking Injection

### Anchor Text Variation Insertion

**Priority**: Medium  
**Epic Suggestion**: Epic 3 (Pre-deployment) - Enhancement  
**Estimated Effort**: Small (1-2 story points)

#### Problem
Currently, when anchor text (main keyword or variations) is not found in the generated article content, the system falls back to inserting only the main keyword. The system searches for variations like "learn about {keyword}" and "{keyword} guide", but these variations almost never exist in the AI-generated content. This means we always end up inserting the exact same anchor text (the main keyword), reducing anchor text diversity.

#### Current Behavior
In `src/interlinking/content_injection.py`, the `_try_inject_link()` function:
1. Searches for anchor text variations in content (main keyword first, then variations)
2. If found, wraps that text with a link
3. **If not found, only inserts the first anchor text (main keyword) into content**

Example for "shaft machining":
- Searches for: "shaft machining", "learn about shaft machining", "shaft machining guide", etc.
- Variations are almost never in the content
- Always falls back to inserting just "shaft machining"

#### Proposed Solution
When anchor text is not found in content, randomly select from ALL available anchor text variations (not just the first one) for insertion:

**Change in `_try_inject_link()`:**
```python
# Current: Always inserts anchor_texts[0] (main keyword)
# Proposed: Randomly select from all anchor_texts for insertion
if anchor_texts:
    anchor_text = random.choice(anchor_texts)  # Random variation instead of [0]
    updated_html = _insert_link_into_random_paragraph(html, anchor_text, target_url)
```

#### Impact
- Improved anchor text diversity
- More natural linking patterns
- Better SEO through varied anchor text
- Leverages all generated variations instead of just one

#### Dependencies
None - can be implemented immediately.

---

### Boilerplate Site Pages (About, Contact, Privacy)

**Priority**: High  
**Epic Suggestion**: Epic 3 (Pre-deployment) - Story 3.4  
**Estimated Effort**: Medium (20 story points, 2-3 days)  
**Status**: ✅ **PROMOTED TO STORY 3.4** (specification complete)

#### Problem
During Story 3.3 implementation, we added navigation menus to all HTML templates with links to:
- `about.html`
- `contact.html`
- `privacy.html`
- `/index.html`

However, these pages don't exist, creating broken links on every deployed site.

#### Impact
- Unprofessional appearance (404 errors on nav links)
- Poor user experience
- Privacy policy may be legally required for public sites
- No contact mechanism for users

#### Solution (Now Story 3.4)
See full specification: `docs/stories/story-3.4-boilerplate-site-pages.md`

**Summary:**
- Automatically generate boilerplate pages for each site during batch generation
- Store in new `site_pages` table
- Use same template as articles for visual consistency
- Generic but professional content suitable for any niche
- Generated once per site, skip if already exists

**Implementation tracked in Story 3.4.**

---

## Epic 4: Cloud Deployment

### Multi-Cloud Storage Support

**Priority**: Low  
**Epic**: Epic 4 (Deployment)  
**Estimated Effort**: Medium (5-8 story points)  
**Status**: Deferred from Story 4.1

#### Problem
Story 4.1 implements deployment to Bunny.net storage only. Support for other cloud providers (AWS S3, Azure Blob Storage, DigitalOcean Spaces, Backblaze B2, etc.) was deferred.

#### Impact
- Limited flexibility for users who prefer or require other providers
- Cannot leverage existing infrastructure on other platforms
- Vendor lock-in to Bunny.net

#### Solution
Implement a storage provider abstraction layer with pluggable backends:
- Abstract `StorageClient` interface
- Provider-specific implementations (S3Client, AzureClient, etc.)
- Provider selection via site deployment configuration
- All credentials via `.env` file

**Dependencies**: None (can be implemented anytime)

---

### CDN Cache Purging After Deployment

**Priority**: Medium  
**Epic**: Epic 4 (Deployment)  
**Estimated Effort**: Small (2-3 story points)

#### Problem
After deploying updated content, old versions may remain cached in CDN, causing users to see stale content until cache naturally expires.

#### Impact
- Content updates not immediately visible
- Confusing for testing/verification
- May take hours for changes to propagate

#### Solution
Add cache purging step after successful deployment:
- Bunny.net: Use Pull Zone purge API
- Purge specific URLs or entire zone
- Optional flag to skip purging (for performance)
- Report purge status in deployment summary

**Dependencies**: Story 4.1 (deployment must work first)

---

### Boilerplate Page Storage Optimization

**Priority**: Low  
**Epic**: Epic 3/4 (Pre-deployment/Deployment)  
**Estimated Effort**: Small (2-3 story points)

#### Problem
Story 3.4 stores full HTML for boilerplate pages (about, contact, privacy) in the database. This is inefficient and creates consistency issues if templates change.

#### Impact
- Database bloat (HTML is large)
- Template changes don't retroactively apply to existing pages
- Difficult to update content across all sites

#### Solution
Store only metadata, regenerate HTML on-the-fly during deployment:
- Database: Store only `page_type` marker (not full HTML)
- Deployment: Generate HTML using current template at deploy time
- Ensures consistency with latest templates
- Reduces storage requirements

**Alternative**: Keep current approach if regeneration adds too much complexity.

**Dependencies**: Story 3.4 and 4.1 (both must exist first)

---

### Homepage (index.html) Generation

**Priority**: Medium  
**Epic**: Epic 3 (Pre-deployment) or Epic 4 (Deployment)  
**Estimated Effort**: Medium (5-8 story points)

#### Problem
Sites have navigation with `/index.html` link, but no homepage exists. Users landing on root domain see 404 or directory listing.

#### Impact
- Poor user experience for site visitors
- Unprofessional appearance
- Lost SEO opportunity (homepage is important)

#### Solution
Generate `index.html` for each site with:
- List of recent articles (with links)
- Site branding/header
- Brief description
- Professional layout using same template system

**Options:**
1. Static page generated once during site creation
2. Dynamic listing updated after each deployment
3. Simple redirect to first article

**Dependencies**: Story 3.4 (boilerplate page infrastructure)

---
### www vs root in domain imports
#### Problem
Domains are stored as either www.domain.com or domain.com in the table, but if you search on the wrong one through any of the scripts (like main.py get-site or on an job.json import) it will fail.

#### Solution
partial match on search?  search for both www or root in the logic?  Just ideas, havent fleshed it out.
## Future Sections

Add new technical debt items below as they're identified during development.