24 KiB

Raw Permalink Blame History

Story 4.1: Deploy Content to Cloud Storage

Status

DRAFT - Needs Review

Story

As a developer, I want to upload all generated HTML files for a batch to their designated Bunny.net storage buckets so that the content is hosted and ready to be served.

Context

Epic 4 is about deploying finalized content to cloud storage
Story 3.4 implemented boilerplate site pages (about, contact, privacy)
Articles have URLs and are assigned to sites (Story 3.1)
Interlinking is complete (Story 3.3)
Content is ready to deploy after batch processing completes
Bunny.net is the only cloud provider for now (multi-cloud is technical debt)

Acceptance Criteria

Core Deployment Functionality

CLI command deploy-batch --batch_id <id> deploys all content in a batch
Deployment is also triggered automatically after batch generation completes
Deployment uploads both articles and boilerplate pages (about, contact, privacy)
For boilerplate pages: Check site_pages table, deploy pages that exist
Read HTML content directly from site_pages.content field (stored in Story 3.4)
Authentication uses:
- BUNNY_API_KEY from .env (storage API operations)
- storage_zone_password from SiteDeployment model (per-zone)
- BUNNY_ACCOUNT_API_KEY from .env (only for creating zones, not uploads)
For each piece of content, identify correct destination storage bucket/path
Upload final HTML to target path (e.g., about.html, my-article-slug.html)

Error Handling

Continue on error (don't halt entire deployment if one file fails)
Log errors for individual file failures
Report summary at end: successful uploads, failed uploads, total time
Both screen output and log file

URL Tracking (Story 4.2 Preview)

After article is successfully deployed, log its public URL to tier-segregated text file
Create deployment_logs/ folder if it doesn't exist
Two files per day: YYYY-MM-DD_tier1_urls.txt and YYYY-MM-DD_other_tiers_urls.txt
URLs for Tier 1 articles → _tier1_urls.txt
URLs for Tier 2+ articles → _other_tiers_urls.txt
Boilerplate pages (about, contact, privacy) are NOT logged to these files
Must avoid duplicate URLs: Read file, check if URL exists, only append if new
Prevents duplicates from manual re-runs after automatic deployment

Database Updates (Story 4.3 Preview)

Update article status to 'deployed' after successful upload
Store final public URL in database
Transactional updates to ensure data integrity

Tasks / Subtasks

1. Create Bunny.net Storage Upload Client

Effort: 3 story points

Create src/deployment/bunny_storage.py module
Implement BunnyStorageClient class for uploading files
Use Bunny.net Storage API (different from Account API)
Authentication using:
- BUNNY_API_KEY from .env (account-level storage API key)
- storage_zone_password from SiteDeployment model (per-zone password)
- Determine correct authentication method during implementation
Methods:
- upload_file(zone_name, zone_password, file_path, content, content_type='text/html')
- file_exists(zone_name, zone_password, file_path) -> bool
- list_files(zone_name, zone_password, prefix='') -> List[str]
Handle HTTP errors, timeouts, retries (3 retries with exponential backoff)
Logging at INFO level for uploads, ERROR for failures

2. Create Deployment Service

Effort: 3 story points

Create src/deployment/deployment_service.py module
Implement DeploymentService class with:
- deploy_batch(batch_id, project_id, continue_on_error=True)
- deploy_article(content_id, site_deployment)
- deploy_boilerplate_page(site_page, site_deployment)
Query all GeneratedContent records for project_id
Query all SitePage records for sites in batch
For each article:
- Get site deployment info (storage zone, region, hostname)
- Generate file path (slug-based, e.g., my-article-slug.html)
- Upload HTML content to Bunny.net storage
- Log success/failure
For each boilerplate page (if exists):
- Get site deployment info
- Generate file path (e.g., about.html, contact.html, privacy.html)
- Upload HTML content
- Log success/failure
Track deployment results (successful, failed, skipped)
Return deployment summary

3. Implement URL Generation for Deployment

Effort: 2 story points

Extend src/generation/url_generator.py module
Add generate_public_url(site_deployment, file_path) -> str:
- Use custom_hostname if available, else pull_zone_bcdn_hostname
- Return full URL: https://{hostname}/{file_path}
Add generate_file_path(content) -> str:
- For articles: Use slug from title or keyword (lowercase, hyphens, .html extension)
- For boilerplate pages: Fixed names (about.html, contact.html, privacy.html)
Handle edge cases (special characters, long slugs, conflicts)

4. Implement URL Logging to Text Files

Effort: 2 story points

Create src/deployment/url_logger.py module
Implement URLLogger class with:
- log_article_url(url, tier, date=None)
- get_existing_urls(tier, date=None) -> Set[str]
Create deployment_logs/ directory if doesn't exist
Determine file based on tier and date:
- Tier 1: deployment_logs/YYYY-MM-DD_tier1_urls.txt
- Tier 2+: deployment_logs/YYYY-MM-DD_other_tiers_urls.txt
Check if URL already exists in file before appending
Append URL to file (one per line)
Thread-safe file writing (use file locks)

5. Implement Database Status Updates

Effort: 2 story points

Update src/database/models.py:
- Add deployed_url field to GeneratedContent (nullable string)
- Add deployed_at field to GeneratedContent (nullable datetime)
Create migration script scripts/migrate_add_deployment_fields.py
Update GeneratedContentRepository with:
- mark_as_deployed(content_id, url, timestamp=None)
- get_deployed_content(project_id) -> List[GeneratedContent]
Use transactions to ensure atomicity
Log status updates at INFO level

6. Create CLI Command: deploy-batch

Effort: 2 story points

Add deploy-batch command to src/cli/commands.py
Arguments:
- --batch_id (required): Batch/project ID to deploy
- --admin-user (optional): Admin username for authentication
- --admin-password (optional): Admin password
- --continue-on-error (default: True): Continue if file fails
- --dry-run (default: False): Preview what would be deployed
Authenticate admin user
Load Bunny.net credentials from .env
Call DeploymentService.deploy_batch()
Display progress (articles uploaded, pages uploaded, errors)
Show final summary with statistics
Exit code 0 if all succeeded, 1 if any failures

7. Integrate Deployment into Batch Processing

Effort: 2 story points

Update src/generation/batch_processor.py
Add optional auto_deploy parameter to process_job()
After interlinking completes, trigger deployment if auto_deploy=True
Use same deployment service as CLI command
Log deployment results
Handle deployment errors gracefully (don't fail batch if deployment fails)
Make auto_deploy=True by default (deploy immediately after generation)
Allow auto_deploy=False flag for testing/debugging scenarios

8. Environment Variable Validation

Effort: 1 story point

Confirm src/core/config.py loads Bunny.net keys from .env only
Add validation in deployment service to check required env vars:
- BUNNY_API_KEY (for storage uploads)
- BUNNY_ACCOUNT_API_KEY (for account operations, if needed)
Raise clear error if keys are missing
Document in technical notes which keys are required
Do NOT reference master.config.json for any API keys

9. Unit Tests

Effort: 3 story points

Test BunnyStorageClient upload functionality (mock HTTP calls)
Test URL generation for various content types
Test file path generation (slug creation, special characters)
Test URL logger (file creation, duplicate prevention)
Test deployment service (successful upload, failed upload, mixed results)
Test database status updates
Mock Bunny.net API responses
Achieve >80% code coverage for new modules

10. Integration Tests

Effort: 2 story points

Test end-to-end deployment of small batch (2-3 articles)
Test deployment with boilerplate pages
Test deployment without boilerplate pages
Test URL logging (multiple deployments, different days)
Test database updates (status changes, URLs stored)
Test CLI command with dry-run mode
Test continue-on-error behavior
Verify no duplicate URLs in log files

Technical Notes

Bunny.net Storage API

Bunny.net has two separate APIs:

Account API (existing BunnyNetClient): For creating storage zones, pull zones
- Uses BUNNY_ACCOUNT_API_KEY from .env
Storage API (new BunnyStorageClient): For uploading/managing files
- Uses BUNNY_API_KEY from .env (account-level storage access)
- Uses storage_zone_password from SiteDeployment model (per-zone password)
- Requires BOTH credentials for authentication

Storage API authentication:

Base URL: https://storage.bunnycdn.com/{zone_name}/{file_path}
Authentication method to be determined during implementation:
- BUNNY_API_KEY from .env (account-level)
- storage_zone_password from database (per-zone, returned in JSON when zone is created)
- May require one or both keys depending on Bunny.net's API requirements
Storage API key can be extracted from Bunny.net JSON response during zone creation
If implementation issues arise, reference code/examples can be provided

Upload example:

# Get site from database
site = site_repo.get_by_id(site_deployment_id)

# Get API key from .env
bunny_api_key = os.getenv("BUNNY_API_KEY")

# Upload (authentication method TBD during implementation)
PUT https://storage.bunnycdn.com/{site.storage_zone_name}/my-article.html
Headers:
  AccessKey: {bunny_api_key OR site.storage_zone_password}  # TBD
  Content-Type: text/html
Body:
  <html>...</html>

File Path Structure

Storage Zone: my-zone
Region: DE (Germany)

Articles:
  /my-article-slug.html
  /another-article.html
  /third-article-title.html

Boilerplate pages:
  /about.html
  /contact.html
  /privacy.html

Not using subdirectories for simplicity
Future: Could organize by date or category

URL Logger Implementation

# src/deployment/url_logger.py

import os
from datetime import datetime
from typing import Set
from pathlib import Path
import fcntl  # For file locking on Unix

class URLLogger:
    def __init__(self, logs_dir: str = "deployment_logs"):
        self.logs_dir = Path(logs_dir)
        self.logs_dir.mkdir(exist_ok=True)
    
    def log_article_url(self, url: str, tier: str, date: datetime = None):
        if date is None:
            date = datetime.utcnow()
        
        # Determine file
        tier_num = self._extract_tier_number(tier)
        if tier_num == 1:
            filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
        else:
            filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
        
        filepath = self.logs_dir / filename
        
        # Check for duplicates
        existing = self.get_existing_urls(tier, date)
        if url in existing:
            return  # Skip duplicate
        
        # Append to file (with lock)
        with open(filepath, 'a') as f:
            fcntl.flock(f, fcntl.LOCK_EX)
            f.write(f"{url}\n")
            fcntl.flock(f, fcntl.LOCK_UN)
    
    def get_existing_urls(self, tier: str, date: datetime = None) -> Set[str]:
        """
        Get existing URLs from log file to prevent duplicates
        
        This is critical for preventing duplicate entries when:
        - Auto-deployment runs, then manual re-run happens
        - Deployment fails partway and is restarted
        """
        if date is None:
            date = datetime.utcnow()
        
        tier_num = self._extract_tier_number(tier)
        if tier_num == 1:
            filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
        else:
            filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
        
        filepath = self.logs_dir / filename
        
        if not filepath.exists():
            return set()
        
        with open(filepath, 'r') as f:
            return set(line.strip() for line in f if line.strip())
    
    def _extract_tier_number(self, tier: str) -> int:
        # Extract number from "tier1", "tier2", etc.
        return int(''.join(c for c in tier if c.isdigit()))

Deployment Service Implementation

# src/deployment/deployment_service.py

from typing import List, Dict, Any
from src.deployment.bunny_storage import BunnyStorageClient
from src.deployment.url_logger import URLLogger
from src.database.repositories import GeneratedContentRepository, SitePageRepository, SiteDeploymentRepository
from src.generation.url_generator import generate_public_url, generate_file_path
import logging

logger = logging.getLogger(__name__)

class DeploymentService:
    def __init__(
        self,
        storage_client: BunnyStorageClient,
        content_repo: GeneratedContentRepository,
        site_repo: SiteDeploymentRepository,
        page_repo: SitePageRepository,
        url_logger: URLLogger
    ):
        self.storage = storage_client
        self.content_repo = content_repo
        self.site_repo = site_repo
        self.page_repo = page_repo
        self.url_logger = url_logger
    
    def deploy_batch(self, project_id: int, continue_on_error: bool = True) -> Dict[str, Any]:
        """
        Deploy all content for a project/batch
        
        Returns:
            Dict with deployment statistics:
            {
                'articles_deployed': 10,
                'articles_failed': 1,
                'pages_deployed': 6,
                'pages_failed': 0,
                'total_time': 45.2
            }
        """
        results = {
            'articles_deployed': 0,
            'articles_failed': 0,
            'pages_deployed': 0,
            'pages_failed': 0,
            'errors': []
        }
        
        # Get all articles for project
        articles = self.content_repo.get_by_project_id(project_id)
        logger.info(f"Found {len(articles)} articles to deploy for project {project_id}")
        
        # Deploy articles
        for article in articles:
            if not article.site_deployment_id:
                logger.warning(f"Article {article.id} has no site assigned, skipping")
                continue
            
            try:
                site = self.site_repo.get_by_id(article.site_deployment_id)
                if not site:
                    raise ValueError(f"Site {article.site_deployment_id} not found")
                
                # Deploy article
                url = self.deploy_article(article, site)
                
                # Log URL to text file
                self.url_logger.log_article_url(url, article.tier)
                
                # Update database
                self.content_repo.mark_as_deployed(article.id, url)
                
                results['articles_deployed'] += 1
                logger.info(f"Deployed article {article.id} to {url}")
                
            except Exception as e:
                results['articles_failed'] += 1
                results['errors'].append({
                    'type': 'article',
                    'id': article.id,
                    'error': str(e)
                })
                logger.error(f"Failed to deploy article {article.id}: {e}")
                
                if not continue_on_error:
                    raise
        
        # Get unique sites from articles
        site_ids = set(a.site_deployment_id for a in articles if a.site_deployment_id)
        
        # Deploy boilerplate pages for each site
        for site_id in site_ids:
            site = self.site_repo.get_by_id(site_id)
            pages = self.page_repo.get_by_site(site_id)
            
            if not pages:
                logger.debug(f"Site {site_id} has no boilerplate pages, skipping")
                continue
            
            logger.info(f"Found {len(pages)} boilerplate pages for site {site_id}")
            
            for page in pages:
                try:
                    # Read HTML from database (stored in page.content from Story 3.4)
                    url = self.deploy_boilerplate_page(page, site)
                    results['pages_deployed'] += 1
                    logger.info(f"Deployed page {page.page_type} to {url}")
                    
                except Exception as e:
                    results['pages_failed'] += 1
                    results['errors'].append({
                        'type': 'page',
                        'site_id': site_id,
                        'page_type': page.page_type,
                        'error': str(e)
                    })
                    logger.error(f"Failed to deploy page {page.page_type} for site {site_id}: {e}")
                    
                    if not continue_on_error:
                        raise
        
        return results
    
    def deploy_article(self, article, site) -> str:
        """Deploy a single article, return public URL"""
        file_path = generate_file_path(article)
        url = generate_public_url(site, file_path)
        
        # Upload using both BUNNY_API_KEY and zone password
        # BunnyStorageClient determines which auth method to use
        self.storage.upload_file(
            zone_name=site.storage_zone_name,
            zone_password=site.storage_zone_password,  # Per-zone password from DB
            file_path=file_path,
            content=article.formatted_html,
            content_type='text/html'
        )
        
        return url
    
    def deploy_boilerplate_page(self, page, site) -> str:
        """
        Deploy a boilerplate page, return public URL
        
        Note: Uses stored HTML from page.content (from Story 3.4)
        Technical debt: Could regenerate on-the-fly instead of storing
        """
        file_path = f"{page.page_type}.html"
        url = generate_public_url(site, file_path)
        
        # Upload using both BUNNY_API_KEY and zone password
        self.storage.upload_file(
            zone_name=site.storage_zone_name,
            zone_password=site.storage_zone_password,
            file_path=file_path,
            content=page.content,  # Full HTML stored in DB
            content_type='text/html'
        )
        
        return url

CLI Command Example

# Deploy a batch manually
uv run python -m src.cli deploy-batch \
  --batch_id 123 \
  --admin-user admin \
  --admin-password mypass

# Output:
# Authenticating...
# Loading Bunny.net credentials...
# Deploying batch 123...
# [1/50] Deploying article "How to Fix Engines"... ✓
# [2/50] Deploying article "Engine Maintenance Tips"... ✓
# ...
# [50/50] Deploying article "Common Engine Problems"... ✓
# Deploying boilerplate pages...
# [1/6] Deploying about.html for site1.b-cdn.net... ✓
# [2/6] Deploying contact.html for site1.b-cdn.net... ✓
# ...
# 
# Deployment Summary:
# ==================
# Articles deployed: 48
# Articles failed: 2
# Pages deployed: 6
# Pages failed: 0
# Total time: 2m 34s
#
# Failed articles:
# - Article 15: Connection timeout
# - Article 32: Invalid HTML content

# Dry-run mode
uv run python -m src.cli deploy-batch \
  --batch_id 123 \
  --dry-run

# Output shows what would be deployed without actually uploading

Environment Variables

Required in .env file:

# Bunny.net Account API (for creating/managing storage zones and pull zones)
BUNNY_ACCOUNT_API_KEY=your_account_api_key_here

# Bunny.net Storage API (for uploading files to storage)
BUNNY_API_KEY=your_storage_api_key_here

# Note: storage_zone_password is per-zone and stored in database
# Both BUNNY_API_KEY and storage_zone_password may be needed for uploads
# API keys should ONLY be in .env file, NOT in master.config.json

Database Schema Updates

-- Add deployment tracking fields to generated_content
ALTER TABLE generated_content ADD COLUMN deployed_url TEXT NULL;
ALTER TABLE generated_content ADD COLUMN deployed_at TIMESTAMP NULL;

CREATE INDEX idx_generated_content_deployed ON generated_content(deployed_at);

Dependencies

Story 3.1: Site assignment (need site_deployment_id on articles)
Story 3.3: Content interlinking (HTML must be finalized)
Story 3.4: Boilerplate pages (need SitePage table)
Bunny.net Storage API access
Environment variables configured in .env

Future Considerations

Story 4.2: URL logging (partially implemented here)
Story 4.3: Database status updates (partially implemented here)
Story 4.4: Post-deployment verification
Multi-cloud support (AWS S3, Azure, DigitalOcean, etc.)
CDN cache purging after deployment
Parallel uploads for faster deployment
Resumable uploads for large files
Deployment rollback mechanism

Technical Debt Created

Multi-cloud support deferred (only Bunny.net for now)
No CDN cache purging yet (Story 4.x)
No deployment verification yet (Story 4.4)
URL logging is simple (no database tracking of logged URLs)
Boilerplate pages stored as full HTML in database (inefficient)
- Better approach: Store just page_type marker, regenerate HTML on-the-fly at deployment
- Reduces storage, ensures consistency with current templates
- Defer optimization to later story

Total Effort

22 story points

Effort Breakdown

Bunny Storage Client (3 points)
Deployment Service (3 points)
URL Generation (2 points)
URL Logging (2 points)
Database Updates (2 points)
CLI Command (2 points)
Batch Integration (2 points)
Environment Audit (1 point)
Unit Tests (3 points)
Integration Tests (2 points)

Questions & Clarifications

Question 1: Boilerplate Page Deployment Strategy

Status: ✓ RESOLVED

The approach:

Check site_pages table in database
Only deploy boilerplate pages if they exist in DB
Read HTML content from site_pages.content field
Most sites won't have them (only newly created sites from Story 3.4+)
Don't check remote buckets (database is source of truth)

Question 2: URL Duplicate Prevention

Status: ✓ RESOLVED

Approach:

Read entire file before appending
Check if URL exists in memory (set), skip if duplicate
File locking for thread-safety
This prevents duplicate URLs from manual re-runs after automatic deployment
No database tracking needed (file is source of truth)

Question 3: Auto-deploy Default Behavior

Status: ✓ RESOLVED

Decision: ON by default

Auto-deploy after batch generation completes
No reason to delay deployment in normal workflow
CLI command still available for manual re-deployment if auto-deploy fails
Can be disabled for testing via flag if needed

Question 4: API Keys in master.config.json

Status: ✓ RESOLVED

Decision: Ignore master.config.json for API keys

All API keys come from .env file only
Even if keys exist in master.config.json now, they'll be removed in future epics
Don't reference master.config.json for any authentication
Only use .env for credentials

Notes

Keep deployment simple for first iteration
Focus on reliability over speed
Auto-deploy is ON by default (deploy immediately after batch generation)
Manual CLI command available for re-deployment or testing
Comprehensive error reporting is critical
URL logging format is simple (one URL per line)
All API keys come from .env file, NOT master.config.json
Storage API authentication details will be determined during implementation

24 KiB Raw Permalink Blame History

Story 4.1: Deploy Content to Cloud Storage

Status

Story

Context

Acceptance Criteria

Core Deployment Functionality

Error Handling

URL Tracking (Story 4.2 Preview)

Database Updates (Story 4.3 Preview)

Tasks / Subtasks

1. Create Bunny.net Storage Upload Client

2. Create Deployment Service

3. Implement URL Generation for Deployment

4. Implement URL Logging to Text Files

5. Implement Database Status Updates

6. Create CLI Command: deploy-batch

7. Integrate Deployment into Batch Processing

8. Environment Variable Validation

9. Unit Tests

10. Integration Tests

Technical Notes

Bunny.net Storage API

File Path Structure

URL Logger Implementation

Deployment Service Implementation

CLI Command Example

Environment Variables

Database Schema Updates

Dependencies

Future Considerations

Technical Debt Created

Total Effort

Effort Breakdown

Questions & Clarifications

Question 1: Boilerplate Page Deployment Strategy

Question 2: URL Duplicate Prevention

Question 3: Auto-deploy Default Behavior

Question 4: API Keys in master.config.json

Notes

24 KiB

Raw Permalink Blame History