24 KiB
24 KiB
Story 4.1: Deploy Content to Cloud Storage
Status
DRAFT - Needs Review
Story
As a developer, I want to upload all generated HTML files for a batch to their designated Bunny.net storage buckets so that the content is hosted and ready to be served.
Context
- Epic 4 is about deploying finalized content to cloud storage
- Story 3.4 implemented boilerplate site pages (about, contact, privacy)
- Articles have URLs and are assigned to sites (Story 3.1)
- Interlinking is complete (Story 3.3)
- Content is ready to deploy after batch processing completes
- Bunny.net is the only cloud provider for now (multi-cloud is technical debt)
Acceptance Criteria
Core Deployment Functionality
- CLI command
deploy-batch --batch_id <id>deploys all content in a batch - Deployment is also triggered automatically after batch generation completes
- Deployment uploads both articles and boilerplate pages (about, contact, privacy)
- For boilerplate pages: Check
site_pagestable, deploy pages that exist - Read HTML content directly from
site_pages.contentfield (stored in Story 3.4) - Authentication uses:
BUNNY_API_KEYfrom.env(storage API operations)storage_zone_passwordfrom SiteDeployment model (per-zone)BUNNY_ACCOUNT_API_KEYfrom.env(only for creating zones, not uploads)
- For each piece of content, identify correct destination storage bucket/path
- Upload final HTML to target path (e.g.,
about.html,my-article-slug.html)
Error Handling
- Continue on error (don't halt entire deployment if one file fails)
- Log errors for individual file failures
- Report summary at end: successful uploads, failed uploads, total time
- Both screen output and log file
URL Tracking (Story 4.2 Preview)
- After article is successfully deployed, log its public URL to tier-segregated text file
- Create
deployment_logs/folder if it doesn't exist - Two files per day:
YYYY-MM-DD_tier1_urls.txtandYYYY-MM-DD_other_tiers_urls.txt - URLs for Tier 1 articles →
_tier1_urls.txt - URLs for Tier 2+ articles →
_other_tiers_urls.txt - Boilerplate pages (about, contact, privacy) are NOT logged to these files
- Must avoid duplicate URLs: Read file, check if URL exists, only append if new
- Prevents duplicates from manual re-runs after automatic deployment
Database Updates (Story 4.3 Preview)
- Update article status to 'deployed' after successful upload
- Store final public URL in database
- Transactional updates to ensure data integrity
Tasks / Subtasks
1. Create Bunny.net Storage Upload Client
Effort: 3 story points
- Create
src/deployment/bunny_storage.pymodule - Implement
BunnyStorageClientclass for uploading files - Use Bunny.net Storage API (different from Account API)
- Authentication using:
BUNNY_API_KEYfrom.env(account-level storage API key)storage_zone_passwordfrom SiteDeployment model (per-zone password)- Determine correct authentication method during implementation
- Methods:
upload_file(zone_name, zone_password, file_path, content, content_type='text/html')file_exists(zone_name, zone_password, file_path) -> boollist_files(zone_name, zone_password, prefix='') -> List[str]
- Handle HTTP errors, timeouts, retries (3 retries with exponential backoff)
- Logging at INFO level for uploads, ERROR for failures
2. Create Deployment Service
Effort: 3 story points
- Create
src/deployment/deployment_service.pymodule - Implement
DeploymentServiceclass with:deploy_batch(batch_id, project_id, continue_on_error=True)deploy_article(content_id, site_deployment)deploy_boilerplate_page(site_page, site_deployment)
- Query all
GeneratedContentrecords for project_id - Query all
SitePagerecords for sites in batch - For each article:
- Get site deployment info (storage zone, region, hostname)
- Generate file path (slug-based, e.g.,
my-article-slug.html) - Upload HTML content to Bunny.net storage
- Log success/failure
- For each boilerplate page (if exists):
- Get site deployment info
- Generate file path (e.g.,
about.html,contact.html,privacy.html) - Upload HTML content
- Log success/failure
- Track deployment results (successful, failed, skipped)
- Return deployment summary
3. Implement URL Generation for Deployment
Effort: 2 story points
- Extend
src/generation/url_generator.pymodule - Add
generate_public_url(site_deployment, file_path) -> str:- Use custom_hostname if available, else pull_zone_bcdn_hostname
- Return full URL:
https://{hostname}/{file_path}
- Add
generate_file_path(content) -> str:- For articles: Use slug from title or keyword (lowercase, hyphens, .html extension)
- For boilerplate pages: Fixed names (about.html, contact.html, privacy.html)
- Handle edge cases (special characters, long slugs, conflicts)
4. Implement URL Logging to Text Files
Effort: 2 story points
- Create
src/deployment/url_logger.pymodule - Implement
URLLoggerclass with:log_article_url(url, tier, date=None)get_existing_urls(tier, date=None) -> Set[str]
- Create
deployment_logs/directory if doesn't exist - Determine file based on tier and date:
- Tier 1:
deployment_logs/YYYY-MM-DD_tier1_urls.txt - Tier 2+:
deployment_logs/YYYY-MM-DD_other_tiers_urls.txt
- Tier 1:
- Check if URL already exists in file before appending
- Append URL to file (one per line)
- Thread-safe file writing (use file locks)
5. Implement Database Status Updates
Effort: 2 story points
- Update
src/database/models.py:- Add
deployed_urlfield toGeneratedContent(nullable string) - Add
deployed_atfield toGeneratedContent(nullable datetime)
- Add
- Create migration script
scripts/migrate_add_deployment_fields.py - Update
GeneratedContentRepositorywith:mark_as_deployed(content_id, url, timestamp=None)get_deployed_content(project_id) -> List[GeneratedContent]
- Use transactions to ensure atomicity
- Log status updates at INFO level
6. Create CLI Command: deploy-batch
Effort: 2 story points
- Add
deploy-batchcommand tosrc/cli/commands.py - Arguments:
--batch_id(required): Batch/project ID to deploy--admin-user(optional): Admin username for authentication--admin-password(optional): Admin password--continue-on-error(default: True): Continue if file fails--dry-run(default: False): Preview what would be deployed
- Authenticate admin user
- Load Bunny.net credentials from
.env - Call
DeploymentService.deploy_batch() - Display progress (articles uploaded, pages uploaded, errors)
- Show final summary with statistics
- Exit code 0 if all succeeded, 1 if any failures
7. Integrate Deployment into Batch Processing
Effort: 2 story points
- Update
src/generation/batch_processor.py - Add optional
auto_deployparameter toprocess_job() - After interlinking completes, trigger deployment if
auto_deploy=True - Use same deployment service as CLI command
- Log deployment results
- Handle deployment errors gracefully (don't fail batch if deployment fails)
- Make
auto_deploy=Trueby default (deploy immediately after generation) - Allow
auto_deploy=Falseflag for testing/debugging scenarios
8. Environment Variable Validation
Effort: 1 story point
- Confirm
src/core/config.pyloads Bunny.net keys from.envonly - Add validation in deployment service to check required env vars:
BUNNY_API_KEY(for storage uploads)BUNNY_ACCOUNT_API_KEY(for account operations, if needed)
- Raise clear error if keys are missing
- Document in technical notes which keys are required
- Do NOT reference
master.config.jsonfor any API keys
9. Unit Tests
Effort: 3 story points
- Test
BunnyStorageClientupload functionality (mock HTTP calls) - Test URL generation for various content types
- Test file path generation (slug creation, special characters)
- Test URL logger (file creation, duplicate prevention)
- Test deployment service (successful upload, failed upload, mixed results)
- Test database status updates
- Mock Bunny.net API responses
- Achieve >80% code coverage for new modules
10. Integration Tests
Effort: 2 story points
- Test end-to-end deployment of small batch (2-3 articles)
- Test deployment with boilerplate pages
- Test deployment without boilerplate pages
- Test URL logging (multiple deployments, different days)
- Test database updates (status changes, URLs stored)
- Test CLI command with dry-run mode
- Test continue-on-error behavior
- Verify no duplicate URLs in log files
Technical Notes
Bunny.net Storage API
Bunny.net has two separate APIs:
- Account API (existing
BunnyNetClient): For creating storage zones, pull zones- Uses
BUNNY_ACCOUNT_API_KEYfrom.env
- Uses
- Storage API (new
BunnyStorageClient): For uploading/managing files- Uses
BUNNY_API_KEYfrom.env(account-level storage access) - Uses
storage_zone_passwordfromSiteDeploymentmodel (per-zone password) - Requires BOTH credentials for authentication
- Uses
Storage API authentication:
- Base URL:
https://storage.bunnycdn.com/{zone_name}/{file_path} - Authentication method to be determined during implementation:
BUNNY_API_KEYfrom.env(account-level)storage_zone_passwordfrom database (per-zone, returned in JSON when zone is created)- May require one or both keys depending on Bunny.net's API requirements
- Storage API key can be extracted from Bunny.net JSON response during zone creation
- If implementation issues arise, reference code/examples can be provided
Upload example:
# Get site from database
site = site_repo.get_by_id(site_deployment_id)
# Get API key from .env
bunny_api_key = os.getenv("BUNNY_API_KEY")
# Upload (authentication method TBD during implementation)
PUT https://storage.bunnycdn.com/{site.storage_zone_name}/my-article.html
Headers:
AccessKey: {bunny_api_key OR site.storage_zone_password} # TBD
Content-Type: text/html
Body:
<html>...</html>
File Path Structure
Storage Zone: my-zone
Region: DE (Germany)
Articles:
/my-article-slug.html
/another-article.html
/third-article-title.html
Boilerplate pages:
/about.html
/contact.html
/privacy.html
Not using subdirectories for simplicity
Future: Could organize by date or category
URL Logger Implementation
# src/deployment/url_logger.py
import os
from datetime import datetime
from typing import Set
from pathlib import Path
import fcntl # For file locking on Unix
class URLLogger:
def __init__(self, logs_dir: str = "deployment_logs"):
self.logs_dir = Path(logs_dir)
self.logs_dir.mkdir(exist_ok=True)
def log_article_url(self, url: str, tier: str, date: datetime = None):
if date is None:
date = datetime.utcnow()
# Determine file
tier_num = self._extract_tier_number(tier)
if tier_num == 1:
filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
else:
filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
filepath = self.logs_dir / filename
# Check for duplicates
existing = self.get_existing_urls(tier, date)
if url in existing:
return # Skip duplicate
# Append to file (with lock)
with open(filepath, 'a') as f:
fcntl.flock(f, fcntl.LOCK_EX)
f.write(f"{url}\n")
fcntl.flock(f, fcntl.LOCK_UN)
def get_existing_urls(self, tier: str, date: datetime = None) -> Set[str]:
"""
Get existing URLs from log file to prevent duplicates
This is critical for preventing duplicate entries when:
- Auto-deployment runs, then manual re-run happens
- Deployment fails partway and is restarted
"""
if date is None:
date = datetime.utcnow()
tier_num = self._extract_tier_number(tier)
if tier_num == 1:
filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
else:
filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
filepath = self.logs_dir / filename
if not filepath.exists():
return set()
with open(filepath, 'r') as f:
return set(line.strip() for line in f if line.strip())
def _extract_tier_number(self, tier: str) -> int:
# Extract number from "tier1", "tier2", etc.
return int(''.join(c for c in tier if c.isdigit()))
Deployment Service Implementation
# src/deployment/deployment_service.py
from typing import List, Dict, Any
from src.deployment.bunny_storage import BunnyStorageClient
from src.deployment.url_logger import URLLogger
from src.database.repositories import GeneratedContentRepository, SitePageRepository, SiteDeploymentRepository
from src.generation.url_generator import generate_public_url, generate_file_path
import logging
logger = logging.getLogger(__name__)
class DeploymentService:
def __init__(
self,
storage_client: BunnyStorageClient,
content_repo: GeneratedContentRepository,
site_repo: SiteDeploymentRepository,
page_repo: SitePageRepository,
url_logger: URLLogger
):
self.storage = storage_client
self.content_repo = content_repo
self.site_repo = site_repo
self.page_repo = page_repo
self.url_logger = url_logger
def deploy_batch(self, project_id: int, continue_on_error: bool = True) -> Dict[str, Any]:
"""
Deploy all content for a project/batch
Returns:
Dict with deployment statistics:
{
'articles_deployed': 10,
'articles_failed': 1,
'pages_deployed': 6,
'pages_failed': 0,
'total_time': 45.2
}
"""
results = {
'articles_deployed': 0,
'articles_failed': 0,
'pages_deployed': 0,
'pages_failed': 0,
'errors': []
}
# Get all articles for project
articles = self.content_repo.get_by_project_id(project_id)
logger.info(f"Found {len(articles)} articles to deploy for project {project_id}")
# Deploy articles
for article in articles:
if not article.site_deployment_id:
logger.warning(f"Article {article.id} has no site assigned, skipping")
continue
try:
site = self.site_repo.get_by_id(article.site_deployment_id)
if not site:
raise ValueError(f"Site {article.site_deployment_id} not found")
# Deploy article
url = self.deploy_article(article, site)
# Log URL to text file
self.url_logger.log_article_url(url, article.tier)
# Update database
self.content_repo.mark_as_deployed(article.id, url)
results['articles_deployed'] += 1
logger.info(f"Deployed article {article.id} to {url}")
except Exception as e:
results['articles_failed'] += 1
results['errors'].append({
'type': 'article',
'id': article.id,
'error': str(e)
})
logger.error(f"Failed to deploy article {article.id}: {e}")
if not continue_on_error:
raise
# Get unique sites from articles
site_ids = set(a.site_deployment_id for a in articles if a.site_deployment_id)
# Deploy boilerplate pages for each site
for site_id in site_ids:
site = self.site_repo.get_by_id(site_id)
pages = self.page_repo.get_by_site(site_id)
if not pages:
logger.debug(f"Site {site_id} has no boilerplate pages, skipping")
continue
logger.info(f"Found {len(pages)} boilerplate pages for site {site_id}")
for page in pages:
try:
# Read HTML from database (stored in page.content from Story 3.4)
url = self.deploy_boilerplate_page(page, site)
results['pages_deployed'] += 1
logger.info(f"Deployed page {page.page_type} to {url}")
except Exception as e:
results['pages_failed'] += 1
results['errors'].append({
'type': 'page',
'site_id': site_id,
'page_type': page.page_type,
'error': str(e)
})
logger.error(f"Failed to deploy page {page.page_type} for site {site_id}: {e}")
if not continue_on_error:
raise
return results
def deploy_article(self, article, site) -> str:
"""Deploy a single article, return public URL"""
file_path = generate_file_path(article)
url = generate_public_url(site, file_path)
# Upload using both BUNNY_API_KEY and zone password
# BunnyStorageClient determines which auth method to use
self.storage.upload_file(
zone_name=site.storage_zone_name,
zone_password=site.storage_zone_password, # Per-zone password from DB
file_path=file_path,
content=article.formatted_html,
content_type='text/html'
)
return url
def deploy_boilerplate_page(self, page, site) -> str:
"""
Deploy a boilerplate page, return public URL
Note: Uses stored HTML from page.content (from Story 3.4)
Technical debt: Could regenerate on-the-fly instead of storing
"""
file_path = f"{page.page_type}.html"
url = generate_public_url(site, file_path)
# Upload using both BUNNY_API_KEY and zone password
self.storage.upload_file(
zone_name=site.storage_zone_name,
zone_password=site.storage_zone_password,
file_path=file_path,
content=page.content, # Full HTML stored in DB
content_type='text/html'
)
return url
CLI Command Example
# Deploy a batch manually
uv run python -m src.cli deploy-batch \
--batch_id 123 \
--admin-user admin \
--admin-password mypass
# Output:
# Authenticating...
# Loading Bunny.net credentials...
# Deploying batch 123...
# [1/50] Deploying article "How to Fix Engines"... ✓
# [2/50] Deploying article "Engine Maintenance Tips"... ✓
# ...
# [50/50] Deploying article "Common Engine Problems"... ✓
# Deploying boilerplate pages...
# [1/6] Deploying about.html for site1.b-cdn.net... ✓
# [2/6] Deploying contact.html for site1.b-cdn.net... ✓
# ...
#
# Deployment Summary:
# ==================
# Articles deployed: 48
# Articles failed: 2
# Pages deployed: 6
# Pages failed: 0
# Total time: 2m 34s
#
# Failed articles:
# - Article 15: Connection timeout
# - Article 32: Invalid HTML content
# Dry-run mode
uv run python -m src.cli deploy-batch \
--batch_id 123 \
--dry-run
# Output shows what would be deployed without actually uploading
Environment Variables
Required in .env file:
# Bunny.net Account API (for creating/managing storage zones and pull zones)
BUNNY_ACCOUNT_API_KEY=your_account_api_key_here
# Bunny.net Storage API (for uploading files to storage)
BUNNY_API_KEY=your_storage_api_key_here
# Note: storage_zone_password is per-zone and stored in database
# Both BUNNY_API_KEY and storage_zone_password may be needed for uploads
# API keys should ONLY be in .env file, NOT in master.config.json
Database Schema Updates
-- Add deployment tracking fields to generated_content
ALTER TABLE generated_content ADD COLUMN deployed_url TEXT NULL;
ALTER TABLE generated_content ADD COLUMN deployed_at TIMESTAMP NULL;
CREATE INDEX idx_generated_content_deployed ON generated_content(deployed_at);
Dependencies
- Story 3.1: Site assignment (need site_deployment_id on articles)
- Story 3.3: Content interlinking (HTML must be finalized)
- Story 3.4: Boilerplate pages (need SitePage table)
- Bunny.net Storage API access
- Environment variables configured in
.env
Future Considerations
- Story 4.2: URL logging (partially implemented here)
- Story 4.3: Database status updates (partially implemented here)
- Story 4.4: Post-deployment verification
- Multi-cloud support (AWS S3, Azure, DigitalOcean, etc.)
- CDN cache purging after deployment
- Parallel uploads for faster deployment
- Resumable uploads for large files
- Deployment rollback mechanism
Technical Debt Created
- Multi-cloud support deferred (only Bunny.net for now)
- No CDN cache purging yet (Story 4.x)
- No deployment verification yet (Story 4.4)
- URL logging is simple (no database tracking of logged URLs)
- Boilerplate pages stored as full HTML in database (inefficient)
- Better approach: Store just page_type marker, regenerate HTML on-the-fly at deployment
- Reduces storage, ensures consistency with current templates
- Defer optimization to later story
Total Effort
22 story points
Effort Breakdown
- Bunny Storage Client (3 points)
- Deployment Service (3 points)
- URL Generation (2 points)
- URL Logging (2 points)
- Database Updates (2 points)
- CLI Command (2 points)
- Batch Integration (2 points)
- Environment Audit (1 point)
- Unit Tests (3 points)
- Integration Tests (2 points)
Questions & Clarifications
Question 1: Boilerplate Page Deployment Strategy
Status: ✓ RESOLVED
The approach:
- Check
site_pagestable in database - Only deploy boilerplate pages if they exist in DB
- Read HTML content from
site_pages.contentfield - Most sites won't have them (only newly created sites from Story 3.4+)
- Don't check remote buckets (database is source of truth)
Question 2: URL Duplicate Prevention
Status: ✓ RESOLVED
Approach:
- Read entire file before appending
- Check if URL exists in memory (set), skip if duplicate
- File locking for thread-safety
- This prevents duplicate URLs from manual re-runs after automatic deployment
- No database tracking needed (file is source of truth)
Question 3: Auto-deploy Default Behavior
Status: ✓ RESOLVED
Decision: ON by default
- Auto-deploy after batch generation completes
- No reason to delay deployment in normal workflow
- CLI command still available for manual re-deployment if auto-deploy fails
- Can be disabled for testing via flag if needed
Question 4: API Keys in master.config.json
Status: ✓ RESOLVED
Decision: Ignore master.config.json for API keys
- All API keys come from
.envfile only - Even if keys exist in master.config.json now, they'll be removed in future epics
- Don't reference master.config.json for any authentication
- Only use .env for credentials
Notes
- Keep deployment simple for first iteration
- Focus on reliability over speed
- Auto-deploy is ON by default (deploy immediately after batch generation)
- Manual CLI command available for re-deployment or testing
- Comprehensive error reporting is critical
- URL logging format is simple (one URL per line)
- All API keys come from
.envfile, NOT master.config.json - Storage API authentication details will be determined during implementation