Big-Link-Man/docs/stories/story-4.1-deploy-content-to...

651 lines
24 KiB
Markdown

# Story 4.1: Deploy Content to Cloud Storage
## Status
**DRAFT** - Needs Review
## Story
**As a developer**, I want to upload all generated HTML files for a batch to their designated Bunny.net storage buckets so that the content is hosted and ready to be served.
## Context
- Epic 4 is about deploying finalized content to cloud storage
- Story 3.4 implemented boilerplate site pages (about, contact, privacy)
- Articles have URLs and are assigned to sites (Story 3.1)
- Interlinking is complete (Story 3.3)
- Content is ready to deploy after batch processing completes
- Bunny.net is the only cloud provider for now (multi-cloud is technical debt)
## Acceptance Criteria
### Core Deployment Functionality
- CLI command `deploy-batch --batch_id <id>` deploys all content in a batch
- Deployment is also triggered automatically after batch generation completes
- Deployment uploads both articles and boilerplate pages (about, contact, privacy)
- For boilerplate pages: Check `site_pages` table, deploy pages that exist
- Read HTML content directly from `site_pages.content` field (stored in Story 3.4)
- Authentication uses:
- `BUNNY_API_KEY` from `.env` (storage API operations)
- `storage_zone_password` from SiteDeployment model (per-zone)
- `BUNNY_ACCOUNT_API_KEY` from `.env` (only for creating zones, not uploads)
- For each piece of content, identify correct destination storage bucket/path
- Upload final HTML to target path (e.g., `about.html`, `my-article-slug.html`)
### Error Handling
- Continue on error (don't halt entire deployment if one file fails)
- Log errors for individual file failures
- Report summary at end: successful uploads, failed uploads, total time
- Both screen output and log file
### URL Tracking (Story 4.2 Preview)
- After article is successfully deployed, log its public URL to tier-segregated text file
- Create `deployment_logs/` folder if it doesn't exist
- Two files per day: `YYYY-MM-DD_tier1_urls.txt` and `YYYY-MM-DD_other_tiers_urls.txt`
- URLs for Tier 1 articles → `_tier1_urls.txt`
- URLs for Tier 2+ articles → `_other_tiers_urls.txt`
- Boilerplate pages (about, contact, privacy) are NOT logged to these files
- **Must avoid duplicate URLs**: Read file, check if URL exists, only append if new
- Prevents duplicates from manual re-runs after automatic deployment
### Database Updates (Story 4.3 Preview)
- Update article status to 'deployed' after successful upload
- Store final public URL in database
- Transactional updates to ensure data integrity
## Tasks / Subtasks
### 1. Create Bunny.net Storage Upload Client
**Effort:** 3 story points
- [ ] Create `src/deployment/bunny_storage.py` module
- [ ] Implement `BunnyStorageClient` class for uploading files
- [ ] Use Bunny.net Storage API (different from Account API)
- [ ] Authentication using:
- `BUNNY_API_KEY` from `.env` (account-level storage API key)
- `storage_zone_password` from SiteDeployment model (per-zone password)
- Determine correct authentication method during implementation
- [ ] Methods:
- `upload_file(zone_name, zone_password, file_path, content, content_type='text/html')`
- `file_exists(zone_name, zone_password, file_path) -> bool`
- `list_files(zone_name, zone_password, prefix='') -> List[str]`
- [ ] Handle HTTP errors, timeouts, retries (3 retries with exponential backoff)
- [ ] Logging at INFO level for uploads, ERROR for failures
### 2. Create Deployment Service
**Effort:** 3 story points
- [ ] Create `src/deployment/deployment_service.py` module
- [ ] Implement `DeploymentService` class with:
- `deploy_batch(batch_id, project_id, continue_on_error=True)`
- `deploy_article(content_id, site_deployment)`
- `deploy_boilerplate_page(site_page, site_deployment)`
- [ ] Query all `GeneratedContent` records for project_id
- [ ] Query all `SitePage` records for sites in batch
- [ ] For each article:
- Get site deployment info (storage zone, region, hostname)
- Generate file path (slug-based, e.g., `my-article-slug.html`)
- Upload HTML content to Bunny.net storage
- Log success/failure
- [ ] For each boilerplate page (if exists):
- Get site deployment info
- Generate file path (e.g., `about.html`, `contact.html`, `privacy.html`)
- Upload HTML content
- Log success/failure
- [ ] Track deployment results (successful, failed, skipped)
- [ ] Return deployment summary
### 3. Implement URL Generation for Deployment
**Effort:** 2 story points
- [ ] Extend `src/generation/url_generator.py` module
- [ ] Add `generate_public_url(site_deployment, file_path) -> str`:
- Use custom_hostname if available, else pull_zone_bcdn_hostname
- Return full URL: `https://{hostname}/{file_path}`
- [ ] Add `generate_file_path(content) -> str`:
- For articles: Use slug from title or keyword (lowercase, hyphens, .html extension)
- For boilerplate pages: Fixed names (about.html, contact.html, privacy.html)
- [ ] Handle edge cases (special characters, long slugs, conflicts)
### 4. Implement URL Logging to Text Files
**Effort:** 2 story points
- [ ] Create `src/deployment/url_logger.py` module
- [ ] Implement `URLLogger` class with:
- `log_article_url(url, tier, date=None)`
- `get_existing_urls(tier, date=None) -> Set[str]`
- [ ] Create `deployment_logs/` directory if doesn't exist
- [ ] Determine file based on tier and date:
- Tier 1: `deployment_logs/YYYY-MM-DD_tier1_urls.txt`
- Tier 2+: `deployment_logs/YYYY-MM-DD_other_tiers_urls.txt`
- [ ] Check if URL already exists in file before appending
- [ ] Append URL to file (one per line)
- [ ] Thread-safe file writing (use file locks)
### 5. Implement Database Status Updates
**Effort:** 2 story points
- [ ] Update `src/database/models.py`:
- Add `deployed_url` field to `GeneratedContent` (nullable string)
- Add `deployed_at` field to `GeneratedContent` (nullable datetime)
- [ ] Create migration script `scripts/migrate_add_deployment_fields.py`
- [ ] Update `GeneratedContentRepository` with:
- `mark_as_deployed(content_id, url, timestamp=None)`
- `get_deployed_content(project_id) -> List[GeneratedContent]`
- [ ] Use transactions to ensure atomicity
- [ ] Log status updates at INFO level
### 6. Create CLI Command: deploy-batch
**Effort:** 2 story points
- [ ] Add `deploy-batch` command to `src/cli/commands.py`
- [ ] Arguments:
- `--batch_id` (required): Batch/project ID to deploy
- `--admin-user` (optional): Admin username for authentication
- `--admin-password` (optional): Admin password
- `--continue-on-error` (default: True): Continue if file fails
- `--dry-run` (default: False): Preview what would be deployed
- [ ] Authenticate admin user
- [ ] Load Bunny.net credentials from `.env`
- [ ] Call `DeploymentService.deploy_batch()`
- [ ] Display progress (articles uploaded, pages uploaded, errors)
- [ ] Show final summary with statistics
- [ ] Exit code 0 if all succeeded, 1 if any failures
### 7. Integrate Deployment into Batch Processing
**Effort:** 2 story points
- [ ] Update `src/generation/batch_processor.py`
- [ ] Add optional `auto_deploy` parameter to `process_job()`
- [ ] After interlinking completes, trigger deployment if `auto_deploy=True`
- [ ] Use same deployment service as CLI command
- [ ] Log deployment results
- [ ] Handle deployment errors gracefully (don't fail batch if deployment fails)
- [ ] Make `auto_deploy=True` by default (deploy immediately after generation)
- [ ] Allow `auto_deploy=False` flag for testing/debugging scenarios
### 8. Environment Variable Validation
**Effort:** 1 story point
- [ ] Confirm `src/core/config.py` loads Bunny.net keys from `.env` only
- [ ] Add validation in deployment service to check required env vars:
- `BUNNY_API_KEY` (for storage uploads)
- `BUNNY_ACCOUNT_API_KEY` (for account operations, if needed)
- [ ] Raise clear error if keys are missing
- [ ] Document in technical notes which keys are required
- [ ] Do NOT reference `master.config.json` for any API keys
### 9. Unit Tests
**Effort:** 3 story points
- [ ] Test `BunnyStorageClient` upload functionality (mock HTTP calls)
- [ ] Test URL generation for various content types
- [ ] Test file path generation (slug creation, special characters)
- [ ] Test URL logger (file creation, duplicate prevention)
- [ ] Test deployment service (successful upload, failed upload, mixed results)
- [ ] Test database status updates
- [ ] Mock Bunny.net API responses
- [ ] Achieve >80% code coverage for new modules
### 10. Integration Tests
**Effort:** 2 story points
- [ ] Test end-to-end deployment of small batch (2-3 articles)
- [ ] Test deployment with boilerplate pages
- [ ] Test deployment without boilerplate pages
- [ ] Test URL logging (multiple deployments, different days)
- [ ] Test database updates (status changes, URLs stored)
- [ ] Test CLI command with dry-run mode
- [ ] Test continue-on-error behavior
- [ ] Verify no duplicate URLs in log files
## Technical Notes
### Bunny.net Storage API
Bunny.net has two separate APIs:
1. **Account API** (existing `BunnyNetClient`): For creating storage zones, pull zones
- Uses `BUNNY_ACCOUNT_API_KEY` from `.env`
2. **Storage API** (new `BunnyStorageClient`): For uploading/managing files
- Uses `BUNNY_API_KEY` from `.env` (account-level storage access)
- Uses `storage_zone_password` from `SiteDeployment` model (per-zone password)
- Requires BOTH credentials for authentication
Storage API authentication:
- Base URL: `https://storage.bunnycdn.com/{zone_name}/{file_path}`
- Authentication method to be determined during implementation:
- `BUNNY_API_KEY` from `.env` (account-level)
- `storage_zone_password` from database (per-zone, returned in JSON when zone is created)
- May require one or both keys depending on Bunny.net's API requirements
- Storage API key can be extracted from Bunny.net JSON response during zone creation
- If implementation issues arise, reference code/examples can be provided
Upload example:
```python
# Get site from database
site = site_repo.get_by_id(site_deployment_id)
# Get API key from .env
bunny_api_key = os.getenv("BUNNY_API_KEY")
# Upload (authentication method TBD during implementation)
PUT https://storage.bunnycdn.com/{site.storage_zone_name}/my-article.html
Headers:
AccessKey: {bunny_api_key OR site.storage_zone_password} # TBD
Content-Type: text/html
Body:
<html>...</html>
```
### File Path Structure
```
Storage Zone: my-zone
Region: DE (Germany)
Articles:
/my-article-slug.html
/another-article.html
/third-article-title.html
Boilerplate pages:
/about.html
/contact.html
/privacy.html
Not using subdirectories for simplicity
Future: Could organize by date or category
```
### URL Logger Implementation
```python
# src/deployment/url_logger.py
import os
from datetime import datetime
from typing import Set
from pathlib import Path
import fcntl # For file locking on Unix
class URLLogger:
def __init__(self, logs_dir: str = "deployment_logs"):
self.logs_dir = Path(logs_dir)
self.logs_dir.mkdir(exist_ok=True)
def log_article_url(self, url: str, tier: str, date: datetime = None):
if date is None:
date = datetime.utcnow()
# Determine file
tier_num = self._extract_tier_number(tier)
if tier_num == 1:
filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
else:
filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
filepath = self.logs_dir / filename
# Check for duplicates
existing = self.get_existing_urls(tier, date)
if url in existing:
return # Skip duplicate
# Append to file (with lock)
with open(filepath, 'a') as f:
fcntl.flock(f, fcntl.LOCK_EX)
f.write(f"{url}\n")
fcntl.flock(f, fcntl.LOCK_UN)
def get_existing_urls(self, tier: str, date: datetime = None) -> Set[str]:
"""
Get existing URLs from log file to prevent duplicates
This is critical for preventing duplicate entries when:
- Auto-deployment runs, then manual re-run happens
- Deployment fails partway and is restarted
"""
if date is None:
date = datetime.utcnow()
tier_num = self._extract_tier_number(tier)
if tier_num == 1:
filename = f"{date.strftime('%Y-%m-%d')}_tier1_urls.txt"
else:
filename = f"{date.strftime('%Y-%m-%d')}_other_tiers_urls.txt"
filepath = self.logs_dir / filename
if not filepath.exists():
return set()
with open(filepath, 'r') as f:
return set(line.strip() for line in f if line.strip())
def _extract_tier_number(self, tier: str) -> int:
# Extract number from "tier1", "tier2", etc.
return int(''.join(c for c in tier if c.isdigit()))
```
### Deployment Service Implementation
```python
# src/deployment/deployment_service.py
from typing import List, Dict, Any
from src.deployment.bunny_storage import BunnyStorageClient
from src.deployment.url_logger import URLLogger
from src.database.repositories import GeneratedContentRepository, SitePageRepository, SiteDeploymentRepository
from src.generation.url_generator import generate_public_url, generate_file_path
import logging
logger = logging.getLogger(__name__)
class DeploymentService:
def __init__(
self,
storage_client: BunnyStorageClient,
content_repo: GeneratedContentRepository,
site_repo: SiteDeploymentRepository,
page_repo: SitePageRepository,
url_logger: URLLogger
):
self.storage = storage_client
self.content_repo = content_repo
self.site_repo = site_repo
self.page_repo = page_repo
self.url_logger = url_logger
def deploy_batch(self, project_id: int, continue_on_error: bool = True) -> Dict[str, Any]:
"""
Deploy all content for a project/batch
Returns:
Dict with deployment statistics:
{
'articles_deployed': 10,
'articles_failed': 1,
'pages_deployed': 6,
'pages_failed': 0,
'total_time': 45.2
}
"""
results = {
'articles_deployed': 0,
'articles_failed': 0,
'pages_deployed': 0,
'pages_failed': 0,
'errors': []
}
# Get all articles for project
articles = self.content_repo.get_by_project_id(project_id)
logger.info(f"Found {len(articles)} articles to deploy for project {project_id}")
# Deploy articles
for article in articles:
if not article.site_deployment_id:
logger.warning(f"Article {article.id} has no site assigned, skipping")
continue
try:
site = self.site_repo.get_by_id(article.site_deployment_id)
if not site:
raise ValueError(f"Site {article.site_deployment_id} not found")
# Deploy article
url = self.deploy_article(article, site)
# Log URL to text file
self.url_logger.log_article_url(url, article.tier)
# Update database
self.content_repo.mark_as_deployed(article.id, url)
results['articles_deployed'] += 1
logger.info(f"Deployed article {article.id} to {url}")
except Exception as e:
results['articles_failed'] += 1
results['errors'].append({
'type': 'article',
'id': article.id,
'error': str(e)
})
logger.error(f"Failed to deploy article {article.id}: {e}")
if not continue_on_error:
raise
# Get unique sites from articles
site_ids = set(a.site_deployment_id for a in articles if a.site_deployment_id)
# Deploy boilerplate pages for each site
for site_id in site_ids:
site = self.site_repo.get_by_id(site_id)
pages = self.page_repo.get_by_site(site_id)
if not pages:
logger.debug(f"Site {site_id} has no boilerplate pages, skipping")
continue
logger.info(f"Found {len(pages)} boilerplate pages for site {site_id}")
for page in pages:
try:
# Read HTML from database (stored in page.content from Story 3.4)
url = self.deploy_boilerplate_page(page, site)
results['pages_deployed'] += 1
logger.info(f"Deployed page {page.page_type} to {url}")
except Exception as e:
results['pages_failed'] += 1
results['errors'].append({
'type': 'page',
'site_id': site_id,
'page_type': page.page_type,
'error': str(e)
})
logger.error(f"Failed to deploy page {page.page_type} for site {site_id}: {e}")
if not continue_on_error:
raise
return results
def deploy_article(self, article, site) -> str:
"""Deploy a single article, return public URL"""
file_path = generate_file_path(article)
url = generate_public_url(site, file_path)
# Upload using both BUNNY_API_KEY and zone password
# BunnyStorageClient determines which auth method to use
self.storage.upload_file(
zone_name=site.storage_zone_name,
zone_password=site.storage_zone_password, # Per-zone password from DB
file_path=file_path,
content=article.formatted_html,
content_type='text/html'
)
return url
def deploy_boilerplate_page(self, page, site) -> str:
"""
Deploy a boilerplate page, return public URL
Note: Uses stored HTML from page.content (from Story 3.4)
Technical debt: Could regenerate on-the-fly instead of storing
"""
file_path = f"{page.page_type}.html"
url = generate_public_url(site, file_path)
# Upload using both BUNNY_API_KEY and zone password
self.storage.upload_file(
zone_name=site.storage_zone_name,
zone_password=site.storage_zone_password,
file_path=file_path,
content=page.content, # Full HTML stored in DB
content_type='text/html'
)
return url
```
### CLI Command Example
```bash
# Deploy a batch manually
uv run python -m src.cli deploy-batch \
--batch_id 123 \
--admin-user admin \
--admin-password mypass
# Output:
# Authenticating...
# Loading Bunny.net credentials...
# Deploying batch 123...
# [1/50] Deploying article "How to Fix Engines"... ✓
# [2/50] Deploying article "Engine Maintenance Tips"... ✓
# ...
# [50/50] Deploying article "Common Engine Problems"... ✓
# Deploying boilerplate pages...
# [1/6] Deploying about.html for site1.b-cdn.net... ✓
# [2/6] Deploying contact.html for site1.b-cdn.net... ✓
# ...
#
# Deployment Summary:
# ==================
# Articles deployed: 48
# Articles failed: 2
# Pages deployed: 6
# Pages failed: 0
# Total time: 2m 34s
#
# Failed articles:
# - Article 15: Connection timeout
# - Article 32: Invalid HTML content
# Dry-run mode
uv run python -m src.cli deploy-batch \
--batch_id 123 \
--dry-run
# Output shows what would be deployed without actually uploading
```
### Environment Variables
Required in `.env` file:
```bash
# Bunny.net Account API (for creating/managing storage zones and pull zones)
BUNNY_ACCOUNT_API_KEY=your_account_api_key_here
# Bunny.net Storage API (for uploading files to storage)
BUNNY_API_KEY=your_storage_api_key_here
# Note: storage_zone_password is per-zone and stored in database
# Both BUNNY_API_KEY and storage_zone_password may be needed for uploads
# API keys should ONLY be in .env file, NOT in master.config.json
```
### Database Schema Updates
```sql
-- Add deployment tracking fields to generated_content
ALTER TABLE generated_content ADD COLUMN deployed_url TEXT NULL;
ALTER TABLE generated_content ADD COLUMN deployed_at TIMESTAMP NULL;
CREATE INDEX idx_generated_content_deployed ON generated_content(deployed_at);
```
## Dependencies
- Story 3.1: Site assignment (need site_deployment_id on articles)
- Story 3.3: Content interlinking (HTML must be finalized)
- Story 3.4: Boilerplate pages (need SitePage table)
- Bunny.net Storage API access
- Environment variables configured in `.env`
## Future Considerations
- Story 4.2: URL logging (partially implemented here)
- Story 4.3: Database status updates (partially implemented here)
- Story 4.4: Post-deployment verification
- Multi-cloud support (AWS S3, Azure, DigitalOcean, etc.)
- CDN cache purging after deployment
- Parallel uploads for faster deployment
- Resumable uploads for large files
- Deployment rollback mechanism
## Technical Debt Created
- Multi-cloud support deferred (only Bunny.net for now)
- No CDN cache purging yet (Story 4.x)
- No deployment verification yet (Story 4.4)
- URL logging is simple (no database tracking of logged URLs)
- Boilerplate pages stored as full HTML in database (inefficient)
- Better approach: Store just page_type marker, regenerate HTML on-the-fly at deployment
- Reduces storage, ensures consistency with current templates
- Defer optimization to later story
## Total Effort
22 story points
### Effort Breakdown
1. Bunny Storage Client (3 points)
2. Deployment Service (3 points)
3. URL Generation (2 points)
4. URL Logging (2 points)
5. Database Updates (2 points)
6. CLI Command (2 points)
7. Batch Integration (2 points)
8. Environment Audit (1 point)
9. Unit Tests (3 points)
10. Integration Tests (2 points)
## Questions & Clarifications
### Question 1: Boilerplate Page Deployment Strategy
**Status:** ✓ RESOLVED
The approach:
- Check `site_pages` table in database
- Only deploy boilerplate pages if they exist in DB
- Read HTML content from `site_pages.content` field
- Most sites won't have them (only newly created sites from Story 3.4+)
- Don't check remote buckets (database is source of truth)
### Question 2: URL Duplicate Prevention
**Status:** ✓ RESOLVED
Approach:
- Read entire file before appending
- Check if URL exists in memory (set), skip if duplicate
- File locking for thread-safety
- This prevents duplicate URLs from manual re-runs after automatic deployment
- No database tracking needed (file is source of truth)
### Question 3: Auto-deploy Default Behavior
**Status:** ✓ RESOLVED
Decision: **ON by default**
- Auto-deploy after batch generation completes
- No reason to delay deployment in normal workflow
- CLI command still available for manual re-deployment if auto-deploy fails
- Can be disabled for testing via flag if needed
### Question 4: API Keys in master.config.json
**Status:** ✓ RESOLVED
Decision: **Ignore master.config.json for API keys**
- All API keys come from `.env` file only
- Even if keys exist in master.config.json now, they'll be removed in future epics
- Don't reference master.config.json for any authentication
- Only use .env for credentials
## Notes
- Keep deployment simple for first iteration
- Focus on reliability over speed
- Auto-deploy is ON by default (deploy immediately after batch generation)
- Manual CLI command available for re-deployment or testing
- Comprehensive error reporting is critical
- URL logging format is simple (one URL per line)
- All API keys come from `.env` file, NOT master.config.json
- Storage API authentication details will be determined during implementation