# Story 3.1: Generate and Validate Article URLs ## Status Finished ## Story **As a developer**, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking. ## Context - Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains) - Remaining articles have `site_deployment_id = null` and need site assignment - Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches) - Sites can have custom domains (`custom_hostname`) OR just bunny.net CDN hostnames (`pull_zone_bcdn_hostname`) - System has 400+ existing bunny.net buckets without custom domains that should be usable - Job config can specify preferred sites for tier1 articles - Job config can request auto-creation of new sites if pool is insufficient - Job config can request pre-creation of sites for specific keywords/entities ## Acceptance Criteria ### Database Changes - `custom_hostname` field in `SiteDeployment` table is nullable (was previously required) - `pull_zone_bcdn_hostname` field in `SiteDeployment` table has unique constraint - Database migration script updates existing schema without data loss ### Repository Updates - `SiteDeploymentRepository.create()` accepts optional `custom_hostname` parameter (defaults to `None`) - `SiteDeploymentRepository.get_by_bcdn_hostname()` method added to query by bunny.net hostname - Repository interface (`ISiteDeploymentRepository`) updated to reflect optional `custom_hostname` ### Job Configuration Extensions - Job config supports optional `tier1_preferred_sites` array (list of hostnames for tier1 assignment) - Job config supports optional `auto_create_sites` boolean (default: false) - Job config supports optional `create_sites_for_keywords` array of `{keyword: str, count: int}` objects - Invalid hostnames in `tier1_preferred_sites` cause graceful errors - Validation occurs at start of Story 3.1 workflow ### Site Creation Logic - If `create_sites_for_keywords` specified: - Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API - Site names: `{keyword-slug}-{random-suffix}` (e.g., "engine-repair-a8f3") - Creates in default region (configurable, default "DE") - Stores in database with `custom_hostname = null` - These sites added to available pool BEFORE assignment - If `auto_create_sites: true` and not enough sites after assignment: - Creates additional generic bunny.net sites on-demand - Generic site names: `{project-keyword-slug}-{random-suffix}` - Creates only the minimum needed - All created sites use bunny.net API (same as `provision-site` command, but without custom domain step) ### Site Assignment Logic - A function accepts a batch of `GeneratedContent` records (all from same batch/tier) - For tier1 articles with `site_deployment_id = null`: - **Priority 1:** Assign from `tier1_preferred_sites` (if specified in job config) - **Priority 2:** Random selection from available pool - For tier2+ articles with `site_deployment_id = null`: - Random selection from available pool - Assignment rules: - Ensures no two articles in the batch get assigned the same site - Can reuse sites from previous batches (only same-batch collision matters) - Pre-created keyword sites are in the available pool - Updates the `GeneratedContent.site_deployment_id` field in database - For articles that already have `site_deployment_id` set (from Story 2.5): leaves them unchanged - If not enough available sites exist: - If `auto_create_sites: true`: create more sites - If `auto_create_sites: false`: raise clear error with count of sites needed vs available ### URL Generation Logic - A function generates the final public URL for each article based on its assigned site - URL structure: `https://{hostname}/{slug}.html` - Hostname selection: - If site has `custom_hostname`: use `custom_hostname` - If site has no `custom_hostname`: use `pull_zone_bcdn_hostname` - Slug generation from article title: - Convert to lowercase - Replace spaces with hyphens - Remove special characters (keep only alphanumeric and hyphens) - Trim to reasonable length (e.g., max 100 characters) - Example: "How to Fix Your Engine" → "how-to-fix-your-engine" ### Output - Function returns a list of URL mappings: `[{content_id: int, title: str, url: str, tier: int}, ...]` - URL list includes ALL articles in the batch (both pre-assigned and newly assigned) - Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15") ### Error Handling - If not enough sites available: clear error message with count needed - If site lookup fails: clear error with site_id - If slug generation produces empty string: use fallback (e.g., "article-{content_id}") ## Tasks / Subtasks ### 1. Database Schema Changes **Effort:** 2 story points - [ ] Create migration script to alter `site_deployments` table: - Make `custom_hostname` nullable - Add unique constraint to `pull_zone_bcdn_hostname` - [ ] Update `src/database/models.py`: - Change `custom_hostname: Mapped[str]` to `Mapped[Optional[str]]` - Change `nullable=False` to `nullable=True` - Add `unique=True` to `pull_zone_bcdn_hostname` field - [ ] Test migration on development database ### 2. Update Repository Layer **Effort:** 2 story points - [ ] Update `src/database/interfaces.py`: - Change `custom_hostname: str` to `custom_hostname: Optional[str]` in `create()` signature - Add `get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]` method signature - [ ] Update `src/database/repositories.py`: - Make `custom_hostname` parameter optional in `create()` with default `None` - Update uniqueness validation to handle nullable `custom_hostname` - Implement `get_by_bcdn_hostname()` method - Update `exists()` method to check both hostname types ### 3. Implement Site Creation Logic **Effort:** 3 story points - [ ] Create new module: `src/generation/site_provisioning.py` - [ ] Implement `create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment`: - Call bunny.net API to create Storage Zone - Call bunny.net API to create Pull Zone linked to storage - Generate unique site name with random suffix - Save to database with `custom_hostname = null` - Return SiteDeployment record - [ ] Implement `provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]`: - For each keyword+count, create N sites - Use keyword in site name (slugified) - Return list of created sites - [ ] Log site creation at INFO level ### 4. Implement Site Assignment Logic **Effort:** 4 story points (increased from 3) - [ ] Update job config schema (`src/generation/job_config.py`): - Add `tier1_preferred_sites: Optional[List[str]]` - Add `auto_create_sites: Optional[bool] = False` - Add `create_sites_for_keywords: Optional[List[Dict]]` - [ ] Create new module: `src/generation/site_assignment.py` - [ ] Implement `assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None`: - Pre-create sites for keywords if specified - Query all available sites from database - Filter out sites already assigned to articles in this batch - For tier1 articles: try preferred sites first, then random - For tier2+ articles: random only - If insufficient sites and auto_create=true: create more - Update `GeneratedContent.site_deployment_id` in database - Validate enough sites are available (raise error if auto_create=false) - [ ] Log assignment decisions at INFO level ### 5. Implement URL Generation Logic **Effort:** 2 story points - [ ] Create `src/generation/url_generator.py` - [ ] Implement `generate_slug(title: str) -> str`: - Convert to lowercase - Replace spaces with hyphens - Remove special characters - Trim to max length - Handle edge cases (empty result, etc.) - [ ] Implement `generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]`: - For each article, lookup its site - Determine hostname (custom or bcdn) - Generate slug from title - Build complete URL - Return list of mappings ### 6. Update Template Service **Effort:** 1 story point - [ ] Update `src/templating/service.py`: - Line 92: Change to `hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname` - Ensure template mapping works for both hostname types ### 7. Update CLI for Undomained Sites **Effort:** 2 story points - [ ] Update `sync-sites` command in `src/cli/commands.py`: - Remove filter that skips sites without custom hostnames (lines 688-689) - Import sites that only have b-cdn.net hostnames - Set `custom_hostname = None` for these sites - [ ] Test importing undomained sites from bunny.net ### 8. Unit Tests **Effort:** 4 story points (increased from 3) - [ ] Test slug generation with various inputs (special chars, long titles, empty strings) - [ ] Test URL generation with custom hostname - [ ] Test URL generation with only bcdn hostname - [ ] Test site assignment with exact count of available sites - [ ] Test site assignment with insufficient sites (error case) - [ ] Test site assignment skips already-assigned articles - [ ] Test tier1 preferred sites logic - [ ] Test tier2+ random assignment only - [ ] Test site creation with keyword prefix - [ ] Test auto-creation on-demand - [ ] Achieve >80% code coverage for new modules ### 9. Integration Tests **Effort:** 3 story points (increased from 2) - [ ] Test full flow: batch generation → site assignment → URL generation - [ ] Test with mix of pre-assigned (Story 2.5) and null articles - [ ] Test tier1_preferred_sites assignment - [ ] Test auto_create_sites when insufficient pool - [ ] Test create_sites_for_keywords pre-creation - [ ] Test database updates persist correctly - [ ] Test with sites that have custom domains vs only bcdn hostnames - [ ] Verify no duplicate site assignments within same batch ## Technical Notes ### Job Configuration Example ```json { "job_name": "Test Run", "project_id": 2, "deployment_targets": [ "www.domain1.com", "www.domain2.com" ], "tier1_preferred_sites": [ "site123.b-cdn.net", "www.otherdomain.com" ], "auto_create_sites": true, "create_sites_for_keywords": [ {"keyword": "engine repair", "count": 3}, {"keyword": "car maintenance", "count": 2} ], "tiers": [ { "tier": 1, "article_count": 10 }, { "tier": 2, "article_count": 50 } ] } ``` **Assignment example with 10 tier1 articles:** - Articles 0-1: Assigned via `deployment_targets` (Story 2.5, already done) - Articles 2-3: Assigned via `tier1_preferred_sites` (Story 3.1) - Articles 4-8: Assigned via keyword sites if available, else random - Article 9: Random or auto-created if pool exhausted **Tier2 articles:** All random ### Site Naming Convention ``` Keyword sites: engine-repair-a8f3 car-maintenance-9x2k Generic sites: shaft-machining-7m4p {project-keyword}-{random-4-char} ``` ### Slug Generation Example ```python def generate_slug(title: str, max_length: int = 100) -> str: """ Generate URL-safe slug from article title Examples: "How to Fix Your Engine" -> "how-to-fix-your-engine" "10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024" "C++ Programming Guide" -> "c-programming-guide" """ import re slug = title.lower() slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen slug = slug.strip('-')[:max_length] # Trim and limit length return slug or "article" # Fallback if empty ``` ### URL Structure Examples ``` Custom domain: https://www.example.com/how-to-fix-your-engine.html Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html ``` ### Site Assignment Algorithm ``` 1. Load job config and GeneratedContent records for batch 2. Pre-create sites for keywords if create_sites_for_keywords specified 3. Query all available SiteDeployment records 4. Identify articles with site_deployment_id = null (need assignment) 5. Filter out sites already used by articles in THIS batch 6. For each tier1 article needing assignment: a. Try tier1_preferred_sites first (if specified) b. Fallback to random from available pool 7. For each tier2+ article needing assignment: a. Random from available pool 8. If insufficient sites and auto_create_sites=true: a. Create minimum needed via bunny.net API b. Retry assignment with expanded pool 9. If insufficient sites and auto_create_sites=false: a. Raise error with count needed vs available 10. Update database with new site_deployment_id values ``` ### Site Creation via Bunny.net API ```python def create_bunnynet_site(name_prefix: str, region: str = "DE"): # Step 1: Create Storage Zone storage = bunny_client.create_storage_zone( name=f"{name_prefix}-{random_suffix()}", region=region ) # Step 2: Create Pull Zone (no custom hostname step) pull = bunny_client.create_pull_zone( name=f"{name_prefix}-{random_suffix()}", storage_zone_id=storage.id ) # Step 3: Save to database site = site_repo.create( site_name=name_prefix, custom_hostname=None, # No custom domain storage_zone_id=storage.id, storage_zone_name=storage.name, storage_zone_password=storage.password, storage_zone_region=region, pull_zone_id=pull.id, pull_zone_bcdn_hostname=pull.hostname ) return site ``` ### Database Migration ```sql -- Make custom_hostname nullable ALTER TABLE site_deployments MODIFY COLUMN custom_hostname VARCHAR(255) NULL; -- Add unique constraint to pull_zone_bcdn_hostname ALTER TABLE site_deployments ADD CONSTRAINT uq_pull_zone_bcdn_hostname UNIQUE (pull_zone_bcdn_hostname); ``` ## Dependencies - Story 1.6: `SiteDeployment` table exists - Story 2.3: Content generation creates `GeneratedContent` records - Story 2.5: Some articles may already have `site_deployment_id` set ## Future Considerations - Story 4.x will use generated URLs for deployment - Story 3.3 will use URL list for interlinking - Future: S3-compatible storage support (custom_hostname nullable enables this) ## Deferred to Technical Debt - Fuzzy keyword/entity matching for intelligent site assignment (T1 articles) - This would compare article keywords/entities to site hostnames and assign based on relevance score - Adds complexity: keyword extraction, scoring algorithm, match tracking - Estimated effort if implemented later: 5-8 story points ## Total Effort 21 story points (increased from 17)