15 KiB
15 KiB
Story 3.1: Generate and Validate Article URLs
Status
Approved
Story
As a developer, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking.
Context
- Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains)
- Remaining articles have
site_deployment_id = nulland need site assignment - Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches)
- Sites can have custom domains (
custom_hostname) OR just bunny.net CDN hostnames (pull_zone_bcdn_hostname) - System has 400+ existing bunny.net buckets without custom domains that should be usable
- Job config can specify preferred sites for tier1 articles
- Job config can request auto-creation of new sites if pool is insufficient
- Job config can request pre-creation of sites for specific keywords/entities
Acceptance Criteria
Database Changes
custom_hostnamefield inSiteDeploymenttable is nullable (was previously required)pull_zone_bcdn_hostnamefield inSiteDeploymenttable has unique constraint- Database migration script updates existing schema without data loss
Repository Updates
SiteDeploymentRepository.create()accepts optionalcustom_hostnameparameter (defaults toNone)SiteDeploymentRepository.get_by_bcdn_hostname()method added to query by bunny.net hostname- Repository interface (
ISiteDeploymentRepository) updated to reflect optionalcustom_hostname
Job Configuration Extensions
- Job config supports optional
tier1_preferred_sitesarray (list of hostnames for tier1 assignment) - Job config supports optional
auto_create_sitesboolean (default: false) - Job config supports optional
create_sites_for_keywordsarray of{keyword: str, count: int}objects - Invalid hostnames in
tier1_preferred_sitescause graceful errors - Validation occurs at start of Story 3.1 workflow
Site Creation Logic
- If
create_sites_for_keywordsspecified:- Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API
- Site names:
{keyword-slug}-{random-suffix}(e.g., "engine-repair-a8f3") - Creates in default region (configurable, default "DE")
- Stores in database with
custom_hostname = null - These sites added to available pool BEFORE assignment
- If
auto_create_sites: trueand not enough sites after assignment:- Creates additional generic bunny.net sites on-demand
- Generic site names:
{project-keyword-slug}-{random-suffix} - Creates only the minimum needed
- All created sites use bunny.net API (same as
provision-sitecommand, but without custom domain step)
Site Assignment Logic
- A function accepts a batch of
GeneratedContentrecords (all from same batch/tier) - For tier1 articles with
site_deployment_id = null:- Priority 1: Assign from
tier1_preferred_sites(if specified in job config) - Priority 2: Random selection from available pool
- Priority 1: Assign from
- For tier2+ articles with
site_deployment_id = null:- Random selection from available pool
- Assignment rules:
- Ensures no two articles in the batch get assigned the same site
- Can reuse sites from previous batches (only same-batch collision matters)
- Pre-created keyword sites are in the available pool
- Updates the
GeneratedContent.site_deployment_idfield in database - For articles that already have
site_deployment_idset (from Story 2.5): leaves them unchanged - If not enough available sites exist:
- If
auto_create_sites: true: create more sites - If
auto_create_sites: false: raise clear error with count of sites needed vs available
- If
URL Generation Logic
- A function generates the final public URL for each article based on its assigned site
- URL structure:
https://{hostname}/{slug}.html - Hostname selection:
- If site has
custom_hostname: usecustom_hostname - If site has no
custom_hostname: usepull_zone_bcdn_hostname
- If site has
- Slug generation from article title:
- Convert to lowercase
- Replace spaces with hyphens
- Remove special characters (keep only alphanumeric and hyphens)
- Trim to reasonable length (e.g., max 100 characters)
- Example: "How to Fix Your Engine" → "how-to-fix-your-engine"
Output
- Function returns a list of URL mappings:
[{content_id: int, title: str, url: str, tier: int}, ...] - URL list includes ALL articles in the batch (both pre-assigned and newly assigned)
- Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15")
Error Handling
- If not enough sites available: clear error message with count needed
- If site lookup fails: clear error with site_id
- If slug generation produces empty string: use fallback (e.g., "article-{content_id}")
Tasks / Subtasks
1. Database Schema Changes
Effort: 2 story points
- Create migration script to alter
site_deploymentstable:- Make
custom_hostnamenullable - Add unique constraint to
pull_zone_bcdn_hostname
- Make
- Update
src/database/models.py:- Change
custom_hostname: Mapped[str]toMapped[Optional[str]] - Change
nullable=Falsetonullable=True - Add
unique=Truetopull_zone_bcdn_hostnamefield
- Change
- Test migration on development database
2. Update Repository Layer
Effort: 2 story points
- Update
src/database/interfaces.py:- Change
custom_hostname: strtocustom_hostname: Optional[str]increate()signature - Add
get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]method signature
- Change
- Update
src/database/repositories.py:- Make
custom_hostnameparameter optional increate()with defaultNone - Update uniqueness validation to handle nullable
custom_hostname - Implement
get_by_bcdn_hostname()method - Update
exists()method to check both hostname types
- Make
3. Implement Site Creation Logic
Effort: 3 story points
- Create new module:
src/generation/site_provisioning.py - Implement
create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment:- Call bunny.net API to create Storage Zone
- Call bunny.net API to create Pull Zone linked to storage
- Generate unique site name with random suffix
- Save to database with
custom_hostname = null - Return SiteDeployment record
- Implement
provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]:- For each keyword+count, create N sites
- Use keyword in site name (slugified)
- Return list of created sites
- Log site creation at INFO level
4. Implement Site Assignment Logic
Effort: 4 story points (increased from 3)
- Update job config schema (
src/generation/job_config.py):- Add
tier1_preferred_sites: Optional[List[str]] - Add
auto_create_sites: Optional[bool] = False - Add
create_sites_for_keywords: Optional[List[Dict]]
- Add
- Create new module:
src/generation/site_assignment.py - Implement
assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None:- Pre-create sites for keywords if specified
- Query all available sites from database
- Filter out sites already assigned to articles in this batch
- For tier1 articles: try preferred sites first, then random
- For tier2+ articles: random only
- If insufficient sites and auto_create=true: create more
- Update
GeneratedContent.site_deployment_idin database - Validate enough sites are available (raise error if auto_create=false)
- Log assignment decisions at INFO level
5. Implement URL Generation Logic
Effort: 2 story points
- Create
src/generation/url_generator.py - Implement
generate_slug(title: str) -> str:- Convert to lowercase
- Replace spaces with hyphens
- Remove special characters
- Trim to max length
- Handle edge cases (empty result, etc.)
- Implement
generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]:- For each article, lookup its site
- Determine hostname (custom or bcdn)
- Generate slug from title
- Build complete URL
- Return list of mappings
6. Update Template Service
Effort: 1 story point
- Update
src/templating/service.py:- Line 92: Change to
hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname - Ensure template mapping works for both hostname types
- Line 92: Change to
7. Update CLI for Undomained Sites
Effort: 2 story points
- Update
sync-sitescommand insrc/cli/commands.py:- Remove filter that skips sites without custom hostnames (lines 688-689)
- Import sites that only have b-cdn.net hostnames
- Set
custom_hostname = Nonefor these sites
- Test importing undomained sites from bunny.net
8. Unit Tests
Effort: 4 story points (increased from 3)
- Test slug generation with various inputs (special chars, long titles, empty strings)
- Test URL generation with custom hostname
- Test URL generation with only bcdn hostname
- Test site assignment with exact count of available sites
- Test site assignment with insufficient sites (error case)
- Test site assignment skips already-assigned articles
- Test tier1 preferred sites logic
- Test tier2+ random assignment only
- Test site creation with keyword prefix
- Test auto-creation on-demand
- Achieve >80% code coverage for new modules
9. Integration Tests
Effort: 3 story points (increased from 2)
- Test full flow: batch generation → site assignment → URL generation
- Test with mix of pre-assigned (Story 2.5) and null articles
- Test tier1_preferred_sites assignment
- Test auto_create_sites when insufficient pool
- Test create_sites_for_keywords pre-creation
- Test database updates persist correctly
- Test with sites that have custom domains vs only bcdn hostnames
- Verify no duplicate site assignments within same batch
Technical Notes
Job Configuration Example
{
"job_name": "Test Run",
"project_id": 2,
"deployment_targets": [
"www.domain1.com",
"www.domain2.com"
],
"tier1_preferred_sites": [
"site123.b-cdn.net",
"www.otherdomain.com"
],
"auto_create_sites": true,
"create_sites_for_keywords": [
{"keyword": "engine repair", "count": 3},
{"keyword": "car maintenance", "count": 2}
],
"tiers": [
{
"tier": 1,
"article_count": 10
},
{
"tier": 2,
"article_count": 50
}
]
}
Assignment example with 10 tier1 articles:
- Articles 0-1: Assigned via
deployment_targets(Story 2.5, already done) - Articles 2-3: Assigned via
tier1_preferred_sites(Story 3.1) - Articles 4-8: Assigned via keyword sites if available, else random
- Article 9: Random or auto-created if pool exhausted
Tier2 articles: All random
Site Naming Convention
Keyword sites: engine-repair-a8f3
car-maintenance-9x2k
Generic sites: shaft-machining-7m4p
{project-keyword}-{random-4-char}
Slug Generation Example
def generate_slug(title: str, max_length: int = 100) -> str:
"""
Generate URL-safe slug from article title
Examples:
"How to Fix Your Engine" -> "how-to-fix-your-engine"
"10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024"
"C++ Programming Guide" -> "c-programming-guide"
"""
import re
slug = title.lower()
slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars
slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen
slug = slug.strip('-')[:max_length] # Trim and limit length
return slug or "article" # Fallback if empty
URL Structure Examples
Custom domain: https://www.example.com/how-to-fix-your-engine.html
Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html
Site Assignment Algorithm
1. Load job config and GeneratedContent records for batch
2. Pre-create sites for keywords if create_sites_for_keywords specified
3. Query all available SiteDeployment records
4. Identify articles with site_deployment_id = null (need assignment)
5. Filter out sites already used by articles in THIS batch
6. For each tier1 article needing assignment:
a. Try tier1_preferred_sites first (if specified)
b. Fallback to random from available pool
7. For each tier2+ article needing assignment:
a. Random from available pool
8. If insufficient sites and auto_create_sites=true:
a. Create minimum needed via bunny.net API
b. Retry assignment with expanded pool
9. If insufficient sites and auto_create_sites=false:
a. Raise error with count needed vs available
10. Update database with new site_deployment_id values
Site Creation via Bunny.net API
def create_bunnynet_site(name_prefix: str, region: str = "DE"):
# Step 1: Create Storage Zone
storage = bunny_client.create_storage_zone(
name=f"{name_prefix}-{random_suffix()}",
region=region
)
# Step 2: Create Pull Zone (no custom hostname step)
pull = bunny_client.create_pull_zone(
name=f"{name_prefix}-{random_suffix()}",
storage_zone_id=storage.id
)
# Step 3: Save to database
site = site_repo.create(
site_name=name_prefix,
custom_hostname=None, # No custom domain
storage_zone_id=storage.id,
storage_zone_name=storage.name,
storage_zone_password=storage.password,
storage_zone_region=region,
pull_zone_id=pull.id,
pull_zone_bcdn_hostname=pull.hostname
)
return site
Database Migration
-- Make custom_hostname nullable
ALTER TABLE site_deployments
MODIFY COLUMN custom_hostname VARCHAR(255) NULL;
-- Add unique constraint to pull_zone_bcdn_hostname
ALTER TABLE site_deployments
ADD CONSTRAINT uq_pull_zone_bcdn_hostname
UNIQUE (pull_zone_bcdn_hostname);
Dependencies
- Story 1.6:
SiteDeploymenttable exists - Story 2.3: Content generation creates
GeneratedContentrecords - Story 2.5: Some articles may already have
site_deployment_idset
Future Considerations
- Story 4.x will use generated URLs for deployment
- Story 3.3 will use URL list for interlinking
- Future: S3-compatible storage support (custom_hostname nullable enables this)
Deferred to Technical Debt
- Fuzzy keyword/entity matching for intelligent site assignment (T1 articles)
- This would compare article keywords/entities to site hostnames and assign based on relevance score
- Adds complexity: keyword extraction, scoring algorithm, match tracking
- Estimated effort if implemented later: 5-8 story points
Total Effort
21 story points (increased from 17)