Big-Link-Man/docs/stories/story-3.1-url-generation-an...

369 lines
15 KiB
Markdown

# Story 3.1: Generate and Validate Article URLs
## Status
Finished
## Story
**As a developer**, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking.
## Context
- Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains)
- Remaining articles have `site_deployment_id = null` and need site assignment
- Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches)
- Sites can have custom domains (`custom_hostname`) OR just bunny.net CDN hostnames (`pull_zone_bcdn_hostname`)
- System has 400+ existing bunny.net buckets without custom domains that should be usable
- Job config can specify preferred sites for tier1 articles
- Job config can request auto-creation of new sites if pool is insufficient
- Job config can request pre-creation of sites for specific keywords/entities
## Acceptance Criteria
### Database Changes
- `custom_hostname` field in `SiteDeployment` table is nullable (was previously required)
- `pull_zone_bcdn_hostname` field in `SiteDeployment` table has unique constraint
- Database migration script updates existing schema without data loss
### Repository Updates
- `SiteDeploymentRepository.create()` accepts optional `custom_hostname` parameter (defaults to `None`)
- `SiteDeploymentRepository.get_by_bcdn_hostname()` method added to query by bunny.net hostname
- Repository interface (`ISiteDeploymentRepository`) updated to reflect optional `custom_hostname`
### Job Configuration Extensions
- Job config supports optional `tier1_preferred_sites` array (list of hostnames for tier1 assignment)
- Job config supports optional `auto_create_sites` boolean (default: false)
- Job config supports optional `create_sites_for_keywords` array of `{keyword: str, count: int}` objects
- Invalid hostnames in `tier1_preferred_sites` cause graceful errors
- Validation occurs at start of Story 3.1 workflow
### Site Creation Logic
- If `create_sites_for_keywords` specified:
- Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API
- Site names: `{keyword-slug}-{random-suffix}` (e.g., "engine-repair-a8f3")
- Creates in default region (configurable, default "DE")
- Stores in database with `custom_hostname = null`
- These sites added to available pool BEFORE assignment
- If `auto_create_sites: true` and not enough sites after assignment:
- Creates additional generic bunny.net sites on-demand
- Generic site names: `{project-keyword-slug}-{random-suffix}`
- Creates only the minimum needed
- All created sites use bunny.net API (same as `provision-site` command, but without custom domain step)
### Site Assignment Logic
- A function accepts a batch of `GeneratedContent` records (all from same batch/tier)
- For tier1 articles with `site_deployment_id = null`:
- **Priority 1:** Assign from `tier1_preferred_sites` (if specified in job config)
- **Priority 2:** Random selection from available pool
- For tier2+ articles with `site_deployment_id = null`:
- Random selection from available pool
- Assignment rules:
- Ensures no two articles in the batch get assigned the same site
- Can reuse sites from previous batches (only same-batch collision matters)
- Pre-created keyword sites are in the available pool
- Updates the `GeneratedContent.site_deployment_id` field in database
- For articles that already have `site_deployment_id` set (from Story 2.5): leaves them unchanged
- If not enough available sites exist:
- If `auto_create_sites: true`: create more sites
- If `auto_create_sites: false`: raise clear error with count of sites needed vs available
### URL Generation Logic
- A function generates the final public URL for each article based on its assigned site
- URL structure: `https://{hostname}/{slug}.html`
- Hostname selection:
- If site has `custom_hostname`: use `custom_hostname`
- If site has no `custom_hostname`: use `pull_zone_bcdn_hostname`
- Slug generation from article title:
- Convert to lowercase
- Replace spaces with hyphens
- Remove special characters (keep only alphanumeric and hyphens)
- Trim to reasonable length (e.g., max 100 characters)
- Example: "How to Fix Your Engine" → "how-to-fix-your-engine"
### Output
- Function returns a list of URL mappings: `[{content_id: int, title: str, url: str, tier: int}, ...]`
- URL list includes ALL articles in the batch (both pre-assigned and newly assigned)
- Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15")
### Error Handling
- If not enough sites available: clear error message with count needed
- If site lookup fails: clear error with site_id
- If slug generation produces empty string: use fallback (e.g., "article-{content_id}")
## Tasks / Subtasks
### 1. Database Schema Changes
**Effort:** 2 story points
- [ ] Create migration script to alter `site_deployments` table:
- Make `custom_hostname` nullable
- Add unique constraint to `pull_zone_bcdn_hostname`
- [ ] Update `src/database/models.py`:
- Change `custom_hostname: Mapped[str]` to `Mapped[Optional[str]]`
- Change `nullable=False` to `nullable=True`
- Add `unique=True` to `pull_zone_bcdn_hostname` field
- [ ] Test migration on development database
### 2. Update Repository Layer
**Effort:** 2 story points
- [ ] Update `src/database/interfaces.py`:
- Change `custom_hostname: str` to `custom_hostname: Optional[str]` in `create()` signature
- Add `get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]` method signature
- [ ] Update `src/database/repositories.py`:
- Make `custom_hostname` parameter optional in `create()` with default `None`
- Update uniqueness validation to handle nullable `custom_hostname`
- Implement `get_by_bcdn_hostname()` method
- Update `exists()` method to check both hostname types
### 3. Implement Site Creation Logic
**Effort:** 3 story points
- [ ] Create new module: `src/generation/site_provisioning.py`
- [ ] Implement `create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment`:
- Call bunny.net API to create Storage Zone
- Call bunny.net API to create Pull Zone linked to storage
- Generate unique site name with random suffix
- Save to database with `custom_hostname = null`
- Return SiteDeployment record
- [ ] Implement `provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]`:
- For each keyword+count, create N sites
- Use keyword in site name (slugified)
- Return list of created sites
- [ ] Log site creation at INFO level
### 4. Implement Site Assignment Logic
**Effort:** 4 story points (increased from 3)
- [ ] Update job config schema (`src/generation/job_config.py`):
- Add `tier1_preferred_sites: Optional[List[str]]`
- Add `auto_create_sites: Optional[bool] = False`
- Add `create_sites_for_keywords: Optional[List[Dict]]`
- [ ] Create new module: `src/generation/site_assignment.py`
- [ ] Implement `assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None`:
- Pre-create sites for keywords if specified
- Query all available sites from database
- Filter out sites already assigned to articles in this batch
- For tier1 articles: try preferred sites first, then random
- For tier2+ articles: random only
- If insufficient sites and auto_create=true: create more
- Update `GeneratedContent.site_deployment_id` in database
- Validate enough sites are available (raise error if auto_create=false)
- [ ] Log assignment decisions at INFO level
### 5. Implement URL Generation Logic
**Effort:** 2 story points
- [ ] Create `src/generation/url_generator.py`
- [ ] Implement `generate_slug(title: str) -> str`:
- Convert to lowercase
- Replace spaces with hyphens
- Remove special characters
- Trim to max length
- Handle edge cases (empty result, etc.)
- [ ] Implement `generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]`:
- For each article, lookup its site
- Determine hostname (custom or bcdn)
- Generate slug from title
- Build complete URL
- Return list of mappings
### 6. Update Template Service
**Effort:** 1 story point
- [ ] Update `src/templating/service.py`:
- Line 92: Change to `hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname`
- Ensure template mapping works for both hostname types
### 7. Update CLI for Undomained Sites
**Effort:** 2 story points
- [ ] Update `sync-sites` command in `src/cli/commands.py`:
- Remove filter that skips sites without custom hostnames (lines 688-689)
- Import sites that only have b-cdn.net hostnames
- Set `custom_hostname = None` for these sites
- [ ] Test importing undomained sites from bunny.net
### 8. Unit Tests
**Effort:** 4 story points (increased from 3)
- [ ] Test slug generation with various inputs (special chars, long titles, empty strings)
- [ ] Test URL generation with custom hostname
- [ ] Test URL generation with only bcdn hostname
- [ ] Test site assignment with exact count of available sites
- [ ] Test site assignment with insufficient sites (error case)
- [ ] Test site assignment skips already-assigned articles
- [ ] Test tier1 preferred sites logic
- [ ] Test tier2+ random assignment only
- [ ] Test site creation with keyword prefix
- [ ] Test auto-creation on-demand
- [ ] Achieve >80% code coverage for new modules
### 9. Integration Tests
**Effort:** 3 story points (increased from 2)
- [ ] Test full flow: batch generation → site assignment → URL generation
- [ ] Test with mix of pre-assigned (Story 2.5) and null articles
- [ ] Test tier1_preferred_sites assignment
- [ ] Test auto_create_sites when insufficient pool
- [ ] Test create_sites_for_keywords pre-creation
- [ ] Test database updates persist correctly
- [ ] Test with sites that have custom domains vs only bcdn hostnames
- [ ] Verify no duplicate site assignments within same batch
## Technical Notes
### Job Configuration Example
```json
{
"job_name": "Test Run",
"project_id": 2,
"deployment_targets": [
"www.domain1.com",
"www.domain2.com"
],
"tier1_preferred_sites": [
"site123.b-cdn.net",
"www.otherdomain.com"
],
"auto_create_sites": true,
"create_sites_for_keywords": [
{"keyword": "engine repair", "count": 3},
{"keyword": "car maintenance", "count": 2}
],
"tiers": [
{
"tier": 1,
"article_count": 10
},
{
"tier": 2,
"article_count": 50
}
]
}
```
**Assignment example with 10 tier1 articles:**
- Articles 0-1: Assigned via `deployment_targets` (Story 2.5, already done)
- Articles 2-3: Assigned via `tier1_preferred_sites` (Story 3.1)
- Articles 4-8: Assigned via keyword sites if available, else random
- Article 9: Random or auto-created if pool exhausted
**Tier2 articles:** All random
### Site Naming Convention
```
Keyword sites: engine-repair-a8f3
car-maintenance-9x2k
Generic sites: shaft-machining-7m4p
{project-keyword}-{random-4-char}
```
### Slug Generation Example
```python
def generate_slug(title: str, max_length: int = 100) -> str:
"""
Generate URL-safe slug from article title
Examples:
"How to Fix Your Engine" -> "how-to-fix-your-engine"
"10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024"
"C++ Programming Guide" -> "c-programming-guide"
"""
import re
slug = title.lower()
slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars
slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen
slug = slug.strip('-')[:max_length] # Trim and limit length
return slug or "article" # Fallback if empty
```
### URL Structure Examples
```
Custom domain: https://www.example.com/how-to-fix-your-engine.html
Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html
```
### Site Assignment Algorithm
```
1. Load job config and GeneratedContent records for batch
2. Pre-create sites for keywords if create_sites_for_keywords specified
3. Query all available SiteDeployment records
4. Identify articles with site_deployment_id = null (need assignment)
5. Filter out sites already used by articles in THIS batch
6. For each tier1 article needing assignment:
a. Try tier1_preferred_sites first (if specified)
b. Fallback to random from available pool
7. For each tier2+ article needing assignment:
a. Random from available pool
8. If insufficient sites and auto_create_sites=true:
a. Create minimum needed via bunny.net API
b. Retry assignment with expanded pool
9. If insufficient sites and auto_create_sites=false:
a. Raise error with count needed vs available
10. Update database with new site_deployment_id values
```
### Site Creation via Bunny.net API
```python
def create_bunnynet_site(name_prefix: str, region: str = "DE"):
# Step 1: Create Storage Zone
storage = bunny_client.create_storage_zone(
name=f"{name_prefix}-{random_suffix()}",
region=region
)
# Step 2: Create Pull Zone (no custom hostname step)
pull = bunny_client.create_pull_zone(
name=f"{name_prefix}-{random_suffix()}",
storage_zone_id=storage.id
)
# Step 3: Save to database
site = site_repo.create(
site_name=name_prefix,
custom_hostname=None, # No custom domain
storage_zone_id=storage.id,
storage_zone_name=storage.name,
storage_zone_password=storage.password,
storage_zone_region=region,
pull_zone_id=pull.id,
pull_zone_bcdn_hostname=pull.hostname
)
return site
```
### Database Migration
```sql
-- Make custom_hostname nullable
ALTER TABLE site_deployments
MODIFY COLUMN custom_hostname VARCHAR(255) NULL;
-- Add unique constraint to pull_zone_bcdn_hostname
ALTER TABLE site_deployments
ADD CONSTRAINT uq_pull_zone_bcdn_hostname
UNIQUE (pull_zone_bcdn_hostname);
```
## Dependencies
- Story 1.6: `SiteDeployment` table exists
- Story 2.3: Content generation creates `GeneratedContent` records
- Story 2.5: Some articles may already have `site_deployment_id` set
## Future Considerations
- Story 4.x will use generated URLs for deployment
- Story 3.3 will use URL list for interlinking
- Future: S3-compatible storage support (custom_hostname nullable enables this)
## Deferred to Technical Debt
- Fuzzy keyword/entity matching for intelligent site assignment (T1 articles)
- This would compare article keywords/entities to site hostnames and assign based on relevance score
- Adds complexity: keyword extraction, scoring algorithm, match tracking
- Estimated effort if implemented later: 5-8 story points
## Total Effort
21 story points (increased from 17)