369 lines
15 KiB
Markdown
369 lines
15 KiB
Markdown
# Story 3.1: Generate and Validate Article URLs
|
|
|
|
## Status
|
|
Finished
|
|
|
|
## Story
|
|
**As a developer**, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking.
|
|
|
|
## Context
|
|
- Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains)
|
|
- Remaining articles have `site_deployment_id = null` and need site assignment
|
|
- Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches)
|
|
- Sites can have custom domains (`custom_hostname`) OR just bunny.net CDN hostnames (`pull_zone_bcdn_hostname`)
|
|
- System has 400+ existing bunny.net buckets without custom domains that should be usable
|
|
- Job config can specify preferred sites for tier1 articles
|
|
- Job config can request auto-creation of new sites if pool is insufficient
|
|
- Job config can request pre-creation of sites for specific keywords/entities
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Database Changes
|
|
- `custom_hostname` field in `SiteDeployment` table is nullable (was previously required)
|
|
- `pull_zone_bcdn_hostname` field in `SiteDeployment` table has unique constraint
|
|
- Database migration script updates existing schema without data loss
|
|
|
|
### Repository Updates
|
|
- `SiteDeploymentRepository.create()` accepts optional `custom_hostname` parameter (defaults to `None`)
|
|
- `SiteDeploymentRepository.get_by_bcdn_hostname()` method added to query by bunny.net hostname
|
|
- Repository interface (`ISiteDeploymentRepository`) updated to reflect optional `custom_hostname`
|
|
|
|
### Job Configuration Extensions
|
|
- Job config supports optional `tier1_preferred_sites` array (list of hostnames for tier1 assignment)
|
|
- Job config supports optional `auto_create_sites` boolean (default: false)
|
|
- Job config supports optional `create_sites_for_keywords` array of `{keyword: str, count: int}` objects
|
|
- Invalid hostnames in `tier1_preferred_sites` cause graceful errors
|
|
- Validation occurs at start of Story 3.1 workflow
|
|
|
|
### Site Creation Logic
|
|
- If `create_sites_for_keywords` specified:
|
|
- Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API
|
|
- Site names: `{keyword-slug}-{random-suffix}` (e.g., "engine-repair-a8f3")
|
|
- Creates in default region (configurable, default "DE")
|
|
- Stores in database with `custom_hostname = null`
|
|
- These sites added to available pool BEFORE assignment
|
|
- If `auto_create_sites: true` and not enough sites after assignment:
|
|
- Creates additional generic bunny.net sites on-demand
|
|
- Generic site names: `{project-keyword-slug}-{random-suffix}`
|
|
- Creates only the minimum needed
|
|
- All created sites use bunny.net API (same as `provision-site` command, but without custom domain step)
|
|
|
|
### Site Assignment Logic
|
|
- A function accepts a batch of `GeneratedContent` records (all from same batch/tier)
|
|
- For tier1 articles with `site_deployment_id = null`:
|
|
- **Priority 1:** Assign from `tier1_preferred_sites` (if specified in job config)
|
|
- **Priority 2:** Random selection from available pool
|
|
- For tier2+ articles with `site_deployment_id = null`:
|
|
- Random selection from available pool
|
|
- Assignment rules:
|
|
- Ensures no two articles in the batch get assigned the same site
|
|
- Can reuse sites from previous batches (only same-batch collision matters)
|
|
- Pre-created keyword sites are in the available pool
|
|
- Updates the `GeneratedContent.site_deployment_id` field in database
|
|
- For articles that already have `site_deployment_id` set (from Story 2.5): leaves them unchanged
|
|
- If not enough available sites exist:
|
|
- If `auto_create_sites: true`: create more sites
|
|
- If `auto_create_sites: false`: raise clear error with count of sites needed vs available
|
|
|
|
### URL Generation Logic
|
|
- A function generates the final public URL for each article based on its assigned site
|
|
- URL structure: `https://{hostname}/{slug}.html`
|
|
- Hostname selection:
|
|
- If site has `custom_hostname`: use `custom_hostname`
|
|
- If site has no `custom_hostname`: use `pull_zone_bcdn_hostname`
|
|
- Slug generation from article title:
|
|
- Convert to lowercase
|
|
- Replace spaces with hyphens
|
|
- Remove special characters (keep only alphanumeric and hyphens)
|
|
- Trim to reasonable length (e.g., max 100 characters)
|
|
- Example: "How to Fix Your Engine" → "how-to-fix-your-engine"
|
|
|
|
### Output
|
|
- Function returns a list of URL mappings: `[{content_id: int, title: str, url: str, tier: int}, ...]`
|
|
- URL list includes ALL articles in the batch (both pre-assigned and newly assigned)
|
|
- Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15")
|
|
|
|
### Error Handling
|
|
- If not enough sites available: clear error message with count needed
|
|
- If site lookup fails: clear error with site_id
|
|
- If slug generation produces empty string: use fallback (e.g., "article-{content_id}")
|
|
|
|
## Tasks / Subtasks
|
|
|
|
### 1. Database Schema Changes
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Create migration script to alter `site_deployments` table:
|
|
- Make `custom_hostname` nullable
|
|
- Add unique constraint to `pull_zone_bcdn_hostname`
|
|
- [ ] Update `src/database/models.py`:
|
|
- Change `custom_hostname: Mapped[str]` to `Mapped[Optional[str]]`
|
|
- Change `nullable=False` to `nullable=True`
|
|
- Add `unique=True` to `pull_zone_bcdn_hostname` field
|
|
- [ ] Test migration on development database
|
|
|
|
### 2. Update Repository Layer
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Update `src/database/interfaces.py`:
|
|
- Change `custom_hostname: str` to `custom_hostname: Optional[str]` in `create()` signature
|
|
- Add `get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]` method signature
|
|
- [ ] Update `src/database/repositories.py`:
|
|
- Make `custom_hostname` parameter optional in `create()` with default `None`
|
|
- Update uniqueness validation to handle nullable `custom_hostname`
|
|
- Implement `get_by_bcdn_hostname()` method
|
|
- Update `exists()` method to check both hostname types
|
|
|
|
### 3. Implement Site Creation Logic
|
|
**Effort:** 3 story points
|
|
|
|
- [ ] Create new module: `src/generation/site_provisioning.py`
|
|
- [ ] Implement `create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment`:
|
|
- Call bunny.net API to create Storage Zone
|
|
- Call bunny.net API to create Pull Zone linked to storage
|
|
- Generate unique site name with random suffix
|
|
- Save to database with `custom_hostname = null`
|
|
- Return SiteDeployment record
|
|
- [ ] Implement `provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]`:
|
|
- For each keyword+count, create N sites
|
|
- Use keyword in site name (slugified)
|
|
- Return list of created sites
|
|
- [ ] Log site creation at INFO level
|
|
|
|
### 4. Implement Site Assignment Logic
|
|
**Effort:** 4 story points (increased from 3)
|
|
|
|
- [ ] Update job config schema (`src/generation/job_config.py`):
|
|
- Add `tier1_preferred_sites: Optional[List[str]]`
|
|
- Add `auto_create_sites: Optional[bool] = False`
|
|
- Add `create_sites_for_keywords: Optional[List[Dict]]`
|
|
- [ ] Create new module: `src/generation/site_assignment.py`
|
|
- [ ] Implement `assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None`:
|
|
- Pre-create sites for keywords if specified
|
|
- Query all available sites from database
|
|
- Filter out sites already assigned to articles in this batch
|
|
- For tier1 articles: try preferred sites first, then random
|
|
- For tier2+ articles: random only
|
|
- If insufficient sites and auto_create=true: create more
|
|
- Update `GeneratedContent.site_deployment_id` in database
|
|
- Validate enough sites are available (raise error if auto_create=false)
|
|
- [ ] Log assignment decisions at INFO level
|
|
|
|
### 5. Implement URL Generation Logic
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Create `src/generation/url_generator.py`
|
|
- [ ] Implement `generate_slug(title: str) -> str`:
|
|
- Convert to lowercase
|
|
- Replace spaces with hyphens
|
|
- Remove special characters
|
|
- Trim to max length
|
|
- Handle edge cases (empty result, etc.)
|
|
- [ ] Implement `generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]`:
|
|
- For each article, lookup its site
|
|
- Determine hostname (custom or bcdn)
|
|
- Generate slug from title
|
|
- Build complete URL
|
|
- Return list of mappings
|
|
|
|
### 6. Update Template Service
|
|
**Effort:** 1 story point
|
|
|
|
- [ ] Update `src/templating/service.py`:
|
|
- Line 92: Change to `hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname`
|
|
- Ensure template mapping works for both hostname types
|
|
|
|
### 7. Update CLI for Undomained Sites
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Update `sync-sites` command in `src/cli/commands.py`:
|
|
- Remove filter that skips sites without custom hostnames (lines 688-689)
|
|
- Import sites that only have b-cdn.net hostnames
|
|
- Set `custom_hostname = None` for these sites
|
|
- [ ] Test importing undomained sites from bunny.net
|
|
|
|
### 8. Unit Tests
|
|
**Effort:** 4 story points (increased from 3)
|
|
|
|
- [ ] Test slug generation with various inputs (special chars, long titles, empty strings)
|
|
- [ ] Test URL generation with custom hostname
|
|
- [ ] Test URL generation with only bcdn hostname
|
|
- [ ] Test site assignment with exact count of available sites
|
|
- [ ] Test site assignment with insufficient sites (error case)
|
|
- [ ] Test site assignment skips already-assigned articles
|
|
- [ ] Test tier1 preferred sites logic
|
|
- [ ] Test tier2+ random assignment only
|
|
- [ ] Test site creation with keyword prefix
|
|
- [ ] Test auto-creation on-demand
|
|
- [ ] Achieve >80% code coverage for new modules
|
|
|
|
### 9. Integration Tests
|
|
**Effort:** 3 story points (increased from 2)
|
|
|
|
- [ ] Test full flow: batch generation → site assignment → URL generation
|
|
- [ ] Test with mix of pre-assigned (Story 2.5) and null articles
|
|
- [ ] Test tier1_preferred_sites assignment
|
|
- [ ] Test auto_create_sites when insufficient pool
|
|
- [ ] Test create_sites_for_keywords pre-creation
|
|
- [ ] Test database updates persist correctly
|
|
- [ ] Test with sites that have custom domains vs only bcdn hostnames
|
|
- [ ] Verify no duplicate site assignments within same batch
|
|
|
|
## Technical Notes
|
|
|
|
### Job Configuration Example
|
|
```json
|
|
{
|
|
"job_name": "Test Run",
|
|
"project_id": 2,
|
|
"deployment_targets": [
|
|
"www.domain1.com",
|
|
"www.domain2.com"
|
|
],
|
|
"tier1_preferred_sites": [
|
|
"site123.b-cdn.net",
|
|
"www.otherdomain.com"
|
|
],
|
|
"auto_create_sites": true,
|
|
"create_sites_for_keywords": [
|
|
{"keyword": "engine repair", "count": 3},
|
|
{"keyword": "car maintenance", "count": 2}
|
|
],
|
|
"tiers": [
|
|
{
|
|
"tier": 1,
|
|
"article_count": 10
|
|
},
|
|
{
|
|
"tier": 2,
|
|
"article_count": 50
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Assignment example with 10 tier1 articles:**
|
|
- Articles 0-1: Assigned via `deployment_targets` (Story 2.5, already done)
|
|
- Articles 2-3: Assigned via `tier1_preferred_sites` (Story 3.1)
|
|
- Articles 4-8: Assigned via keyword sites if available, else random
|
|
- Article 9: Random or auto-created if pool exhausted
|
|
|
|
**Tier2 articles:** All random
|
|
|
|
### Site Naming Convention
|
|
```
|
|
Keyword sites: engine-repair-a8f3
|
|
car-maintenance-9x2k
|
|
Generic sites: shaft-machining-7m4p
|
|
{project-keyword}-{random-4-char}
|
|
```
|
|
|
|
### Slug Generation Example
|
|
```python
|
|
def generate_slug(title: str, max_length: int = 100) -> str:
|
|
"""
|
|
Generate URL-safe slug from article title
|
|
|
|
Examples:
|
|
"How to Fix Your Engine" -> "how-to-fix-your-engine"
|
|
"10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024"
|
|
"C++ Programming Guide" -> "c-programming-guide"
|
|
"""
|
|
import re
|
|
|
|
slug = title.lower()
|
|
slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars
|
|
slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen
|
|
slug = slug.strip('-')[:max_length] # Trim and limit length
|
|
|
|
return slug or "article" # Fallback if empty
|
|
```
|
|
|
|
### URL Structure Examples
|
|
```
|
|
Custom domain: https://www.example.com/how-to-fix-your-engine.html
|
|
Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html
|
|
```
|
|
|
|
### Site Assignment Algorithm
|
|
```
|
|
1. Load job config and GeneratedContent records for batch
|
|
2. Pre-create sites for keywords if create_sites_for_keywords specified
|
|
3. Query all available SiteDeployment records
|
|
4. Identify articles with site_deployment_id = null (need assignment)
|
|
5. Filter out sites already used by articles in THIS batch
|
|
6. For each tier1 article needing assignment:
|
|
a. Try tier1_preferred_sites first (if specified)
|
|
b. Fallback to random from available pool
|
|
7. For each tier2+ article needing assignment:
|
|
a. Random from available pool
|
|
8. If insufficient sites and auto_create_sites=true:
|
|
a. Create minimum needed via bunny.net API
|
|
b. Retry assignment with expanded pool
|
|
9. If insufficient sites and auto_create_sites=false:
|
|
a. Raise error with count needed vs available
|
|
10. Update database with new site_deployment_id values
|
|
```
|
|
|
|
### Site Creation via Bunny.net API
|
|
```python
|
|
def create_bunnynet_site(name_prefix: str, region: str = "DE"):
|
|
# Step 1: Create Storage Zone
|
|
storage = bunny_client.create_storage_zone(
|
|
name=f"{name_prefix}-{random_suffix()}",
|
|
region=region
|
|
)
|
|
|
|
# Step 2: Create Pull Zone (no custom hostname step)
|
|
pull = bunny_client.create_pull_zone(
|
|
name=f"{name_prefix}-{random_suffix()}",
|
|
storage_zone_id=storage.id
|
|
)
|
|
|
|
# Step 3: Save to database
|
|
site = site_repo.create(
|
|
site_name=name_prefix,
|
|
custom_hostname=None, # No custom domain
|
|
storage_zone_id=storage.id,
|
|
storage_zone_name=storage.name,
|
|
storage_zone_password=storage.password,
|
|
storage_zone_region=region,
|
|
pull_zone_id=pull.id,
|
|
pull_zone_bcdn_hostname=pull.hostname
|
|
)
|
|
|
|
return site
|
|
```
|
|
|
|
### Database Migration
|
|
```sql
|
|
-- Make custom_hostname nullable
|
|
ALTER TABLE site_deployments
|
|
MODIFY COLUMN custom_hostname VARCHAR(255) NULL;
|
|
|
|
-- Add unique constraint to pull_zone_bcdn_hostname
|
|
ALTER TABLE site_deployments
|
|
ADD CONSTRAINT uq_pull_zone_bcdn_hostname
|
|
UNIQUE (pull_zone_bcdn_hostname);
|
|
```
|
|
|
|
## Dependencies
|
|
- Story 1.6: `SiteDeployment` table exists
|
|
- Story 2.3: Content generation creates `GeneratedContent` records
|
|
- Story 2.5: Some articles may already have `site_deployment_id` set
|
|
|
|
## Future Considerations
|
|
- Story 4.x will use generated URLs for deployment
|
|
- Story 3.3 will use URL list for interlinking
|
|
- Future: S3-compatible storage support (custom_hostname nullable enables this)
|
|
|
|
## Deferred to Technical Debt
|
|
- Fuzzy keyword/entity matching for intelligent site assignment (T1 articles)
|
|
- This would compare article keywords/entities to site hostnames and assign based on relevance score
|
|
- Adds complexity: keyword extraction, scoring algorithm, match tracking
|
|
- Estimated effort if implemented later: 5-8 story points
|
|
|
|
## Total Effort
|
|
21 story points (increased from 17)
|
|
|