Story 3.1 written - will implement bucket assignment and creation logic
parent
6a3e65ef7c
commit
1c19d514c2
|
|
@ -0,0 +1,368 @@
|
|||
# Story 3.1: Generate and Validate Article URLs
|
||||
|
||||
## Status
|
||||
Approved
|
||||
|
||||
## Story
|
||||
**As a developer**, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking.
|
||||
|
||||
## Context
|
||||
- Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains)
|
||||
- Remaining articles have `site_deployment_id = null` and need site assignment
|
||||
- Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches)
|
||||
- Sites can have custom domains (`custom_hostname`) OR just bunny.net CDN hostnames (`pull_zone_bcdn_hostname`)
|
||||
- System has 400+ existing bunny.net buckets without custom domains that should be usable
|
||||
- Job config can specify preferred sites for tier1 articles
|
||||
- Job config can request auto-creation of new sites if pool is insufficient
|
||||
- Job config can request pre-creation of sites for specific keywords/entities
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Database Changes
|
||||
- `custom_hostname` field in `SiteDeployment` table is nullable (was previously required)
|
||||
- `pull_zone_bcdn_hostname` field in `SiteDeployment` table has unique constraint
|
||||
- Database migration script updates existing schema without data loss
|
||||
|
||||
### Repository Updates
|
||||
- `SiteDeploymentRepository.create()` accepts optional `custom_hostname` parameter (defaults to `None`)
|
||||
- `SiteDeploymentRepository.get_by_bcdn_hostname()` method added to query by bunny.net hostname
|
||||
- Repository interface (`ISiteDeploymentRepository`) updated to reflect optional `custom_hostname`
|
||||
|
||||
### Job Configuration Extensions
|
||||
- Job config supports optional `tier1_preferred_sites` array (list of hostnames for tier1 assignment)
|
||||
- Job config supports optional `auto_create_sites` boolean (default: false)
|
||||
- Job config supports optional `create_sites_for_keywords` array of `{keyword: str, count: int}` objects
|
||||
- Invalid hostnames in `tier1_preferred_sites` cause graceful errors
|
||||
- Validation occurs at start of Story 3.1 workflow
|
||||
|
||||
### Site Creation Logic
|
||||
- If `create_sites_for_keywords` specified:
|
||||
- Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API
|
||||
- Site names: `{keyword-slug}-{random-suffix}` (e.g., "engine-repair-a8f3")
|
||||
- Creates in default region (configurable, default "DE")
|
||||
- Stores in database with `custom_hostname = null`
|
||||
- These sites added to available pool BEFORE assignment
|
||||
- If `auto_create_sites: true` and not enough sites after assignment:
|
||||
- Creates additional generic bunny.net sites on-demand
|
||||
- Generic site names: `{project-keyword-slug}-{random-suffix}`
|
||||
- Creates only the minimum needed
|
||||
- All created sites use bunny.net API (same as `provision-site` command, but without custom domain step)
|
||||
|
||||
### Site Assignment Logic
|
||||
- A function accepts a batch of `GeneratedContent` records (all from same batch/tier)
|
||||
- For tier1 articles with `site_deployment_id = null`:
|
||||
- **Priority 1:** Assign from `tier1_preferred_sites` (if specified in job config)
|
||||
- **Priority 2:** Random selection from available pool
|
||||
- For tier2+ articles with `site_deployment_id = null`:
|
||||
- Random selection from available pool
|
||||
- Assignment rules:
|
||||
- Ensures no two articles in the batch get assigned the same site
|
||||
- Can reuse sites from previous batches (only same-batch collision matters)
|
||||
- Pre-created keyword sites are in the available pool
|
||||
- Updates the `GeneratedContent.site_deployment_id` field in database
|
||||
- For articles that already have `site_deployment_id` set (from Story 2.5): leaves them unchanged
|
||||
- If not enough available sites exist:
|
||||
- If `auto_create_sites: true`: create more sites
|
||||
- If `auto_create_sites: false`: raise clear error with count of sites needed vs available
|
||||
|
||||
### URL Generation Logic
|
||||
- A function generates the final public URL for each article based on its assigned site
|
||||
- URL structure: `https://{hostname}/{slug}.html`
|
||||
- Hostname selection:
|
||||
- If site has `custom_hostname`: use `custom_hostname`
|
||||
- If site has no `custom_hostname`: use `pull_zone_bcdn_hostname`
|
||||
- Slug generation from article title:
|
||||
- Convert to lowercase
|
||||
- Replace spaces with hyphens
|
||||
- Remove special characters (keep only alphanumeric and hyphens)
|
||||
- Trim to reasonable length (e.g., max 100 characters)
|
||||
- Example: "How to Fix Your Engine" → "how-to-fix-your-engine"
|
||||
|
||||
### Output
|
||||
- Function returns a list of URL mappings: `[{content_id: int, title: str, url: str, tier: int}, ...]`
|
||||
- URL list includes ALL articles in the batch (both pre-assigned and newly assigned)
|
||||
- Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15")
|
||||
|
||||
### Error Handling
|
||||
- If not enough sites available: clear error message with count needed
|
||||
- If site lookup fails: clear error with site_id
|
||||
- If slug generation produces empty string: use fallback (e.g., "article-{content_id}")
|
||||
|
||||
## Tasks / Subtasks
|
||||
|
||||
### 1. Database Schema Changes
|
||||
**Effort:** 2 story points
|
||||
|
||||
- [ ] Create migration script to alter `site_deployments` table:
|
||||
- Make `custom_hostname` nullable
|
||||
- Add unique constraint to `pull_zone_bcdn_hostname`
|
||||
- [ ] Update `src/database/models.py`:
|
||||
- Change `custom_hostname: Mapped[str]` to `Mapped[Optional[str]]`
|
||||
- Change `nullable=False` to `nullable=True`
|
||||
- Add `unique=True` to `pull_zone_bcdn_hostname` field
|
||||
- [ ] Test migration on development database
|
||||
|
||||
### 2. Update Repository Layer
|
||||
**Effort:** 2 story points
|
||||
|
||||
- [ ] Update `src/database/interfaces.py`:
|
||||
- Change `custom_hostname: str` to `custom_hostname: Optional[str]` in `create()` signature
|
||||
- Add `get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]` method signature
|
||||
- [ ] Update `src/database/repositories.py`:
|
||||
- Make `custom_hostname` parameter optional in `create()` with default `None`
|
||||
- Update uniqueness validation to handle nullable `custom_hostname`
|
||||
- Implement `get_by_bcdn_hostname()` method
|
||||
- Update `exists()` method to check both hostname types
|
||||
|
||||
### 3. Implement Site Creation Logic
|
||||
**Effort:** 3 story points
|
||||
|
||||
- [ ] Create new module: `src/generation/site_provisioning.py`
|
||||
- [ ] Implement `create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment`:
|
||||
- Call bunny.net API to create Storage Zone
|
||||
- Call bunny.net API to create Pull Zone linked to storage
|
||||
- Generate unique site name with random suffix
|
||||
- Save to database with `custom_hostname = null`
|
||||
- Return SiteDeployment record
|
||||
- [ ] Implement `provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]`:
|
||||
- For each keyword+count, create N sites
|
||||
- Use keyword in site name (slugified)
|
||||
- Return list of created sites
|
||||
- [ ] Log site creation at INFO level
|
||||
|
||||
### 4. Implement Site Assignment Logic
|
||||
**Effort:** 4 story points (increased from 3)
|
||||
|
||||
- [ ] Update job config schema (`src/generation/job_config.py`):
|
||||
- Add `tier1_preferred_sites: Optional[List[str]]`
|
||||
- Add `auto_create_sites: Optional[bool] = False`
|
||||
- Add `create_sites_for_keywords: Optional[List[Dict]]`
|
||||
- [ ] Create new module: `src/generation/site_assignment.py`
|
||||
- [ ] Implement `assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None`:
|
||||
- Pre-create sites for keywords if specified
|
||||
- Query all available sites from database
|
||||
- Filter out sites already assigned to articles in this batch
|
||||
- For tier1 articles: try preferred sites first, then random
|
||||
- For tier2+ articles: random only
|
||||
- If insufficient sites and auto_create=true: create more
|
||||
- Update `GeneratedContent.site_deployment_id` in database
|
||||
- Validate enough sites are available (raise error if auto_create=false)
|
||||
- [ ] Log assignment decisions at INFO level
|
||||
|
||||
### 5. Implement URL Generation Logic
|
||||
**Effort:** 2 story points
|
||||
|
||||
- [ ] Create `src/generation/url_generator.py`
|
||||
- [ ] Implement `generate_slug(title: str) -> str`:
|
||||
- Convert to lowercase
|
||||
- Replace spaces with hyphens
|
||||
- Remove special characters
|
||||
- Trim to max length
|
||||
- Handle edge cases (empty result, etc.)
|
||||
- [ ] Implement `generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]`:
|
||||
- For each article, lookup its site
|
||||
- Determine hostname (custom or bcdn)
|
||||
- Generate slug from title
|
||||
- Build complete URL
|
||||
- Return list of mappings
|
||||
|
||||
### 6. Update Template Service
|
||||
**Effort:** 1 story point
|
||||
|
||||
- [ ] Update `src/templating/service.py`:
|
||||
- Line 92: Change to `hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname`
|
||||
- Ensure template mapping works for both hostname types
|
||||
|
||||
### 7. Update CLI for Undomained Sites
|
||||
**Effort:** 2 story points
|
||||
|
||||
- [ ] Update `sync-sites` command in `src/cli/commands.py`:
|
||||
- Remove filter that skips sites without custom hostnames (lines 688-689)
|
||||
- Import sites that only have b-cdn.net hostnames
|
||||
- Set `custom_hostname = None` for these sites
|
||||
- [ ] Test importing undomained sites from bunny.net
|
||||
|
||||
### 8. Unit Tests
|
||||
**Effort:** 4 story points (increased from 3)
|
||||
|
||||
- [ ] Test slug generation with various inputs (special chars, long titles, empty strings)
|
||||
- [ ] Test URL generation with custom hostname
|
||||
- [ ] Test URL generation with only bcdn hostname
|
||||
- [ ] Test site assignment with exact count of available sites
|
||||
- [ ] Test site assignment with insufficient sites (error case)
|
||||
- [ ] Test site assignment skips already-assigned articles
|
||||
- [ ] Test tier1 preferred sites logic
|
||||
- [ ] Test tier2+ random assignment only
|
||||
- [ ] Test site creation with keyword prefix
|
||||
- [ ] Test auto-creation on-demand
|
||||
- [ ] Achieve >80% code coverage for new modules
|
||||
|
||||
### 9. Integration Tests
|
||||
**Effort:** 3 story points (increased from 2)
|
||||
|
||||
- [ ] Test full flow: batch generation → site assignment → URL generation
|
||||
- [ ] Test with mix of pre-assigned (Story 2.5) and null articles
|
||||
- [ ] Test tier1_preferred_sites assignment
|
||||
- [ ] Test auto_create_sites when insufficient pool
|
||||
- [ ] Test create_sites_for_keywords pre-creation
|
||||
- [ ] Test database updates persist correctly
|
||||
- [ ] Test with sites that have custom domains vs only bcdn hostnames
|
||||
- [ ] Verify no duplicate site assignments within same batch
|
||||
|
||||
## Technical Notes
|
||||
|
||||
### Job Configuration Example
|
||||
```json
|
||||
{
|
||||
"job_name": "Test Run",
|
||||
"project_id": 2,
|
||||
"deployment_targets": [
|
||||
"www.domain1.com",
|
||||
"www.domain2.com"
|
||||
],
|
||||
"tier1_preferred_sites": [
|
||||
"site123.b-cdn.net",
|
||||
"www.otherdomain.com"
|
||||
],
|
||||
"auto_create_sites": true,
|
||||
"create_sites_for_keywords": [
|
||||
{"keyword": "engine repair", "count": 3},
|
||||
{"keyword": "car maintenance", "count": 2}
|
||||
],
|
||||
"tiers": [
|
||||
{
|
||||
"tier": 1,
|
||||
"article_count": 10
|
||||
},
|
||||
{
|
||||
"tier": 2,
|
||||
"article_count": 50
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Assignment example with 10 tier1 articles:**
|
||||
- Articles 0-1: Assigned via `deployment_targets` (Story 2.5, already done)
|
||||
- Articles 2-3: Assigned via `tier1_preferred_sites` (Story 3.1)
|
||||
- Articles 4-8: Assigned via keyword sites if available, else random
|
||||
- Article 9: Random or auto-created if pool exhausted
|
||||
|
||||
**Tier2 articles:** All random
|
||||
|
||||
### Site Naming Convention
|
||||
```
|
||||
Keyword sites: engine-repair-a8f3
|
||||
car-maintenance-9x2k
|
||||
Generic sites: shaft-machining-7m4p
|
||||
{project-keyword}-{random-4-char}
|
||||
```
|
||||
|
||||
### Slug Generation Example
|
||||
```python
|
||||
def generate_slug(title: str, max_length: int = 100) -> str:
|
||||
"""
|
||||
Generate URL-safe slug from article title
|
||||
|
||||
Examples:
|
||||
"How to Fix Your Engine" -> "how-to-fix-your-engine"
|
||||
"10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024"
|
||||
"C++ Programming Guide" -> "c-programming-guide"
|
||||
"""
|
||||
import re
|
||||
|
||||
slug = title.lower()
|
||||
slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars
|
||||
slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen
|
||||
slug = slug.strip('-')[:max_length] # Trim and limit length
|
||||
|
||||
return slug or "article" # Fallback if empty
|
||||
```
|
||||
|
||||
### URL Structure Examples
|
||||
```
|
||||
Custom domain: https://www.example.com/how-to-fix-your-engine.html
|
||||
Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html
|
||||
```
|
||||
|
||||
### Site Assignment Algorithm
|
||||
```
|
||||
1. Load job config and GeneratedContent records for batch
|
||||
2. Pre-create sites for keywords if create_sites_for_keywords specified
|
||||
3. Query all available SiteDeployment records
|
||||
4. Identify articles with site_deployment_id = null (need assignment)
|
||||
5. Filter out sites already used by articles in THIS batch
|
||||
6. For each tier1 article needing assignment:
|
||||
a. Try tier1_preferred_sites first (if specified)
|
||||
b. Fallback to random from available pool
|
||||
7. For each tier2+ article needing assignment:
|
||||
a. Random from available pool
|
||||
8. If insufficient sites and auto_create_sites=true:
|
||||
a. Create minimum needed via bunny.net API
|
||||
b. Retry assignment with expanded pool
|
||||
9. If insufficient sites and auto_create_sites=false:
|
||||
a. Raise error with count needed vs available
|
||||
10. Update database with new site_deployment_id values
|
||||
```
|
||||
|
||||
### Site Creation via Bunny.net API
|
||||
```python
|
||||
def create_bunnynet_site(name_prefix: str, region: str = "DE"):
|
||||
# Step 1: Create Storage Zone
|
||||
storage = bunny_client.create_storage_zone(
|
||||
name=f"{name_prefix}-{random_suffix()}",
|
||||
region=region
|
||||
)
|
||||
|
||||
# Step 2: Create Pull Zone (no custom hostname step)
|
||||
pull = bunny_client.create_pull_zone(
|
||||
name=f"{name_prefix}-{random_suffix()}",
|
||||
storage_zone_id=storage.id
|
||||
)
|
||||
|
||||
# Step 3: Save to database
|
||||
site = site_repo.create(
|
||||
site_name=name_prefix,
|
||||
custom_hostname=None, # No custom domain
|
||||
storage_zone_id=storage.id,
|
||||
storage_zone_name=storage.name,
|
||||
storage_zone_password=storage.password,
|
||||
storage_zone_region=region,
|
||||
pull_zone_id=pull.id,
|
||||
pull_zone_bcdn_hostname=pull.hostname
|
||||
)
|
||||
|
||||
return site
|
||||
```
|
||||
|
||||
### Database Migration
|
||||
```sql
|
||||
-- Make custom_hostname nullable
|
||||
ALTER TABLE site_deployments
|
||||
MODIFY COLUMN custom_hostname VARCHAR(255) NULL;
|
||||
|
||||
-- Add unique constraint to pull_zone_bcdn_hostname
|
||||
ALTER TABLE site_deployments
|
||||
ADD CONSTRAINT uq_pull_zone_bcdn_hostname
|
||||
UNIQUE (pull_zone_bcdn_hostname);
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
- Story 1.6: `SiteDeployment` table exists
|
||||
- Story 2.3: Content generation creates `GeneratedContent` records
|
||||
- Story 2.5: Some articles may already have `site_deployment_id` set
|
||||
|
||||
## Future Considerations
|
||||
- Story 4.x will use generated URLs for deployment
|
||||
- Story 3.3 will use URL list for interlinking
|
||||
- Future: S3-compatible storage support (custom_hostname nullable enables this)
|
||||
|
||||
## Deferred to Technical Debt
|
||||
- Fuzzy keyword/entity matching for intelligent site assignment (T1 articles)
|
||||
- This would compare article keywords/entities to site hostnames and assign based on relevance score
|
||||
- Adds complexity: keyword extraction, scoring algorithm, match tracking
|
||||
- Estimated effort if implemented later: 5-8 story points
|
||||
|
||||
## Total Effort
|
||||
21 story points (increased from 17)
|
||||
|
||||
|
|
@ -369,6 +369,92 @@ Current augmentation is basic:
|
|||
|
||||
---
|
||||
|
||||
## Story 3.1: URL Generation and Site Assignment
|
||||
|
||||
### Fuzzy Keyword/Entity Matching for Site Assignment
|
||||
|
||||
**Priority**: Medium
|
||||
**Epic Suggestion**: Epic 3 (Pre-deployment) - Enhancement
|
||||
**Estimated Effort**: Medium (5-8 story points)
|
||||
|
||||
#### Problem
|
||||
Currently tier1 site assignment uses:
|
||||
1. Explicit preferred sites from job config
|
||||
2. Random selection from available pool
|
||||
|
||||
This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance.
|
||||
|
||||
#### Proposed Solution
|
||||
|
||||
**Intelligent Site Matching:**
|
||||
1. Extract article keywords and entities from GeneratedContent
|
||||
2. Parse keywords/entities from site hostnames and names
|
||||
3. Score each (article, site) pair based on keyword/entity overlap
|
||||
4. Assign tier1 articles to highest-scoring available sites
|
||||
5. Fall back to random if no good matches
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Article: "Engine Repair Basics"
|
||||
Keywords: ["engine repair", "automotive", "maintenance"]
|
||||
Entities: ["engine", "carburetor", "cylinder"]
|
||||
|
||||
Available Sites:
|
||||
- auto-repair-tips.com Score: 0.85 (high match)
|
||||
- engine-maintenance-guide.com Score: 0.92 (very high match)
|
||||
- cooking-recipes.com Score: 0.05 (low match)
|
||||
|
||||
Assignment: engine-maintenance-guide.com (best match)
|
||||
```
|
||||
|
||||
**Implementation Details:**
|
||||
- Scoring algorithm: weighted combination of keyword match + entity match
|
||||
- Fuzzy matching: use Levenshtein distance or similar for partial matches
|
||||
- Track assignments to avoid reusing sites within same batch
|
||||
- Configurable threshold (e.g., only assign if score > 0.5, else random)
|
||||
|
||||
**Job Configuration:**
|
||||
```json
|
||||
{
|
||||
"tier1_site_matching": {
|
||||
"enabled": true,
|
||||
"min_score": 0.5,
|
||||
"weight_keywords": 0.6,
|
||||
"weight_entities": 0.4
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Database Changes:**
|
||||
None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name)
|
||||
|
||||
#### Complexity Factors
|
||||
- Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"])
|
||||
- Entity recognition and normalization
|
||||
- Scoring algorithm design and tuning
|
||||
- Testing with various domain/content combinations
|
||||
- Performance optimization for large site pools
|
||||
|
||||
#### Impact
|
||||
- Better SEO through topical site clustering
|
||||
- More organized content portfolio
|
||||
- Easier to identify which sites cover which topics
|
||||
- Improved content discoverability
|
||||
|
||||
#### Alternative: Simpler Keyword-Only Matching
|
||||
If full fuzzy matching is too complex, start with exact keyword substring matching:
|
||||
```python
|
||||
# Simple version: check if article keyword appears in hostname
|
||||
if article.main_keyword.lower() in site.custom_hostname.lower():
|
||||
score = 1.0
|
||||
else:
|
||||
score = 0.0
|
||||
```
|
||||
|
||||
This would still provide value with much less complexity (2-3 story points instead of 5-8).
|
||||
|
||||
---
|
||||
|
||||
## Future Sections
|
||||
|
||||
Add new technical debt items below as they're identified during development.
|
||||
|
|
|
|||
Loading…
Reference in New Issue