From 1c19d514c28b127ba276c18f184667002621bae3 Mon Sep 17 00:00:00 2001 From: PeninsulaInd Date: Tue, 21 Oct 2025 09:11:17 -0500 Subject: [PATCH] Story 3.1 written - will implement bucket assignment and creation logic --- ...-3.1-url-generation-and-site-assignment.md | 368 ++++++++++++++++++ docs/technical-debt.md | 86 ++++ 2 files changed, 454 insertions(+) create mode 100644 docs/stories/story-3.1-url-generation-and-site-assignment.md diff --git a/docs/stories/story-3.1-url-generation-and-site-assignment.md b/docs/stories/story-3.1-url-generation-and-site-assignment.md new file mode 100644 index 0000000..418f6cc --- /dev/null +++ b/docs/stories/story-3.1-url-generation-and-site-assignment.md @@ -0,0 +1,368 @@ +# Story 3.1: Generate and Validate Article URLs + +## Status +Approved + +## Story +**As a developer**, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking. + +## Context +- Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains) +- Remaining articles have `site_deployment_id = null` and need site assignment +- Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches) +- Sites can have custom domains (`custom_hostname`) OR just bunny.net CDN hostnames (`pull_zone_bcdn_hostname`) +- System has 400+ existing bunny.net buckets without custom domains that should be usable +- Job config can specify preferred sites for tier1 articles +- Job config can request auto-creation of new sites if pool is insufficient +- Job config can request pre-creation of sites for specific keywords/entities + +## Acceptance Criteria + +### Database Changes +- `custom_hostname` field in `SiteDeployment` table is nullable (was previously required) +- `pull_zone_bcdn_hostname` field in `SiteDeployment` table has unique constraint +- Database migration script updates existing schema without data loss + +### Repository Updates +- `SiteDeploymentRepository.create()` accepts optional `custom_hostname` parameter (defaults to `None`) +- `SiteDeploymentRepository.get_by_bcdn_hostname()` method added to query by bunny.net hostname +- Repository interface (`ISiteDeploymentRepository`) updated to reflect optional `custom_hostname` + +### Job Configuration Extensions +- Job config supports optional `tier1_preferred_sites` array (list of hostnames for tier1 assignment) +- Job config supports optional `auto_create_sites` boolean (default: false) +- Job config supports optional `create_sites_for_keywords` array of `{keyword: str, count: int}` objects +- Invalid hostnames in `tier1_preferred_sites` cause graceful errors +- Validation occurs at start of Story 3.1 workflow + +### Site Creation Logic +- If `create_sites_for_keywords` specified: + - Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API + - Site names: `{keyword-slug}-{random-suffix}` (e.g., "engine-repair-a8f3") + - Creates in default region (configurable, default "DE") + - Stores in database with `custom_hostname = null` + - These sites added to available pool BEFORE assignment +- If `auto_create_sites: true` and not enough sites after assignment: + - Creates additional generic bunny.net sites on-demand + - Generic site names: `{project-keyword-slug}-{random-suffix}` + - Creates only the minimum needed +- All created sites use bunny.net API (same as `provision-site` command, but without custom domain step) + +### Site Assignment Logic +- A function accepts a batch of `GeneratedContent` records (all from same batch/tier) +- For tier1 articles with `site_deployment_id = null`: + - **Priority 1:** Assign from `tier1_preferred_sites` (if specified in job config) + - **Priority 2:** Random selection from available pool +- For tier2+ articles with `site_deployment_id = null`: + - Random selection from available pool +- Assignment rules: + - Ensures no two articles in the batch get assigned the same site + - Can reuse sites from previous batches (only same-batch collision matters) + - Pre-created keyword sites are in the available pool +- Updates the `GeneratedContent.site_deployment_id` field in database +- For articles that already have `site_deployment_id` set (from Story 2.5): leaves them unchanged +- If not enough available sites exist: + - If `auto_create_sites: true`: create more sites + - If `auto_create_sites: false`: raise clear error with count of sites needed vs available + +### URL Generation Logic +- A function generates the final public URL for each article based on its assigned site +- URL structure: `https://{hostname}/{slug}.html` +- Hostname selection: + - If site has `custom_hostname`: use `custom_hostname` + - If site has no `custom_hostname`: use `pull_zone_bcdn_hostname` +- Slug generation from article title: + - Convert to lowercase + - Replace spaces with hyphens + - Remove special characters (keep only alphanumeric and hyphens) + - Trim to reasonable length (e.g., max 100 characters) + - Example: "How to Fix Your Engine" → "how-to-fix-your-engine" + +### Output +- Function returns a list of URL mappings: `[{content_id: int, title: str, url: str, tier: int}, ...]` +- URL list includes ALL articles in the batch (both pre-assigned and newly assigned) +- Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15") + +### Error Handling +- If not enough sites available: clear error message with count needed +- If site lookup fails: clear error with site_id +- If slug generation produces empty string: use fallback (e.g., "article-{content_id}") + +## Tasks / Subtasks + +### 1. Database Schema Changes +**Effort:** 2 story points + +- [ ] Create migration script to alter `site_deployments` table: + - Make `custom_hostname` nullable + - Add unique constraint to `pull_zone_bcdn_hostname` +- [ ] Update `src/database/models.py`: + - Change `custom_hostname: Mapped[str]` to `Mapped[Optional[str]]` + - Change `nullable=False` to `nullable=True` + - Add `unique=True` to `pull_zone_bcdn_hostname` field +- [ ] Test migration on development database + +### 2. Update Repository Layer +**Effort:** 2 story points + +- [ ] Update `src/database/interfaces.py`: + - Change `custom_hostname: str` to `custom_hostname: Optional[str]` in `create()` signature + - Add `get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment]` method signature +- [ ] Update `src/database/repositories.py`: + - Make `custom_hostname` parameter optional in `create()` with default `None` + - Update uniqueness validation to handle nullable `custom_hostname` + - Implement `get_by_bcdn_hostname()` method + - Update `exists()` method to check both hostname types + +### 3. Implement Site Creation Logic +**Effort:** 3 story points + +- [ ] Create new module: `src/generation/site_provisioning.py` +- [ ] Implement `create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment`: + - Call bunny.net API to create Storage Zone + - Call bunny.net API to create Pull Zone linked to storage + - Generate unique site name with random suffix + - Save to database with `custom_hostname = null` + - Return SiteDeployment record +- [ ] Implement `provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]`: + - For each keyword+count, create N sites + - Use keyword in site name (slugified) + - Return list of created sites +- [ ] Log site creation at INFO level + +### 4. Implement Site Assignment Logic +**Effort:** 4 story points (increased from 3) + +- [ ] Update job config schema (`src/generation/job_config.py`): + - Add `tier1_preferred_sites: Optional[List[str]]` + - Add `auto_create_sites: Optional[bool] = False` + - Add `create_sites_for_keywords: Optional[List[Dict]]` +- [ ] Create new module: `src/generation/site_assignment.py` +- [ ] Implement `assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None`: + - Pre-create sites for keywords if specified + - Query all available sites from database + - Filter out sites already assigned to articles in this batch + - For tier1 articles: try preferred sites first, then random + - For tier2+ articles: random only + - If insufficient sites and auto_create=true: create more + - Update `GeneratedContent.site_deployment_id` in database + - Validate enough sites are available (raise error if auto_create=false) +- [ ] Log assignment decisions at INFO level + +### 5. Implement URL Generation Logic +**Effort:** 2 story points + +- [ ] Create `src/generation/url_generator.py` +- [ ] Implement `generate_slug(title: str) -> str`: + - Convert to lowercase + - Replace spaces with hyphens + - Remove special characters + - Trim to max length + - Handle edge cases (empty result, etc.) +- [ ] Implement `generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]`: + - For each article, lookup its site + - Determine hostname (custom or bcdn) + - Generate slug from title + - Build complete URL + - Return list of mappings + +### 6. Update Template Service +**Effort:** 1 story point + +- [ ] Update `src/templating/service.py`: + - Line 92: Change to `hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname` + - Ensure template mapping works for both hostname types + +### 7. Update CLI for Undomained Sites +**Effort:** 2 story points + +- [ ] Update `sync-sites` command in `src/cli/commands.py`: + - Remove filter that skips sites without custom hostnames (lines 688-689) + - Import sites that only have b-cdn.net hostnames + - Set `custom_hostname = None` for these sites +- [ ] Test importing undomained sites from bunny.net + +### 8. Unit Tests +**Effort:** 4 story points (increased from 3) + +- [ ] Test slug generation with various inputs (special chars, long titles, empty strings) +- [ ] Test URL generation with custom hostname +- [ ] Test URL generation with only bcdn hostname +- [ ] Test site assignment with exact count of available sites +- [ ] Test site assignment with insufficient sites (error case) +- [ ] Test site assignment skips already-assigned articles +- [ ] Test tier1 preferred sites logic +- [ ] Test tier2+ random assignment only +- [ ] Test site creation with keyword prefix +- [ ] Test auto-creation on-demand +- [ ] Achieve >80% code coverage for new modules + +### 9. Integration Tests +**Effort:** 3 story points (increased from 2) + +- [ ] Test full flow: batch generation → site assignment → URL generation +- [ ] Test with mix of pre-assigned (Story 2.5) and null articles +- [ ] Test tier1_preferred_sites assignment +- [ ] Test auto_create_sites when insufficient pool +- [ ] Test create_sites_for_keywords pre-creation +- [ ] Test database updates persist correctly +- [ ] Test with sites that have custom domains vs only bcdn hostnames +- [ ] Verify no duplicate site assignments within same batch + +## Technical Notes + +### Job Configuration Example +```json +{ + "job_name": "Test Run", + "project_id": 2, + "deployment_targets": [ + "www.domain1.com", + "www.domain2.com" + ], + "tier1_preferred_sites": [ + "site123.b-cdn.net", + "www.otherdomain.com" + ], + "auto_create_sites": true, + "create_sites_for_keywords": [ + {"keyword": "engine repair", "count": 3}, + {"keyword": "car maintenance", "count": 2} + ], + "tiers": [ + { + "tier": 1, + "article_count": 10 + }, + { + "tier": 2, + "article_count": 50 + } + ] +} +``` + +**Assignment example with 10 tier1 articles:** +- Articles 0-1: Assigned via `deployment_targets` (Story 2.5, already done) +- Articles 2-3: Assigned via `tier1_preferred_sites` (Story 3.1) +- Articles 4-8: Assigned via keyword sites if available, else random +- Article 9: Random or auto-created if pool exhausted + +**Tier2 articles:** All random + +### Site Naming Convention +``` +Keyword sites: engine-repair-a8f3 + car-maintenance-9x2k +Generic sites: shaft-machining-7m4p + {project-keyword}-{random-4-char} +``` + +### Slug Generation Example +```python +def generate_slug(title: str, max_length: int = 100) -> str: + """ + Generate URL-safe slug from article title + + Examples: + "How to Fix Your Engine" -> "how-to-fix-your-engine" + "10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024" + "C++ Programming Guide" -> "c-programming-guide" + """ + import re + + slug = title.lower() + slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars + slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/hyphens with single hyphen + slug = slug.strip('-')[:max_length] # Trim and limit length + + return slug or "article" # Fallback if empty +``` + +### URL Structure Examples +``` +Custom domain: https://www.example.com/how-to-fix-your-engine.html +Bunny CDN only: https://mysite123.b-cdn.net/how-to-fix-your-engine.html +``` + +### Site Assignment Algorithm +``` +1. Load job config and GeneratedContent records for batch +2. Pre-create sites for keywords if create_sites_for_keywords specified +3. Query all available SiteDeployment records +4. Identify articles with site_deployment_id = null (need assignment) +5. Filter out sites already used by articles in THIS batch +6. For each tier1 article needing assignment: + a. Try tier1_preferred_sites first (if specified) + b. Fallback to random from available pool +7. For each tier2+ article needing assignment: + a. Random from available pool +8. If insufficient sites and auto_create_sites=true: + a. Create minimum needed via bunny.net API + b. Retry assignment with expanded pool +9. If insufficient sites and auto_create_sites=false: + a. Raise error with count needed vs available +10. Update database with new site_deployment_id values +``` + +### Site Creation via Bunny.net API +```python +def create_bunnynet_site(name_prefix: str, region: str = "DE"): + # Step 1: Create Storage Zone + storage = bunny_client.create_storage_zone( + name=f"{name_prefix}-{random_suffix()}", + region=region + ) + + # Step 2: Create Pull Zone (no custom hostname step) + pull = bunny_client.create_pull_zone( + name=f"{name_prefix}-{random_suffix()}", + storage_zone_id=storage.id + ) + + # Step 3: Save to database + site = site_repo.create( + site_name=name_prefix, + custom_hostname=None, # No custom domain + storage_zone_id=storage.id, + storage_zone_name=storage.name, + storage_zone_password=storage.password, + storage_zone_region=region, + pull_zone_id=pull.id, + pull_zone_bcdn_hostname=pull.hostname + ) + + return site +``` + +### Database Migration +```sql +-- Make custom_hostname nullable +ALTER TABLE site_deployments + MODIFY COLUMN custom_hostname VARCHAR(255) NULL; + +-- Add unique constraint to pull_zone_bcdn_hostname +ALTER TABLE site_deployments + ADD CONSTRAINT uq_pull_zone_bcdn_hostname + UNIQUE (pull_zone_bcdn_hostname); +``` + +## Dependencies +- Story 1.6: `SiteDeployment` table exists +- Story 2.3: Content generation creates `GeneratedContent` records +- Story 2.5: Some articles may already have `site_deployment_id` set + +## Future Considerations +- Story 4.x will use generated URLs for deployment +- Story 3.3 will use URL list for interlinking +- Future: S3-compatible storage support (custom_hostname nullable enables this) + +## Deferred to Technical Debt +- Fuzzy keyword/entity matching for intelligent site assignment (T1 articles) +- This would compare article keywords/entities to site hostnames and assign based on relevance score +- Adds complexity: keyword extraction, scoring algorithm, match tracking +- Estimated effort if implemented later: 5-8 story points + +## Total Effort +21 story points (increased from 17) + diff --git a/docs/technical-debt.md b/docs/technical-debt.md index 0f801ce..fe17c0c 100644 --- a/docs/technical-debt.md +++ b/docs/technical-debt.md @@ -369,6 +369,92 @@ Current augmentation is basic: --- +## Story 3.1: URL Generation and Site Assignment + +### Fuzzy Keyword/Entity Matching for Site Assignment + +**Priority**: Medium +**Epic Suggestion**: Epic 3 (Pre-deployment) - Enhancement +**Estimated Effort**: Medium (5-8 story points) + +#### Problem +Currently tier1 site assignment uses: +1. Explicit preferred sites from job config +2. Random selection from available pool + +This doesn't leverage semantic matching between article content and site domains/names. For SEO and organizational purposes, it would be valuable to assign articles to sites based on topic/keyword relevance. + +#### Proposed Solution + +**Intelligent Site Matching:** +1. Extract article keywords and entities from GeneratedContent +2. Parse keywords/entities from site hostnames and names +3. Score each (article, site) pair based on keyword/entity overlap +4. Assign tier1 articles to highest-scoring available sites +5. Fall back to random if no good matches + +**Example:** +``` +Article: "Engine Repair Basics" + Keywords: ["engine repair", "automotive", "maintenance"] + Entities: ["engine", "carburetor", "cylinder"] + +Available Sites: + - auto-repair-tips.com Score: 0.85 (high match) + - engine-maintenance-guide.com Score: 0.92 (very high match) + - cooking-recipes.com Score: 0.05 (low match) + +Assignment: engine-maintenance-guide.com (best match) +``` + +**Implementation Details:** +- Scoring algorithm: weighted combination of keyword match + entity match +- Fuzzy matching: use Levenshtein distance or similar for partial matches +- Track assignments to avoid reusing sites within same batch +- Configurable threshold (e.g., only assign if score > 0.5, else random) + +**Job Configuration:** +```json +{ + "tier1_site_matching": { + "enabled": true, + "min_score": 0.5, + "weight_keywords": 0.6, + "weight_entities": 0.4 + } +} +``` + +**Database Changes:** +None required - uses existing GeneratedContent fields (keyword, entities) and SiteDeployment fields (custom_hostname, site_name) + +#### Complexity Factors +- Keyword extraction from domain names (e.g., "auto-repair-tips.com" → ["auto", "repair", "tips"]) +- Entity recognition and normalization +- Scoring algorithm design and tuning +- Testing with various domain/content combinations +- Performance optimization for large site pools + +#### Impact +- Better SEO through topical site clustering +- More organized content portfolio +- Easier to identify which sites cover which topics +- Improved content discoverability + +#### Alternative: Simpler Keyword-Only Matching +If full fuzzy matching is too complex, start with exact keyword substring matching: +```python +# Simple version: check if article keyword appears in hostname +if article.main_keyword.lower() in site.custom_hostname.lower(): + score = 1.0 +else: + score = 0.0 +``` + +This would still provide value with much less complexity (2-3 story points instead of 5-8). + +--- + ## Future Sections Add new technical debt items below as they're identified during development.