Big-Link-Man/docs/stories/story-3.1-url-generation-an...

15 KiB

Story 3.1: Generate and Validate Article URLs

Status

Approved

Story

As a developer, I want to assign unique sites to all articles in a batch, validate those sites exist, and generate final public URLs for each article, so that I have a definitive URL list before interlinking.

Context

  • Story 2.5 assigns the first N tier1 articles to deployment targets (sites with custom domains)
  • Remaining articles have site_deployment_id = null and need site assignment
  • Articles from the same batch/tier must NOT share a site (but can reuse sites from previous batches)
  • Sites can have custom domains (custom_hostname) OR just bunny.net CDN hostnames (pull_zone_bcdn_hostname)
  • System has 400+ existing bunny.net buckets without custom domains that should be usable
  • Job config can specify preferred sites for tier1 articles
  • Job config can request auto-creation of new sites if pool is insufficient
  • Job config can request pre-creation of sites for specific keywords/entities

Acceptance Criteria

Database Changes

  • custom_hostname field in SiteDeployment table is nullable (was previously required)
  • pull_zone_bcdn_hostname field in SiteDeployment table has unique constraint
  • Database migration script updates existing schema without data loss

Repository Updates

  • SiteDeploymentRepository.create() accepts optional custom_hostname parameter (defaults to None)
  • SiteDeploymentRepository.get_by_bcdn_hostname() method added to query by bunny.net hostname
  • Repository interface (ISiteDeploymentRepository) updated to reflect optional custom_hostname

Job Configuration Extensions

  • Job config supports optional tier1_preferred_sites array (list of hostnames for tier1 assignment)
  • Job config supports optional auto_create_sites boolean (default: false)
  • Job config supports optional create_sites_for_keywords array of {keyword: str, count: int} objects
  • Invalid hostnames in tier1_preferred_sites cause graceful errors
  • Validation occurs at start of Story 3.1 workflow

Site Creation Logic

  • If create_sites_for_keywords specified:
    • Pre-creates bunny.net sites (Storage Zone + Pull Zone) via API
    • Site names: {keyword-slug}-{random-suffix} (e.g., "engine-repair-a8f3")
    • Creates in default region (configurable, default "DE")
    • Stores in database with custom_hostname = null
    • These sites added to available pool BEFORE assignment
  • If auto_create_sites: true and not enough sites after assignment:
    • Creates additional generic bunny.net sites on-demand
    • Generic site names: {project-keyword-slug}-{random-suffix}
    • Creates only the minimum needed
  • All created sites use bunny.net API (same as provision-site command, but without custom domain step)

Site Assignment Logic

  • A function accepts a batch of GeneratedContent records (all from same batch/tier)
  • For tier1 articles with site_deployment_id = null:
    • Priority 1: Assign from tier1_preferred_sites (if specified in job config)
    • Priority 2: Random selection from available pool
  • For tier2+ articles with site_deployment_id = null:
    • Random selection from available pool
  • Assignment rules:
    • Ensures no two articles in the batch get assigned the same site
    • Can reuse sites from previous batches (only same-batch collision matters)
    • Pre-created keyword sites are in the available pool
  • Updates the GeneratedContent.site_deployment_id field in database
  • For articles that already have site_deployment_id set (from Story 2.5): leaves them unchanged
  • If not enough available sites exist:
    • If auto_create_sites: true: create more sites
    • If auto_create_sites: false: raise clear error with count of sites needed vs available

URL Generation Logic

  • A function generates the final public URL for each article based on its assigned site
  • URL structure: https://{hostname}/{slug}.html
  • Hostname selection:
    • If site has custom_hostname: use custom_hostname
    • If site has no custom_hostname: use pull_zone_bcdn_hostname
  • Slug generation from article title:
    • Convert to lowercase
    • Replace spaces with hyphens
    • Remove special characters (keep only alphanumeric and hyphens)
    • Trim to reasonable length (e.g., max 100 characters)
    • Example: "How to Fix Your Engine" → "how-to-fix-your-engine"

Output

  • Function returns a list of URL mappings: [{content_id: int, title: str, url: str, tier: int}, ...]
  • URL list includes ALL articles in the batch (both pre-assigned and newly assigned)
  • Logging shows assignment decisions at INFO level (e.g., "Assigned content_id=42 to site_id=15")

Error Handling

  • If not enough sites available: clear error message with count needed
  • If site lookup fails: clear error with site_id
  • If slug generation produces empty string: use fallback (e.g., "article-{content_id}")

Tasks / Subtasks

1. Database Schema Changes

Effort: 2 story points

  • Create migration script to alter site_deployments table:
    • Make custom_hostname nullable
    • Add unique constraint to pull_zone_bcdn_hostname
  • Update src/database/models.py:
    • Change custom_hostname: Mapped[str] to Mapped[Optional[str]]
    • Change nullable=False to nullable=True
    • Add unique=True to pull_zone_bcdn_hostname field
  • Test migration on development database

2. Update Repository Layer

Effort: 2 story points

  • Update src/database/interfaces.py:
    • Change custom_hostname: str to custom_hostname: Optional[str] in create() signature
    • Add get_by_bcdn_hostname(hostname: str) -> Optional[SiteDeployment] method signature
  • Update src/database/repositories.py:
    • Make custom_hostname parameter optional in create() with default None
    • Update uniqueness validation to handle nullable custom_hostname
    • Implement get_by_bcdn_hostname() method
    • Update exists() method to check both hostname types

3. Implement Site Creation Logic

Effort: 3 story points

  • Create new module: src/generation/site_provisioning.py
  • Implement create_bunnynet_site(name_prefix: str, region: str = "DE") -> SiteDeployment:
    • Call bunny.net API to create Storage Zone
    • Call bunny.net API to create Pull Zone linked to storage
    • Generate unique site name with random suffix
    • Save to database with custom_hostname = null
    • Return SiteDeployment record
  • Implement provision_keyword_sites(keywords: List[Dict], bunny_client, site_repo) -> List[SiteDeployment]:
    • For each keyword+count, create N sites
    • Use keyword in site name (slugified)
    • Return list of created sites
  • Log site creation at INFO level

4. Implement Site Assignment Logic

Effort: 4 story points (increased from 3)

  • Update job config schema (src/generation/job_config.py):
    • Add tier1_preferred_sites: Optional[List[str]]
    • Add auto_create_sites: Optional[bool] = False
    • Add create_sites_for_keywords: Optional[List[Dict]]
  • Create new module: src/generation/site_assignment.py
  • Implement assign_sites_to_batch(content_records: List[GeneratedContent], job_config, site_repo, bunny_client) -> None:
    • Pre-create sites for keywords if specified
    • Query all available sites from database
    • Filter out sites already assigned to articles in this batch
    • For tier1 articles: try preferred sites first, then random
    • For tier2+ articles: random only
    • If insufficient sites and auto_create=true: create more
    • Update GeneratedContent.site_deployment_id in database
    • Validate enough sites are available (raise error if auto_create=false)
  • Log assignment decisions at INFO level

5. Implement URL Generation Logic

Effort: 2 story points

  • Create src/generation/url_generator.py
  • Implement generate_slug(title: str) -> str:
    • Convert to lowercase
    • Replace spaces with hyphens
    • Remove special characters
    • Trim to max length
    • Handle edge cases (empty result, etc.)
  • Implement generate_urls_for_batch(content_records: List[GeneratedContent], site_repo: SiteDeploymentRepository) -> List[Dict]:
    • For each article, lookup its site
    • Determine hostname (custom or bcdn)
    • Generate slug from title
    • Build complete URL
    • Return list of mappings

6. Update Template Service

Effort: 1 story point

  • Update src/templating/service.py:
    • Line 92: Change to hostname = site_deployment.custom_hostname or site_deployment.pull_zone_bcdn_hostname
    • Ensure template mapping works for both hostname types

7. Update CLI for Undomained Sites

Effort: 2 story points

  • Update sync-sites command in src/cli/commands.py:
    • Remove filter that skips sites without custom hostnames (lines 688-689)
    • Import sites that only have b-cdn.net hostnames
    • Set custom_hostname = None for these sites
  • Test importing undomained sites from bunny.net

8. Unit Tests

Effort: 4 story points (increased from 3)

  • Test slug generation with various inputs (special chars, long titles, empty strings)
  • Test URL generation with custom hostname
  • Test URL generation with only bcdn hostname
  • Test site assignment with exact count of available sites
  • Test site assignment with insufficient sites (error case)
  • Test site assignment skips already-assigned articles
  • Test tier1 preferred sites logic
  • Test tier2+ random assignment only
  • Test site creation with keyword prefix
  • Test auto-creation on-demand
  • Achieve >80% code coverage for new modules

9. Integration Tests

Effort: 3 story points (increased from 2)

  • Test full flow: batch generation → site assignment → URL generation
  • Test with mix of pre-assigned (Story 2.5) and null articles
  • Test tier1_preferred_sites assignment
  • Test auto_create_sites when insufficient pool
  • Test create_sites_for_keywords pre-creation
  • Test database updates persist correctly
  • Test with sites that have custom domains vs only bcdn hostnames
  • Verify no duplicate site assignments within same batch

Technical Notes

Job Configuration Example

{
  "job_name": "Test Run",
  "project_id": 2,
  "deployment_targets": [
    "www.domain1.com",
    "www.domain2.com"
  ],
  "tier1_preferred_sites": [
    "site123.b-cdn.net",
    "www.otherdomain.com"
  ],
  "auto_create_sites": true,
  "create_sites_for_keywords": [
    {"keyword": "engine repair", "count": 3},
    {"keyword": "car maintenance", "count": 2}
  ],
  "tiers": [
    {
      "tier": 1,
      "article_count": 10
    },
    {
      "tier": 2,
      "article_count": 50
    }
  ]
}

Assignment example with 10 tier1 articles:

  • Articles 0-1: Assigned via deployment_targets (Story 2.5, already done)
  • Articles 2-3: Assigned via tier1_preferred_sites (Story 3.1)
  • Articles 4-8: Assigned via keyword sites if available, else random
  • Article 9: Random or auto-created if pool exhausted

Tier2 articles: All random

Site Naming Convention

Keyword sites:     engine-repair-a8f3
                   car-maintenance-9x2k
Generic sites:     shaft-machining-7m4p
                   {project-keyword}-{random-4-char}

Slug Generation Example

def generate_slug(title: str, max_length: int = 100) -> str:
    """
    Generate URL-safe slug from article title
    
    Examples:
        "How to Fix Your Engine" -> "how-to-fix-your-engine"
        "10 Best SEO Tips for 2024!" -> "10-best-seo-tips-for-2024"
        "C++ Programming Guide" -> "c-programming-guide"
    """
    import re
    
    slug = title.lower()
    slug = re.sub(r'[^\w\s-]', '', slug)  # Remove special chars
    slug = re.sub(r'[-\s]+', '-', slug)    # Replace spaces/hyphens with single hyphen
    slug = slug.strip('-')[:max_length]    # Trim and limit length
    
    return slug or "article"  # Fallback if empty

URL Structure Examples

Custom domain:     https://www.example.com/how-to-fix-your-engine.html
Bunny CDN only:    https://mysite123.b-cdn.net/how-to-fix-your-engine.html

Site Assignment Algorithm

1. Load job config and GeneratedContent records for batch
2. Pre-create sites for keywords if create_sites_for_keywords specified
3. Query all available SiteDeployment records
4. Identify articles with site_deployment_id = null (need assignment)
5. Filter out sites already used by articles in THIS batch
6. For each tier1 article needing assignment:
   a. Try tier1_preferred_sites first (if specified)
   b. Fallback to random from available pool
7. For each tier2+ article needing assignment:
   a. Random from available pool
8. If insufficient sites and auto_create_sites=true:
   a. Create minimum needed via bunny.net API
   b. Retry assignment with expanded pool
9. If insufficient sites and auto_create_sites=false:
   a. Raise error with count needed vs available
10. Update database with new site_deployment_id values

Site Creation via Bunny.net API

def create_bunnynet_site(name_prefix: str, region: str = "DE"):
    # Step 1: Create Storage Zone
    storage = bunny_client.create_storage_zone(
        name=f"{name_prefix}-{random_suffix()}",
        region=region
    )
    
    # Step 2: Create Pull Zone (no custom hostname step)
    pull = bunny_client.create_pull_zone(
        name=f"{name_prefix}-{random_suffix()}",
        storage_zone_id=storage.id
    )
    
    # Step 3: Save to database
    site = site_repo.create(
        site_name=name_prefix,
        custom_hostname=None,  # No custom domain
        storage_zone_id=storage.id,
        storage_zone_name=storage.name,
        storage_zone_password=storage.password,
        storage_zone_region=region,
        pull_zone_id=pull.id,
        pull_zone_bcdn_hostname=pull.hostname
    )
    
    return site

Database Migration

-- Make custom_hostname nullable
ALTER TABLE site_deployments 
  MODIFY COLUMN custom_hostname VARCHAR(255) NULL;

-- Add unique constraint to pull_zone_bcdn_hostname
ALTER TABLE site_deployments 
  ADD CONSTRAINT uq_pull_zone_bcdn_hostname 
  UNIQUE (pull_zone_bcdn_hostname);

Dependencies

  • Story 1.6: SiteDeployment table exists
  • Story 2.3: Content generation creates GeneratedContent records
  • Story 2.5: Some articles may already have site_deployment_id set

Future Considerations

  • Story 4.x will use generated URLs for deployment
  • Story 3.3 will use URL list for interlinking
  • Future: S3-compatible storage support (custom_hostname nullable enables this)

Deferred to Technical Debt

  • Fuzzy keyword/entity matching for intelligent site assignment (T1 articles)
  • This would compare article keywords/entities to site hostnames and assign based on relevance score
  • Adds complexity: keyword extraction, scoring algorithm, match tracking
  • Estimated effort if implemented later: 5-8 story points

Total Effort

21 story points (increased from 17)