Big-Link-Man/docs/stories/story-3.2-find-tiered-links.md

457 lines
17 KiB
Markdown

# Story 3.2: Find Tiered Links
## Status
Complete - QA Approved
## Story
**As a developer**, I want a module that finds all required tiered links (money site or lower-tier) based on the current batch's tier, so I have them ready for injection.
## Context
- Story 3.1 generates URLs for articles in the current batch
- Articles are organized in tiers (T1, T2, T3, etc.) where higher tiers link to lower tiers
- Tier 1 articles link to the money site (client's actual website)
- Tier 2+ articles link to random articles from the tier immediately below
- All articles in a batch are from the same project and tier
- URLs are generated on-the-fly from `GeneratedContent` records (not stored in DB yet)
- The link relationships (which article links to which) will be tracked in Story 4.2
## Acceptance Criteria
### Core Functionality
- A function accepts a batch of `GeneratedContent` records and job configuration
- It determines the tier of the batch (all articles in batch are same tier)
- **If Tier 1:**
- It retrieves the `money_site_url` from the project settings
- Returns a single money site URL
- **If Tier 2 or higher:**
- It queries `GeneratedContent` table for articles from the tier immediately below (e.g., T2 queries T1)
- Filters to same project only
- Selects random articles from the lower tier
- Generates URLs for those articles using `generate_urls_for_batch()`
- Returns list of lower-tier URLs
- Function signature: `find_tiered_links(content_records: List[GeneratedContent], job_config, project_repo, content_repo, site_repo) -> Dict`
### Link Count Configuration
- By default: select 2-4 random lower-tier URLs (random count between 2 and 4)
- Job config supports optional `tiered_link_count_range: {min: int, max: int}`
- If min == max, always returns exactly that many links (e.g., `{min: 8, max: 8}` returns 8 links)
- If min < max, returns random count between min and max (inclusive)
- Default if not specified: `{min: 2, max: 4}`
### Return Format
- **Tier 1 batches:** `{tier: 1, money_site_url: "https://example.com"}`
- **Tier 2+ batches:** `{tier: N, lower_tier_urls: ["https://...", "https://..."], lower_tier: N-1}`
### Error Handling
- **Tier 2+ with no lower-tier articles:** Raise error and quit
- Error message: "Cannot generate tier {N} batch: no tier {N-1} articles found in project {project_id}"
- **Tier 1 with no money_site_url:** Raise error and quit
- Error message: "Cannot generate tier 1 batch: money_site_url not set in project {project_id}"
- **Fewer lower-tier URLs than min requested:** Log warning and continue
- Warning: "Only {count} tier {N-1} articles available, requested min {min}. Using all available."
- Returns all available lower-tier URLs even if less than min
- **Empty content_records list:** Raise ValueError
- **Mixed tiers in content_records:** Raise ValueError
### Logging
- INFO: Log tier detection (e.g., "Batch is tier 2, querying tier 1 articles")
- INFO: Log link selection (e.g., "Selected 3 random tier 1 URLs from 15 available")
- WARNING: If fewer articles available than requested minimum
- ERROR: If no lower-tier articles found or money_site_url missing
## Tasks / Subtasks
### 1. Create Article Links Table
**Effort:** 2 story points
- [ ] Create migration script for `article_links` table:
- `id` (primary key, auto-increment)
- `from_content_id` (foreign key to generated_content.id, indexed)
- `to_content_id` (foreign key to generated_content.id, indexed)
- `to_url` (text, nullable - for money site URLs that aren't in our DB)
- `link_type` (varchar: "tiered", "wheel_next", "wheel_prev", "homepage")
- `created_at` (timestamp)
- [ ] Add unique constraint on (from_content_id, to_content_id, link_type) to prevent duplicates
- [ ] Create `ArticleLink` model in `src/database/models.py`
- [ ] Test migration on development database
### 2. Create Article Links Repository
**Effort:** 2 story points
- [ ] Create `IArticleLinkRepository` interface in `src/database/interfaces.py`:
- `create(from_content_id, to_content_id, to_url, link_type) -> ArticleLink`
- `get_by_source_article(from_content_id) -> List[ArticleLink]`
- `get_by_target_article(to_content_id) -> List[ArticleLink]`
- `get_by_link_type(link_type) -> List[ArticleLink]`
- `delete(link_id) -> bool`
- [ ] Implement `ArticleLinkRepository` in `src/database/repositories.py`
- [ ] Handle both internal links (to_content_id) and external links (to_url for money site)
### 3. Extend Job Configuration Schema
**Effort:** 1 story point
- [ ] Add `tiered_link_count_range: Optional[Dict]` to job config schema
- [ ] Default: `{min: 2, max: 4}` if not specified
- [ ] Validation: min >= 1, max >= min
- [ ] Example: `{"tiered_link_count_range": {"min": 3, "max": 6}}`
### 4. Add Money Site URL to Project
**Effort:** 1 story point
- [ ] Add `money_site_url` field to Project model (nullable string, indexed)
- [ ] Create migration script to add column to existing projects table
- [ ] Update ProjectRepository.create() to accept money_site_url parameter
- [ ] Test migration on development database
### 5. Implement Tiered Link Finder
**Effort:** 3 story points
- [ ] Create new module: `src/interlinking/tiered_links.py`
- [ ] Implement `find_tiered_links()` function:
- Validate content_records is not empty
- Validate all records are same tier
- Detect tier from first record
- Handle Tier 1 case (money site)
- Handle Tier 2+ case (lower-tier articles)
- Apply link count range configuration
- Generate URLs using `url_generator.generate_urls_for_batch()`
- Return formatted result
- [ ] Implement `_select_random_count(min_count: int, max_count: int) -> int` helper
- [ ] Implement `_validate_batch_tier(content_records: List[GeneratedContent]) -> int` helper
### 6. Unit Tests
**Effort:** 4 story points
- [ ] Test ArticleLink model creation and relationships
- [ ] Test ArticleLinkRepository CRUD operations
- [ ] Test duplicate link prevention (unique constraint)
- [ ] Test Tier 1 batch returns money_site_url
- [ ] Test Tier 1 batch with missing money_site_url raises error
- [ ] Test Tier 2 batch queries Tier 1 articles from same project only
- [ ] Test Tier 3 batch queries Tier 2 articles
- [ ] Test random selection with default range (2-4)
- [ ] Test custom link count range from job config
- [ ] Test exact count (min == max)
- [ ] Test empty content_records raises error
- [ ] Test mixed tiers in batch raises error
- [ ] Test no lower-tier articles available raises error
- [ ] Test fewer lower-tier articles than min logs warning and continues
- [ ] Mock GeneratedContent, Project, and URL generation
- [ ] Achieve >85% code coverage
### 7. Integration Tests
**Effort:** 2 story points
- [ ] Test article_links table migration and constraints
- [ ] Test full flow with real database: create T1 articles, then query for T2 batch
- [ ] Test with multiple projects to verify same-project filtering
- [ ] Test URL generation integration with Story 3.1 url_generator
- [ ] Test with different link count configurations
- [ ] Verify lower-tier article selection is truly random
- [ ] Test storing links in article_links table (for Story 3.3/4.2 usage)
## Technical Notes
### Article Links Table Schema
```sql
CREATE TABLE article_links (
id INTEGER PRIMARY KEY AUTOINCREMENT,
from_content_id INTEGER NOT NULL,
to_content_id INTEGER NULL,
to_url TEXT NULL,
anchor_text TEXT NULL,
link_type VARCHAR(20) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (from_content_id) REFERENCES generated_content(id) ON DELETE CASCADE,
FOREIGN KEY (to_content_id) REFERENCES generated_content(id) ON DELETE CASCADE,
UNIQUE (from_content_id, to_content_id, link_type),
CHECK (to_content_id IS NOT NULL OR to_url IS NOT NULL)
);
CREATE INDEX idx_article_links_from ON article_links(from_content_id);
CREATE INDEX idx_article_links_to ON article_links(to_content_id);
CREATE INDEX idx_article_links_type ON article_links(link_type);
```
**Note:** The `anchor_text` field was added in Story 4.5 to store the actual anchor text used for each link, improving query performance and data integrity.
**Link Types:**
- `tiered`: Link from tier N article to tier N-1 article (or money site for tier 1)
- `wheel_next`: Link to next article in batch wheel
- `wheel_prev`: Link to previous article in batch wheel
- `homepage`: Link to site homepage
**Usage:**
- For tier 1 articles linking to money site: `to_content_id = NULL`, `to_url = money_site_url`
- For tier 2+ linking to lower tiers: `to_content_id = lower_tier_article.id`, `to_url = NULL`
- For wheel/homepage links: `to_content_id = other_article.id`, `to_url = NULL`
### ArticleLink Model
```python
class ArticleLink(Base):
__tablename__ = "article_links"
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
from_content_id: Mapped[int] = mapped_column(
Integer,
ForeignKey('generated_content.id', ondelete='CASCADE'),
nullable=False,
index=True
)
to_content_id: Mapped[Optional[int]] = mapped_column(
Integer,
ForeignKey('generated_content.id', ondelete='CASCADE'),
nullable=True,
index=True
)
to_url: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
anchor_text: Mapped[Optional[str]] = mapped_column(Text, nullable=True) # Added in Story 4.5
link_type: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
```
### Project Model Extension
```python
# Add to Project model in src/database/models.py
class Project(Base):
# ... existing fields ...
money_site_url: Mapped[Optional[str]] = mapped_column(String(500), nullable=True, index=True)
```
```sql
-- Migration script to add money_site_url to projects table
ALTER TABLE projects ADD COLUMN money_site_url VARCHAR(500) NULL;
CREATE INDEX idx_projects_money_site_url ON projects(money_site_url);
```
### ArticleLink Repository Usage Examples
```python
# Story 3.3: Record wheel link
link_repo.create(
from_content_id=article_a.id,
to_content_id=article_b.id,
to_url=None,
anchor_text="Next Article",
link_type="wheel_next"
)
# Story 4.2: Record tier 1 article linking to money site
link_repo.create(
from_content_id=tier1_article.id,
to_content_id=None,
to_url="https://www.moneysite.com",
anchor_text="expert services", # Added in Story 4.5
link_type="tiered"
)
# Story 4.2: Record tier 2 article linking to tier 1 article
link_repo.create(
from_content_id=tier2_article.id,
to_content_id=tier1_article.id,
to_url=None,
anchor_text="learn more", # Added in Story 4.5
link_type="tiered"
)
# Query all outbound links from an article
outbound_links = link_repo.get_by_source_article(article.id)
# Query all articles that link TO a specific article
inbound_links = link_repo.get_by_target_article(article.id)
```
### Job Configuration Example
```json
{
"job_name": "Test Batch",
"project_id": 2,
"tiered_link_count_range": {
"min": 3,
"max": 5
},
"tiers": [
{
"tier": 2,
"article_count": 20
}
]
}
```
### Function Signature
```python
def find_tiered_links(
content_records: List[GeneratedContent],
job_config: JobConfig,
project_repo: IProjectRepository,
content_repo: IGeneratedContentRepository,
site_repo: ISiteDeploymentRepository
) -> Dict:
"""
Find tiered links for a batch of articles
Args:
content_records: Batch of articles (all same tier, same project)
job_config: Job configuration with optional link count range
project_repo: For retrieving money_site_url
content_repo: For querying lower-tier articles
site_repo: For URL generation
Returns:
Tier 1: {tier: 1, money_site_url: "https://..."}
Tier 2+: {tier: N, lower_tier_urls: [...], lower_tier: N-1}
Raises:
ValueError: If batch is invalid or required data is missing
"""
pass
```
### Implementation Example
```python
import random
import logging
from typing import List, Dict
from src.database.models import GeneratedContent
from src.generation.url_generator import generate_urls_for_batch
logger = logging.getLogger(__name__)
def find_tiered_links(content_records, job_config, project_repo, content_repo, site_repo):
if not content_records:
raise ValueError("content_records cannot be empty")
tier = _validate_batch_tier(content_records)
project_id = content_records[0].project_id
logger.info(f"Finding tiered links for tier {tier} batch (project {project_id})")
if tier == 1:
project = project_repo.get_by_id(project_id)
if not project or not project.money_site_url:
raise ValueError(
f"Cannot generate tier 1 batch: money_site_url not set in project {project_id}"
)
return {
"tier": 1,
"money_site_url": project.money_site_url
}
lower_tier = tier - 1
logger.info(f"Batch is tier {tier}, querying tier {lower_tier} articles")
lower_tier_articles = content_repo.get_by_project_and_tier(project_id, lower_tier)
if not lower_tier_articles:
raise ValueError(
f"Cannot generate tier {tier} batch: no tier {lower_tier} articles found in project {project_id}"
)
link_range = job_config.get("tiered_link_count_range", {"min": 2, "max": 4})
min_count = link_range["min"]
max_count = link_range["max"]
available_count = len(lower_tier_articles)
desired_count = random.randint(min_count, max_count)
if available_count < min_count:
logger.warning(
f"Only {available_count} tier {lower_tier} articles available, "
f"requested min {min_count}. Using all available."
)
selected_articles = lower_tier_articles
else:
actual_count = min(desired_count, available_count)
selected_articles = random.sample(lower_tier_articles, actual_count)
logger.info(
f"Selected {len(selected_articles)} random tier {lower_tier} URLs "
f"from {available_count} available"
)
url_mappings = generate_urls_for_batch(selected_articles, site_repo)
lower_tier_urls = [mapping["url"] for mapping in url_mappings]
return {
"tier": tier,
"lower_tier": lower_tier,
"lower_tier_urls": lower_tier_urls
}
def _validate_batch_tier(content_records: List[GeneratedContent]) -> int:
tiers = set(record.tier for record in content_records)
if len(tiers) > 1:
raise ValueError(f"All articles in batch must be same tier, found: {tiers}")
return int(list(tiers)[0])
```
### Database Queries Needed
```python
def get_by_project_and_tier(self, project_id: int, tier: int) -> List[GeneratedContent]:
"""
Get all articles for a specific project and tier
Returns articles that have site_deployment_id set (from Story 3.1)
"""
return self.session.query(GeneratedContent)\
.filter(
GeneratedContent.project_id == project_id,
GeneratedContent.tier == tier,
GeneratedContent.site_deployment_id.isnot(None)
)\
.all()
```
### Return Value Examples
```python
# Tier 1 batch
{
"tier": 1,
"money_site_url": "https://www.mymoneysite.com"
}
# Tier 2 batch
{
"tier": 2,
"lower_tier": 1,
"lower_tier_urls": [
"https://site1.b-cdn.net/article-title-1.html",
"https://www.customdomain.com/article-title-2.html",
"https://site2.b-cdn.net/article-title-3.html"
]
}
# Tier 3 batch with custom range (8 links)
{
"tier": 3,
"lower_tier": 2,
"lower_tier_urls": [
"https://site3.b-cdn.net/...",
"https://site4.b-cdn.net/...",
# ... 6 more URLs
]
}
```
## Dependencies
- Story 3.1: Site assignment and URL generation must be complete
- Story 2.3: GeneratedContent records exist in database
- Story 1.x: Project and GeneratedContent tables exist
## Future Considerations
- Story 3.3 will use the tiered links found by this module for actual content injection
- Story 3.3 will populate article_links table with wheel and homepage link relationships
- Story 4.2 will use article_links table to log tiered link relationships after deployment
- Future: Intelligent link distribution (ensure even link spread across lower-tier articles)
- Future: Analytics dashboard showing link structure and tier relationships using article_links table
## Link Relationship Tracking
This story creates the `article_links` table infrastructure. The actual population of link relationships will happen in:
- **Story 3.3**: Stores wheel and homepage links when injecting them into content
- **Story 4.2**: Stores tiered links when logging final URLs after deployment
- The table enables future analytics on link distribution, tier structure, and interlinking patterns
## Total Effort
16 story points