342 lines
12 KiB
Markdown
342 lines
12 KiB
Markdown
# Story 3.3: Content Interlinking Injection
|
|
|
|
## Status
|
|
✅ **COMPLETE** - Implemented, Integrated, and Tested
|
|
|
|
## Summary
|
|
This story injects three types of links into article HTML:
|
|
1. **Tiered Links** - T1 articles link to money site, T2+ link to lower-tier articles
|
|
2. **Homepage Links** - Link to the site's homepage (base domain)
|
|
3. **"See Also" Section** - Links to all other articles in the batch
|
|
|
|
Uses existing `anchor_text_generator.py` for tier-based anchor text with support for job config overrides (default/override/append modes).
|
|
|
|
## Story
|
|
**As a developer**, I want to inject all required links (batch "wheel", home page, and tiered/money site) into each new article's HTML content, so that the articles are fully interlinked and ready for deployment.
|
|
|
|
## Context
|
|
- Story 3.1 generates final URLs for all articles in the batch
|
|
- Story 3.2 finds the required tiered links (money site or lower-tier URLs)
|
|
- Articles have raw HTML content from Epic 2 (h2, h3, p tags)
|
|
- Project contains anchor text lists for each tier
|
|
- Articles need wheel links (next/previous), homepage links, and tiered links
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Core Functionality
|
|
- A function takes raw HTML content, URL list, tiered links, and project data
|
|
- **Wheel Links:** Each article gets "next" and "previous" links to other articles in the batch
|
|
- Last article's "next" links to first article (circular)
|
|
- First article's "previous" links to last article (circular)
|
|
- **Homepage Links:** Each article gets a link to its site's homepage
|
|
- **Tiered Links:** Articles get links based on their tier
|
|
- Tier 1: Links to money site using T1 anchor text
|
|
- Tier 2+: Links to lower-tier articles using appropriate tier anchor text
|
|
|
|
### Input Requirements
|
|
- Raw HTML content (from Epic 2)
|
|
- List of article URLs with titles (from Story 3.1)
|
|
- Tiered links object (from Story 3.2)
|
|
- Project data (for anchor text lists)
|
|
- Batch tier information
|
|
|
|
### Output Requirements
|
|
- Final HTML content with all links injected
|
|
- Updated content stored in database
|
|
- Link relationships recorded in `article_links` table
|
|
|
|
## Implementation Details
|
|
|
|
### Anchor Text Generation
|
|
**RESOLVED:** Use existing `src/interlinking/anchor_text_generator.py` with job config overrides
|
|
- **Default tier-based anchor text:**
|
|
- Tier 1: Uses main keyword variations
|
|
- Tier 2: Uses related searches
|
|
- Tier 3: Uses main keyword variations
|
|
- Tier 4+: Uses entities
|
|
- **Job config overrides via `anchor_text_config`:**
|
|
- `mode: "default"` - Use tier-based defaults
|
|
- `mode: "override"` - Replace defaults with `custom_text` list
|
|
- `mode: "append"` - Add `custom_text` to tier-based defaults
|
|
- Import and use `get_anchor_text_for_tier()` function
|
|
|
|
### Homepage URL Generation
|
|
**RESOLVED:** Remove the slug after `/` from the article URL
|
|
- Example: `https://site.com/article-slug.html` → `https://site.com/`
|
|
- Use base domain as homepage URL
|
|
|
|
### Link Placement Strategy
|
|
|
|
#### Tiered Links (Money Site / Lower Tier)
|
|
1. **First Priority:** Find anchor text already in the document
|
|
- Search for anchor text in HTML content
|
|
- Add link to FIRST match only (prevent duplicate links)
|
|
- Case-insensitive matching
|
|
2. **Fallback:** If anchor text not found in document
|
|
- Insert anchor text into a sentence in the article
|
|
- Make it a link to the target URL
|
|
|
|
#### Wheel Links (See Also Section)
|
|
- Add a "See Also" section after the last paragraph
|
|
- Format as heading + unordered list
|
|
- Include ALL other articles in the batch (excluding current article)
|
|
- Each list item is an article title as a link
|
|
- Example:
|
|
```html
|
|
<h3>See Also</h3>
|
|
<ul>
|
|
<li><a href="url1">Article Title 1</a></li>
|
|
<li><a href="url2">Article Title 2</a></li>
|
|
<li><a href="url3">Article Title 3</a></li>
|
|
</ul>
|
|
```
|
|
|
|
#### Homepage Links
|
|
- Same as tiered links: find anchor text in content or insert it
|
|
- Link to site homepage (base domain)
|
|
|
|
## Implementation Approach
|
|
|
|
### Function Signature
|
|
```python
|
|
def inject_interlinks(
|
|
content_records: List[GeneratedContent],
|
|
article_urls: List[Dict], # [{content_id, title, url}, ...]
|
|
tiered_links: Dict, # From Story 3.2
|
|
project: Project,
|
|
content_repo: GeneratedContentRepository,
|
|
link_repo: ArticleLinkRepository
|
|
) -> None: # Updates content in database
|
|
```
|
|
|
|
### Processing Flow
|
|
1. For each article in the batch:
|
|
a. Load its raw HTML content
|
|
b. Generate tier-appropriate anchor text using `get_anchor_text_for_tier()`
|
|
c. Inject tiered links (money site or lower tier)
|
|
d. Inject homepage link
|
|
e. Inject wheel links ("See Also" section)
|
|
f. Update content in database
|
|
g. Record all links in `article_links` table
|
|
|
|
### Link Injection Details
|
|
|
|
#### Tiered Link Injection
|
|
```python
|
|
# Get anchor text for this tier
|
|
from src.interlinking.anchor_text_generator import get_anchor_text_for_tier
|
|
|
|
# Get default tier-based anchor text
|
|
default_anchors = get_anchor_text_for_tier(tier, project, count=5)
|
|
|
|
# Apply job config overrides if present
|
|
if job_config.anchor_text_config:
|
|
if job_config.anchor_text_config.mode == "override":
|
|
anchor_texts = job_config.anchor_text_config.custom_text or default_anchors
|
|
elif job_config.anchor_text_config.mode == "append":
|
|
anchor_texts = default_anchors + (job_config.anchor_text_config.custom_text or [])
|
|
else: # "default"
|
|
anchor_texts = default_anchors
|
|
else:
|
|
anchor_texts = default_anchors
|
|
|
|
# For each anchor text:
|
|
for anchor_text in anchor_texts:
|
|
if anchor_text in html_content (case-insensitive):
|
|
# Wrap FIRST occurrence with link
|
|
html_content = wrap_first_occurrence(html_content, anchor_text, target_url)
|
|
break
|
|
else:
|
|
# Insert anchor text + link into a paragraph
|
|
html_content = insert_link_into_content(html_content, anchor_text, target_url)
|
|
```
|
|
|
|
#### Homepage Link Injection
|
|
```python
|
|
# Derive homepage URL
|
|
homepage_url = extract_base_url(article_url) # https://site.com/article.html → https://site.com/
|
|
|
|
# Use main keyword as anchor text
|
|
anchor_text = project.main_keyword
|
|
# Find or insert link (same strategy as tiered links)
|
|
```
|
|
|
|
#### Wheel Link Injection
|
|
```python
|
|
# Build "See Also" section with ALL other articles in batch
|
|
other_articles = [a for a in article_urls if a['content_id'] != current_article.id]
|
|
|
|
see_also_html = "<h3>See Also</h3>\n<ul>\n"
|
|
for article in other_articles:
|
|
see_also_html += f' <li><a href="{article["url"]}">{article["title"]}</a></li>\n'
|
|
see_also_html += "</ul>\n"
|
|
|
|
# Append after last paragraph (before closing tags)
|
|
html_content = insert_before_closing_tags(html_content, see_also_html)
|
|
```
|
|
|
|
### Database Updates
|
|
- Update `GeneratedContent.content` with final HTML
|
|
- Create `ArticleLink` records for all injected links:
|
|
- `link_type="tiered"` for money site / lower tier links
|
|
- `link_type="homepage"` for homepage links
|
|
- `link_type="wheel_see_also"` for "See Also" section links
|
|
- Track both internal (`to_content_id`) and external (`to_url`) links
|
|
|
|
**Note:** The "See Also" section replaces the previous wheel_next/wheel_prev concept. Each article links to all other articles in the batch via the "See Also" section.
|
|
|
|
## Tasks / Subtasks
|
|
|
|
### 1. Create Content Injection Module
|
|
**Effort:** 3 story points
|
|
|
|
- [ ] Create `src/interlinking/content_injection.py`
|
|
- [ ] Implement `inject_interlinks()` main function
|
|
- [ ] Implement "See Also" section builder (all batch articles)
|
|
- [ ] Implement homepage URL extraction (base domain)
|
|
- [ ] Implement tiered link injection with anchor text matching
|
|
|
|
### 2. Anchor Text Processing
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Import `get_anchor_text_for_tier()` from existing module
|
|
- [ ] Apply job config `anchor_text_config` overrides (default/override/append)
|
|
- [ ] Implement case-insensitive anchor text search in HTML
|
|
- [ ] Wrap first occurrence of anchor text with link
|
|
- [ ] Implement fallback: insert anchor text + link if not found in content
|
|
|
|
### 3. HTML Link Injection
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Implement safe HTML parsing (avoid breaking existing tags)
|
|
- [ ] Implement link insertion before closing article/body tags
|
|
- [ ] Ensure proper link formatting (`<a href="...">text</a>`)
|
|
- [ ] Handle edge cases (empty content, malformed HTML)
|
|
- [ ] Preserve HTML structure and formatting
|
|
|
|
### 4. Database Integration
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Update `GeneratedContent.content` with final HTML
|
|
- [ ] Create `ArticleLink` records for all links
|
|
- [ ] Handle both internal (content_id) and external (URL) links
|
|
- [ ] Ensure proper link type categorization
|
|
|
|
### 5. Unit Tests
|
|
**Effort:** 3 story points
|
|
|
|
- [ ] Test "See Also" section generation (all batch articles)
|
|
- [ ] Test homepage URL extraction (remove slug after `/`)
|
|
- [ ] Test tiered link injection for T1 (money site) and T2+ (lower tier)
|
|
- [ ] Test anchor text config modes: default, override, append
|
|
- [ ] Test case-insensitive anchor text matching (first occurrence only)
|
|
- [ ] Test fallback anchor text insertion when not found in content
|
|
- [ ] Test HTML structure preservation after link injection
|
|
- [ ] Test database record creation (ArticleLink for all link types)
|
|
- [ ] Test with different tier configurations (T1, T2, T3, T4+)
|
|
|
|
### 6. Integration Tests
|
|
**Effort:** 2 story points
|
|
|
|
- [ ] Test full flow: Story 3.1 URLs → Story 3.2 tiered links → Story 3.3 injection
|
|
- [ ] Test with different batch sizes (5, 10, 20 articles)
|
|
- [ ] Test with various HTML content structures
|
|
- [ ] Verify link relationships in `article_links` table
|
|
- [ ] Test with different tiers and project configurations
|
|
- [ ] Verify final HTML is deployable (well-formed)
|
|
|
|
## Dependencies
|
|
- Story 3.1: URL generation must be complete
|
|
- Story 3.2: Tiered link finding must be complete
|
|
- Story 2.3: Generated content must exist
|
|
- Story 1.x: Project and database models must exist
|
|
|
|
## Future Considerations
|
|
- Story 4.x will use the final HTML content for deployment
|
|
- Analytics dashboard will use `article_links` data
|
|
- Future: Advanced link placement strategies
|
|
- Future: Link density optimization
|
|
|
|
## Total Effort
|
|
14 story points
|
|
|
|
## Technical Notes
|
|
|
|
### Existing Code to Use
|
|
```python
|
|
# Use existing anchor text generator
|
|
from src.interlinking.anchor_text_generator import get_anchor_text_for_tier
|
|
|
|
# Example usage - Default tier-based
|
|
anchor_texts = get_anchor_text_for_tier("tier1", project, count=5)
|
|
# Returns: ["shaft machining", "learn about shaft machining", "shaft machining guide", ...]
|
|
|
|
# Example usage - With job config override
|
|
if job_config.anchor_text_config:
|
|
if job_config.anchor_text_config.mode == "override":
|
|
anchor_texts = job_config.anchor_text_config.custom_text
|
|
# Returns: ["click here for more info", "learn more about this topic", ...]
|
|
elif job_config.anchor_text_config.mode == "append":
|
|
anchor_texts = default_anchors + job_config.anchor_text_config.custom_text
|
|
# Returns: ["shaft machining", "learn about...", "click here...", ...]
|
|
```
|
|
|
|
### Anchor Text Configuration (Job Config)
|
|
Job configuration supports three modes for anchor text:
|
|
|
|
```json
|
|
{
|
|
"anchor_text_config": {
|
|
"mode": "default|override|append",
|
|
"custom_text": ["anchor 1", "anchor 2", ...]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Modes:**
|
|
- `default`: Use tier-based anchor text from `anchor_text_generator.py`
|
|
- `override`: Replace tier-based anchors with `custom_text` list
|
|
- `append`: Add `custom_text` to tier-based anchors
|
|
|
|
**Example - Override Mode:**
|
|
```json
|
|
{
|
|
"anchor_text_config": {
|
|
"mode": "override",
|
|
"custom_text": [
|
|
"click here for more info",
|
|
"learn more about this topic",
|
|
"discover the best practices"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Link Injection Rules
|
|
1. **One link per anchor text** - Only link the FIRST occurrence
|
|
2. **Case-insensitive search** - Match "Shaft Machining" with "shaft machining"
|
|
3. **Preserve HTML structure** - Don't break existing tags
|
|
4. **Fallback insertion** - If anchor text not in content, insert it naturally
|
|
5. **Config overrides** - Job config can override/append to tier-based defaults
|
|
|
|
### "See Also" Section Format
|
|
```html
|
|
<!-- Appended after last paragraph -->
|
|
<h3>See Also</h3>
|
|
<ul>
|
|
<li><a href="https://site1.com/article1.html">Article Title 1</a></li>
|
|
<li><a href="https://site2.com/article2.html">Article Title 2</a></li>
|
|
<li><a href="https://site3.com/article3.html">Article Title 3</a></li>
|
|
</ul>
|
|
```
|
|
|
|
### Homepage URL Examples
|
|
```
|
|
https://example.com/article-slug.html → https://example.com/
|
|
https://site.b-cdn.net/my-article.html → https://site.b-cdn.net/
|
|
https://www.custom.com/path/to/article.html → https://www.custom.com/
|
|
```
|
|
|
|
## Notes
|
|
This story uses existing tier-based anchor text generation. No need to implement anchor text logic from scratch - just import and use the existing functions that handle all edge cases automatically.
|