Big-Link-Man/docs/stories/story-3.3-content-interlink...

12 KiB

Story 3.3: Content Interlinking Injection

Status

COMPLETE - Implemented, Integrated, and Tested

Summary

This story injects three types of links into article HTML:

  1. Tiered Links - T1 articles link to money site, T2+ link to lower-tier articles
  2. Homepage Links - Link to the site's homepage (base domain)
  3. "See Also" Section - Links to all other articles in the batch

Uses existing anchor_text_generator.py for tier-based anchor text with support for job config overrides (default/override/append modes).

Story

As a developer, I want to inject all required links (batch "wheel", home page, and tiered/money site) into each new article's HTML content, so that the articles are fully interlinked and ready for deployment.

Context

  • Story 3.1 generates final URLs for all articles in the batch
  • Story 3.2 finds the required tiered links (money site or lower-tier URLs)
  • Articles have raw HTML content from Epic 2 (h2, h3, p tags)
  • Project contains anchor text lists for each tier
  • Articles need wheel links (next/previous), homepage links, and tiered links

Acceptance Criteria

Core Functionality

  • A function takes raw HTML content, URL list, tiered links, and project data
  • Wheel Links: Each article gets "next" and "previous" links to other articles in the batch
    • Last article's "next" links to first article (circular)
    • First article's "previous" links to last article (circular)
  • Homepage Links: Each article gets a link to its site's homepage
  • Tiered Links: Articles get links based on their tier
    • Tier 1: Links to money site using T1 anchor text
    • Tier 2+: Links to lower-tier articles using appropriate tier anchor text

Input Requirements

  • Raw HTML content (from Epic 2)
  • List of article URLs with titles (from Story 3.1)
  • Tiered links object (from Story 3.2)
  • Project data (for anchor text lists)
  • Batch tier information

Output Requirements

  • Final HTML content with all links injected
  • Updated content stored in database
  • Link relationships recorded in article_links table

Implementation Details

Anchor Text Generation

RESOLVED: Use existing src/interlinking/anchor_text_generator.py with job config overrides

  • Default tier-based anchor text:
    • Tier 1: Uses main keyword variations
    • Tier 2: Uses related searches
    • Tier 3: Uses main keyword variations
    • Tier 4+: Uses entities
  • Job config overrides via anchor_text_config:
    • mode: "default" - Use tier-based defaults
    • mode: "override" - Replace defaults with custom_text list
    • mode: "append" - Add custom_text to tier-based defaults
  • Import and use get_anchor_text_for_tier() function

Homepage URL Generation

RESOLVED: Remove the slug after / from the article URL

  • Example: https://site.com/article-slug.htmlhttps://site.com/
  • Use base domain as homepage URL
  1. First Priority: Find anchor text already in the document
    • Search for anchor text in HTML content
    • Add link to FIRST match only (prevent duplicate links)
    • Case-insensitive matching
  2. Fallback: If anchor text not found in document
    • Insert anchor text into a sentence in the article
    • Make it a link to the target URL
  • Add a "See Also" section after the last paragraph
  • Format as heading + unordered list
  • Include ALL other articles in the batch (excluding current article)
  • Each list item is an article title as a link
  • Example:
    <h3>See Also</h3>
    <ul>
      <li><a href="url1">Article Title 1</a></li>
      <li><a href="url2">Article Title 2</a></li>
      <li><a href="url3">Article Title 3</a></li>
    </ul>
    
  • Same as tiered links: find anchor text in content or insert it
  • Link to site homepage (base domain)

Implementation Approach

Function Signature

def inject_interlinks(
    content_records: List[GeneratedContent],
    article_urls: List[Dict],  # [{content_id, title, url}, ...]
    tiered_links: Dict,       # From Story 3.2
    project: Project,
    content_repo: GeneratedContentRepository,
    link_repo: ArticleLinkRepository
) -> None:  # Updates content in database

Processing Flow

  1. For each article in the batch: a. Load its raw HTML content b. Generate tier-appropriate anchor text using get_anchor_text_for_tier() c. Inject tiered links (money site or lower tier) d. Inject homepage link e. Inject wheel links ("See Also" section) f. Update content in database g. Record all links in article_links table
# Get anchor text for this tier
from src.interlinking.anchor_text_generator import get_anchor_text_for_tier

# Get default tier-based anchor text
default_anchors = get_anchor_text_for_tier(tier, project, count=5)

# Apply job config overrides if present
if job_config.anchor_text_config:
    if job_config.anchor_text_config.mode == "override":
        anchor_texts = job_config.anchor_text_config.custom_text or default_anchors
    elif job_config.anchor_text_config.mode == "append":
        anchor_texts = default_anchors + (job_config.anchor_text_config.custom_text or [])
    else:  # "default"
        anchor_texts = default_anchors
else:
    anchor_texts = default_anchors

# For each anchor text:
for anchor_text in anchor_texts:
    if anchor_text in html_content (case-insensitive):
        # Wrap FIRST occurrence with link
        html_content = wrap_first_occurrence(html_content, anchor_text, target_url)
        break
    else:
        # Insert anchor text + link into a paragraph
        html_content = insert_link_into_content(html_content, anchor_text, target_url)
# Derive homepage URL
homepage_url = extract_base_url(article_url)  # https://site.com/article.html → https://site.com/

# Use main keyword as anchor text
anchor_text = project.main_keyword
# Find or insert link (same strategy as tiered links)
# Build "See Also" section with ALL other articles in batch
other_articles = [a for a in article_urls if a['content_id'] != current_article.id]

see_also_html = "<h3>See Also</h3>\n<ul>\n"
for article in other_articles:
    see_also_html += f'  <li><a href="{article["url"]}">{article["title"]}</a></li>\n'
see_also_html += "</ul>\n"

# Append after last paragraph (before closing tags)
html_content = insert_before_closing_tags(html_content, see_also_html)

Database Updates

  • Update GeneratedContent.content with final HTML
  • Create ArticleLink records for all injected links:
    • link_type="tiered" for money site / lower tier links
    • link_type="homepage" for homepage links
    • link_type="wheel_see_also" for "See Also" section links
  • Track both internal (to_content_id) and external (to_url) links

Note: The "See Also" section replaces the previous wheel_next/wheel_prev concept. Each article links to all other articles in the batch via the "See Also" section.

Tasks / Subtasks

1. Create Content Injection Module

Effort: 3 story points

  • Create src/interlinking/content_injection.py
  • Implement inject_interlinks() main function
  • Implement "See Also" section builder (all batch articles)
  • Implement homepage URL extraction (base domain)
  • Implement tiered link injection with anchor text matching

2. Anchor Text Processing

Effort: 2 story points

  • Import get_anchor_text_for_tier() from existing module
  • Apply job config anchor_text_config overrides (default/override/append)
  • Implement case-insensitive anchor text search in HTML
  • Wrap first occurrence of anchor text with link
  • Implement fallback: insert anchor text + link if not found in content

Effort: 2 story points

  • Implement safe HTML parsing (avoid breaking existing tags)
  • Implement link insertion before closing article/body tags
  • Ensure proper link formatting (<a href="...">text</a>)
  • Handle edge cases (empty content, malformed HTML)
  • Preserve HTML structure and formatting

4. Database Integration

Effort: 2 story points

  • Update GeneratedContent.content with final HTML
  • Create ArticleLink records for all links
  • Handle both internal (content_id) and external (URL) links
  • Ensure proper link type categorization

5. Unit Tests

Effort: 3 story points

  • Test "See Also" section generation (all batch articles)
  • Test homepage URL extraction (remove slug after /)
  • Test tiered link injection for T1 (money site) and T2+ (lower tier)
  • Test anchor text config modes: default, override, append
  • Test case-insensitive anchor text matching (first occurrence only)
  • Test fallback anchor text insertion when not found in content
  • Test HTML structure preservation after link injection
  • Test database record creation (ArticleLink for all link types)
  • Test with different tier configurations (T1, T2, T3, T4+)

6. Integration Tests

Effort: 2 story points

  • Test full flow: Story 3.1 URLs → Story 3.2 tiered links → Story 3.3 injection
  • Test with different batch sizes (5, 10, 20 articles)
  • Test with various HTML content structures
  • Verify link relationships in article_links table
  • Test with different tiers and project configurations
  • Verify final HTML is deployable (well-formed)

Dependencies

  • Story 3.1: URL generation must be complete
  • Story 3.2: Tiered link finding must be complete
  • Story 2.3: Generated content must exist
  • Story 1.x: Project and database models must exist

Future Considerations

  • Story 4.x will use the final HTML content for deployment
  • Analytics dashboard will use article_links data
  • Future: Advanced link placement strategies
  • Future: Link density optimization

Total Effort

14 story points

Technical Notes

Existing Code to Use

# Use existing anchor text generator
from src.interlinking.anchor_text_generator import get_anchor_text_for_tier

# Example usage - Default tier-based
anchor_texts = get_anchor_text_for_tier("tier1", project, count=5)
# Returns: ["shaft machining", "learn about shaft machining", "shaft machining guide", ...]

# Example usage - With job config override
if job_config.anchor_text_config:
    if job_config.anchor_text_config.mode == "override":
        anchor_texts = job_config.anchor_text_config.custom_text
        # Returns: ["click here for more info", "learn more about this topic", ...]
    elif job_config.anchor_text_config.mode == "append":
        anchor_texts = default_anchors + job_config.anchor_text_config.custom_text
        # Returns: ["shaft machining", "learn about...", "click here...", ...]

Anchor Text Configuration (Job Config)

Job configuration supports three modes for anchor text:

{
  "anchor_text_config": {
    "mode": "default|override|append",
    "custom_text": ["anchor 1", "anchor 2", ...]
  }
}

Modes:

  • default: Use tier-based anchor text from anchor_text_generator.py
  • override: Replace tier-based anchors with custom_text list
  • append: Add custom_text to tier-based anchors

Example - Override Mode:

{
  "anchor_text_config": {
    "mode": "override",
    "custom_text": [
      "click here for more info",
      "learn more about this topic",
      "discover the best practices"
    ]
  }
}
  1. One link per anchor text - Only link the FIRST occurrence
  2. Case-insensitive search - Match "Shaft Machining" with "shaft machining"
  3. Preserve HTML structure - Don't break existing tags
  4. Fallback insertion - If anchor text not in content, insert it naturally
  5. Config overrides - Job config can override/append to tier-based defaults

"See Also" Section Format

<!-- Appended after last paragraph -->
<h3>See Also</h3>
<ul>
  <li><a href="https://site1.com/article1.html">Article Title 1</a></li>
  <li><a href="https://site2.com/article2.html">Article Title 2</a></li>
  <li><a href="https://site3.com/article3.html">Article Title 3</a></li>
</ul>

Homepage URL Examples

https://example.com/article-slug.html → https://example.com/
https://site.b-cdn.net/my-article.html → https://site.b-cdn.net/
https://www.custom.com/path/to/article.html → https://www.custom.com/

Notes

This story uses existing tier-based anchor text generation. No need to implement anchor text logic from scratch - just import and use the existing functions that handle all edge cases automatically.