Fixed NOT TESTED: now actually listens to # of links. Also makes See Also smaller.

main
PeninsulaInd 2025-10-23 15:37:31 -05:00
parent b168d33e2d
commit 083a8cacdd
7 changed files with 546 additions and 11 deletions

View File

@ -0,0 +1,247 @@
# Deploy-Batch Analysis for test_shaft_machining.json
## Quick Answers to Your Questions
### 1. What should the anchor text be at each level?
**Tier 1 Articles (5 articles):**
- **Money Site Links:** Uses `main_keyword` variations from project
- "shaft machining"
- "learn about shaft machining"
- "shaft machining guide"
- "best shaft machining"
- "shaft machining tips"
- System tries to find these phrases in content; picks first one that matches
- **Home Link:** Now in navigation menu (not injected into content)
- **See Also Links:** Uses article titles as anchor text
**Tier 2 Articles (20 articles):**
- **Lower Tier Links:** Uses `related_searches` from CORA data
- Depends on what related searches were in the shaft_machining.xlsx file
- If no related searches exist, falls back to main_keyword variations
- **Home Link:** Now in navigation menu (not injected into content)
- **See Also Links:** Uses article titles as anchor text
**Configuration:**
- Anchor text rules come from `master.config.json``interlinking.tier_anchor_text_rules`
- Can be overridden in job config with `anchor_text_config`
### 2. How many links should be in each article?
**Tier 1 Articles:**
- 1 link to money site (https://fzemanufacturing.com/capabilities/shaft-machining-services)
- 4 "See Also" links (to the other 4 tier1 articles)
- **Total: 5 links per tier1 article** (plus Home in nav menu)
**Tier 2 Articles:**
- 2-4 links to tier1 articles (random selection, count is `interlinking.links_per_article_min` to `max`)
- 19 "See Also" links (to the other 19 tier2 articles)
- **Total: 21-23 links per tier2 article** (plus Home in nav menu)
**Your JSON Configuration:**
```json
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4
}
```
This controls the tiered links (tier2 → tier1). Each tier2 article will get between 2-4 random tier1 articles to link to.
### 3. Should "Home" be a link?
**YES** - Home is a link in the navigation menu at the top of every page.
**How it works:**
- The HTML template (`basic.html`) includes a `<nav>` menu with Home link
- Template line 113: `<li><a href="/index.html">Home</a></li>`
- This is part of the template wrapper, not injected into article content
**Old behavior (now removed):**
- Previously, system searched article content for "Home" and tried to link it
- This was redundant since Home is already in the nav menu
- Code has been updated to remove this injection
## Step-by-Step: What Happens During deploy-batch
### Step 1: Load Articles from Database
```
- Project 1 has generated content already
- Tier 1: 5 articles
- Tier 2: 20 articles
- Each article has: title, content (HTML), site_deployment_id
```
### Step 2: URL Generation (already done during generate-batch)
```
Tier 1 URLs (round-robin between getcnc.info and textbullseye.com):
- Article 0: https://getcnc.info/{slug}.html
- Article 1: https://www.textbullseye.com/{slug}.html
- Article 2: https://getcnc.info/{slug}.html
- Article 3: https://www.textbullseye.com/{slug}.html
- Article 4: https://getcnc.info/{slug}.html
Tier 2 URLs (round-robin):
- Articles 0-19 distributed across both domains
```
### Step 3: Tiered Links (already injected during generate-batch)
**For Tier 1:**
- Target: Money site URL from project database
- Anchor text: main_keyword variations
- Links already in `generated_content.content` HTML
**For Tier 2:**
- Target: Random selection of tier1 URLs (2-4 per article)
- Anchor text: related_searches from project
- Links already in HTML
### Step 4: Homepage Links
- Home link is in the navigation menu (template)
- No longer injected into article content
### Step 5: See Also Section (already injected)
- HTML section with links to other articles in same tier
### Step 6: Template Application (already done)
- HTML wrapped in template from `src/templating/templates/basic.html`
- Navigation menu added
- Stored in `generated_content.formatted_html`
### Step 7: Upload to Bunny.net
```
For each article:
1. Get site deployment credentials
2. Upload formatted_html to storage zone
3. File path: /{slug}.html
4. Log URL to deployment_logs/
5. Update database: deployed_url, status='deployed'
For each site's boilerplate pages:
1. Upload index.html (if exists)
2. Upload about.html
3. Upload contact.html
4. Upload privacy.html
```
## Database Link Tracking
All links are tracked in `article_links` table:
**Tier 1 Article Example (ID: 43):**
```
| from_content_id | to_content_id | to_url | anchor_text | link_type |
|-----------------|---------------|--------|-------------|-----------|
| 43 | NULL | https://fzemanufacturing.com/... | "shaft machining" | tiered |
| 43 | 44 | NULL | "Understanding CNC..." | wheel_see_also |
| 43 | 45 | NULL | "Advanced Shaft..." | wheel_see_also |
| 43 | 46 | NULL | "Precision Machining..." | wheel_see_also |
| 43 | 47 | NULL | "Modern Shaft..." | wheel_see_also |
```
**Tier 2 Article Example (ID: 48):**
```
| from_content_id | to_content_id | to_url | anchor_text | link_type |
|-----------------|---------------|--------|-------------|-----------|
| 48 | NULL | https://getcnc.info/{slug1}.html | "cnc machining services" | tiered |
| 48 | NULL | https://www.textbullseye.com/{slug2}.html | "precision shaft work" | tiered |
| 48 | NULL | https://getcnc.info/{slug3}.html | "shaft turning operations" | tiered |
| 48 | 49 | NULL | "Tier 2 Article 2 Title" | wheel_see_also |
| ... | ... | ... | ... | ... |
| 48 | 67 | NULL | "Tier 2 Article 20 Title" | wheel_see_also |
```
**Note:** Home link is no longer tracked in the database since it's in the template, not injected into content.
## Your Specific JSON File Analysis
```json
{
"jobs": [
{
"project_id": 1,
"deployment_targets": [
"getcnc.info",
"www.textbullseye.com"
],
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 1500,
"max_word_count": 2000,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "anthropic/claude-3.5-sonnet"
}
},
"tier2": {
"count": 20,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "openai/gpt-4o-mini"
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4
}
}
}
}
]
}
```
**What This Configuration Does:**
1. **Tier 1 (5 articles):**
- Uses Claude Sonnet for content, GPT-4o-mini for titles/outlines
- 1500-2000 words per article
- Distributed across getcnc.info and textbullseye.com
- Each links to: money site (1) + See Also (4) = 5 total links (plus Home in nav menu)
2. **Tier 2 (20 articles):**
- Uses GPT-4o-mini for everything (cheaper)
- Default word count (1100-1500)
- Each links to: 2-4 tier1 articles + See Also (19) = 21-23 total links (plus Home in nav menu)
- Distributed across both domains
3. **Missing Configurations (using defaults):**
- `tier1.interlinking`: Not specified → uses defaults (but tier1 always gets 1 money site link anyway)
- `anchor_text_config`: Not specified → uses master.config.json rules
## All JSON Fields That Affect Behavior
See `MASTER_JSON.json` for the complete reference. Key fields:
**Top-level job fields:**
- `project_id` - Which project's data to use
- `deployment_targets` - Which domains to deploy to
- `models` - Which AI models to use
- `tiered_link_count_range` - How many tiered links (job-level default)
- `anchor_text_config` - Override anchor text generation
- `interlinking` - Job-level interlinking defaults
**Tier-level fields:**
- `count` - Number of articles
- `min_word_count`, `max_word_count` - Content length
- `min_h2_tags`, `max_h2_tags`, `min_h3_tags`, `max_h3_tags` - Outline structure
- `models` - Tier-specific model overrides
- `interlinking` - Tier-specific interlinking overrides
**Fields in master.config.json:**
- `interlinking.tier_anchor_text_rules` - Defines anchor text sources per tier
- `interlinking.include_home_link` - Global default for Home links
- `interlinking.wheel_links` - Enable/disable See Also sections
**Fields in project database:**
- `main_keyword` - Used for tier1 anchor text
- `related_searches` - Used for tier2 anchor text
- `entities` - Used for tier3+ anchor text
- `money_site_url` - Destination for tier1 links

View File

@ -0,0 +1,161 @@
# Job Configuration Field Reference
## Quick Field List
### Job Level (applies to all tiers)
```
project_id - Required, integer
deployment_targets - Array of domain strings
tier1_preferred_sites - Array of domain strings (subset of deployment_targets)
auto_create_sites - Boolean (NOT IMPLEMENTED - parsed but doesn't work)
create_sites_for_keywords - Array of {keyword, count} objects (NOT IMPLEMENTED - parsed but doesn't work)
models - {title, outline, content} with model strings
tiered_link_count_range - {min, max} integers
anchor_text_config - {mode, custom_text}
failure_config - {max_consecutive_failures, skip_on_failure}
interlinking - {links_per_article_min, links_per_article_max, see_also_min, see_also_max}
tiers - Required, object with tier1/tier2/tier3
```
### Tier Level (per tier configuration)
```
count - Required, integer (number of articles)
min_word_count - Integer
max_word_count - Integer
min_h2_tags - Integer
max_h2_tags - Integer
min_h3_tags - Integer
max_h3_tags - Integer
models - {title, outline, content} - overrides job-level
interlinking - {links_per_article_min, links_per_article_max, see_also_min, see_also_max} - overrides job-level
```
## Field Behaviors
**deployment_targets**: Sites to deploy to (round-robin distribution)
**tier1_preferred_sites**: If set, tier1 only uses these sites
**models**: Use format "provider/model-name" (e.g., "openai/gpt-4o-mini")
**anchor_text_config**: Job-level only, applies to ALL tiers (no tier-specific option)
- "default" = Use master.config.json tier rules
- "override" = Replace with custom_text for all tiers
- "append" = Add custom_text to tier rules for all tiers
**tiered_link_count_range**: How many links to lower tier
- Tier1: Always 1 link to money site (this setting ignored)
- Tier2+: Random between min and max links to lower tier
**interlinking.links_per_article_min/max**: Same as tiered_link_count_range
**interlinking.see_also_min/max**: How many See Also links (default 4-5)
- Randomly selects this many articles from same tier for See Also section
## Defaults
If not specified, these defaults apply:
### Tier1 Defaults
```json
{
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
}
```
### Tier2 Defaults
```json
{
"min_word_count": 1100,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
}
```
### Tier3 Defaults
```json
{
"min_word_count": 850,
"max_word_count": 1350,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
```
## Minimal Working Example
```json
{
"jobs": [{
"project_id": 1,
"deployment_targets": ["example.com"],
"tiers": {
"tier1": {"count": 5},
"tier2": {"count": 20}
}
}]
}
```
## Your Current Example
```json
{
"jobs": [{
"project_id": 1,
"deployment_targets": ["getcnc.info", "www.textbullseye.com"],
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 1500,
"max_word_count": 2000,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "anthropic/claude-3.5-sonnet"
}
},
"tier2": {
"count": 20,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "openai/gpt-4o-mini"
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4
}
}
}
}]
}
```
## Result Behavior
**Tier 1 Articles (5):**
- 1 link to money site
- 4 See Also links to other tier1 articles
- Home link in nav menu
**Tier 2 Articles (20):**
- 2-4 links to random tier1 articles
- 19 See Also links to other tier2 articles
- Home link in nav menu
**Anchor Text:**
- Tier1: Uses main_keyword from project
- Tier2: Uses related_searches from project
- Can override with anchor_text_config

View File

@ -0,0 +1,34 @@
import sqlite3
conn = sqlite3.connect('content_automation.db')
cursor = conn.cursor()
# Get first tier2 article
cursor.execute('SELECT id FROM generated_content WHERE project_id=1 AND tier="tier2" LIMIT 1')
tier2_id = cursor.fetchone()[0]
# Count links by type
cursor.execute('''
SELECT link_type, COUNT(*)
FROM article_links
WHERE from_content_id=?
GROUP BY link_type
''', (tier2_id,))
print(f'Tier2 article {tier2_id} link counts:')
for row in cursor.fetchall():
print(f' {row[0]}: {row[1]}')
# Show the actual tiered links
cursor.execute('''
SELECT to_url, anchor_text
FROM article_links
WHERE from_content_id=? AND link_type="tiered"
''', (tier2_id,))
print(f'\nTiered links for article {tier2_id}:')
for i, row in enumerate(cursor.fetchall(), 1):
print(f' {i}. {row[1]} -> {row[0][:60]}...')
conn.close()

View File

@ -365,6 +365,17 @@ class BatchProcessor:
click.echo(f" {tier_name}: No articles with site assignments to post-process")
return
# Skip articles already post-processed (idempotency check)
unprocessed = [a for a in content_records if not a.formatted_html]
if not unprocessed:
click.echo(f" {tier_name}: All {len(content_records)} articles already post-processed, skipping")
return
if len(unprocessed) < len(content_records):
click.echo(f" {tier_name}: Skipping {len(content_records) - len(unprocessed)} already processed articles")
content_records = unprocessed
click.echo(f" {tier_name}: Post-processing {len(content_records)} articles...")
# Step 1: Generate URLs (Story 3.1)

View File

@ -63,6 +63,8 @@ class InterlinkingConfig:
links_per_article_min: int = 2
links_per_article_max: int = 4
include_home_link: bool = True
see_also_min: int = 4
see_also_max: int = 5
@dataclass
@ -265,16 +267,24 @@ class JobConfig:
min_links = interlinking_data.get("links_per_article_min", 2)
max_links = interlinking_data.get("links_per_article_max", 4)
include_home = interlinking_data.get("include_home_link", True)
see_also_min = interlinking_data.get("see_also_min", 4)
see_also_max = interlinking_data.get("see_also_max", 5)
if not isinstance(min_links, int) or min_links < 0:
raise ValueError("'interlinking' links_per_article_min must be a non-negative integer")
if not isinstance(max_links, int) or max_links < min_links:
raise ValueError("'interlinking' links_per_article_max must be >= links_per_article_min")
if not isinstance(include_home, bool):
raise ValueError("'interlinking' include_home_link must be a boolean")
if not isinstance(see_also_min, int) or see_also_min < 0:
raise ValueError("'interlinking' see_also_min must be a non-negative integer")
if not isinstance(see_also_max, int) or see_also_max < see_also_min:
raise ValueError("'interlinking' see_also_max must be >= see_also_min")
interlinking = InterlinkingConfig(
links_per_article_min=min_links,
links_per_article_max=max_links,
include_home_link=include_home
include_home_link=include_home,
see_also_min=see_also_min,
see_also_max=see_also_max
)
return Job(

View File

@ -64,14 +64,11 @@ def inject_interlinks(
html, content, tiered_links, project, job_config, link_repo
)
# Inject homepage link
html = _inject_homepage_link(
html, content, article_url, project, link_repo
)
# Note: Home link is now in the navigation menu (template), no need to inject into content
# Inject See Also section
html = _inject_see_also_section(
html, content, article_urls, link_repo
html, content, article_urls, link_repo, job_config
)
# Update content in database
@ -199,9 +196,10 @@ def _inject_see_also_section(
html: str,
content: GeneratedContent,
article_urls: List[Dict],
link_repo: ArticleLinkRepository
link_repo: ArticleLinkRepository,
job_config=None
) -> str:
"""Inject See Also section with all other batch articles"""
"""Inject See Also section with random selection of batch articles"""
# Get all other articles (excluding current)
other_articles = [a for a in article_urls if a['content_id'] != content.id]
@ -209,9 +207,18 @@ def _inject_see_also_section(
logger.info(f"No other articles for See Also section in content {content.id}")
return html
# Get See Also link count (default 4-5)
see_also_config = _get_see_also_config(job_config)
min_links = see_also_config['min']
max_links = see_also_config['max']
# Select random articles
count = min(random.randint(min_links, max_links), len(other_articles))
selected_articles = random.sample(other_articles, count)
# Build See Also HTML
see_also_html = "<h3>See Also</h3>\n<ul>\n"
for article in other_articles:
for article in selected_articles:
see_also_html += f' <li><a href="{article["url"]}">{article["title"]}</a></li>\n'
see_also_html += "</ul>\n"
@ -219,7 +226,7 @@ def _inject_see_also_section(
html = _insert_before_closing_tags(html, see_also_html)
# Record links
for article in other_articles:
for article in selected_articles:
link_repo.create(
from_content_id=content.id,
to_content_id=article['content_id'],
@ -228,10 +235,41 @@ def _inject_see_also_section(
link_type="wheel_see_also"
)
logger.info(f"Injected See Also section with {len(other_articles)} links for content {content.id}")
logger.info(f"Injected See Also section with {len(selected_articles)} links for content {content.id}")
return html
def _get_see_also_config(job_config) -> Dict[str, int]:
"""Get See Also link count config, default 4-5"""
default_config = {"min": 4, "max": 5}
if job_config is None:
return default_config
# Check for see_also_min/max in interlinking config
interlinking = None
if hasattr(job_config, 'interlinking'):
interlinking = job_config.interlinking
elif isinstance(job_config, dict):
interlinking = job_config.get('interlinking')
if not interlinking:
return default_config
# Get min/max from interlinking config
if isinstance(interlinking, dict):
min_val = interlinking.get('see_also_min')
max_val = interlinking.get('see_also_max')
else:
min_val = getattr(interlinking, 'see_also_min', None)
max_val = getattr(interlinking, 'see_also_max', None)
if min_val is not None and max_val is not None:
return {"min": min_val, "max": max_val}
return default_config
def _get_anchor_texts_for_tier(
tier: str,
project: Project,

View File

@ -0,0 +1,34 @@
from src.database.session import db_manager
from src.database.repositories import GeneratedContentRepository, ProjectRepository, SiteDeploymentRepository
from src.interlinking.tiered_links import find_tiered_links
from src.generation.job_config import JobConfig
session = db_manager.get_session()
content_repo = GeneratedContentRepository(session)
project_repo = ProjectRepository(session)
site_repo = SiteDeploymentRepository(session)
# Get tier2 articles
tier2_articles = content_repo.get_by_project_and_tier(1, "tier2")
print(f"Found {len(tier2_articles)} tier2 articles")
# Load job config
job_config = JobConfig("jobs/test_shaft_machining.json")
job = job_config.get_jobs()[0]
print(f"\nJob config:")
print(f" tiered_link_count_range: {job.tiered_link_count_range}")
print(f" interlinking: {job.interlinking}")
# Test the function
print(f"\nCalling find_tiered_links()...")
result = find_tiered_links(tier2_articles, job, project_repo, content_repo, site_repo)
print(f"\nResult:")
print(f" tier: {result.get('tier')}")
print(f" lower_tier: {result.get('lower_tier')}")
print(f" Number of URLs selected: {len(result.get('lower_tier_urls', []))}")
print(f" URLs: {result.get('lower_tier_urls', [])}")
session.close()