Add file-first content creation pipeline with Cora inbox watcher

Content tasks now trigger from Cora xlsx files dropped in Z:/content-cora-inbox/
instead of auto-firing from ClickUp polling. The watcher fuzzy-matches files to
ClickUp tasks and auto-detects content type from URL presence (optimization vs
new content). Adds cli_flags support for service page hints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix/customer-field-migration
PeninsulaInd 2026-02-25 17:29:04 -06:00
parent 2ef7ae2607
commit 41487c8d6b
18 changed files with 5807 additions and 34 deletions

View File

@ -0,0 +1,182 @@
# CNC Swiss Screw Machining: Precision, Process, and When to Use It
CNC Swiss screw machining is a precision turning process for producing small, complex parts at tight tolerances and high volumes. This guide covers how Swiss screw machines work, what makes them different from conventional CNC turning, and how to evaluate a machining partner.
---
## What Is CNC Swiss Screw Machining and How Does It Work?
Swiss screw machining is a CNC turning process that uses a sliding headstock and guide bushing to support bar stock close to the cutting point. The result is reduced deflection, minimal vibration, and tolerances that conventional lathes struggle to achieve.
### Origins and Definition
The Swiss screw machine was developed in Switzerland in the 1800s to produce the tiny screws and pins required for watchmaking. This early form of precision metalworking used cam-driven automatic lathes — mechanical automation that could repeat the same cuts with consistent accuracy. The design became the foundation for precision small-part manufacturing and fabrication worldwide.
Today's CNC Swiss lathes add programmable multi-axis motion, live tooling, and sub-spindle capability. These Swiss lathes handle complex geometries, tight tolerances, and high production volumes that cam-driven machines could not. Modern CNC machining controls allow manufacturers to program intricate tool paths across multiple axes, producing parts that would have been impossible on earlier automatic lathe designs.
The key distinction from a conventional CNC lathe: on a Swiss lathe, the workpiece moves through a guide bushing while the tools remain in a fixed cutting zone. On a conventional lathe, the tools traverse along a stationary workpiece held by the tailstock and headstock. This difference determines how much deflection occurs during cutting.
### The Sliding Headstock and Guide Bushing
Bar stock feeds through a collet in the sliding headstock, which moves along the Z-axis to advance material into the cutting zone. A guide bushing supports the bar just 13mm from where the tool contacts the workpiece.
With the material held rigidly near the cutting point, there is almost no leverage for cutting forces to deflect the workpiece. Vibration is dampened and chatter is reduced, delivering tighter tolerances and better surface finish than conventional turning on the same part geometry.
Guide bushings come in two types. Rotary guide bushings rotate with the workpiece and deliver tolerances of ±0.0005" or better. Fixed guide bushings do not rotate and are used when even tighter tolerances are required.
### Multi-Tool Simultaneous Operation
CNC Swiss screw machines can mount up to 20 tools and operate several simultaneously. A main spindle handles turning while a sub-spindle machines the back end — all in one setup. This level of automation eliminates manual handling and keeps cycle times short.
Live tooling adds milling, cross-drilling, threading, and tapping directly on the Swiss lathe. Parts that would require three or four setups across different CNC machines come off a Swiss screw machine complete, with no secondary operations needed.
---
## Benefits of CNC Swiss Screw Machining
### Precision and Production Advantages
- **Tolerances of ±0.0002"** are standard, with tighter tolerances achievable on specific features
- **Spindle speeds up to 10,000 RPM** enable efficient cutting of both metals and engineering plastics
- **Continuous bar-fed operation** — bar stock feeds automatically, parts drop off complete, minimal operator intervention
- **Reduced secondary operations** eliminate the cost of moving parts between machines
- **Automation** — bar feeders and CNC machining controls enable lights-out production, reducing labor costs on long runs
Setup is the largest cost driver. After that, per-part costs drop significantly, making Swiss screw machining cost-effective at medium to high production volumes.
### Materials for Swiss Screw Machining
Swiss screw machines work with a broad range of materials:
- **Stainless steel** — 303, 304, and 316 grades
- **Aluminum** — lightweight aerospace and electronics parts
- **Brass and copper** — electrical contacts, fittings, and connectors
- **Titanium** — medical implants and aerospace fasteners
- **Nickel alloys** — corrosion-resistant components for harsh environments
- **Bronze** — bushings, bearings, and wear components
- **Engineering plastics** — PEEK, Delrin, and nylon
Bar stock must be centerless-ground to ±0.0002" diametric tolerance to feed smoothly through the guide bushing. Exotic alloys like Inconel are workable but require specialized carbide tooling and experienced programming.
### Industries and Common Applications
- **Medical devices** — bone screws, dental implants, surgical instrument shafts, cannulas, and orthopedic pins. Medical applications often require biocompatible materials like titanium or surgical-grade stainless steel, plus full lot traceability.
- **Aerospace** — fasteners, sensor housings, hydraulic fittings, and electrical connectors. Aerospace machining demands tight tolerances, exotic materials, and documented quality processes.
- **Automotive** — fuel injector components, transmission pins, and valve parts produced in high volumes with consistent quality.
- **Electronics** — connector pins, contact sockets, terminal posts, and micro-components where dimensional precision directly affects electrical performance.
- **Defense** — ITAR-compliant precision components for weapons systems, communication equipment, and guidance systems.
Common machined parts include screws, pins, shafts, bushings, contacts, fittings, and cylindrical components with high length-to-diameter ratios.
---
## CNC Swiss Machining vs. Conventional CNC Turning
### Key Differences
| Factor | CNC Swiss Screw Machining | Conventional CNC Turning |
| ------ | ------------------------- | ------------------------ |
| Part diameter | Up to ~32mm (1.25") | Larger parts, no practical limit |
| Tolerances | ±0.0002" standard | ±0.001" typical |
| Complexity | Multi-axis, live tooling, sub-spindle | Typically 2-axis |
| Best volume | Medium to high | Flexible |
| L/D ratio | Excels at 10:1 or more | Limited by deflection |
| Setup cost | Higher | Lower |
| Per-part cost | Lower for small, complex parts | Lower for larger, simpler parts |
The guide bushing is the fundamental differentiator. It allows Swiss lathes to cut long, thin parts without the deflection that makes the same part impossible to hold tolerance on a conventional CNC lathe.
### When NOT to Use Swiss Screw Machining
Consider conventional CNC turning or milling when:
- **Parts exceed 32mm diameter** — larger parts need a conventional CNC lathe or mill
- **Production runs are very short** — for 1050 pieces, a conventional CNC lathe is more economical
- **Tolerances are relaxed** — if the spec calls for ±0.005" or wider, Swiss machining is overkill
- **The geometry is not cylindrical** — prismatic parts are better suited to 3-axis or 5-axis CNC milling
- **No features benefit from simultaneous operations** — simple turned profiles cost less on a conventional lathe
---
## Quality, Certification, and Choosing a Partner
### Industry Certifications and Inspection
Quality in Swiss screw machining depends on the quality management systems behind the machines.
Key certifications:
- **ISO 9001:2015** — baseline quality management system standard
- **ISO 13485** — required for medical device component manufacturing
- **ITAR registration** — mandatory for defense-related machining
- **IATF 16949** — automotive quality standard with defect prevention requirements
Inspection methods to ask about:
- **Statistical process control (SPC)** — monitors dimensional trends during production
- **Coordinate measuring machines (CMM)** — 3D dimensional verification of finished parts
- **First article inspection (FAI)** — full dimensional report verifying the setup matches the print
Material traceability is standard in medical and aerospace work and increasingly expected across all industries.
### What to Look for in a Swiss Screw Machining Supplier
- **Machine fleet** — modern CNC Swiss lathes with multi-axis capability, live tooling, and sub-spindles
- **Relevant certifications** — ISO 9001 baseline, plus ISO 13485, ITAR, or IATF 16949 as your industry requires
- **Demonstrated tolerance capability** — sample parts or dimensional reports in your materials
- **In-house secondary operations** — deburring, heat treating, plating, and passivation under one roof
- **Engineering support** — a good partner reviews prints and suggests design optimizations for manufacturability
---
## Get Started with CNC Swiss Screw Machining
CNC Swiss screw machining delivers precision, speed, and repeatability for small-diameter parts that demand tight tolerances. Whether you are producing medical implants, aerospace fasteners, or high-volume electronic connectors, Swiss machining is a proven process for turning complex designs into finished components. Contact us to discuss your project and request a quote.
---
<!-- FOQ SECTION START -->
## Frequently Asked Questions About CNC Swiss Screw Machining
### What Is the Difference Between Swiss Screw Machining and CNC Turning?
Swiss screw machining differs from conventional CNC turning in how the workpiece is supported during cutting. A Swiss screw machine uses a guide bushing to hold the bar stock within 13mm of the cutting tool, virtually eliminating deflection and enabling tolerances of ±0.0002". Conventional CNC turning clamps the workpiece without a guide bushing, which limits precision on long, slender parts and typically holds tolerances of ±0.001".
### How Tight Are Swiss Screw Machining Tolerances?
Swiss screw machining tolerances are typically ±0.0002" as a standard capability. This precision is possible because the guide bushing supports the workpiece close to the cutting tool, reducing deflection and vibration that would otherwise compromise dimensional accuracy.
### What Materials Can Be Swiss Screw Machined?
Swiss screw machines can process stainless steel, aluminum, brass, copper, titanium, nickel alloys, bronze, and engineering plastics like PEEK, Delrin, and nylon. Bar stock must be centerless-ground to ±0.0002" diametric tolerance to feed properly through the guide bushing.
### Is Swiss Screw Machining Cost-Effective for Small Production Runs?
Swiss screw machining is generally not cost-effective for very small runs due to significant setup time and tooling costs. The process becomes economical at medium to high volumes where setup cost is amortized across many parts. For runs under 50 pieces, conventional CNC turning is often more economical.
### What Industries Use CNC Swiss Screw Machining?
CNC Swiss screw machining is used extensively in medical device, aerospace, automotive, electronics, and defense manufacturing. These industries require small, complex, precision components produced at tight tolerances and in high volumes — exactly the part profile Swiss screw machines are designed to handle.
### How Does a Guide Bushing Work on a Swiss Screw Machine?
A guide bushing on a Swiss screw machine acts as a stationary support that holds the bar stock just 13mm from the cutting tool. As the sliding headstock feeds the workpiece through the bushing along the Z-axis, the bushing prevents the material from deflecting, enabling tighter tolerances and smoother surface finishes.
### What Part Sizes Can a Swiss Screw Machine Handle?
Swiss screw machines handle bar stock up to 32mm (1.25") in diameter. They excel at parts with high length-to-diameter ratios — 10:1 or greater — where conventional lathes would struggle with deflection. Larger parts are better suited to conventional CNC turning or milling.
### Does Swiss Screw Machining Require Secondary Operations?
Swiss screw machining often eliminates secondary operations entirely. With live tooling, sub-spindles, and multi-axis capability, a CNC Swiss machine can perform turning, milling, cross-drilling, threading, tapping, and knurling in a single setup. Parts frequently come off the machine complete.
### What Certifications Should a Swiss Screw Machining Supplier Have?
A Swiss screw machining supplier should hold ISO 9001:2015 as a baseline. Medical work requires ISO 13485, defense applications require ITAR registration, and automotive work calls for IATF 16949. Look for documented inspection processes including SPC, CMM measurement, and first article inspection.
### When Should You Choose Conventional CNC Over Swiss Machining?
Choose conventional CNC turning or milling over Swiss machining when parts exceed 32mm in diameter, production volumes are very low, tolerances are wider than ±0.005", or the geometry is primarily non-cylindrical. Conventional CNC is also better for simple turned profiles that don't benefit from simultaneous multi-tool operations.
<!-- FOQ SECTION END -->

View File

@ -0,0 +1,167 @@
# Outline: CNC Swiss Screw Machining
**Format:** Comprehensive Guide
**Target word count:** ~1,400 words (cluster target from Cora: 1,342)
**Primary keyword:** cnc swiss screw machining
**Target audience:** Engineers, procurement professionals, and manufacturing decision-makers evaluating Swiss screw machining for their parts
**Heading targets (from Cora Structure):** 1 H1, 4+ H2s, ~10 H3s
---
## H1: CNC Swiss Screw Machining: Precision, Process, and When to Use It
Brief intro: What Swiss screw machining is in one sentence, why it matters for precision small parts, and what the reader will learn.
---
## H2: What Is CNC Swiss Screw Machining and How Does It Work?
Definition + the mechanical process combined into one major section.
### H3: Origins and Definition
- Precision turning process using a sliding headstock and guide bushing
- Developed in Switzerland in the 1800s for watchmaking
- Key distinction from conventional CNC lathes: the workpiece moves, not just the tool
- Modern CNC Swiss machines: programmable, multi-axis, live tooling capable
### H3: The Sliding Headstock and Guide Bushing
- Bar stock feeds through collet in the sliding headstock
- Guide bushing supports material 1-3mm from the cutting tool
- Headstock moves along Z-axis, feeding stock into the tooling zone
- Result: minimal deflection, vibration dampened, tighter tolerances possible
- Guide bushing types: rotary (>±0.0005") vs. fixed (tighter tolerances)
### H3: Multi-Tool Simultaneous Operation
- Up to 20 tools can operate simultaneously
- Main spindle + sub-spindle: machine both ends of a part in one setup
- Live tooling: milling, cross-drilling, threading, tapping without removing the part
- Parts come off the machine complete — minimal secondary operations
~350 words
---
## H2: Benefits of CNC Swiss Screw Machining
### H3: Precision and Production Advantages
- **Precision:** ±0.0002" tolerances, up to 10,000 RPM, micron-level accuracy
- **Reduced secondary operations:** complete parts in one chucking
- **Production speed:** continuous bar-fed operation, minimal downtime
- **Material efficiency:** less waste than conventional machining
- **Cost-effective at volume:** low per-part cost once setup is complete
### H3: Materials for Swiss Screw Machining
- Metals: stainless steel, aluminum, brass, copper, bronze, titanium, nickel alloys
- Plastics: PEEK, Delrin, nylon
- Bar stock requirements: must be centerless-ground to ±0.0002" for optimal results
- Exotic alloys are workable but require specific tooling and speeds
### H3: Industries and Common Applications
- **Medical:** surgical instruments, implants, bone screws, dental components
- **Aerospace:** fasteners, connectors, sensor housings
- **Automotive:** high-volume small precision parts, fuel system components
- **Electronics:** pins, connectors, contacts, micro-components
- **Defense:** ITAR-compliant precision components
- Common part types: screws, pins, shafts, bushings, contacts, fittings
~350 words
---
## H2: CNC Swiss Machining vs. Conventional CNC Turning
### H3: Key Differences
| Factor | Swiss CNC | Conventional CNC |
| ------ | --------- | ---------------- |
| Part diameter | Up to ~32mm (1.25") | Larger parts |
| Tolerances | ±0.0002" standard | ±0.001" typical |
| Complexity | High (multi-axis, live tooling) | Moderate |
| Volume | Best at high volume | Better for short runs |
| Length-to-diameter ratio | Excels at high L/D ratios | Limited by deflection |
### H3: When NOT to Use Swiss Screw Machining
Parts larger than 32mm diameter, very short production runs where setup cost doesn't amortize, parts that don't require tight tolerances, non-cylindrical geometries better suited to 3- or 5-axis milling.
~250 words
---
## H2: Quality, Certification, and Choosing a Partner
### H3: Industry Certifications and Inspection
- ISO 9001:2015 (general quality management)
- ISO 13485 (medical device manufacturing)
- ITAR registration (defense applications)
- IATF 16949 (automotive)
- Inspection methods: SPC, CMM, optical measurement, laser micrometers
- First article inspection, in-process monitoring, material traceability
### H3: What to Look for in a Swiss Screw Machining Supplier
- Machine fleet: modern CNC Swiss machines with multi-axis capability
- Certifications relevant to your industry
- Tolerance capabilities demonstrated with similar materials
- Secondary operations available in-house
- Production volume capacity and lead times
~200 words
---
## Conclusion
Recap + CTA. ~50 words
---
## Structure Summary
| Level | Count | Cora Target (min) |
| ----- | ----- | ----------------- |
| H1 | 1 | 1 |
| H2 | 5 | 4 |
| H3 | 11 | 10 |
## Unique Angles
1. **"When NOT to use Swiss"** — honest guidance that builds trust and captures comparison traffic
2. **Quality/inspection detail** — goes beyond just listing ISO numbers
3. **Supplier selection guidance** — practical buyer help that competitors skip
---
## Fan-Out Query Headings
Separate from main content. Do NOT count against word count or heading targets.
Style as accordions, FAQs, or hidden divs.
Answer format: restate the question in the answer ("How does X work? X works by...").
Each answer: 2-3 sentences max, self-contained.
### H3: What Is the Difference Between Swiss Screw Machining and CNC Turning?
### H3: How Tight Are Swiss Screw Machining Tolerances?
### H3: What Materials Can Be Swiss Screw Machined?
### H3: Is Swiss Screw Machining Cost-Effective for Small Production Runs?
### H3: What Industries Use CNC Swiss Screw Machining?
### H3: How Does a Guide Bushing Work on a Swiss Screw Machine?
### H3: What Part Sizes Can a Swiss Screw Machine Handle?
### H3: Does Swiss Screw Machining Require Secondary Operations?
### H3: What Certifications Should a Swiss Screw Machining Supplier Have?
### H3: When Should You Choose Conventional CNC Over Swiss Machining?

View File

@ -0,0 +1,122 @@
# Research Summary: CNC Swiss Screw Machining
## Search Term
cnc swiss screw machining
## Sources Analyzed
| Source | URL | Word Count | Angle |
|--------|-----|------------|-------|
| Kerr Screw | kerrscrew.com/swiss-screw-machining-explained/ | ~1,300 | Historical context, automation evolution, applications |
| Avanti Engineering | avantiengineering.com/swiss-screw-machining-benefits-applications/ | ~900 | Benefits, applications, how it works |
| IQS Directory | iqsdirectory.com/.../swiss-screw-machining.html | ~6,500 | Deep technical guide: process, types, tools, materials, prep |
| Hogge Precision | hoggeprecision.com/benefits-of-cnc-swiss-screw-machining/ | ~800 | CNC vs automatic types, benefits, capabilities |
| Cox Manufacturing | coxmanufacturing.com/blog/what-is-swiss-screw-machining/ | ~250 | Brief intro, guide bushing emphasis |
| Nolte Precise | nolteprecise.com/cnc-swiss-screw-machining/ | ~1,100 | High-volume production focus |
| Hartford Technologies | resources.hartfordtechnologies.com/... | — | Swiss vs traditional machining comparison |
| Impro Precision | improprecision.com/introduction-swiss-screw-machining/ | — | Industry applications deep dive |
---
## Common Themes (what everyone covers)
### 1. Definition & History
Every competitor explains that Swiss screw machining originated in Switzerland in the late 1800s for watchmaking. They define it as a precision turning process using a sliding headstock and guide bushing. This is table stakes — must be covered.
### 2. How It Works (Guide Bushing + Sliding Headstock)
Core technical differentiator from conventional CNC lathes:
- Bar stock feeds through a chucking collet in the sliding headstock
- Guide bushing supports the workpiece 1-3mm from the cutting tool
- Headstock moves along Z-axis (vs. conventional lathes where the tool moves)
- Reduces deflection and vibration, enabling tighter tolerances
- Guide bushing types: synchronous rotary (for >±0.0005") and fixed (for tighter tolerances)
### 3. Precision & Tolerances
Consistently cited numbers:
- ±0.0002" to ±0.0005" tolerances standard
- Up to 10,000 RPM spindle speeds
- Bar stock must be centerless-ground to ±0.0002" diametric tolerance
- Surface finish quality superior to conventional turning
### 4. Benefits Over Conventional CNC
Every competitor lists some version of:
- Tighter tolerances (guide bushing reduces deflection)
- Reduced secondary operations (multi-spindle, live tooling)
- Higher production speed for small parts
- Lower per-part cost at volume
- Less material waste
- Simultaneous multi-tool operation (up to 20 tools at once)
### 5. Materials
Standard list: stainless steel, aluminum, brass, copper, bronze, titanium, nickel alloys, and engineering plastics (PEEK, Delrin, nylon). Exotic alloys also mentioned.
### 6. Industries & Applications
Medical (implants, surgical instruments), aerospace (fasteners, connectors), automotive (high-volume small parts), electronics (connectors, pins), defense, hydraulics, telecommunications.
### 7. CNC vs. Automatic (Cam-Driven)
Most competitors distinguish between:
- Automatic/cam-driven machines: simpler geometry, extremely high volume, lower setup flexibility
- CNC Swiss machines: complex geometry, tighter tolerances, programmable, more flexible
---
## Content Structure Patterns
**Short-form competitors** (~250-800 words): Kerr Screw, Hogge, Cox
- Definition → Benefits list → Industries → CTA
- Minimal technical depth, service-page style
**Mid-form competitors** (~900-1,400 words): Avanti, Nolte, Hartford
- Definition → How it works → Benefits → Applications → Swiss vs. conventional comparison
- Moderate technical depth, educational blog style
**Long-form competitors** (~6,500 words): IQS Directory
- Comprehensive guide with chapters: definition → process → types → tools → materials → components → benefits → preparation
- Deep technical reference, encyclopedia style
**Observation:** Most competitors are in the 800-1,400 word range. IQS is an outlier at 6,500+. There's a gap in the 2,000-3,000 word range — content that's thorough enough to be a real resource but not a textbook chapter.
---
## Gaps (what competitors miss or cover poorly)
### 1. Design for Swiss Machining
Only IQS Directory touches on preparation/design considerations. Nobody provides practical guidance for engineers on how to design parts specifically for Swiss screw machining (feature sizes, wall thickness, corner radii, tolerance callouts that are realistic).
### 2. When NOT to Use Swiss Machining
Competitors focus on benefits but rarely discuss limitations or when conventional CNC is actually better (larger parts, short runs, parts without rotational symmetry).
### 3. Cost Breakdown / Economics
Everyone says "cost-effective" but nobody provides actual cost drivers: setup costs, material costs (centerless-ground bar stock premium), tooling costs, volume thresholds where Swiss becomes economical vs. conventional CNC.
### 4. Quality & Inspection Process
Certifications get mentioned (ISO 9001, ISO 13485, ITAR) but the actual inspection process — SPC, CMM measurement, optical inspection, first article inspection — is barely explained.
### 5. Machine Selection (Brand/Model Landscape)
Brief mentions of Tsugami, Citizen, Star, Tornos — but no meaningful comparison of what machines are used or why. Buyers researching this topic often need to understand what machine capabilities their supplier should have.
### 6. Modern Capabilities Beyond Turning
Swiss machines today can do milling, drilling, cross-drilling, threading, knurling, and even gear cutting — but most competitors undersell these capabilities, making Swiss machining sound like it's only for round turned parts.
---
## Potential Unique Angles
1. **"Design for Swiss" section** — Practical engineering guidance on how to design parts that are optimized for Swiss screw machining. This is genuinely useful and nobody covers it well.
2. **Economics / When to Choose Swiss** — Honest cost analysis: volume thresholds, setup costs, when conventional CNC or multi-spindle screw machines are actually better choices. This builds trust and captures comparison-search traffic.
3. **Modern Swiss capabilities** — Position Swiss machining as more than just turning. Cover live tooling, secondary operations, and complex multi-axis work that today's CNC Swiss machines can handle.
---
## Entity Landscape (from competitor content)
Frequently mentioned entities across sources:
- **Machine components:** guide bushing, sliding headstock, spindle, collet, bar feeder, turret, live tooling
- **Materials:** stainless steel, aluminum, brass, titanium, PEEK, Delrin, copper, bronze, nickel
- **Industries:** medical devices, aerospace, automotive, electronics, defense, telecommunications
- **Processes:** turning, milling, drilling, threading, tapping, knurling, parting
- **Quality:** ISO 9001, ISO 13485, ITAR, SPC, CMM, first article inspection
- **Machine brands:** Tsugami, Citizen, Star, Tornos
- **Specifications:** tolerance (±0.0002"), RPM (10,000), bar stock diameter (up to 32mm or 1.25")

View File

@ -0,0 +1,180 @@
# Brand Voice & Tone Guidelines
Reference for maintaining consistent voice across all written content. These are defaults — override with client-specific guidelines when available.
---
## Voice Archetypes
Choose one primary archetype per brand. A secondary archetype can add nuance but should never dominate.
### Expert
- **Sounds like:** A senior practitioner sharing hard-won knowledge.
- **Characteristics:** Precise, evidence-backed, confident without arrogance. Cites data, references real-world experience, and isn't afraid to say "it depends."
- **Typical vocabulary:** "In practice," "the tradeoff is," "based on our benchmarks," "here's why this matters."
- **Risk to avoid:** Coming across as condescending or overly academic.
- **Best for:** Technical audiences, B2B SaaS, engineering blogs, whitepapers.
### Guide
- **Sounds like:** A patient teacher walking you through something step by step.
- **Characteristics:** Clear, encouraging, anticipates confusion. Breaks complex ideas into digestible pieces. Uses analogies.
- **Typical vocabulary:** "Let's start with," "think of it like," "the key thing to remember," "don't worry if this seems complex."
- **Risk to avoid:** Being patronizing or oversimplifying for an advanced audience.
- **Best for:** Tutorials, onboarding content, documentation, beginner-to-intermediate audiences.
### Innovator
- **Sounds like:** Someone who sees around corners and wants to bring you along.
- **Characteristics:** Forward-looking, curious, willing to challenge assumptions. Connects dots across domains. Thinks in systems.
- **Typical vocabulary:** "What if," "the shift we're seeing," "this changes the calculus," "the next wave."
- **Risk to avoid:** Sounding like hype or vaporware. Must ground vision in evidence.
- **Best for:** Thought leadership, industry analysis, product vision content, founder blogs.
### Friend
- **Sounds like:** A sharp colleague sharing advice over coffee.
- **Characteristics:** Warm, direct, conversational. Uses "you" and "we." Comfortable with humor when it's natural. Doesn't hide behind jargon.
- **Typical vocabulary:** "Here's the thing," "honestly," "we've all been there," "the trick is."
- **Risk to avoid:** Being too casual for high-stakes topics or enterprise audiences.
- **Best for:** Community content, newsletters, brand blogs aimed at practitioners.
### Motivator
- **Sounds like:** A coach who believes in your potential and pushes you to act.
- **Characteristics:** Energetic, action-oriented, focused on outcomes. Uses imperatives. Celebrates progress.
- **Typical vocabulary:** "Start today," "you can do this," "here's your edge," "stop waiting for perfect."
- **Risk to avoid:** Empty cheerleading. Must pair motivation with substance.
- **Best for:** Career content, productivity content, entrepreneurship, course marketing.
---
## Core Writing Principles
These apply regardless of archetype.
### 1. Clarity First
- If a sentence can be misread, rewrite it.
- Use the simplest word that conveys the precise meaning. "Use" over "utilize." "Start" over "commence."
- One idea per paragraph. One purpose per section.
- Define jargon on first use, or skip it entirely.
### 2. Customer-Centric
- Frame everything from the reader's perspective, not the company's.
- **Instead of:** "We built a new feature that enables real-time collaboration."
- **Write:** "You can now edit documents with your team in real time."
- Lead with the reader's problem or goal, not the product or solution.
### 3. Active Voice
- Active voice is the default. Passive voice is acceptable only when the actor is unknown or irrelevant.
- **Active:** "The script generates a report every morning."
- **Passive (acceptable):** "The logs are rotated every 24 hours." (The actor doesn't matter.)
- **Passive (avoid):** "A decision was made to deprecate the endpoint." (Who decided?)
### 4. Show, Don't Claim
- Replace vague claims with specific evidence.
- **Claim:** "Our platform is incredibly fast."
- **Show:** "Queries return in under 50ms at the 99th percentile."
- If you can't provide evidence, soften the language or cut the sentence.
---
## Tone Attributes
Tone shifts based on content type and audience. Use these spectrums to calibrate.
### Formality Spectrum
```
Casual -------|-------|-------|-------|------- Formal
1 2 3 4 5
```
| Level | Description | Use When |
|-------|-------------|----------|
| 1 | Slang OK, sentence fragments, first person | Internal team comms, very informal blogs |
| 2 | Conversational, contractions, direct address | Newsletters, community posts, most blog content |
| 3 | Professional but approachable, minimal contractions | Product announcements, mid-funnel content |
| 4 | Polished, structured, no contractions | Whitepapers, enterprise case studies, executive briefs |
| 5 | Formal, third person, precise terminology | Legal, compliance, academic partnerships |
**Default for most blog/article content: Level 2-3.**
### Technical Depth Spectrum
```
General -------|-------|-------|-------|------- Deep Technical
1 2 3 4 5
```
| Level | Description | Use When |
|-------|-------------|----------|
| 1 | No jargon, analogy-heavy, conceptual | Non-technical stakeholders, general audience |
| 2 | Light jargon (defined inline), practical focus | Business audience with some domain familiarity |
| 3 | Industry-standard terminology, code snippets OK | Practitioners who do the work daily |
| 4 | Assumes working knowledge, implementation details | Developers, engineers, technical decision-makers |
| 5 | Deep internals, performance analysis, tradeoff math | Senior engineers, architects, researchers |
**Default: Match the audience. When unsure, aim one level below what you think the audience can handle. Accessibility wins.**
---
## Language Preferences
### Use Action Verbs
Lead sentences — especially headings and CTAs — with strong verbs.
| Weak | Strong |
|------|--------|
| There is a way to improve | Improve |
| This section is a discussion of | This section covers |
| You should consider using | Use |
| It is important to note that | Note: |
| We are going to walk through | Let's walk through |
### Be Concrete and Specific
Vague language erodes trust. Replace generalities with specifics.
| Vague | Concrete |
|-------|----------|
| "significantly faster" | "3x faster" or "reduced from 12s to 2s" |
| "a large number of users" | "over 40,000 monthly active users" |
| "best-in-class" | describe the specific advantage |
| "seamless integration" | "connects via a single API call" |
| "in the near future" | "by Q2" or "in the next release" |
### Avoid These Patterns
- **Weasel words:** "very," "really," "extremely," "quite," "somewhat" — cut them or replace with data.
- **Nominalizations:** "implementation" when you mean "implement," "utilization" when you mean "use."
- **Hedge stacking:** "It might potentially be possible to perhaps consider..." — commit to a position or state the uncertainty once, clearly.
- **Buzzword chains:** "AI-powered next-gen synergistic platform" — describe what it actually does.
---
## Pre-Publication Checklist
Run through this before publishing any piece of content.
### Voice Consistency
- [ ] Does the piece sound like one person wrote it, beginning to end?
- [ ] Does it match the target voice archetype?
- [ ] Are there jarring shifts in tone between sections?
- [ ] If multiple authors contributed, has it been edited for a unified voice?
### Clarity
- [ ] Can a reader in the target audience understand every sentence on the first read?
- [ ] Is jargon defined or avoided?
- [ ] Are all acronyms expanded on first use?
- [ ] Do headings accurately describe the content beneath them?
- [ ] Is the article scannable? (subheadings every 2-4 paragraphs, short paragraphs, lists where appropriate)
### Value
- [ ] Does the introduction make clear what the reader will gain?
- [ ] Does every section earn its place? (Cut anything that doesn't serve the reader's goal.)
- [ ] Are claims supported by evidence, examples, or data?
- [ ] Is the advice actionable — can the reader do something with it today?
- [ ] Does the conclusion provide a clear next step?
### Formatting
- [ ] Title is under 70 characters and includes the core keyword or topic.
- [ ] Meta description is 140-160 characters and summarizes the value proposition.
- [ ] Headings use parallel structure (all questions, all noun phrases, or all verb phrases — not mixed).
- [ ] Code blocks, tables, and images have context (a sentence before them explaining what the reader is looking at).
- [ ] Links use descriptive anchor text, not "click here."
- [ ] No walls of text — maximum 4 sentences per paragraph for web content.

View File

@ -0,0 +1,267 @@
# Content Frameworks Reference
Quick-reference guide for structuring blog posts and articles. Use these templates as starting points, then adapt to the topic and audience.
---
## Article Templates
### How-To Guide
```
Title: How to [Achieve Specific Outcome] (in [Timeframe/Steps])
Introduction
- State the outcome the reader will achieve
- Briefly explain why this matters or who this is for
- Set expectations: what they need, how long it takes
Prerequisites / What You'll Need (optional)
- Tools, knowledge, or setup required before starting
Step 1: [Action Verb] + [Object]
- What to do and why
- Concrete details, examples, or code snippets
- Common mistake to avoid at this step
Step 2: [Action Verb] + [Object]
- (same pattern)
... (repeat for each step)
Troubleshooting / Common Issues (optional)
- Problem → Cause → Fix, in a quick table or list
Conclusion
- Recap what the reader accomplished
- Suggest a logical next step or related guide
```
**Key principle:** Each step starts with an action verb. One action per step. If a step has sub-steps, break it out.
---
### Listicle
```
Title: [Number] [Adjective] [Things] for [Audience/Goal]
Examples: "9 Underrated Tools for Frontend Performance"
"5 Strategies That Reduced Our Build Time by 60%"
Introduction (2-3 sentences)
- Who this list is for
- What criteria you used to select items
Item 1: [Name or Short Description]
- What it is (1 sentence)
- Why it matters or when to use it (1-2 sentences)
- Concrete example, stat, or tip
Item 2: ...
(repeat)
Wrap-Up
- Quick summary of top picks or situational recommendations
- CTA: ask readers to share their own picks, or link to a deeper dive
```
**Key principle:** Each item must stand alone. Readers skim listicles — front-load the value in each entry. Order by impact (strongest first or last) or by logical progression.
---
### Comparison / Vs Article
```
Title: [Option A] vs [Option B]: [Decision Context]
Example: "Postgres vs MySQL: Which Database Fits Your SaaS in 2026?"
Introduction
- The decision the reader faces
- Who this comparison is for (skill level, use case)
- Summary verdict (give the answer up front, then prove it)
Quick Comparison Table
| Criteria | Option A | Option B |
|-----------------|----------------|----------------|
| [Criterion 1] | ... | ... |
| [Criterion 2] | ... | ... |
| Pricing | ... | ... |
| Best for | ... | ... |
Section: [Criterion 1] Deep Dive
- How A handles it
- How B handles it
- Verdict for this criterion
(repeat for each major criterion)
When to Choose A
- Bullet list of scenarios, use cases, or team profiles
When to Choose B
- Same structure
Final Recommendation
- Restate the summary verdict with nuance
- Suggest next steps (trial links, related guides)
```
**Key principle:** Be opinionated. Readers come to comparison articles for a recommendation, not a feature dump. State your pick early, then support it.
---
### Case Study
```
Title: How [Company/Person] [Achieved Result] with [Method/Tool]
Snapshot (sidebar or callout box)
- Company/person profile
- Challenge in one line
- Result in one line (with numbers)
- Timeline
The Challenge
- Situation before: pain points, constraints, failed attempts
- Why existing solutions weren't working
- Stakes: what would happen if unsolved
The Approach
- What they decided to do and why
- Implementation details (tools, process, decisions)
- Obstacles encountered during execution
The Results
- Quantified outcomes (before/after metrics)
- Qualitative outcomes (team sentiment, workflow changes)
- Timeline to results
Key Takeaways
- 2-4 lessons the reader can apply to their own situation
- What the subject would do differently next time (if anything)
```
**Key principle:** Specifics beat generalities. Use real numbers, timelines, and named tools. A case study without measurable results is just a testimonial.
---
### Thought Leadership
```
Title: [Contrarian Claim] or [Reframed Problem]
Examples: "Your Microservices Migration Will Fail — Here's Why"
"We've Been Thinking About Developer Productivity Wrong"
The Hook
- A bold claim, surprising stat, or industry assumption to challenge
- One paragraph max
The Conventional View
- What most people believe or do today
- Why it seems reasonable on the surface
The Shift
- What's changed (new data, your experience, a trend)
- Why the conventional view no longer holds
- Evidence: data, examples, analogies
The New Mental Model
- Your proposed way of thinking about this
- How it changes decisions or priorities
- 1-2 concrete examples of the new model applied
Implications
- What readers should do differently starting now
- What this means for the industry over the next 1-3 years
Close
- Restate the core insight in one sentence
- Invite discussion or point to your deeper work on this topic
```
**Key principle:** Thought leadership requires a genuine point of view. The article should change how the reader thinks, not just inform them.
---
## Persuasion Frameworks
### AIDA (Attention, Interest, Desire, Action)
Use AIDA to structure the emotional arc of an article, especially product-adjacent or tutorial content.
| Stage | Purpose | Tactics |
|-------|---------|---------|
| **Attention** | Stop the scroll. Earn the click. | Surprising stat, bold claim, relatable pain point in the title and opening line. |
| **Interest** | Convince them to keep reading. | Show you understand their situation. Introduce the core concept or framework. Use subheadings that promise value. |
| **Desire** | Make them want the outcome. | Show results: examples, screenshots, before/after. Paint a picture of life after applying the advice. |
| **Action** | Tell them what to do next. | Specific, low-friction CTA. One action, not five. "Clone the repo," "Try this query," "Read part 2." |
---
### PAS (Problem, Agitate, Solution)
Use PAS for introductions, email content, and articles addressing a known pain point.
| Stage | Purpose | Tactics |
|-------|---------|---------|
| **Problem** | Name the pain clearly. | Describe the situation in the reader's own words. Be specific — "your CI pipeline takes 40 minutes" beats "slow builds." |
| **Agitate** | Make the pain feel urgent. | Show the consequences: wasted time, lost revenue, compounding tech debt. Use "what happens if you don't fix this" framing. |
| **Solution** | Present the path forward. | Introduce your approach, tool, or framework. Transition into the body of the article. |
PAS works best in the first 3-5 paragraphs, then hand off to a structural template (How-To, Listicle, etc.) for the body.
---
## Introduction Patterns
Use one of these patterns for the opening 2-4 sentences. Match the pattern to the article type and audience.
**The Stat Drop**
Open with a surprising number, then connect it to the reader's world.
> "73% of API integrations fail in the first year — not because of bad code, but because of bad documentation."
**The Contrarian Hook**
Challenge a common belief head-on.
> "You don't need a content calendar. What you need is a content system."
**The Pain Mirror**
Describe the reader's frustration in their own words.
> "You've rewritten the onboarding flow three times this quarter. Each time, engagement drops again within a month."
**The Outcome Lead**
Start with the result, then explain how to get there.
> "Our deploy frequency went from weekly to 12x per day. Here's the infrastructure change that made it possible."
**The Story Open**
Begin with a brief, relevant anecdote (3 sentences max).
> "Last March, our team pushed a migration that broke checkout for 6 hours. The post-mortem revealed something we didn't expect."
**The Question**
Ask a question the reader is already asking themselves.
> "Why does every database migration guide assume you have zero traffic?"
---
## Conclusion Patterns
Every conclusion should do two things: (1) reinforce the core takeaway, and (2) give the reader a next step.
**The Recap + CTA**
Summarize the 2-3 key points, then give one clear action.
> "To recap: validate early, test with real data, and deploy incrementally. Ready to try it? Start with [specific first step]."
**The Implication Close**
Zoom out. Connect the article's advice to a bigger trend or outcome.
> "This isn't just about faster deploys — it's about building a team that ships with confidence."
**The Next Step Bridge**
Point to a logical follow-up resource or action.
> "Now that your monitoring is in place, the next step is setting up alerting thresholds. We cover that in [linked article]."
**The Challenge Close**
Issue a direct, friendly challenge to the reader.
> "Pick one of these patterns and apply it to your next pull request. See what changes."
**The Open Loop**
Tease upcoming content or unresolved questions to drive return visits.
> "We've covered the read path. In part 2, we'll tackle the write path — where the real complexity lives."

View File

@ -0,0 +1,292 @@
"""
Competitor Content Scraper
Fetches web pages and extracts clean text content for analysis.
Used as a utility when the user provides a list of URLs to examine.
Usage:
uv run --with requests,beautifulsoup4 python competitor_scraper.py URL1 URL2 ...
[--output-dir ./working/competitor_content/]
[--format json|text]
"""
import argparse
import json
import re
import sys
import time
from pathlib import Path
from urllib.parse import urlparse
try:
import requests
from bs4 import BeautifulSoup
except ImportError:
print(
"Error: requests and beautifulsoup4 are required.\n"
"Install with: uv add requests beautifulsoup4",
file=sys.stderr,
)
sys.exit(1)
UNWANTED_TAGS = [
"nav", "footer", "header", "aside", "script", "style", "noscript",
"iframe", "form", "button", "svg", "img", "video", "audio",
]
UNWANTED_CLASSES = [
"nav", "navbar", "navigation", "menu", "sidebar", "footer", "header",
"breadcrumb", "cookie", "popup", "modal", "advertisement", "ad-",
"social", "share", "comment", "related-posts",
]
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
}
class CompetitorScraper:
"""Fetches and cleans web page content for competitor analysis."""
def __init__(self, timeout: int = 15, delay: float = 1.0):
"""
Args:
timeout: Request timeout in seconds.
delay: Delay between requests in seconds (rate limiting).
"""
self.timeout = timeout
self.delay = delay
self.session = requests.Session()
self.session.headers.update(DEFAULT_HEADERS)
def scrape_url(self, url: str) -> dict:
"""Scrape a single URL and extract clean content.
Returns:
Dict with: url, host, title, meta_description, headings, text, word_count, error
"""
result = {
"url": url,
"host": urlparse(url).netloc,
"title": "",
"meta_description": "",
"headings": [],
"text": "",
"word_count": 0,
"error": None,
}
try:
response = self.session.get(url, timeout=self.timeout)
response.raise_for_status()
response.encoding = response.apparent_encoding or "utf-8"
html = response.text
except requests.RequestException as e:
result["error"] = str(e)
return result
soup = BeautifulSoup(html, "html.parser")
# Extract title
title_tag = soup.find("title")
if title_tag:
result["title"] = title_tag.get_text(strip=True)
# Extract meta description
meta_desc = soup.find("meta", attrs={"name": "description"})
if meta_desc and meta_desc.get("content"):
result["meta_description"] = meta_desc["content"].strip()
# Extract headings before cleaning
result["headings"] = self._extract_headings(soup)
# Clean the HTML and extract main text
result["text"] = self._extract_text(soup)
result["word_count"] = len(result["text"].split())
return result
def scrape_urls(self, urls: list[str]) -> list[dict]:
"""Scrape multiple URLs with rate limiting.
Args:
urls: List of URLs to scrape.
Returns:
List of result dicts from scrape_url.
"""
results = []
for i, url in enumerate(urls):
if i > 0:
time.sleep(self.delay)
print(f" Scraping [{i + 1}/{len(urls)}]: {url}", file=sys.stderr)
result = self.scrape_url(url)
if result["error"]:
print(f" Error: {result['error']}", file=sys.stderr)
else:
print(f" OK: {result['word_count']} words", file=sys.stderr)
results.append(result)
return results
def save_results(self, results: list[dict], output_dir: str) -> list[str]:
"""Save scraped results as individual text files.
Args:
results: List of result dicts from scrape_urls.
output_dir: Directory to write files to.
Returns:
List of file paths written.
"""
out_path = Path(output_dir)
out_path.mkdir(parents=True, exist_ok=True)
saved = []
for result in results:
if result["error"] or not result["text"]:
continue
# Create filename from host
host = result["host"].replace("www.", "")
safe_name = re.sub(r'[^\w\-.]', '_', host)
filepath = out_path / f"{safe_name}.txt"
content = self._format_output(result)
filepath.write_text(content, encoding="utf-8")
saved.append(str(filepath))
return saved
def _extract_headings(self, soup: BeautifulSoup) -> list[dict]:
"""Extract all headings (h1-h6) with their level and text."""
headings = []
for tag in soup.find_all(re.compile(r'^h[1-6]$')):
level = int(tag.name[1])
text = tag.get_text(strip=True)
if text:
headings.append({"level": level, "text": text})
return headings
def _extract_text(self, soup: BeautifulSoup) -> str:
"""Extract clean body text from HTML, stripping navigation and boilerplate."""
# Remove unwanted tags
for tag_name in UNWANTED_TAGS:
for tag in soup.find_all(tag_name):
tag.decompose()
# Remove elements with unwanted class names
for element in list(soup.find_all(True)):
if element.attrs is None:
continue
classes = element.get("class", [])
if isinstance(classes, list):
class_str = " ".join(classes).lower()
else:
class_str = str(classes).lower()
el_id = str(element.get("id", "")).lower()
for pattern in UNWANTED_CLASSES:
if pattern in class_str or pattern in el_id:
element.decompose()
break
# Try to find main content area
main_content = (
soup.find("main")
or soup.find("article")
or soup.find("div", {"role": "main"})
or soup.find("div", class_=re.compile(r'content|article|post|entry', re.I))
or soup.body
or soup
)
# Extract text with some structure preserved
text = main_content.get_text(separator="\n", strip=True)
# Clean up excessive whitespace
lines = []
for line in text.splitlines():
line = line.strip()
if line:
lines.append(line)
return "\n".join(lines)
def _format_output(self, result: dict) -> str:
"""Format a single result as a readable text file."""
lines = [
f"URL: {result['url']}",
f"Title: {result['title']}",
f"Meta Description: {result['meta_description']}",
f"Word Count: {result['word_count']}",
"",
"--- HEADINGS ---",
]
for h in result["headings"]:
indent = " " * (h["level"] - 1)
lines.append(f"{indent}H{h['level']}: {h['text']}")
lines.extend(["", "--- CONTENT ---", "", result["text"]])
return "\n".join(lines)
def main():
parser = argparse.ArgumentParser(description="Scrape competitor web pages for content analysis")
parser.add_argument("urls", nargs="+", help="URLs to scrape")
parser.add_argument(
"--output-dir",
default="./working/competitor_content",
help="Directory to save scraped content (default: ./working/competitor_content/)",
)
parser.add_argument(
"--format",
choices=["json", "text"],
default="text",
help="Output format for stdout (default: text)",
)
parser.add_argument(
"--timeout",
type=int,
default=15,
help="Request timeout in seconds (default: 15)",
)
parser.add_argument(
"--delay",
type=float,
default=1.0,
help="Delay between requests in seconds (default: 1.0)",
)
args = parser.parse_args()
scraper = CompetitorScraper(timeout=args.timeout, delay=args.delay)
results = scraper.scrape_urls(args.urls)
# Save files
saved = scraper.save_results(results, args.output_dir)
print(f"\nSaved {len(saved)} files to {args.output_dir}", file=sys.stderr)
# Output to stdout
successful = [r for r in results if not r["error"]]
if args.format == "json":
print(json.dumps(successful, indent=2))
else:
for r in successful:
print(scraper._format_output(r))
print("\n" + "=" * 80 + "\n")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,968 @@
"""
Cora SEO Report Parser
Reads a Cora XLSX file and extracts structured data from relevant sheets.
Used as a foundation module by entity_optimizer, lsi_optimizer, and seo_optimizer.
Usage:
uv run --with openpyxl python cora_parser.py <xlsx_path> [--sheet SHEET] [--format FORMAT]
Options:
--sheet Which data to extract: entities, lsi, variations, results, tunings,
structure, densities, targets, summary, all (default: summary)
--format Output format: json, text (default: text)
"""
import argparse
import json
import math
import re
import sys
from pathlib import Path
try:
import openpyxl
except ImportError:
print("Error: openpyxl is required. Install with: uv add openpyxl", file=sys.stderr)
sys.exit(1)
# =============================================================================
# Optimization Rules
#
# Hard-wired overrides that apply regardless of what Cora data says.
# These encode expert SEO knowledge and practical constraints.
# =============================================================================
OPTIMIZATION_RULES = {
# Heading rules
"h1_max": 1, # Never more than 1 H1
"h1_min": 1, # Always have exactly 1 H1
"optimize_headings": ["h1", "h2", "h3"], # Primary optimization targets
"low_priority_headings": ["h4"], # Only add if most competitors have them
"ignore_headings": ["h5", "h6"], # Skip entirely
# Keyword density
"exact_match_density_min": 0.02, # 2% minimum for exact match keyword
"no_keyword_stuffing_limit": True, # Do NOT flag for keyword stuffing
# Variations capture exact match, so hitting variation density covers it
# Word count strategy
"word_count_strategy": "cluster", # "cluster" = nearest competitive cluster, not raw average
"word_count_acceptable_max": 1500, # Up to 1500 is always acceptable even if target is lower
# Density awareness
"density_interdependent": True, # Adding content changes all density calculations
# Entity / LSI filtering
"exclude_competitor_entities": True, # Never use competitor company names as entities or LSI
"exclude_measurement_entities": True, # Ignore measurements (dimensions, tolerances) as entities
"allow_organization_entities": True, # Organizations like ISO, ANSI, etc. are OK
"never_mention_competitors": True, # Never mention competitors by name in content
}
class CoraReport:
"""Parses a Cora SEO XLSX report and provides structured access to its data."""
def __init__(self, xlsx_path: str):
self.path = Path(xlsx_path)
if not self.path.exists():
raise FileNotFoundError(f"XLSX file not found: {xlsx_path}")
self.wb = openpyxl.load_workbook(str(self.path), data_only=True)
self._site_domain = None # Cached after first detection
# -------------------------------------------------------------------------
# Core metadata
# -------------------------------------------------------------------------
def get_sheet_names(self) -> list[str]:
return self.wb.sheetnames
def get_search_term(self) -> str:
"""Extract the target keyword from the report."""
for sheet_name in ["Basic Tunings", "Strategic Overview", "Structure"]:
if sheet_name not in self.wb.sheetnames:
continue
ws = self.wb[sheet_name]
for row in ws.iter_rows(min_row=1, max_row=10, values_only=True):
if row and row[0] == "Search Terms" and len(row) > 1 and row[1]:
return str(row[1])
return ""
def get_variations_list(self) -> list[str]:
"""Extract the keyword variations list from Strategic Overview B10.
These are pipe-delimited inside curly braces:
{cnc screw|cnc screw machining|cnc swiss|...}
"""
if "Strategic Overview" not in self.wb.sheetnames:
return []
ws = self.wb["Strategic Overview"]
rows = list(ws.iter_rows(min_row=1, max_row=12, values_only=True))
for row in rows:
if row and row[0] == "Keywords" and len(row) > 1 and row[1]:
raw = str(row[1]).strip()
# Remove curly braces and split on pipe
raw = raw.strip("{}")
return [v.strip() for v in raw.split("|") if v.strip()]
return []
def get_site_domain(self) -> str:
"""Detect the user's site domain from the report.
Looks for the domain in the Entities sheet header (column with a .com/.net etc.
that isn't a standard Cora column) or the site column in other sheets.
"""
if self._site_domain:
return self._site_domain
# Try Entities sheet first
if "Entities" in self.wb.sheetnames:
ws = self.wb["Entities"]
rows = list(ws.iter_rows(min_row=1, max_row=5, values_only=True))
for row in rows:
if row and row[0] == "Entity":
for h in row:
if h and isinstance(h, str):
h = h.strip()
if re.match(r'^[a-zA-Z0-9-]+\.[a-zA-Z]{2,}$', h):
self._site_domain = h
return h
# Try LSI Keywords sheet — header like "#40.7 hoggeprecision.com"
if "LSI Keywords" in self.wb.sheetnames:
ws = self.wb["LSI Keywords"]
rows = list(ws.iter_rows(min_row=1, max_row=10, values_only=True))
for row in rows:
if row and row[0] == "LSI Keyword":
for h in row:
if h and isinstance(h, str):
match = re.search(r'([a-zA-Z0-9-]+\.[a-zA-Z]{2,})', h.strip())
if match:
self._site_domain = match.group(1)
return self._site_domain
return ""
# -------------------------------------------------------------------------
# Entities
# -------------------------------------------------------------------------
def get_entities(self) -> list[dict]:
"""Extract entities from the Entities sheet.
Returns list of dicts with: name, freebase_id, wikidata_id, wiki_link,
relevance, confidence, type, correlation, current_count, max_count, deficit
"""
if "Entities" not in self.wb.sheetnames:
return []
ws = self.wb["Entities"]
rows = list(ws.iter_rows(values_only=True))
# Find header row containing "Entity", "Freebase ID", etc.
header_idx = None
for i, row in enumerate(rows):
if row and row[0] == "Entity" and len(row) > 1 and row[1] == "Freebase ID":
header_idx = i
break
if header_idx is None:
return []
headers = rows[header_idx]
col_map = {str(h).strip(): j for j, h in enumerate(headers) if h}
# Find the site-specific column (domain name like "hoggeprecision.com")
site_col_idx = None
site_domain = self.get_site_domain()
if site_domain:
site_col_idx = col_map.get(site_domain)
entities = []
for row in rows[header_idx + 1:]:
if not row or not row[0]:
continue
name = str(row[0]).strip()
if not name:
continue
# Skip rows that look like metadata (e.g., "critical values: ...")
if name.startswith("critical") or name.startswith("http"):
continue
entity = {
"name": name,
"freebase_id": _safe_str(row, col_map.get("Freebase ID")),
"wikidata_id": _safe_str(row, col_map.get("Wikidata ID")),
"wiki_link": _safe_str(row, col_map.get("Wiki Link")),
"relevance": _safe_float(row, col_map.get("Relevance")),
"confidence": _safe_float(row, col_map.get("Confidence")),
"type": _safe_str(row, col_map.get("Type")),
"correlation": _safe_float(row, col_map.get("Best of Both")),
"current_count": _safe_int(row, site_col_idx),
"max_count": _safe_int(row, col_map.get("Max")),
"deficit": _safe_int(row, col_map.get("Deficit")),
}
entities.append(entity)
return entities
# -------------------------------------------------------------------------
# LSI Keywords
# -------------------------------------------------------------------------
def get_lsi_keywords(self) -> list[dict]:
"""Extract LSI keywords from the LSI Keywords sheet.
Returns list of dicts with: keyword, spearmans, pearsons, best_of_both,
pages, max, avg, current_count, deficit
"""
if "LSI Keywords" not in self.wb.sheetnames:
return []
ws = self.wb["LSI Keywords"]
rows = list(ws.iter_rows(values_only=True))
# Find header row containing "LSI Keyword", "Spearmans", etc.
header_idx = None
for i, row in enumerate(rows):
if row and row[0] == "LSI Keyword":
header_idx = i
break
if header_idx is None:
return []
headers = rows[header_idx]
col_map = {str(h).strip(): j for j, h in enumerate(headers) if h}
# Find site column — pattern like "#40.7 hoggeprecision.com"
site_col_idx = None
site_domain = self.get_site_domain()
if site_domain:
for j, h in enumerate(headers):
if h and isinstance(h, str) and site_domain in h:
site_col_idx = j
break
if site_col_idx is None:
site_col_idx = _find_site_col_idx(headers)
lsi_keywords = []
for row in rows[header_idx + 1:]:
if not row or not row[0]:
continue
keyword = str(row[0]).strip()
if not keyword:
continue
lsi = {
"keyword": keyword,
"spearmans": _safe_float(row, col_map.get("Spearmans")),
"pearsons": _safe_float(row, col_map.get("Pearsons")),
"best_of_both": _safe_float(row, col_map.get("Best of Both")),
"pages": _safe_int(row, col_map.get("Pages")),
"max": _safe_int(row, col_map.get("Max")),
"avg": _safe_float(row, col_map.get("Avg")),
"current_count": _safe_int(row, site_col_idx),
"deficit": _safe_float(row, col_map.get("Deficit")),
}
lsi_keywords.append(lsi)
return lsi_keywords
# -------------------------------------------------------------------------
# Keyword Variations
# -------------------------------------------------------------------------
def get_keyword_variations(self) -> list[dict]:
"""Extract keyword variation counts from the Variations sheet.
Returns list of dicts with: variation, page1_max, page1_avg
"""
if "Variations" not in self.wb.sheetnames:
return []
ws = self.wb["Variations"]
rows = list(ws.iter_rows(values_only=True))
if not rows or len(rows) < 3:
return []
header_row = rows[0]
# Find where variation columns start (after "# used" column)
var_start = 3 # default
for j, h in enumerate(header_row):
if h and str(h).strip() == "# used":
var_start = j + 1
break
max_row = rows[1] if len(rows) > 1 else None
avg_row = rows[2] if len(rows) > 2 else None
variations = []
for j in range(var_start, len(header_row)):
name = header_row[j]
if not name:
continue
variation = {
"variation": str(name).strip(),
"page1_max": _safe_int(max_row, j) if max_row else 0,
"page1_avg": _safe_int(avg_row, j) if avg_row else 0,
}
variations.append(variation)
return variations
# -------------------------------------------------------------------------
# Structure Targets (per-element targets from Structure sheet)
# -------------------------------------------------------------------------
def get_structure_targets(self) -> dict:
"""Extract per-element optimization targets from the Structure sheet.
Returns a dict keyed by element type with sub-targets:
{
"title_tag": {"exact_match": 0.2, "variations": 1.3, "entities": 5.8, "lsi_words": 10.7},
"meta_description": {...},
"all_h_tags": {"count": 20.7, "exact_match": 0.4, "variations": 5.7, "entities": 45.8, "lsi_words": 77.4},
"h1": {"count": 1.1, "exact_match": 0.1, "variations": 1, "entities": 3.8, "lsi_words": 7.3},
"h2": {...},
"h3": {...},
"h4": {...},
}
Page 1 Average values are in column D (index 3).
"""
if "Structure" not in self.wb.sheetnames:
return {}
ws = self.wb["Structure"]
rows = list(ws.iter_rows(values_only=True))
# Find the header row with "Factor Name", "Page 1 Avg" etc.
header_idx = None
for i, row in enumerate(rows):
if row and len(row) > 3:
if row[2] == "Factor Name" or (row[1] == "Factor ID" and row[2] == "Factor Name"):
header_idx = i
break
# Also check for the combined "Best of Both Correlation" header
if row[0] and "Best of Both" in str(row[0]):
header_idx = i
break
if header_idx is None:
return {}
# Parse factor rows into sections
# Section headers: "TITLE TAG", "META DESCRIPTION", "TOTAL FOR ALL H TAGS",
# "H1 Data", "H2 Data", "H3 Data", "H4 Data", "H5 Data", "H6 Data"
section_map = {
"TITLE TAG": "title_tag",
"META DESCRIPTION": "meta_description",
"TOTAL FOR ALL H TAGS": "all_h_tags",
"H1 Data": "h1",
"H2 Data": "h2",
"H3 Data": "h3",
"H4 Data": "h4",
}
# Factor name patterns to field names
factor_patterns = {
"Number of": "count",
"Exact Match": "exact_match",
"Variation": "variations",
"Entities": "entities",
"LSI": "lsi_words",
"Search Term": "search_terms",
"Keywords": "keywords",
}
targets = {}
current_section = None
for row in rows[header_idx + 1:]:
if not row or len(row) < 4:
continue
factor_name = _safe_str(row, 2)
# Check if this is a section header
if factor_name in section_map:
current_section = section_map[factor_name]
targets[current_section] = {}
continue
# Skip sections we don't care about (H5, H6)
if factor_name in ("H5 Data", "H6 Data"):
current_section = None
continue
if current_section is None:
continue
# Get the Page 1 Average (column D, index 3)
avg_val = _safe_float(row, 3)
if avg_val is None:
continue
# Map factor name to field
field_name = None
for pattern, field in factor_patterns.items():
if pattern.lower() in factor_name.lower():
field_name = field
break
if field_name and current_section:
# Also grab correlation from column A
correlation = _safe_float(row, 0)
# Outlier detection: check if one of the top 10 results
# contributes >50% of the sum. If so, exclude it and
# recompute the average — that outlier is skewing the target.
top10 = [_safe_float(row, j) or 0 for j in range(4, 14)]
top10_sum = sum(top10)
adjusted_avg = avg_val
outlier_detected = False
if top10_sum > 0:
max_val = max(top10)
if max_val > top10_sum * 0.5 and avg_val > 1:
# One result is >50% of the total — outlier.
# Skip adjustment when avg <= 1: a single "1" among
# zeros triggers the rule but the target is already
# small enough that adjustment would zero it out.
remaining = [v for v in top10 if v != max_val]
# If max_val appears multiple times, only remove one
if len(remaining) == len(top10):
remaining = top10[:]
remaining.remove(max_val)
if remaining:
adjusted_avg = sum(remaining) / len(remaining)
outlier_detected = True
target_val = math.ceil(adjusted_avg)
entry = {
"avg": avg_val,
"target": target_val,
"correlation": correlation,
}
if outlier_detected:
entry["outlier_adjusted"] = True
entry["original_target"] = math.ceil(avg_val)
targets[current_section][field_name] = entry
return targets
# -------------------------------------------------------------------------
# Density Targets (from Strategic Overview rows 46-48)
# -------------------------------------------------------------------------
def get_density_targets(self) -> dict:
"""Extract density targets from Strategic Overview rows 46-48.
Row 46: Variation density
Row 47: Entity density
Row 48: LSI density
Column D (index 3) = Page 1 Average.
Returns per-result values so we can show distribution.
"""
if "Strategic Overview" not in self.wb.sheetnames:
return {}
ws = self.wb["Strategic Overview"]
rows = list(ws.iter_rows(values_only=True))
# Find the density rows — they're the last 3 non-empty rows in the data section
# Look for them near row 46-48 area, identified by having floats in col D
# and being near the bottom of the data
# Approach: find the row with "Relevant Density" and the 3 rows after the gap
density_area_start = None
for i, row in enumerate(rows):
if row and len(row) > 2 and row[2] == "Relevant Density":
# Density target rows are a few rows below this
density_area_start = i
break
if density_area_start is None:
return {}
# The 3 density rows come after a gap. They have NO values in cols A, B, C —
# only numeric values from col D onward. Row 44 (which has a correlation in
# col A) is a count row, not a density row, so we skip it.
density_rows = []
for i in range(density_area_start + 1, min(density_area_start + 10, len(rows))):
row = rows[i]
if not row:
continue
col_a = row[0] if len(row) > 0 else None
col_b = row[1] if len(row) > 1 else None
col_c = row[2] if len(row) > 2 else None
col_d = row[3] if len(row) > 3 else None
# Density rows have None in A, B, C and a float in D
if col_a is None and col_b is None and col_c is None and col_d is not None:
try:
float(col_d)
density_rows.append(row)
except (ValueError, TypeError):
pass
# Get result domains from row 22 area for the site column
result_start_col = 4 # Results start at col E (index 4)
result = {}
labels = ["variation_density", "entity_density", "lsi_density"]
for idx, label in enumerate(labels):
if idx >= len(density_rows):
break
row = density_rows[idx]
avg = _safe_float(row, 3)
# Collect per-competitor values
competitor_vals = []
for j in range(result_start_col, min(result_start_col + 10, len(row))):
v = _safe_float(row, j)
if v is not None:
competitor_vals.append(v)
result[label] = {
"avg": avg,
"avg_pct": f"{avg * 100:.2f}%" if avg else "N/A",
"competitor_values": competitor_vals,
}
return result
# -------------------------------------------------------------------------
# Content Targets (word count, distinct entities, etc.)
# -------------------------------------------------------------------------
def get_content_targets(self) -> dict:
"""Extract key content-level targets from Strategic Overview.
Includes: word count distribution, distinct entities target, variations in HTML, etc.
"""
if "Strategic Overview" not in self.wb.sheetnames:
return {}
ws = self.wb["Strategic Overview"]
rows = list(ws.iter_rows(values_only=True))
targets = {}
result_start_col = 4
for i, row in enumerate(rows):
if not row or len(row) < 4:
continue
factor_name = _safe_str(row, 2)
factor_id = _safe_str(row, 1)
correlation = _safe_float(row, 0)
avg = _safe_float(row, 3)
if not factor_name or avg is None:
continue
# Key factors we care about
if factor_name == "Number of Distinct Entities Used":
competitor_vals = []
for j in range(result_start_col, min(result_start_col + 10, len(row))):
v = _safe_float(row, j)
if v is not None:
competitor_vals.append(int(v))
targets["distinct_entities"] = {
"factor_id": factor_id,
"avg": avg,
"target": math.ceil(avg),
"correlation": correlation,
"competitor_values": competitor_vals,
}
elif factor_name == "Variations in HTML Tags":
targets["variations_in_html"] = {
"factor_id": factor_id,
"avg": avg,
"target": math.ceil(avg),
"correlation": correlation,
}
elif factor_name == "Entities in the HTML Tag":
targets["entities_in_html"] = {
"factor_id": factor_id,
"avg": avg,
"target": math.ceil(avg),
"correlation": correlation,
}
return targets
def get_word_count_distribution(self) -> dict:
"""Get word count data for competitive cluster analysis.
Returns the clean word count for each competitor from the Keywords sheet,
sorted ascending, plus the Page 1 Average and suggested cluster target.
"""
if "Keywords" not in self.wb.sheetnames:
return {}
ws = self.wb["Keywords"]
rows = list(ws.iter_rows(values_only=True))
if not rows:
return {}
headers = rows[0]
col_map = {str(h).strip(): j for j, h in enumerate(headers) if h}
host_idx = col_map.get("Host")
clean_wc_idx = col_map.get("Clean Word Count")
if host_idx is None or clean_wc_idx is None:
return {}
# Collect word counts for page 1 results (top 10)
competitors = []
for row in rows[1:11]:
if not row or not row[host_idx]:
continue
wc = _safe_int(row, clean_wc_idx)
if wc and wc > 0:
competitors.append({
"host": str(row[host_idx]),
"clean_word_count": wc,
})
if not competitors:
return {}
# Sort by word count
competitors.sort(key=lambda x: x["clean_word_count"])
counts = [c["clean_word_count"] for c in competitors]
# Calculate cluster target
avg = sum(counts) / len(counts)
median = counts[len(counts) // 2]
cluster_target = _find_cluster_target(counts)
return {
"competitors": competitors,
"counts_sorted": counts,
"average": round(avg),
"median": median,
"cluster_target": cluster_target,
"min": counts[0],
"max": counts[-1],
}
# -------------------------------------------------------------------------
# Basic Tunings
# -------------------------------------------------------------------------
def get_basic_tunings(self) -> list[dict]:
"""Extract on-page tuning factors from the Basic Tunings sheet."""
if "Basic Tunings" not in self.wb.sheetnames:
return []
ws = self.wb["Basic Tunings"]
rows = list(ws.iter_rows(values_only=True))
# Find sub-header row with "Factor ID", "Factor"
header_idx = None
for i, row in enumerate(rows):
if row and len(row) > 2 and row[1] == "Factor ID" and row[2] == "Factor":
header_idx = i
break
if header_idx is None:
return []
tunings = []
for row in rows[header_idx + 1:]:
if not row:
continue
factor_id = row[1] if len(row) > 1 else None
if not factor_id or not str(factor_id).strip():
continue
factor_id_str = str(factor_id).strip()
if not re.match(r'^[A-Z]{2,}\d+', factor_id_str):
continue
tuning = {
"factor_id": factor_id_str,
"factor": _safe_str(row, 2),
"current": _safe_str(row, 3),
"goal": _safe_str(row, 4),
"percent": _safe_float(row, 5),
"recommendation": _safe_str(row, 6),
}
tunings.append(tuning)
return tunings
# -------------------------------------------------------------------------
# Competitor URLs (Results sheet)
# -------------------------------------------------------------------------
def get_competitor_urls(self) -> list[dict]:
"""Extract competitor URLs from the Results sheet."""
if "Results" not in self.wb.sheetnames:
return []
ws = self.wb["Results"]
rows = list(ws.iter_rows(values_only=True))
if not rows:
return []
headers = rows[0]
col_map = {str(h).strip(): j for j, h in enumerate(headers) if h}
results = []
for row in rows[1:]:
if not row or not row[0]:
continue
result = {
"rank": _safe_int(row, col_map.get("Rank")),
"host": _safe_str(row, col_map.get("Host")),
"url": _safe_str(row, col_map.get("URL")),
"title": _safe_str(row, col_map.get("Link Text")),
"summary": _safe_str(row, col_map.get("Summary")),
}
results.append(result)
return results
# -------------------------------------------------------------------------
# Summary
# -------------------------------------------------------------------------
def get_summary(self) -> dict:
"""Get a high-level summary of the Cora report with all key targets."""
entities = self.get_entities()
lsi = self.get_lsi_keywords()
variations = self.get_variations_list()
tunings = self.get_basic_tunings()
results = self.get_competitor_urls()
density = self.get_density_targets()
content = self.get_content_targets()
wc_dist = self.get_word_count_distribution()
# Find word count goal from tunings
word_count_goal = None
for t in tunings:
if t["factor"] == "Word Count":
word_count_goal = t["goal"]
break
entities_with_deficit = [e for e in entities if e["deficit"] and e["deficit"] > 0]
lsi_with_deficit = [l for l in lsi if l["deficit"] and l["deficit"] > 0]
return {
"search_term": self.get_search_term(),
"site_domain": self.get_site_domain(),
"keyword_variations": variations,
"total_entities": len(entities),
"entities_with_deficit": len(entities_with_deficit),
"total_lsi_keywords": len(lsi),
"lsi_with_deficit": len(lsi_with_deficit),
"word_count_goal": word_count_goal,
"word_count_cluster_target": wc_dist.get("cluster_target"),
"word_count_distribution": wc_dist.get("counts_sorted", []),
"variation_density_avg": density.get("variation_density", {}).get("avg_pct"),
"entity_density_avg": density.get("entity_density", {}).get("avg_pct"),
"lsi_density_avg": density.get("lsi_density", {}).get("avg_pct"),
"distinct_entities_target": content.get("distinct_entities", {}).get("target"),
"competitors_analyzed": len(results),
"tuning_factors": len(tunings),
"optimization_rules": OPTIMIZATION_RULES,
}
# =============================================================================
# Helper functions
# =============================================================================
def _safe_str(row, idx) -> str:
if idx is None or idx >= len(row) or row[idx] is None:
return ""
return str(row[idx]).strip()
def _safe_float(row, idx) -> float | None:
if idx is None or idx >= len(row) or row[idx] is None:
return None
try:
return float(row[idx])
except (ValueError, TypeError):
return None
def _safe_int(row, idx) -> int | None:
if idx is None or idx >= len(row) or row[idx] is None:
return None
try:
return int(float(row[idx]))
except (ValueError, TypeError):
return None
def _find_site_col_idx(headers) -> int | None:
"""Find site column by looking for domain pattern in header values."""
for j, h in enumerate(headers):
if h and isinstance(h, str):
h_str = h.strip()
if re.search(r'[a-zA-Z0-9-]+\.[a-zA-Z]{2,}', h_str):
# Skip known non-site headers
if h_str in ("Best of Both", "LSI Keyword"):
continue
return j
return None
def _find_cluster_target(counts: list[int]) -> int:
"""Find the nearest competitive cluster target for word count.
Strategy: Don't use the raw average (skewed by outliers).
Instead, find clusters of 3+ competitors within 30% of each other
and target slightly above the nearest cluster's center.
"""
if not counts:
return 0
if len(counts) <= 3:
return math.ceil(max(counts) * 1.05)
# Simple clustering: find the densest grouping
best_cluster = []
for i in range(len(counts)):
cluster = [counts[i]]
for j in range(i + 1, len(counts)):
# Within 40% range of the cluster start
if counts[j] <= counts[i] * 1.4:
cluster.append(counts[j])
else:
break
if len(cluster) >= len(best_cluster):
best_cluster = cluster
if best_cluster:
cluster_avg = sum(best_cluster) / len(best_cluster)
# Target slightly above the cluster average
return math.ceil(cluster_avg * 1.05)
# Fallback: median + 5%
median = counts[len(counts) // 2]
return math.ceil(median * 1.05)
# =============================================================================
# Output formatting
# =============================================================================
def format_text(data, label: str = "") -> str:
"""Format data as human-readable text."""
lines = []
if label:
lines.append(f"=== {label} ===")
lines.append("")
if isinstance(data, dict):
for key, value in data.items():
if isinstance(value, list) and len(value) > 5:
lines.append(f" {key}: [{len(value)} items]")
elif isinstance(value, dict):
lines.append(f" {key}:")
for k2, v2 in value.items():
lines.append(f" {k2}: {v2}")
else:
lines.append(f" {key}: {value}")
elif isinstance(data, list):
for i, item in enumerate(data):
if isinstance(item, dict):
lines.append(f" [{i + 1}]")
for key, value in item.items():
lines.append(f" {key}: {value}")
else:
lines.append(f" [{i + 1}] {item}")
lines.append("")
return "\n".join(lines)
# =============================================================================
# CLI
# =============================================================================
def main():
parser = argparse.ArgumentParser(description="Parse a Cora SEO XLSX report")
parser.add_argument("xlsx_path", help="Path to the Cora XLSX file")
parser.add_argument(
"--sheet",
choices=[
"entities", "lsi", "variations", "results", "tunings",
"structure", "densities", "targets", "wordcount", "summary", "all",
],
default="summary",
help="Which data to extract (default: summary)",
)
parser.add_argument(
"--format",
choices=["json", "text"],
default="text",
help="Output format (default: text)",
)
parser.add_argument(
"--top-n",
type=int,
default=0,
help="Limit output to top N results (0 = all)",
)
args = parser.parse_args()
report = CoraReport(args.xlsx_path)
extractors = {
"entities": ("Entities", report.get_entities),
"lsi": ("LSI Keywords", report.get_lsi_keywords),
"variations": ("Keyword Variations", lambda: report.get_keyword_variations()),
"results": ("Competitor URLs", report.get_competitor_urls),
"tunings": ("Basic Tunings", report.get_basic_tunings),
"structure": ("Structure Targets", report.get_structure_targets),
"densities": ("Density Targets", report.get_density_targets),
"targets": ("Content Targets", report.get_content_targets),
"wordcount": ("Word Count Distribution", report.get_word_count_distribution),
"summary": ("Summary", report.get_summary),
}
if args.sheet == "all":
sheets_to_show = ["summary", "structure", "densities", "targets", "wordcount"]
else:
sheets_to_show = [args.sheet]
for sheet_key in sheets_to_show:
label, extractor = extractors[sheet_key]
data = extractor()
if args.top_n > 0 and isinstance(data, list):
data = data[:args.top_n]
if args.format == "json":
print(json.dumps(data, indent=2, default=str))
else:
print(format_text(data, label))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,455 @@
#!/usr/bin/env python3
"""
Entity Optimizer Cora Entity Analysis for Content Drafts
Counts Cora-defined entities in a markdown content draft and recommends
additions based on relevance and deficit data from a Cora XLSX report.
Usage:
uv run --with openpyxl python entity_optimizer.py <draft_path> <cora_xlsx_path> [--format json|text] [--top-n 30]
Options:
--format Output format: json or text (default: text)
--top-n Number of top recommendations to show (default: 30)
"""
import argparse
import json
import re
import sys
from pathlib import Path
from cora_parser import CoraReport
class EntityOptimizer:
"""Analyzes a content draft against Cora entity targets and recommends additions."""
def __init__(self, cora_xlsx_path: str):
"""Load entity targets from a Cora XLSX report.
Args:
cora_xlsx_path: Path to the Cora SEO XLSX file.
"""
self.report = CoraReport(cora_xlsx_path)
self.entities = self.report.get_entities()
self.search_term = self.report.get_search_term()
# Populated after analyze_draft() is called
self.draft_text = ""
self.sections = [] # list of {"heading": str, "level": int, "text": str}
self.entity_counts = {} # entity name -> {"total": int, "per_section": {heading: count}}
def analyze_draft(self, draft_path: str) -> dict:
"""Run a full analysis of a content draft against Cora entity targets.
Args:
draft_path: Path to a markdown content draft file.
Returns:
dict with keys: summary, entity_counts, deficits, recommendations, section_density
"""
path = Path(draft_path)
if not path.exists():
raise FileNotFoundError(f"Draft file not found: {draft_path}")
self.draft_text = path.read_text(encoding="utf-8")
self.sections = self._parse_sections(self.draft_text)
self.entity_counts = self.count_entities(self.draft_text)
deficits = self.calculate_deficits()
recommendations = self.recommend_additions()
section_density = self._section_density()
# Build summary stats
entities_found = sum(
1 for name, counts in self.entity_counts.items() if counts["total"] > 0
)
entities_with_deficit = sum(1 for d in deficits if d["remaining_deficit"] > 0)
summary = {
"search_term": self.search_term,
"total_entities_tracked": len(self.entities),
"entities_found_in_draft": entities_found,
"entities_with_deficit": entities_with_deficit,
"total_sections": len(self.sections),
}
return {
"summary": summary,
"entity_counts": self.entity_counts,
"deficits": deficits,
"recommendations": recommendations,
"section_density": section_density,
}
def count_entities(self, text: str) -> dict:
"""Count occurrences of each Cora entity in the text, total and per section.
Uses case-insensitive matching with word boundaries so partial matches
inside larger words are excluded.
Args:
text: The full draft text.
Returns:
dict mapping entity name to {"total": int, "per_section": {heading: int}}
"""
counts = {}
sections = self.sections if self.sections else self._parse_sections(text)
for entity in self.entities:
name = entity["name"]
pattern = re.compile(r"\b" + re.escape(name) + r"\b", re.IGNORECASE)
total = len(pattern.findall(text))
per_section = {}
for section in sections:
section_count = len(pattern.findall(section["text"]))
if section_count > 0:
per_section[section["heading"]] = section_count
counts[name] = {
"total": total,
"per_section": per_section,
}
return counts
def calculate_deficits(self) -> list[dict]:
"""Calculate which entities are still below their Cora deficit target.
Compares the count found in the draft against the deficit value from
the Cora report. An entity with a Cora deficit of 20 and a draft count
of 5 has a remaining deficit of 15.
Returns:
List of dicts with: name, relevance, correlation, cora_deficit,
draft_count, remaining_deficit sorted by remaining_deficit descending.
"""
deficits = []
for entity in self.entities:
name = entity["name"]
cora_deficit = entity.get("deficit") or 0
draft_count = self.entity_counts.get(name, {}).get("total", 0)
remaining = max(0, cora_deficit - draft_count)
deficits.append({
"name": name,
"relevance": entity.get("relevance") or 0,
"correlation": entity.get("correlation") or 0,
"cora_deficit": cora_deficit,
"draft_count": draft_count,
"remaining_deficit": remaining,
})
deficits.sort(key=lambda d: d["remaining_deficit"], reverse=True)
return deficits
def recommend_additions(self) -> list[dict]:
"""Generate prioritized recommendations for entity additions.
Priority is calculated as relevance * remaining_deficit, so entities
that are both highly relevant and far below target rank highest.
Each recommendation includes suggested sections where the entity
could naturally be added, based on where related entities already appear.
Returns:
List of recommendation dicts sorted by priority descending. Each dict
has: name, relevance, correlation, cora_deficit, draft_count,
remaining_deficit, priority, suggested_sections.
"""
deficits = self.calculate_deficits()
recommendations = []
for deficit_entry in deficits:
if deficit_entry["remaining_deficit"] <= 0:
continue
relevance = deficit_entry["relevance"]
remaining = deficit_entry["remaining_deficit"]
priority = relevance * remaining
suggested = self._suggest_sections(deficit_entry["name"])
recommendations.append({
"name": deficit_entry["name"],
"relevance": relevance,
"correlation": deficit_entry["correlation"],
"cora_deficit": deficit_entry["cora_deficit"],
"draft_count": deficit_entry["draft_count"],
"remaining_deficit": remaining,
"priority": round(priority, 4),
"suggested_sections": suggested,
})
recommendations.sort(key=lambda r: r["priority"], reverse=True)
return recommendations
# ------------------------------------------------------------------
# Internal helpers
# ------------------------------------------------------------------
def _parse_sections(self, text: str) -> list[dict]:
"""Split markdown text into sections by headings.
Each section captures the heading text, heading level, and the body
text under that heading (up to the next heading of equal or higher level).
A virtual "Introduction" section is created for content before the first heading.
Returns:
list of {"heading": str, "level": int, "text": str}
"""
heading_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
matches = list(heading_pattern.finditer(text))
sections = []
# Content before the first heading becomes the Introduction section
if matches:
intro_text = text[:matches[0].start()].strip()
if intro_text:
sections.append({
"heading": "Introduction",
"level": 0,
"text": intro_text,
})
else:
# No headings at all — treat the entire text as one section
return [{
"heading": "Full Document",
"level": 0,
"text": text,
}]
for i, match in enumerate(matches):
level = len(match.group(1))
heading = match.group(2).strip()
start = match.end()
end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
body = text[start:end].strip()
sections.append({
"heading": heading,
"level": level,
"text": body,
})
return sections
def _suggest_sections(self, entity_name: str) -> list[str]:
"""Suggest sections where an entity could naturally be added.
Strategy: find sections that already contain other entities from the
same Cora report. Sections with higher concentrations of related
entities are better candidates because the topic is contextually aligned.
If no sections have related entities, return all non-empty sections
as general candidates.
Args:
entity_name: The entity to find placement for.
Returns:
List of section heading strings, ordered by relevance.
"""
if not self.sections:
return []
# Build a score for each section: count how many other entities appear there
section_scores = []
for section in self.sections:
heading = section["heading"]
other_entity_count = 0
for name, counts in self.entity_counts.items():
if name.lower() == entity_name.lower():
continue
if heading in counts.get("per_section", {}):
other_entity_count += counts["per_section"][heading]
if other_entity_count > 0:
section_scores.append((heading, other_entity_count))
# Sort by entity richness descending
section_scores.sort(key=lambda x: x[1], reverse=True)
if section_scores:
return [heading for heading, _score in section_scores]
# Fallback: return all sections with non-trivial content
return [
s["heading"]
for s in self.sections
if len(s["text"].split()) > 20
]
def _section_density(self) -> list[dict]:
"""Calculate per-section entity density.
Returns:
List of dicts with: heading, level, word_count, entities_found,
entity_mentions, density (mentions per 100 words).
"""
densities = []
for section in self.sections:
heading = section["heading"]
word_count = len(section["text"].split())
entities_found = 0
total_mentions = 0
for name, counts in self.entity_counts.items():
section_count = counts.get("per_section", {}).get(heading, 0)
if section_count > 0:
entities_found += 1
total_mentions += section_count
density = round((total_mentions / word_count) * 100, 2) if word_count > 0 else 0.0
densities.append({
"heading": heading,
"level": section["level"],
"word_count": word_count,
"entities_found": entities_found,
"entity_mentions": total_mentions,
"density_per_100_words": density,
})
return densities
# ------------------------------------------------------------------
# Output formatting
# ------------------------------------------------------------------
def format_text_report(analysis: dict, top_n: int = 30) -> str:
"""Format the analysis result as a human-readable text report."""
lines = []
summary = analysis["summary"]
# --- Header ---
lines.append("=" * 70)
lines.append(" ENTITY OPTIMIZATION REPORT")
if summary.get("search_term"):
lines.append(f" Target keyword: {summary['search_term']}")
lines.append("=" * 70)
lines.append("")
# --- Summary ---
lines.append("SUMMARY")
lines.append("-" * 40)
lines.append(f" Total entities tracked: {summary['total_entities_tracked']}")
lines.append(f" Entities found in draft: {summary['entities_found_in_draft']}")
lines.append(f" Entities with deficit: {summary['entities_with_deficit']}")
lines.append(f" Total sections in draft: {summary['total_sections']}")
lines.append("")
# --- Top Recommendations ---
recommendations = analysis["recommendations"]
shown = recommendations[:top_n]
lines.append(f"TOP {min(top_n, len(recommendations))} RECOMMENDATIONS (sorted by priority)")
lines.append("-" * 70)
if not shown:
lines.append(" No entity deficits found — the draft covers all targets.")
else:
for i, rec in enumerate(shown, 1):
sections_str = ", ".join(rec["suggested_sections"][:3]) if rec["suggested_sections"] else "any section"
lines.append(
f" {i:>3}. Entity '{rec['name']}' found {rec['draft_count']} times, "
f"target deficit is {rec['cora_deficit']}. "
f"Remaining: {rec['remaining_deficit']}. "
f"Priority: {rec['priority']}"
)
lines.append(
f" Relevance: {rec['relevance']} | Correlation: {rec['correlation']}"
)
lines.append(
f" Suggested sections: [{sections_str}]"
)
lines.append("")
# --- Per-Section Entity Density ---
lines.append("PER-SECTION ENTITY DENSITY")
lines.append("-" * 70)
lines.append(f" {'Section':<40} {'Words':>6} {'Entities':>9} {'Mentions':>9} {'Density':>8}")
lines.append(f" {'-' * 40} {'-' * 6} {'-' * 9} {'-' * 9} {'-' * 8}")
for sd in analysis["section_density"]:
indent = " " * sd["level"] if sd["level"] > 0 else ""
heading_display = indent + sd["heading"]
if len(heading_display) > 38:
heading_display = heading_display[:35] + "..."
lines.append(
f" {heading_display:<40} {sd['word_count']:>6} {sd['entities_found']:>9} "
f"{sd['entity_mentions']:>9} {sd['density_per_100_words']:>7.2f}%"
)
lines.append("")
lines.append("=" * 70)
return "\n".join(lines)
def format_json_report(analysis: dict, top_n: int = 30) -> str:
"""Format the analysis result as machine-readable JSON."""
output = {
"summary": analysis["summary"],
"recommendations": analysis["recommendations"][:top_n],
"section_density": analysis["section_density"],
"entity_counts": analysis["entity_counts"],
"deficits": analysis["deficits"],
}
return json.dumps(output, indent=2, default=str)
# ------------------------------------------------------------------
# CLI entry point
# ------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Analyze a content draft against Cora entity targets and recommend additions.",
usage="uv run --with openpyxl python entity_optimizer.py <draft_path> <cora_xlsx_path> [options]",
)
parser.add_argument(
"draft_path",
help="Path to the markdown content draft",
)
parser.add_argument(
"cora_xlsx_path",
help="Path to the Cora SEO XLSX report",
)
parser.add_argument(
"--format",
choices=["json", "text"],
default="text",
help="Output format (default: text)",
)
parser.add_argument(
"--top-n",
type=int,
default=30,
help="Number of top recommendations to display (default: 30)",
)
args = parser.parse_args()
try:
optimizer = EntityOptimizer(args.cora_xlsx_path)
analysis = optimizer.analyze_draft(args.draft_path)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error analyzing draft: {e}", file=sys.stderr)
sys.exit(1)
if args.format == "json":
print(format_json_report(analysis, top_n=args.top_n))
else:
print(format_text_report(analysis, top_n=args.top_n))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,414 @@
"""
LSI Keyword Optimizer
Counts Cora-defined LSI keywords in a content draft and recommends additions.
Reads LSI targets from a Cora XLSX report via cora_parser.CoraReport, then
scans a markdown draft to measure per-keyword usage and calculate deficits.
Recommendations are prioritized by |correlation| x deficit so the most
ranking-impactful gaps surface first.
Usage:
uv run --with openpyxl python lsi_optimizer.py <draft_path> <cora_xlsx_path> \
[--format json|text] [--min-correlation 0.2] [--top-n 50]
"""
import argparse
import json
import re
import sys
from pathlib import Path
from cora_parser import CoraReport
class LSIOptimizer:
"""Analyzes a content draft against Cora LSI keyword targets."""
def __init__(self, cora_xlsx_path: str):
"""Load LSI keyword targets from a Cora XLSX report.
Args:
cora_xlsx_path: Path to the Cora SEO report XLSX file.
"""
self.report = CoraReport(cora_xlsx_path)
self.lsi_keywords = self.report.get_lsi_keywords()
self.draft_text = ""
self.sections: list[dict] = []
self._keyword_counts: dict[str, int] = {}
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def analyze_draft(self, draft_path: str) -> dict:
"""Run full LSI analysis on a markdown draft.
Args:
draft_path: Path to a markdown content draft.
Returns:
Analysis dict with keys: summary, keyword_counts, deficits,
recommendations, section_coverage.
"""
path = Path(draft_path)
if not path.exists():
raise FileNotFoundError(f"Draft file not found: {draft_path}")
self.draft_text = path.read_text(encoding="utf-8")
self.sections = self._parse_sections(self.draft_text)
self._keyword_counts = self.count_lsi_keywords(self.draft_text)
deficits = self.calculate_deficits()
recommendations = self.recommend_additions()
section_coverage = self._section_coverage()
total_tracked = len(self.lsi_keywords)
found_in_draft = sum(1 for c in self._keyword_counts.values() if c > 0)
with_deficit = len(deficits)
return {
"summary": {
"total_lsi_tracked": total_tracked,
"found_in_draft": found_in_draft,
"with_deficit": with_deficit,
"fully_satisfied": total_tracked - with_deficit,
},
"keyword_counts": self._keyword_counts,
"deficits": deficits,
"recommendations": recommendations,
"section_coverage": section_coverage,
}
def count_lsi_keywords(self, text: str) -> dict[str, int]:
"""Count occurrences of each LSI keyword in the given text.
Uses word-boundary-aware regex matching so multi-word phrases like
"part that" are matched correctly and case-insensitively.
Args:
text: The content string to scan.
Returns:
Dict mapping keyword string to its occurrence count.
"""
counts: dict[str, int] = {}
for kw_data in self.lsi_keywords:
keyword = kw_data["keyword"]
pattern = self._keyword_pattern(keyword)
matches = pattern.findall(text)
counts[keyword] = len(matches)
return counts
def calculate_deficits(self) -> list[dict]:
"""Identify LSI keywords whose draft count is below the Cora target.
A keyword has a deficit when the Cora report indicates a positive
deficit value (target minus current usage in the report) AND the
draft count has not yet closed that gap.
Returns:
List of dicts with: keyword, draft_count, target, deficit,
spearmans, pearsons, best_of_both. Only keywords with
remaining deficit > 0 are included.
"""
deficits = []
for kw_data in self.lsi_keywords:
keyword = kw_data["keyword"]
cora_deficit = kw_data.get("deficit") or 0
if cora_deficit <= 0:
continue
# The Cora deficit is based on the original page. The draft may
# have added some occurrences, so we re-compute: how many more
# are still needed?
cora_current = kw_data.get("current_count") or 0
target = cora_current + cora_deficit
draft_count = self._keyword_counts.get(keyword, 0)
remaining_deficit = target - draft_count
if remaining_deficit <= 0:
continue
deficits.append({
"keyword": keyword,
"draft_count": draft_count,
"target": target,
"deficit": remaining_deficit,
"spearmans": kw_data.get("spearmans"),
"pearsons": kw_data.get("pearsons"),
"best_of_both": kw_data.get("best_of_both"),
})
return deficits
def recommend_additions(
self,
min_correlation: float = 0.0,
top_n: int = 0,
) -> list[dict]:
"""Produce a prioritized list of LSI keyword additions.
Priority score = abs(best_of_both) x deficit. Keywords with higher
correlation to ranking AND larger deficits sort to the top.
Args:
min_correlation: Only include keywords whose
abs(best_of_both) >= this threshold.
top_n: Limit to top N results (0 = no limit).
Returns:
Sorted list of dicts with: keyword, priority, deficit,
draft_count, target, best_of_both, spearmans, pearsons.
"""
deficits = self.calculate_deficits()
recommendations = []
for d in deficits:
correlation = abs(d["best_of_both"]) if d["best_of_both"] else 0.0
if correlation < min_correlation:
continue
priority = correlation * d["deficit"]
recommendations.append({
"keyword": d["keyword"],
"priority": round(priority, 4),
"deficit": d["deficit"],
"draft_count": d["draft_count"],
"target": d["target"],
"best_of_both": d["best_of_both"],
"spearmans": d["spearmans"],
"pearsons": d["pearsons"],
})
recommendations.sort(key=lambda r: r["priority"], reverse=True)
if top_n > 0:
recommendations = recommendations[:top_n]
return recommendations
# ------------------------------------------------------------------
# Internal helpers
# ------------------------------------------------------------------
@staticmethod
def _keyword_pattern(keyword: str) -> re.Pattern:
"""Build a word-boundary-aware regex for an LSI keyword.
Handles multi-word phrases by joining escaped tokens with flexible
whitespace. Case-insensitive.
"""
tokens = keyword.strip().split()
escaped = [re.escape(t) for t in tokens]
# Allow flexible whitespace between tokens in multi-word phrases
pattern_str = r"\b" + r"\s+".join(escaped) + r"\b"
return re.compile(pattern_str, re.IGNORECASE)
@staticmethod
def _parse_sections(text: str) -> list[dict]:
"""Split markdown text into sections by headings.
Returns list of dicts with: heading, level, content.
The content before the first heading gets heading="(intro)".
"""
heading_re = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
matches = list(heading_re.finditer(text))
sections: list[dict] = []
if not matches:
# No headings — treat entire text as one section
sections.append({
"heading": "(intro)",
"level": 0,
"content": text,
})
return sections
# Content before first heading
if matches[0].start() > 0:
intro = text[: matches[0].start()]
if intro.strip():
sections.append({
"heading": "(intro)",
"level": 0,
"content": intro,
})
for i, match in enumerate(matches):
level = len(match.group(1))
heading = match.group(2).strip()
start = match.end()
end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
content = text[start:end]
sections.append({
"heading": heading,
"level": level,
"content": content,
})
return sections
def _section_coverage(self) -> list[dict]:
"""Calculate LSI keyword coverage per section.
Returns list of dicts with: heading, level, total_keywords_found,
keyword_details (list of keyword/count pairs present in that section).
"""
coverage = []
for section in self.sections:
section_counts = self.count_lsi_keywords(section["content"])
found = {kw: cnt for kw, cnt in section_counts.items() if cnt > 0}
coverage.append({
"heading": section["heading"],
"level": section["level"],
"total_keywords_found": len(found),
"keyword_details": [
{"keyword": kw, "count": cnt}
for kw, cnt in sorted(found.items(), key=lambda x: x[1], reverse=True)
],
})
return coverage
# ----------------------------------------------------------------------
# Output formatting
# ----------------------------------------------------------------------
def format_text_report(analysis: dict) -> str:
"""Format the analysis dict as a human-readable text report."""
lines: list[str] = []
summary = analysis["summary"]
# --- Summary ---
lines.append("=" * 60)
lines.append(" LSI KEYWORD OPTIMIZATION REPORT")
lines.append("=" * 60)
lines.append("")
lines.append(f" Total LSI keywords tracked : {summary['total_lsi_tracked']}")
lines.append(f" Found in draft : {summary['found_in_draft']}")
lines.append(f" With deficit (need more) : {summary['with_deficit']}")
lines.append(f" Fully satisfied : {summary['fully_satisfied']}")
lines.append("")
# --- Top Recommendations ---
recs = analysis["recommendations"]
if recs:
lines.append("-" * 60)
lines.append(" TOP RECOMMENDATIONS (sorted by priority)")
lines.append("-" * 60)
lines.append("")
lines.append(
f" {'#':<4} {'Keyword':<30} {'Priority':>9} "
f"{'Deficit':>8} {'Draft':>6} {'Target':>7} {'Corr':>7}"
)
lines.append(f" {''*4} {''*30} {''*9} {''*8} {''*6} {''*7} {''*7}")
for i, rec in enumerate(recs, 1):
corr = rec["best_of_both"]
corr_str = f"{corr:.3f}" if corr is not None else "N/A"
keyword_display = rec["keyword"]
if len(keyword_display) > 28:
keyword_display = keyword_display[:25] + "..."
lines.append(
f" {i:<4} {keyword_display:<30} {rec['priority']:>9.4f} "
f"{rec['deficit']:>8} {rec['draft_count']:>6} "
f"{rec['target']:>7} {corr_str:>7}"
)
lines.append("")
else:
lines.append(" No recommendations — all LSI targets met or no deficits found.")
lines.append("")
# --- Section Coverage ---
sections = analysis["section_coverage"]
if sections:
lines.append("-" * 60)
lines.append(" PER-SECTION LSI COVERAGE")
lines.append("-" * 60)
lines.append("")
for sec in sections:
indent = " " * (sec["level"] + 1)
heading = sec["heading"]
kw_count = sec["total_keywords_found"]
lines.append(f"{indent}{heading} ({kw_count} LSI keyword{'s' if kw_count != 1 else ''})")
if sec["keyword_details"]:
for detail in sec["keyword_details"][:10]:
lines.append(f"{indent} - \"{detail['keyword']}\" x{detail['count']}")
remaining = len(sec["keyword_details"]) - 10
if remaining > 0:
lines.append(f"{indent} ... and {remaining} more")
lines.append("")
lines.append("=" * 60)
return "\n".join(lines)
# ----------------------------------------------------------------------
# CLI entry point
# ----------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Analyze a content draft against Cora LSI keyword targets.",
)
parser.add_argument(
"draft_path",
help="Path to the markdown content draft",
)
parser.add_argument(
"cora_xlsx_path",
help="Path to the Cora SEO XLSX report",
)
parser.add_argument(
"--format",
choices=["json", "text"],
default="text",
help="Output format (default: text)",
)
parser.add_argument(
"--min-correlation",
type=float,
default=0.2,
help="Minimum |correlation| to include in recommendations (default: 0.2)",
)
parser.add_argument(
"--top-n",
type=int,
default=50,
help="Limit recommendations to top N (default: 50, 0 = unlimited)",
)
args = parser.parse_args()
try:
optimizer = LSIOptimizer(args.cora_xlsx_path)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
try:
analysis = optimizer.analyze_draft(args.draft_path)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
# Apply CLI filters to recommendations
analysis["recommendations"] = optimizer.recommend_additions(
min_correlation=args.min_correlation,
top_n=args.top_n,
)
if args.format == "json":
print(json.dumps(analysis, indent=2, default=str))
else:
print(format_text_report(analysis))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,402 @@
"""
SEO Content Optimizer
Checks keyword density and content structure of a draft against Cora targets.
Usage:
uv run --with openpyxl python seo_optimizer.py <draft_path>
[--keyword <kw>] [--cora-xlsx <path>] [--format json|text]
Works standalone for basic checks, or with a Cora XLSX report for
keyword-specific targets via cora_parser.CoraReport.
"""
import argparse
import json
import re
import sys
from pathlib import Path
# Optional Cora integration — script works without it
try:
from cora_parser import CoraReport
except ImportError:
CoraReport = None
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _split_words(text: str) -> list[str]:
"""Extract words from text (alphabetic sequences)."""
return re.findall(r"[a-zA-Z']+", text)
def _strip_markdown_headings(text: str) -> str:
"""Remove markdown heading markers from text for word counting."""
return re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE)
def _extract_headings(text: str) -> list[dict]:
"""Extract markdown-style headings with their levels."""
headings = []
for match in re.finditer(r"^(#{1,6})\s+(.+)$", text, re.MULTILINE):
level = len(match.group(1))
headings.append({"level": level, "text": match.group(2).strip()})
return headings
# ---------------------------------------------------------------------------
# SEOOptimizer
# ---------------------------------------------------------------------------
class SEOOptimizer:
"""Analyze a content draft for keyword density and structure."""
def __init__(self):
self._results = {}
# -- public entry point -------------------------------------------------
def analyze(
self,
draft_path: str,
primary_keyword: str | None = None,
cora_xlsx_path: str | None = None,
) -> dict:
"""Run checks on *draft_path* and return an analysis dict."""
path = Path(draft_path)
if not path.exists():
raise FileNotFoundError(f"Draft not found: {draft_path}")
text = path.read_text(encoding="utf-8")
# Optionally load Cora data
cora = None
if cora_xlsx_path:
if CoraReport is None:
print(
"Warning: cora_parser not available. "
"Install openpyxl and ensure cora_parser.py is importable.",
file=sys.stderr,
)
else:
cora = CoraReport(cora_xlsx_path)
# Determine keyword list
keywords = []
if primary_keyword:
keywords.append(primary_keyword)
if cora:
search_term = cora.get_search_term()
if search_term and search_term.lower() not in [k.lower() for k in keywords]:
keywords.insert(0, search_term)
for var in cora.get_keyword_variations():
v = var["variation"]
if v.lower() not in [k.lower() for k in keywords]:
keywords.append(v)
# If still no keywords but Cora gave a search term, use it
if not keywords and cora:
st = cora.get_search_term()
if st:
keywords.append(st)
# Word-count target from Cora
word_count_target = None
if cora:
for t in cora.get_basic_tunings():
if t["factor"] == "Word Count":
try:
word_count_target = int(float(t["goal"]))
except (ValueError, TypeError):
pass
break
# Build Cora keyword targets (page1_avg) for comparison
cora_keyword_targets = {}
if cora:
for var in cora.get_keyword_variations():
cora_keyword_targets[var["variation"].lower()] = {
"page1_avg": var.get("page1_avg", 0),
"page1_max": var.get("page1_max", 0),
}
# Run checks
self._results["content_length"] = self.check_content_length(text, target=word_count_target)
self._results["structure"] = self.check_structure(text)
self._results["keyword_density"] = self.check_keyword_density(
text, keywords=keywords or None, cora_targets=cora_keyword_targets,
)
return self._results
# -- individual checks --------------------------------------------------
def check_keyword_density(
self,
text: str,
keywords: list[str] | None = None,
cora_targets: dict | None = None,
) -> dict:
"""Return per-keyword density information.
Only reports variations that have page1_avg > 0 (competitors actually
use them) when Cora targets are available.
"""
clean_text = _strip_markdown_headings(text).lower()
words = _split_words(clean_text)
total_words = len(words)
if total_words == 0:
return {"total_words": 0, "keywords": []}
results: list[dict] = []
if keywords:
for kw in keywords:
kw_lower = kw.lower()
# Skip zero-avg variations — competitors don't use them
if cora_targets and kw_lower in cora_targets:
if cora_targets[kw_lower].get("page1_avg", 0) == 0:
continue
kw_words = kw_lower.split()
if len(kw_words) > 1:
pattern = re.compile(r"\b" + re.escape(kw_lower) + r"\b")
count = len(pattern.findall(clean_text))
else:
count = sum(1 for w in words if w == kw_lower)
density = (count / total_words) * 100 if total_words else 0
entry = {
"keyword": kw,
"count": count,
"density_pct": round(density, 2),
}
# Add Cora target if available
if cora_targets and kw_lower in cora_targets:
entry["target_avg"] = cora_targets[kw_lower]["page1_avg"]
entry["target_max"] = cora_targets[kw_lower]["page1_max"]
results.append(entry)
else:
# Fallback: top frequent words (>= 4 chars)
freq: dict[str, int] = {}
for w in words:
if len(w) >= 4:
freq[w] = freq.get(w, 0) + 1
top = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
for w, count in top:
density = (count / total_words) * 100
results.append({
"keyword": w,
"count": count,
"density_pct": round(density, 2),
})
return {"total_words": total_words, "keywords": results}
def check_structure(self, text: str) -> dict:
"""Analyze heading hierarchy, paragraph count, and list usage."""
headings = _extract_headings(text)
# Count headings per level
heading_counts = {f"h{i}": 0 for i in range(1, 7)}
for h in headings:
heading_counts[f"h{h['level']}"] += 1
# Check nesting issues
nesting_issues: list[str] = []
if heading_counts["h1"] > 1:
nesting_issues.append(f"Multiple H1 tags found ({heading_counts['h1']}); use exactly one.")
prev_level = 0
for h in headings:
if prev_level > 0 and h["level"] > prev_level + 1:
nesting_issues.append(
f"Heading skip: H{prev_level} -> H{h['level']} "
f"(at \"{h['text'][:40]}...\")"
if len(h["text"]) > 40 else
f"Heading skip: H{prev_level} -> H{h['level']} "
f"(at \"{h['text']}\")"
)
prev_level = h["level"]
# Paragraphs
paragraphs = []
for block in re.split(r"\n\s*\n", text):
block = block.strip()
if not block:
continue
if re.match(r"^#{1,6}\s+", block) and "\n" not in block:
continue
if all(re.match(r"^\s*[-*+]\s|^\s*\d+\.\s", line) for line in block.splitlines() if line.strip()):
continue
paragraphs.append(block)
paragraph_count = len(paragraphs)
# List usage
unordered_items = len(re.findall(r"^\s*[-*+]\s", text, re.MULTILINE))
ordered_items = len(re.findall(r"^\s*\d+\.\s", text, re.MULTILINE))
return {
"heading_counts": heading_counts,
"headings": [{"level": h["level"], "text": h["text"]} for h in headings],
"nesting_issues": nesting_issues,
"paragraph_count": paragraph_count,
"unordered_list_items": unordered_items,
"ordered_list_items": ordered_items,
}
def check_content_length(self, text: str, target: int | None = None) -> dict:
"""Compare word count against an optional target."""
clean = _strip_markdown_headings(text)
words = _split_words(clean)
word_count = len(words)
result: dict = {"word_count": word_count}
if target is not None:
result["target"] = target
result["difference"] = word_count - target
if word_count >= target:
result["status"] = "meets_target"
elif word_count >= target * 0.8:
result["status"] = "close"
else:
result["status"] = "below_target"
return result
# ---------------------------------------------------------------------------
# Text-mode formatting
# ---------------------------------------------------------------------------
def _format_text_report(results: dict) -> str:
"""Format analysis results as a human-readable text report."""
lines: list[str] = []
sep = "-" * 60
# 1. Content Stats
cl = results.get("content_length", {})
lines.append(sep)
lines.append(" CONTENT STATS")
lines.append(sep)
lines.append(f" Word count: {cl.get('word_count', 0)}")
if cl.get("target"):
lines.append(f" Target: {cl['target']} ({cl.get('status', '')})")
diff = cl.get("difference", 0)
sign = "+" if diff >= 0 else ""
lines.append(f" Difference: {sign}{diff}")
lines.append("")
# 2. Structure
st = results.get("structure", {})
lines.append(sep)
lines.append(" STRUCTURE")
lines.append(sep)
hc = st.get("heading_counts", {})
for lvl in range(1, 7):
count = hc.get(f"h{lvl}", 0)
if count > 0:
lines.append(f" H{lvl}: {count}")
issues = st.get("nesting_issues", [])
if issues:
lines.append(" Nesting issues:")
for issue in issues:
lines.append(f" - {issue}")
else:
lines.append(" Nesting: OK")
lines.append("")
# 3. Keyword Density (only variations with targets)
kd = results.get("keyword_density", {})
kw_list = kd.get("keywords", [])
lines.append(sep)
lines.append(" KEYWORD DENSITY")
lines.append(sep)
if kw_list:
lines.append(f" {'Variation':<30s} {'Count':>5s} {'Density':>7s} {'Avg':>5s} {'Max':>5s}")
lines.append(f" {'-'*30} {'-'*5} {'-'*7} {'-'*5} {'-'*5}")
for kw in kw_list:
avg_str = str(kw.get("target_avg", "")) if "target_avg" in kw else ""
max_str = str(kw.get("target_max", "")) if "target_max" in kw else ""
lines.append(
f" {kw['keyword']:<30s} "
f"{kw['count']:>5d} "
f"{kw['density_pct']:>6.2f}% "
f"{avg_str:>5s} "
f"{max_str:>5s}"
)
else:
lines.append(" No keywords specified.")
lines.append("")
lines.append(sep)
return "\n".join(lines)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Check keyword density and structure of a content draft.",
epilog="Example: uv run --with openpyxl python seo_optimizer.py draft.md --cora-xlsx report.xlsx",
)
parser.add_argument(
"draft_path",
help="Path to the content draft (plain text or markdown)",
)
parser.add_argument(
"--keyword",
dest="keyword",
default=None,
help="Primary keyword to evaluate",
)
parser.add_argument(
"--cora-xlsx",
dest="cora_xlsx",
default=None,
help="Path to a Cora XLSX report for keyword-specific targets",
)
parser.add_argument(
"--format",
choices=["json", "text"],
default="text",
help="Output format (default: text)",
)
args = parser.parse_args()
optimizer = SEOOptimizer()
try:
results = optimizer.analyze(
draft_path=args.draft_path,
primary_keyword=args.keyword,
cora_xlsx_path=args.cora_xlsx,
)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error during analysis: {e}", file=sys.stderr)
sys.exit(1)
if args.format == "json":
print(json.dumps(results, indent=2, default=str))
else:
print(_format_text_report(results))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,466 @@
#!/usr/bin/env python3
"""
Test Block Generator Programmatically Assemble Test Blocks from Templates
Takes LLM-generated sentence templates (with {N} slots for body text) and
pre-written headings, plus an LLM-curated entity list, and assembles a test
block. Tracks aggregate densities in real-time and stops when targets are met.
The LLM handles all intelligence: filtering entities for topical relevance,
writing headings, creating body templates. This script handles all math:
slot filling, density tracking, stop conditions.
Usage:
uv run --with openpyxl python test_block_generator.py <templates_path> <prep_json_path> <cora_xlsx_path>
--entities-file <path> [--output-dir ./working/] [--min-sentences 5]
"""
import argparse
import json
import re
import sys
from pathlib import Path
from cora_parser import CoraReport
# ---------------------------------------------------------------------------
# Term selection
# ---------------------------------------------------------------------------
def load_entity_names(entities_file: str) -> list[str]:
"""Load LLM-curated entity names from file (one per line)."""
path = Path(entities_file)
if not path.exists():
print(f"Error: entities file not found: {path}", file=sys.stderr)
sys.exit(1)
names = []
for line in path.read_text(encoding="utf-8").splitlines():
name = line.strip()
if name:
names.append(name)
return names
def build_term_queue(
filtered_entity_names: list[str],
variations: list[str],
) -> list[str]:
"""Build a flat priority-ordered term list.
Order: filtered entities (LLM-curated, in provided order) -> keyword variations.
"""
terms = []
seen = set()
# 1. Filtered entities from LLM (already curated for topical relevance)
for name in filtered_entity_names:
if name.lower() not in seen:
terms.append(name)
seen.add(name.lower())
# 2. Keyword variations
for v in variations:
if v.lower() not in seen:
terms.append(v)
seen.add(v.lower())
return terms
# ---------------------------------------------------------------------------
# Generator
# ---------------------------------------------------------------------------
class TestBlockGenerator:
"""Fills body templates with entity/variation terms, inserts pre-written
headings, and tracks aggregate densities."""
def __init__(self, cora_xlsx_path: str, prep_data: dict, filtered_entity_names: list[str]):
self.report = CoraReport(cora_xlsx_path)
self.prep = prep_data
self.entities = self.report.get_entities()
self.variations = self.report.get_variations_list()
# Compile regex patterns for counting (built once, used per sentence)
self.entity_patterns = {}
for e in self.entities:
name = e["name"]
self.entity_patterns[name] = re.compile(
r"\b" + re.escape(name) + r"\b", re.IGNORECASE
)
self.variation_patterns = {}
for v in self.variations:
self.variation_patterns[v] = re.compile(
r"\b" + re.escape(v) + r"\b", re.IGNORECASE
)
# Build term queue from LLM-curated entity list
self.term_queue = build_term_queue(filtered_entity_names, self.variations)
self.term_idx = 0
# Track which 0->1 entities have been introduced
# Use the full missing list from prep to track introductions accurately
missing = prep_data.get("distinct_entities", {}).get("missing_entities", [])
self.missing_names = {e["name"] for e in missing}
self.introduced = set()
# Running totals for new content
self.new_words = 0
self.new_entity_mentions = 0
self.new_variation_mentions = 0
self.new_h2_count = 0
self.new_h3_count = 0
# Baseline from prep
self.base_words = prep_data["word_count"]["current"]
self.base_entity_mentions = prep_data["entity_density"]["current_mentions"]
self.base_variation_mentions = prep_data["variation_density"]["current_mentions"]
self.target_entity_d = prep_data["entity_density"]["target_decimal"]
self.target_variation_d = prep_data["variation_density"]["target_decimal"]
def pick_term(self, used_in_sentence: set) -> str:
"""Pick next term from the queue, skipping duplicates within a sentence."""
if not self.term_queue:
return "equipment"
used_lower = {u.lower() for u in used_in_sentence}
for _ in range(len(self.term_queue)):
term = self.term_queue[self.term_idx % len(self.term_queue)]
self.term_idx = (self.term_idx + 1) % len(self.term_queue)
if term.lower() not in used_lower:
return term
# All exhausted for this sentence, return next anyway
term = self.term_queue[self.term_idx % len(self.term_queue)]
self.term_idx = (self.term_idx + 1) % len(self.term_queue)
return term
def fill_template(self, template: str) -> str:
"""Fill a template's {N} slots with terms."""
slots = re.findall(r"\{(\d+)\}", template)
used = set()
filled = template
for slot_num in slots:
term = self.pick_term(used)
used.add(term)
filled = filled.replace(f"{{{slot_num}}}", term, 1)
return filled
def count_sentence(self, text: str) -> tuple[int, int, int]:
"""Count words, entity mentions, and variation mentions in text.
Also tracks which 0->1 entities have been introduced.
Returns: (word_count, entity_mentions, variation_mentions)
"""
entity_mentions = 0
for name, pattern in self.entity_patterns.items():
count = len(pattern.findall(text))
entity_mentions += count
if count > 0 and name in self.missing_names:
self.introduced.add(name)
variation_mentions = 0
for v, pattern in self.variation_patterns.items():
variation_mentions += len(pattern.findall(text))
words = len(re.findall(r"[a-zA-Z']+", text))
return words, entity_mentions, variation_mentions
def projected_density(self, metric: str) -> float:
"""Calculate projected density after current additions."""
total_words = self.base_words + self.new_words
if total_words == 0:
return 0.0
if metric == "entity":
return (self.base_entity_mentions + self.new_entity_mentions) / total_words
elif metric == "variation":
return (self.base_variation_mentions + self.new_variation_mentions) / total_words
return 0.0
def targets_met(self, min_reached: bool) -> bool:
"""Check if all density targets are met and minimums reached."""
if not min_reached:
return False
entity_ok = self.projected_density("entity") >= self.target_entity_d
variation_ok = self.projected_density("variation") >= self.target_variation_d
distinct_deficit = self.prep["distinct_entities"]["deficit"]
distinct_ok = len(self.introduced) >= distinct_deficit
wc_deficit = self.prep["word_count"]["deficit"]
wc_ok = self.new_words >= wc_deficit
return entity_ok and variation_ok and distinct_ok and wc_ok
def generate(
self,
templates: list[str],
min_sentences: int = 5,
) -> dict:
"""Generate the test block by filling body templates and inserting
pre-written headings.
Args:
templates: List of template strings. Lines starting with "H2:" or
"H3:" are pre-written headings (inserted as-is, no slot filling).
Everything else is a body template with {N} slots.
min_sentences: Minimum sentences before checking stop condition.
Returns:
Dict with "sentences" list and "stats" summary.
"""
h2_headings = []
h3_headings = []
body_templates = []
for t in templates:
t = t.strip()
if not t:
continue
if t.upper().startswith("H2:"):
h2_headings.append(t[3:].strip())
elif t.upper().startswith("H3:"):
h3_headings.append(t[3:].strip())
else:
body_templates.append(t)
if not body_templates:
return {"error": "No body templates found", "sentences": [], "stats": {}}
h2_needed = self.prep["headings"]["h2"]["deficit"]
h3_needed = self.prep["headings"]["h3"]["deficit"]
sentences = []
count = 0
body_idx = 0
h2_idx = 0
h3_idx = 0
max_iter = max(len(body_templates) * 3, 60)
for _ in range(max_iter):
# Insert pre-written heading if deficit exists and we're at a paragraph break
if h2_needed > 0 and h2_headings and count % 5 == 0:
text = h2_headings[h2_idx % len(h2_headings)]
w, e, v = self.count_sentence(text)
self.new_words += w
self.new_entity_mentions += e
self.new_variation_mentions += v
self.new_h2_count += 1
h2_needed -= 1
h2_idx += 1
sentences.append({"text": text, "type": "h2"})
count += 1
continue
if h3_needed > 0 and h3_headings and count > 0 and count % 3 == 0:
text = h3_headings[h3_idx % len(h3_headings)]
w, e, v = self.count_sentence(text)
self.new_words += w
self.new_entity_mentions += e
self.new_variation_mentions += v
self.new_h3_count += 1
h3_needed -= 1
h3_idx += 1
sentences.append({"text": text, "type": "h3"})
count += 1
continue
# Body sentence — fill template slots
tmpl = body_templates[body_idx % len(body_templates)]
filled = self.fill_template(tmpl)
w, e, v = self.count_sentence(filled)
self.new_words += w
self.new_entity_mentions += e
self.new_variation_mentions += v
body_idx += 1
sentences.append({"text": filled, "type": "body"})
count += 1
if self.targets_met(count >= min_sentences):
break
return {
"sentences": sentences,
"stats": {
"total_sentences": count,
"new_words": self.new_words,
"new_entity_mentions": self.new_entity_mentions,
"new_variation_mentions": self.new_variation_mentions,
"new_distinct_entities_introduced": len(self.introduced),
"introduced_entities": sorted(self.introduced),
"new_h2_count": self.new_h2_count,
"new_h3_count": self.new_h3_count,
"projected_entity_density_pct": round(
self.projected_density("entity") * 100, 2
),
"projected_variation_density_pct": round(
self.projected_density("variation") * 100, 2
),
"target_entity_density_pct": round(self.target_entity_d * 100, 2),
"target_variation_density_pct": round(self.target_variation_d * 100, 2),
},
}
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def format_markdown(sentences: list[dict]) -> str:
"""Convert sentence list to markdown with test block markers."""
lines = ["<!-- HIDDEN TEST BLOCK START -->", ""]
paragraph = []
for s in sentences:
if s["type"] in ("h2", "h3"):
# Flush paragraph before heading
if paragraph:
lines.append(" ".join(paragraph))
lines.append("")
paragraph = []
prefix = "##" if s["type"] == "h2" else "###"
lines.append(f"{prefix} {s['text']}")
lines.append("")
else:
paragraph.append(s["text"])
if len(paragraph) >= 4:
lines.append(" ".join(paragraph))
lines.append("")
paragraph = []
if paragraph:
lines.append(" ".join(paragraph))
lines.append("")
lines.append("<!-- HIDDEN TEST BLOCK END -->")
return "\n".join(lines)
def format_html(sentences: list[dict]) -> str:
"""Convert sentence list to HTML with test block markers."""
lines = ["<!-- HIDDEN TEST BLOCK START -->", ""]
paragraph = []
for s in sentences:
if s["type"] in ("h2", "h3"):
if paragraph:
lines.append("<p>" + " ".join(paragraph) + "</p>")
lines.append("")
paragraph = []
tag = "h2" if s["type"] == "h2" else "h3"
lines.append(f"<{tag}>{s['text']}</{tag}>")
lines.append("")
else:
paragraph.append(s["text"])
if len(paragraph) >= 4:
lines.append("<p>" + " ".join(paragraph) + "</p>")
lines.append("")
paragraph = []
if paragraph:
lines.append("<p>" + " ".join(paragraph) + "</p>")
lines.append("")
lines.append("<!-- HIDDEN TEST BLOCK END -->")
return "\n".join(lines)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Generate a test block from templates and deficit data.",
)
parser.add_argument("templates_path", help="Path to templates file (one per line)")
parser.add_argument("prep_json_path", help="Path to prep JSON from test_block_prep.py")
parser.add_argument("cora_xlsx_path", help="Path to Cora XLSX report")
parser.add_argument(
"--entities-file", required=True,
help="Path to LLM-curated entity list (one name per line)",
)
parser.add_argument(
"--output-dir", default="./working",
help="Directory for output files (default: ./working)",
)
parser.add_argument(
"--min-sentences", type=int, default=5,
help="Minimum sentences before checking stop condition (default: 5)",
)
args = parser.parse_args()
# Load inputs
templates_path = Path(args.templates_path)
if not templates_path.exists():
print(f"Error: templates file not found: {templates_path}", file=sys.stderr)
sys.exit(1)
templates = [
line.strip()
for line in templates_path.read_text(encoding="utf-8").splitlines()
if line.strip()
]
prep_path = Path(args.prep_json_path)
if not prep_path.exists():
print(f"Error: prep JSON not found: {prep_path}", file=sys.stderr)
sys.exit(1)
prep_data = json.loads(prep_path.read_text(encoding="utf-8"))
# Load LLM-curated entity list
filtered_entity_names = load_entity_names(args.entities_file)
# Generate
gen = TestBlockGenerator(args.cora_xlsx_path, prep_data, filtered_entity_names)
result = gen.generate(templates, min_sentences=args.min_sentences)
if "error" in result and result["error"]:
print(f"Error: {result['error']}", file=sys.stderr)
sys.exit(1)
# Write outputs
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
md_path = out_dir / "test_block.md"
html_path = out_dir / "test_block.html"
stats_path = out_dir / "test_block_stats.json"
md_content = format_markdown(result["sentences"])
html_content = format_html(result["sentences"])
md_path.write_text(md_content, encoding="utf-8")
html_path.write_text(html_content, encoding="utf-8")
stats_path.write_text(
json.dumps(result["stats"], indent=2, default=str), encoding="utf-8"
)
# Print summary
stats = result["stats"]
print(f"Test block generated:")
print(f" Sentences: {stats['total_sentences']}")
print(f" Words: {stats['new_words']}")
print(f" Entity mentions: {stats['new_entity_mentions']}")
print(f" Variation mentions: {stats['new_variation_mentions']}")
print(f" New 0->1 entities: {stats['new_distinct_entities_introduced']}")
print(f" Projected entity density: {stats['projected_entity_density_pct']}%"
f" (target: {stats['target_entity_density_pct']}%)")
print(f" Projected variation density: {stats['projected_variation_density_pct']}%"
f" (target: {stats['target_variation_density_pct']}%)")
print(f"\nFiles written:")
print(f" {md_path}")
print(f" {html_path}")
print(f" {stats_path}")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,578 @@
#!/usr/bin/env python3
"""
Test Block Prep Extract Deficit Data for Test Block Generation
Reads existing content (from competitor_scraper.py output or plain text) and a
Cora XLSX report, then calculates all deficit metrics needed to programmatically
generate a test block.
Outputs structured JSON with:
- Word count vs target + deficit
- Distinct entity count vs target + deficit + list of missing entities
- Variation density vs target + deficit (Cora row 46)
- Entity density vs target + deficit (Cora row 47)
- LSI density vs target + deficit (Cora row 48)
- Heading structure deficits
- Template generation instructions (slots per sentence, sentence count, etc.)
Usage:
uv run --with openpyxl python test_block_prep.py <content_path> <cora_xlsx_path>
[--format json|text]
"""
import argparse
import json
import math
import re
import sys
from pathlib import Path
from cora_parser import CoraReport
# ---------------------------------------------------------------------------
# Content parsing
# ---------------------------------------------------------------------------
def parse_scraper_content(file_path: str) -> dict:
"""Parse a competitor_scraper.py output file or plain text/markdown.
Returns dict with: headings, content, word_count, title, meta_description.
"""
text = Path(file_path).read_text(encoding="utf-8")
result = {
"headings": [],
"content": "",
"word_count": 0,
"title": "",
"meta_description": "",
}
if "--- HEADINGS ---" in text and "--- CONTENT ---" in text:
headings_start = text.index("--- HEADINGS ---")
content_start = text.index("--- CONTENT ---")
# Parse metadata
metadata = text[:headings_start]
for line in metadata.splitlines():
if line.startswith("Title: "):
result["title"] = line[7:].strip()
elif line.startswith("Meta Description: "):
result["meta_description"] = line[18:].strip()
# Parse headings
headings_text = text[headings_start + len("--- HEADINGS ---"):content_start].strip()
for line in headings_text.splitlines():
line = line.strip()
match = re.match(r"H(\d):\s+(.+)", line)
if match:
result["headings"].append({
"level": int(match.group(1)),
"text": match.group(2).strip(),
})
# Parse content
result["content"] = text[content_start + len("--- CONTENT ---"):].strip()
else:
# Plain text/markdown
result["content"] = text.strip()
for match in re.finditer(r"^(#{1,6})\s+(.+)$", text, re.MULTILINE):
result["headings"].append({
"level": len(match.group(1)),
"text": match.group(2).strip(),
})
words = re.findall(r"[a-zA-Z']+", result["content"])
result["word_count"] = len(words)
return result
# ---------------------------------------------------------------------------
# Counting functions
# ---------------------------------------------------------------------------
def count_entity_mentions(text: str, entities: list[dict]) -> dict:
"""Count mentions of each Cora entity in text.
Returns: per_entity dict, total_mentions, distinct_count.
"""
per_entity = {}
total_mentions = 0
distinct_count = 0
for entity in entities:
name = entity["name"]
pattern = re.compile(r"\b" + re.escape(name) + r"\b", re.IGNORECASE)
count = len(pattern.findall(text))
per_entity[name] = count
total_mentions += count
if count > 0:
distinct_count += 1
return {
"per_entity": per_entity,
"total_mentions": total_mentions,
"distinct_count": distinct_count,
}
def count_variation_mentions(text: str, variations: list[str]) -> dict:
"""Count mentions of each keyword variation in text.
Returns: per_variation dict, total_mentions.
"""
per_variation = {}
total_mentions = 0
for var in variations:
pattern = re.compile(r"\b" + re.escape(var) + r"\b", re.IGNORECASE)
count = len(pattern.findall(text))
per_variation[var] = count
total_mentions += count
return {
"per_variation": per_variation,
"total_mentions": total_mentions,
}
def count_lsi_mentions(text: str, lsi_keywords: list[dict]) -> dict:
"""Count mentions of each LSI keyword in text.
Returns: per_keyword dict, total_mentions, distinct_count.
"""
per_keyword = {}
total_mentions = 0
distinct_count = 0
for kw_data in lsi_keywords:
keyword = kw_data["keyword"]
tokens = keyword.strip().split()
escaped = [re.escape(t) for t in tokens]
pattern_str = r"\b" + r"\s+".join(escaped) + r"\b"
pattern = re.compile(pattern_str, re.IGNORECASE)
count = len(pattern.findall(text))
per_keyword[keyword] = count
total_mentions += count
if count > 0:
distinct_count += 1
return {
"per_keyword": per_keyword,
"total_mentions": total_mentions,
"distinct_count": distinct_count,
}
def count_terms_in_headings(
headings: list[dict],
entities: list[dict],
variations: list[str],
) -> dict:
"""Count entity and variation mentions in heading text.
Returns total counts and per-level breakdown.
"""
all_heading_text = " ".join(h["text"] for h in headings)
entity_mentions = 0
for entity in entities:
pattern = re.compile(r"\b" + re.escape(entity["name"]) + r"\b", re.IGNORECASE)
entity_mentions += len(pattern.findall(all_heading_text))
variation_mentions = 0
for var in variations:
pattern = re.compile(r"\b" + re.escape(var) + r"\b", re.IGNORECASE)
variation_mentions += len(pattern.findall(all_heading_text))
per_level = {}
for level in [2, 3]:
level_headings = [h for h in headings if h["level"] == level]
level_text = " ".join(h["text"] for h in level_headings)
lev_entity = 0
for entity in entities:
pattern = re.compile(r"\b" + re.escape(entity["name"]) + r"\b", re.IGNORECASE)
lev_entity += len(pattern.findall(level_text))
lev_var = 0
for var in variations:
pattern = re.compile(r"\b" + re.escape(var) + r"\b", re.IGNORECASE)
lev_var += len(pattern.findall(level_text))
per_level[f"h{level}"] = {
"count": len(level_headings),
"entity_mentions": lev_entity,
"variation_mentions": lev_var,
}
return {
"entity_mentions_total": entity_mentions,
"variation_mentions_total": variation_mentions,
"per_level": per_level,
}
# ---------------------------------------------------------------------------
# Template instruction calculation
# ---------------------------------------------------------------------------
def calculate_template_instructions(
current_words: int,
current_entity_mentions: int,
current_variation_mentions: int,
target_entity_density: float,
target_variation_density: float,
distinct_entity_deficit: int,
word_count_deficit: int,
) -> dict:
"""Calculate template parameters for the generator script.
Figures out how many words the test block needs, how many slots per
sentence, and how many sentences so the LLM knows what to generate.
"""
AVG_WORDS_PER_SENTENCE = 15
MAX_SLOTS = 5
MIN_SLOTS = 2
current_entity_density = current_entity_mentions / current_words if current_words > 0 else 0
current_variation_density = current_variation_mentions / current_words if current_words > 0 else 0
# Minimum test block size from word count deficit
min_words = max(word_count_deficit, 150)
# Calculate minimum words needed to close entity density gap
entity_deficit_pct = target_entity_density - current_entity_density
if entity_deficit_pct > 0:
# At max internal density (MAX_SLOTS / AVG_WORDS), how many words?
max_internal = MAX_SLOTS / AVG_WORDS_PER_SENTENCE
if max_internal > target_entity_density:
needed = (target_entity_density * current_words - current_entity_mentions)
words_for_entity = math.ceil(needed / (max_internal - target_entity_density))
min_words = max(min_words, words_for_entity)
# Same for variation density gap
var_deficit_pct = target_variation_density - current_variation_density
if var_deficit_pct > 0:
max_internal = MAX_SLOTS / AVG_WORDS_PER_SENTENCE
if max_internal > target_variation_density:
needed = (target_variation_density * current_words - current_variation_mentions)
words_for_var = math.ceil(needed / (max_internal - target_variation_density))
min_words = max(min_words, words_for_var)
# If only distinct entities are deficit (densities met), smaller block
if entity_deficit_pct <= 0 and var_deficit_pct <= 0 and distinct_entity_deficit > 0:
min_words = max(150, distinct_entity_deficit * AVG_WORDS_PER_SENTENCE)
# Round up to nearest 50
target_words = math.ceil(max(min_words, 150) / 50) * 50
# Required entity mentions in test block
if target_entity_density > 0:
total_needed = math.ceil(target_entity_density * (current_words + target_words))
entity_mentions_needed = max(0, total_needed - current_entity_mentions)
else:
entity_mentions_needed = max(distinct_entity_deficit, 0)
# Required variation mentions in test block
if target_variation_density > 0:
total_needed = math.ceil(target_variation_density * (current_words + target_words))
variation_mentions_needed = max(0, total_needed - current_variation_mentions)
else:
variation_mentions_needed = 0
# Derive slots per sentence
target_sentences = max(1, math.ceil(target_words / AVG_WORDS_PER_SENTENCE))
total_slots = entity_mentions_needed + variation_mentions_needed
# Overlapping terms count toward both, so reduce estimate
total_slots = max(total_slots, entity_mentions_needed)
slots_per_sentence = math.ceil(total_slots / target_sentences) if target_sentences > 0 else MIN_SLOTS
slots_per_sentence = max(MIN_SLOTS, min(MAX_SLOTS, slots_per_sentence))
# Number of templates: derived from two factors
# 1. Word deficit: how many sentences to fill the word gap
word_driven = math.ceil(target_words / AVG_WORDS_PER_SENTENCE)
# 2. Entity deficit: how many sentences to introduce all missing entities
entity_driven = math.ceil(distinct_entity_deficit / slots_per_sentence) if slots_per_sentence > 0 else 0
num_templates = max(word_driven, entity_driven, 5)
return {
"target_word_count": target_words,
"num_templates": num_templates,
"num_templates_reason": "word_deficit" if word_driven >= entity_driven else "entity_deficit",
"slots_per_sentence": slots_per_sentence,
"avg_words_per_template": AVG_WORDS_PER_SENTENCE,
"entity_mentions_needed": entity_mentions_needed,
"variation_mentions_needed": variation_mentions_needed,
"rationale": (
f"Need ~{entity_mentions_needed} entity mentions and "
f"~{variation_mentions_needed} variation mentions "
f"across ~{target_words} words. "
f"Templates: {num_templates} (driven by {'word deficit' if word_driven >= entity_driven else 'entity deficit'}), "
f"{slots_per_sentence} slots each."
),
}
# ---------------------------------------------------------------------------
# Main prep function
# ---------------------------------------------------------------------------
def run_prep(content_path: str, cora_xlsx_path: str) -> dict:
"""Run the full test block prep analysis."""
report = CoraReport(cora_xlsx_path)
entities = report.get_entities()
lsi_keywords = report.get_lsi_keywords()
variations_list = report.get_variations_list()
density_targets = report.get_density_targets()
content_targets = report.get_content_targets()
structure_targets = report.get_structure_targets()
word_count_dist = report.get_word_count_distribution()
# Parse existing content
parsed = parse_scraper_content(content_path)
content_text = parsed["content"]
current_words = parsed["word_count"]
headings = parsed["headings"]
# --- Word count ---
cluster_target = word_count_dist.get("cluster_target", 0)
wc_target = cluster_target if cluster_target else word_count_dist.get("average", 0)
wc_deficit = max(0, wc_target - current_words)
# --- Entity counts ---
entity_data = count_entity_mentions(content_text, entities)
distinct_target = content_targets.get("distinct_entities", {}).get("target", 0)
distinct_deficit = max(0, distinct_target - entity_data["distinct_count"])
# Missing entities (0 count, sorted by relevance)
missing_entities = []
for entity in entities:
if entity_data["per_entity"].get(entity["name"], 0) == 0:
missing_entities.append({
"name": entity["name"],
"relevance": entity.get("relevance") or 0,
"type": entity.get("type", ""),
})
missing_entities.sort(key=lambda e: e["relevance"], reverse=True)
# --- Variation counts ---
variation_data = count_variation_mentions(content_text, variations_list)
# --- LSI counts ---
lsi_data = count_lsi_mentions(content_text, lsi_keywords)
# --- Density calculations ---
cur_entity_d = entity_data["total_mentions"] / current_words if current_words else 0
cur_var_d = variation_data["total_mentions"] / current_words if current_words else 0
cur_lsi_d = lsi_data["total_mentions"] / current_words if current_words else 0
tgt_entity_d = density_targets.get("entity_density", {}).get("avg") or 0
tgt_var_d = density_targets.get("variation_density", {}).get("avg") or 0
tgt_lsi_d = density_targets.get("lsi_density", {}).get("avg") or 0
# --- Heading analysis ---
heading_data = count_terms_in_headings(headings, entities, variations_list)
h2_target = structure_targets.get("h2", {}).get("count", {}).get("target", 0)
h3_target = structure_targets.get("h3", {}).get("count", {}).get("target", 0)
h2_current = heading_data["per_level"].get("h2", {}).get("count", 0)
h3_current = heading_data["per_level"].get("h3", {}).get("count", 0)
all_h_var_target = structure_targets.get("all_h_tags", {}).get("variations", {}).get("target", 0)
all_h_ent_target = structure_targets.get("all_h_tags", {}).get("entities", {}).get("target", 0)
# --- Template instructions ---
template_inst = calculate_template_instructions(
current_words=current_words,
current_entity_mentions=entity_data["total_mentions"],
current_variation_mentions=variation_data["total_mentions"],
target_entity_density=tgt_entity_d,
target_variation_density=tgt_var_d,
distinct_entity_deficit=distinct_deficit,
word_count_deficit=wc_deficit,
)
return {
"search_term": report.get_search_term(),
"content_file": content_path,
"word_count": {
"current": current_words,
"target": wc_target,
"deficit": wc_deficit,
"status": "meets_target" if wc_deficit == 0 else "below_target",
},
"distinct_entities": {
"current": entity_data["distinct_count"],
"target": distinct_target,
"deficit": distinct_deficit,
"total_tracked": len(entities),
"missing_entities": missing_entities,
},
"entity_density": {
"current_pct": round(cur_entity_d * 100, 2),
"target_pct": round(tgt_entity_d * 100, 2),
"deficit_pct": round(max(0, tgt_entity_d - cur_entity_d) * 100, 2),
"current_mentions": entity_data["total_mentions"],
"target_decimal": tgt_entity_d,
"current_decimal": cur_entity_d,
"status": "meets_target" if cur_entity_d >= tgt_entity_d else "below_target",
},
"variation_density": {
"current_pct": round(cur_var_d * 100, 2),
"target_pct": round(tgt_var_d * 100, 2),
"deficit_pct": round(max(0, tgt_var_d - cur_var_d) * 100, 2),
"current_mentions": variation_data["total_mentions"],
"target_decimal": tgt_var_d,
"current_decimal": cur_var_d,
"status": "meets_target" if cur_var_d >= tgt_var_d else "below_target",
},
"lsi_density": {
"current_pct": round(cur_lsi_d * 100, 2),
"target_pct": round(tgt_lsi_d * 100, 2),
"deficit_pct": round(max(0, tgt_lsi_d - cur_lsi_d) * 100, 2),
"current_mentions": lsi_data["total_mentions"],
"target_decimal": tgt_lsi_d,
"current_decimal": cur_lsi_d,
"status": "meets_target" if cur_lsi_d >= tgt_lsi_d else "below_target",
},
"headings": {
"h2": {
"current": h2_current,
"target": h2_target,
"deficit": max(0, h2_target - h2_current),
},
"h3": {
"current": h3_current,
"target": h3_target,
"deficit": max(0, h3_target - h3_current),
},
"variations_in_headings": {
"current": heading_data["variation_mentions_total"],
"target": all_h_var_target,
"deficit": max(0, all_h_var_target - heading_data["variation_mentions_total"]),
},
"entities_in_headings": {
"current": heading_data["entity_mentions_total"],
"target": all_h_ent_target,
"deficit": max(0, all_h_ent_target - heading_data["entity_mentions_total"]),
},
},
"template_instructions": template_inst,
}
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def format_text_report(data: dict) -> str:
"""Format prep data as a human-readable text report."""
lines = []
sep = "=" * 65
lines.append(sep)
lines.append(f" TEST BLOCK PREP — {data['search_term']}")
lines.append(sep)
lines.append("")
# Word count
wc = data["word_count"]
lines.append("WORD COUNT")
lines.append(f" Current: {wc['current']} | Target: {wc['target']} | Deficit: {wc['deficit']} [{wc['status']}]")
lines.append("")
# Distinct entities
de = data["distinct_entities"]
lines.append("DISTINCT ENTITIES")
lines.append(f" Current: {de['current']} | Target: {de['target']} | Deficit: {de['deficit']} (of {de['total_tracked']} tracked)")
if de["missing_entities"]:
lines.append(f" Top missing (0->1):")
for ent in de["missing_entities"][:15]:
lines.append(f" - {ent['name']} (relevance: {ent['relevance']}, type: {ent['type']})")
remaining = len(de["missing_entities"]) - 15
if remaining > 0:
lines.append(f" ... and {remaining} more")
lines.append("")
# Entity density
ed = data["entity_density"]
lines.append("ENTITY DENSITY (Cora row 47)")
lines.append(f" Current: {ed['current_pct']}% | Target: {ed['target_pct']}% | Deficit: {ed['deficit_pct']}% [{ed['status']}]")
lines.append(f" Current mentions: {ed['current_mentions']}")
lines.append("")
# Variation density
vd = data["variation_density"]
lines.append("VARIATION DENSITY (Cora row 46)")
lines.append(f" Current: {vd['current_pct']}% | Target: {vd['target_pct']}% | Deficit: {vd['deficit_pct']}% [{vd['status']}]")
lines.append(f" Current mentions: {vd['current_mentions']}")
lines.append("")
# LSI density
ld = data["lsi_density"]
lines.append("LSI DENSITY (Cora row 48)")
lines.append(f" Current: {ld['current_pct']}% | Target: {ld['target_pct']}% | Deficit: {ld['deficit_pct']}% [{ld['status']}]")
lines.append(f" Current mentions: {ld['current_mentions']}")
lines.append("")
# Headings
hd = data["headings"]
lines.append("HEADING DEFICITS")
lines.append(f" H2: {hd['h2']['current']} current / {hd['h2']['target']} target -- deficit {hd['h2']['deficit']}")
lines.append(f" H3: {hd['h3']['current']} current / {hd['h3']['target']} target -- deficit {hd['h3']['deficit']}")
lines.append(f" Variations in headings: {hd['variations_in_headings']['current']} / {hd['variations_in_headings']['target']} -- deficit {hd['variations_in_headings']['deficit']}")
lines.append(f" Entities in headings: {hd['entities_in_headings']['current']} / {hd['entities_in_headings']['target']} -- deficit {hd['entities_in_headings']['deficit']}")
lines.append("")
# Template instructions
ti = data["template_instructions"]
lines.append("TEMPLATE INSTRUCTIONS")
lines.append(f" {ti['rationale']}")
lines.append(f" >> Generate {ti['num_templates']} templates, ~{ti['avg_words_per_template']} words each, {ti['slots_per_sentence']} slots per template")
lines.append("")
lines.append(sep)
return "\n".join(lines)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Extract deficit data for test block generation.",
)
parser.add_argument("content_path", help="Path to scraper output or content file")
parser.add_argument("cora_xlsx_path", help="Path to Cora XLSX report")
parser.add_argument(
"--format", choices=["json", "text"], default="text",
help="Output format (default: text)",
)
parser.add_argument(
"--output", "-o", default=None,
help="Write output to file instead of stdout",
)
args = parser.parse_args()
try:
data = run_prep(args.content_path, args.cora_xlsx_path)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if args.format == "json":
output = json.dumps(data, indent=2, default=str)
else:
output = format_text_report(data)
if args.output:
Path(args.output).write_text(output, encoding="utf-8")
print(f"Written to {args.output}", file=sys.stderr)
else:
print(output)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,378 @@
#!/usr/bin/env python3
"""
Test Block Validator Before/After Comparison
Runs the same deficit analysis from test_block_prep.py on:
1. Existing content alone (before)
2. Existing content + test block (after)
Produces a deterministic comparison showing exactly how each metric changed.
Usage:
uv run --with openpyxl python test_block_validate.py <content_path> <test_block_path> <cora_xlsx_path>
[--format json|text] [--output PATH]
"""
import argparse
import json
import re
import sys
from pathlib import Path
from cora_parser import CoraReport
from test_block_prep import (
parse_scraper_content,
count_entity_mentions,
count_variation_mentions,
count_lsi_mentions,
count_terms_in_headings,
)
def extract_test_block_text(file_path: str) -> str:
"""Read test block file and return the text content.
Strips HTML tags and test block markers. Returns plain text for counting.
"""
text = Path(file_path).read_text(encoding="utf-8")
# Remove test block markers
text = text.replace("<!-- HIDDEN TEST BLOCK START -->", "")
text = text.replace("<!-- HIDDEN TEST BLOCK END -->", "")
# Remove HTML tags
text = re.sub(r"<[^>]+>", " ", text)
# Remove markdown heading markers
text = re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE)
return text.strip()
def extract_test_block_headings(file_path: str) -> list[dict]:
"""Extract heading structure from test block (HTML or markdown)."""
text = Path(file_path).read_text(encoding="utf-8")
headings = []
# Try HTML headings first
for match in re.finditer(r"<h(\d)>(.+?)</h\d>", text, re.IGNORECASE):
headings.append({
"level": int(match.group(1)),
"text": match.group(2).strip(),
})
# If no HTML headings, try markdown
if not headings:
for match in re.finditer(r"^(#{1,6})\s+(.+)$", text, re.MULTILINE):
headings.append({
"level": len(match.group(1)),
"text": match.group(2).strip(),
})
return headings
def run_validation(
content_path: str,
test_block_path: str,
cora_xlsx_path: str,
) -> dict:
"""Run before/after validation.
Returns dict with: before, after, delta, targets, status.
"""
report = CoraReport(cora_xlsx_path)
entities = report.get_entities()
lsi_keywords = report.get_lsi_keywords()
variations_list = report.get_variations_list()
density_targets = report.get_density_targets()
content_targets = report.get_content_targets()
structure_targets = report.get_structure_targets()
word_count_dist = report.get_word_count_distribution()
# --- Parse existing content ---
parsed = parse_scraper_content(content_path)
existing_text = parsed["content"]
existing_headings = parsed["headings"]
# --- Parse test block ---
block_text = extract_test_block_text(test_block_path)
block_headings = extract_test_block_headings(test_block_path)
# --- Combined ---
combined_text = existing_text + "\n\n" + block_text
combined_headings = existing_headings + block_headings
# --- Count words ---
count_words = lambda t: len(re.findall(r"[a-zA-Z']+", t))
before_words = count_words(existing_text)
block_words = count_words(block_text)
after_words = count_words(combined_text)
# --- Count entities ---
before_ent = count_entity_mentions(existing_text, entities)
after_ent = count_entity_mentions(combined_text, entities)
# --- Count variations ---
before_var = count_variation_mentions(existing_text, variations_list)
after_var = count_variation_mentions(combined_text, variations_list)
# --- Count LSI ---
before_lsi = count_lsi_mentions(existing_text, lsi_keywords)
after_lsi = count_lsi_mentions(combined_text, lsi_keywords)
# --- Heading analysis ---
before_hdg = count_terms_in_headings(existing_headings, entities, variations_list)
after_hdg = count_terms_in_headings(combined_headings, entities, variations_list)
# --- Targets ---
tgt_entity_d = density_targets.get("entity_density", {}).get("avg") or 0
tgt_var_d = density_targets.get("variation_density", {}).get("avg") or 0
tgt_lsi_d = density_targets.get("lsi_density", {}).get("avg") or 0
distinct_target = content_targets.get("distinct_entities", {}).get("target", 0)
cluster_target = word_count_dist.get("cluster_target", 0)
wc_target = cluster_target if cluster_target else word_count_dist.get("average", 0)
h2_target = structure_targets.get("h2", {}).get("count", {}).get("target", 0)
h3_target = structure_targets.get("h3", {}).get("count", {}).get("target", 0)
# --- Build comparison ---
def density(mentions, words):
return mentions / words if words > 0 else 0
def pct(d):
return round(d * 100, 2)
# Find new 0->1 entities
new_entities = []
for name, after_count in after_ent["per_entity"].items():
before_count = before_ent["per_entity"].get(name, 0)
if before_count == 0 and after_count > 0:
new_entities.append(name)
before_h2 = len([h for h in existing_headings if h["level"] == 2])
after_h2 = len([h for h in combined_headings if h["level"] == 2])
before_h3 = len([h for h in existing_headings if h["level"] == 3])
after_h3 = len([h for h in combined_headings if h["level"] == 3])
return {
"search_term": report.get_search_term(),
"test_block_words": block_words,
"word_count": {
"before": before_words,
"after": after_words,
"target": wc_target,
"before_status": "meets" if before_words >= wc_target else "below",
"after_status": "meets" if after_words >= wc_target else "below",
},
"distinct_entities": {
"before": before_ent["distinct_count"],
"after": after_ent["distinct_count"],
"target": distinct_target,
"new_0_to_1": len(new_entities),
"new_entity_names": sorted(new_entities),
"before_status": "meets" if before_ent["distinct_count"] >= distinct_target else "below",
"after_status": "meets" if after_ent["distinct_count"] >= distinct_target else "below",
},
"entity_density": {
"before_pct": pct(density(before_ent["total_mentions"], before_words)),
"after_pct": pct(density(after_ent["total_mentions"], after_words)),
"target_pct": pct(tgt_entity_d),
"before_mentions": before_ent["total_mentions"],
"after_mentions": after_ent["total_mentions"],
"delta_mentions": after_ent["total_mentions"] - before_ent["total_mentions"],
"before_status": "meets" if density(before_ent["total_mentions"], before_words) >= tgt_entity_d else "below",
"after_status": "meets" if density(after_ent["total_mentions"], after_words) >= tgt_entity_d else "below",
},
"variation_density": {
"before_pct": pct(density(before_var["total_mentions"], before_words)),
"after_pct": pct(density(after_var["total_mentions"], after_words)),
"target_pct": pct(tgt_var_d),
"before_mentions": before_var["total_mentions"],
"after_mentions": after_var["total_mentions"],
"delta_mentions": after_var["total_mentions"] - before_var["total_mentions"],
"before_status": "meets" if density(before_var["total_mentions"], before_words) >= tgt_var_d else "below",
"after_status": "meets" if density(after_var["total_mentions"], after_words) >= tgt_var_d else "below",
},
"lsi_density": {
"before_pct": pct(density(before_lsi["total_mentions"], before_words)),
"after_pct": pct(density(after_lsi["total_mentions"], after_words)),
"target_pct": pct(tgt_lsi_d),
"before_mentions": before_lsi["total_mentions"],
"after_mentions": after_lsi["total_mentions"],
"delta_mentions": after_lsi["total_mentions"] - before_lsi["total_mentions"],
"before_status": "meets" if density(before_lsi["total_mentions"], before_words) >= tgt_lsi_d else "below",
"after_status": "meets" if density(after_lsi["total_mentions"], after_words) >= tgt_lsi_d else "below",
},
"headings": {
"h2": {
"before": before_h2,
"after": after_h2,
"target": h2_target,
},
"h3": {
"before": before_h3,
"after": after_h3,
"target": h3_target,
},
"entities_in_headings": {
"before": before_hdg["entity_mentions_total"],
"after": after_hdg["entity_mentions_total"],
},
"variations_in_headings": {
"before": before_hdg["variation_mentions_total"],
"after": after_hdg["variation_mentions_total"],
},
},
}
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def format_text_report(data: dict) -> str:
"""Format validation as a human-readable before/after comparison."""
lines = []
sep = "=" * 70
lines.append(sep)
lines.append(f" TEST BLOCK VALIDATION -- {data['search_term']}")
lines.append(f" Test block added {data['test_block_words']} words")
lines.append(sep)
lines.append("")
# Helper for status indicator
def status(s):
return "[OK]" if s == "meets" else "[!!]"
# Word count
wc = data["word_count"]
lines.append(f" {'METRIC':<30} {'BEFORE':>10} {'AFTER':>10} {'TARGET':>10} {'STATUS':>8}")
lines.append(f" {'-'*30} {'-'*10} {'-'*10} {'-'*10} {'-'*8}")
lines.append(
f" {'Word count':<30} {wc['before']:>10} {wc['after']:>10} "
f"{wc['target']:>10} {status(wc['after_status']):>8}"
)
# Distinct entities
de = data["distinct_entities"]
lines.append(
f" {'Distinct entities':<30} {de['before']:>10} {de['after']:>10} "
f"{de['target']:>10} {status(de['after_status']):>8}"
)
# Entity density
ed = data["entity_density"]
lines.append(
f" {'Entity density %':<30} {ed['before_pct']:>9}% {ed['after_pct']:>9}% "
f"{ed['target_pct']:>9}% {status(ed['after_status']):>8}"
)
# Variation density
vd = data["variation_density"]
lines.append(
f" {'Variation density %':<30} {vd['before_pct']:>9}% {vd['after_pct']:>9}% "
f"{vd['target_pct']:>9}% {status(vd['after_status']):>8}"
)
# LSI density
ld = data["lsi_density"]
lines.append(
f" {'LSI density %':<30} {ld['before_pct']:>9}% {ld['after_pct']:>9}% "
f"{ld['target_pct']:>9}% {status(ld['after_status']):>8}"
)
lines.append("")
# Mention counts
lines.append(f" {'MENTION COUNTS':<30} {'BEFORE':>10} {'AFTER':>10} {'DELTA':>10}")
lines.append(f" {'-'*30} {'-'*10} {'-'*10} {'-'*10}")
lines.append(
f" {'Entity mentions':<30} {ed['before_mentions']:>10} "
f"{ed['after_mentions']:>10} {'+' + str(ed['delta_mentions']):>10}"
)
lines.append(
f" {'Variation mentions':<30} {vd['before_mentions']:>10} "
f"{vd['after_mentions']:>10} {'+' + str(vd['delta_mentions']):>10}"
)
lines.append(
f" {'LSI mentions':<30} {ld['before_mentions']:>10} "
f"{ld['after_mentions']:>10} {'+' + str(ld['delta_mentions']):>10}"
)
lines.append("")
# Headings
hd = data["headings"]
lines.append(f" {'HEADINGS':<30} {'BEFORE':>10} {'AFTER':>10} {'TARGET':>10}")
lines.append(f" {'-'*30} {'-'*10} {'-'*10} {'-'*10}")
lines.append(f" {'H2 count':<30} {hd['h2']['before']:>10} {hd['h2']['after']:>10} {hd['h2']['target']:>10}")
lines.append(f" {'H3 count':<30} {hd['h3']['before']:>10} {hd['h3']['after']:>10} {hd['h3']['target']:>10}")
lines.append(
f" {'Entities in headings':<30} {hd['entities_in_headings']['before']:>10} "
f"{hd['entities_in_headings']['after']:>10}"
)
lines.append(
f" {'Variations in headings':<30} {hd['variations_in_headings']['before']:>10} "
f"{hd['variations_in_headings']['after']:>10}"
)
lines.append("")
# New entities
de = data["distinct_entities"]
if de["new_entity_names"]:
lines.append(f" NEW ENTITIES INTRODUCED (0->1): {de['new_0_to_1']}")
for name in de["new_entity_names"]:
lines.append(f" + {name}")
lines.append("")
lines.append(sep)
return "\n".join(lines)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Validate a test block with before/after comparison.",
)
parser.add_argument("content_path", help="Path to existing content (scraper output)")
parser.add_argument("test_block_path", help="Path to test block (.md or .html)")
parser.add_argument("cora_xlsx_path", help="Path to Cora XLSX report")
parser.add_argument(
"--format", choices=["json", "text"], default="text",
help="Output format (default: text)",
)
parser.add_argument(
"--output", "-o", default=None,
help="Write output to file instead of stdout",
)
args = parser.parse_args()
try:
data = run_validation(args.content_path, args.test_block_path, args.cora_xlsx_path)
except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if args.format == "json":
output = json.dumps(data, indent=2, default=str)
else:
output = format_text_report(data)
if args.output:
Path(args.output).write_text(output, encoding="utf-8")
print(f"Written to {args.output}", file=sys.stderr)
else:
# Handle Windows encoding
try:
print(output)
except UnicodeEncodeError:
sys.stdout.buffer.write(output.encode("utf-8"))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,582 @@
---
name: content-researcher
description: Research, outline, draft, and optimize SEO web content (service pages, blog posts, product pages) against Cora SEO reports. Create new content. Entity, LSI, and keyword density optimization. Generate entity test blocks (hidden divs).
---
# Content Research & Creation Skill
Write and optimize SEO web content — service pages, blog posts, product pages, landing pages. Covers the full pipeline: competitor research, outline, drafting, and quantitative optimization against a Cora SEO report (XLSX).
---
## Invocation
Use this skill when the user asks to write, research, outline, draft, or optimize web content. Common triggers:
- "Write a service page about [topic]"
- "Let's work on the [topic] page"
- "Create content about [topic] for [company]"
- "I have a Cora report for [keyword]"
- "Optimize this page against the Cora report"
- "Help me build an outline for [topic]"
- "Research [topic] and write an article"
- Any mention of writing web pages, blog posts, or SEO content for a website
**Routing logic — ask two questions up front:**
1. "Do you have a Cora report (XLSX) for this keyword?"
2. "Do you have existing content to optimize?" (could be a URL to a live page, pasted text, or a file path)
| Cora report? | Existing content? | Start at |
|--------------|-------------------|----------|
| No | No | Phase 1, Step 1 (full research → draft workflow) |
| Yes | No | Phase 1, Step 1 (research → outline using Cora targets → draft → optimize) |
| Yes | Yes | Phase 2, Step 6 (load Cora, optimize existing content) |
| No | Yes | Ask user to generate the Cora report first — optimization without Cora targets is guesswork |
**Existing content from a URL:** If the user provides a URL to a live page (e.g. their WordPress site), **always use the BS4 competitor scraper** to pull the content — never `web_fetch`. The `web_fetch` tool runs content through an AI summarization layer that loses heading structure, drops sections, and can hallucinate product details. The scraper returns the actual HTML heading hierarchy and verbatim text.
```bash
cd {skill_dir}/scripts && uv run --with requests,beautifulsoup4 python competitor_scraper.py "URL" --output-dir ./working/competitor_content/
```
Read the output file, then use the scraped heading structure and body text to build `./working/draft.md`. Preserve the original text verbatim — do not paraphrase or summarize product descriptions, specifications, or technical details. Only restructure headings and add entity/LSI terms where needed for optimization. The user does NOT need to paste or save the content manually.
---
## Phase 1: Research & First Draft
### Step 1 — Topic Input
Collect from the user:
- **Required:** Topic or keyword
- **Optional:** Competitor URLs to examine, industry context, pasted research they've already done, target audience
- **For service pages:** Company name, what services/capabilities they actually offer, what they do NOT offer. This prevents writing claims about capabilities the company doesn't have. Ask explicitly: "Is this a service page? If so, what does the company offer and what should I avoid mentioning?"
For informational/educational articles, company details are less critical — the content is about the topic, not the company. For service pages, company context is mandatory before drafting.
If the user provides their own research (pasted text, notes, URLs), use that as the primary input. Do not redo research the user has already done.
### Step 2 — Competitor Research
Research what competitors are publishing on this topic. Three modes depending on user input:
**Mode A — Claude researches (default):**
Use `web_search` to find the top competitor content for the topic. Use the BS4 competitor scraper (not `web_fetch`) to read the most relevant 5-10 results — this preserves accurate heading structure and verbatim text. Focus on:
- What subtopics they cover
- How they structure their content (H2/H3 breakdown)
- What angles or claims they make
- What they leave out (gaps)
**Mode B — User provides URLs:**
If the user gives specific URLs, use the competitor scraper to bulk-fetch them:
```bash
cd {skill_dir}/scripts && uv run --with requests,beautifulsoup4 python competitor_scraper.py URL1 URL2 URL3 --output-dir ./working/competitor_content/
```
Then read the output files and analyze them.
**Mode C — User provides research:**
If the user pastes in research, notes, or analysis, skip scraping and work from what they gave you.
**Output:** Write a research summary covering:
1. Common themes across competitors (what everyone covers)
2. Content structure patterns (how they organize it)
3. Key entities, terms, and concepts mentioned repeatedly
4. Gaps — what competitors miss or cover poorly
5. Potential unique angles
Save the research summary to `./working/research_summary.md`.
### Step 3 — Build Outline
Using the research summary, build a structured outline:
1. **Generate fan-out queries** — Before structuring the outline, generate 10-15 search queries you would use to thoroughly research this topic. These are the natural "next searches" someone would run after the primary keyword — questions, comparisons, material/process specifics, use-case queries. Examples for "cnc swiss screw machining":
- "what is swiss screw machining"
- "swiss screw machining vs cnc turning"
- "swiss machining tolerances"
- "what materials can be swiss machined"
- "swiss screw machining for medical devices"
- "when to use swiss machining vs conventional lathe"
These queries represent the search cluster around the topic. The more of them the content answers, the more authoritative it becomes across related searches.
2. **Cover the common ground** — Include the themes that all/most competitors address. Missing these makes content look incomplete.
3. **Identify 1-2 unique angles** — Find something competitors are NOT covering well. This is the content's differentiator.
4. **Shape H3 headings from fan-out queries** — Map the strongest fan-out queries to H3 headings. Headings that match real search patterns give the content more surface area across the query cluster. A heading like "What Materials Can Be Swiss Machined?" is better than "Materials" because it mirrors how people actually search.
5. **Structure for scanning** — Use clear H2 sections with H3 subsections. Each H2 should address one major subtopic.
6. **Include notes on each section** — Brief description of what goes in each section and why.
Consult `references/content_frameworks.md` for structural templates (how-to, listicle, comparison, etc.) and select the best fit for the topic.
**IMPORTANT: YOU NEED A CORA REPORT BEFORE building the outline.** The Cora report provides:
- Heading count targets (H2, H3 counts) that shape the outline structure
- Entity lists that inform heading names (pack entity terms into H2/H3 headings)
- Word count targets that determine section depth
- Structure targets (entities per heading level, variations per heading level) that guide how keyword-rich headings should be
If the user has not yet provided the Cora XLSX, **ask for it before proceeding with the outline.** Research can happen without Cora, but the outline should not be built without it.
Save the outline to `./working/outline.md`.
### Step 4 — HUMAN REVIEW (STOP AND WAIT)
**Present the outline to the user and ask:**
> "Here's the outline based on the research. Review it and let me know:
> 1. Any sections to add, remove, or reorder?
> 2. Are the unique angles worth pursuing?
> 3. Any specific points or data you want included?
> 4. Anything else before I draft?"
**Do NOT proceed until the user responds.** This is a critical gate. Incorporate all feedback before moving on.
### Step 5 — Write First Draft
Write the full content based on the approved outline:
- Follow the structure exactly as approved
- Consult `references/brand_guidelines.md` for voice and tone guidance
- Write in clear, scannable paragraphs (max 4 sentences per paragraph)
- Use subheadings every 2-4 paragraphs
- Include lists, examples, and concrete details where appropriate
- Aim for the word count the user specified.
**Fan-Out Query (FOQ) Section:**
After the main content, write a separate FOQ section using the fan-out queries from the outline. This section is **excluded** from word count and heading count targets — it lives outside the core article.
- Each FOQ is an H3 heading phrased as a question
- Answer in 2-3 sentences max, self-contained
- **Restate the question in the answer** — this is the format LLMs and featured snippets prefer for citation: "How does X work? X works by..."
- The user may style these as accordions, FAQ schema, or hidden divs
- Mark the section clearly (e.g. `<!-- FOQ SECTION START -->`) so it's easy to separate from the main content
Save the draft to `./working/draft.md`.
Tell the user: "First draft is ready. If you have a Cora report for this keyword, provide the XLSX path and I'll optimize against it. Otherwise, let me know what changes you'd like."
---
## Phase 2: Cora Optimization
This phase begins when the user provides a Cora XLSX report. The draft may come from Phase 1, or the user may provide an existing draft to optimize.
### Step 6 — Load Cora Report
Parse the Cora XLSX and display a summary of targets:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python cora_parser.py "{cora_xlsx_path}" --sheet summary
```
Show the user:
- Search term and keyword variations
- Entity count and deficit count
- LSI keyword count and deficit count
- Word count target (cluster target, not raw average)
- Density targets (variation, entity, LSI)
- Key optimization rules that will be applied
### Step 7 — Entity Optimization
Run the entity optimizer against the draft:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python entity_optimizer.py "{draft_path}" "{cora_xlsx_path}" --top-n 30
```
Review the output and apply the top recommendations:
- Focus on entities with high relevance AND high remaining deficit
- Add entities naturally — they must fit the context of the section
- Prioritize adding entities to H2 and H3 headings first (these are primary optimization targets)
- Do NOT force entities where they don't make sense — readability always wins
- H1: exactly 1, always. Do not add a second H1.
- H5, H6: ignore completely
- H4: only add if most competitors have them
After applying entity changes, save the updated draft.
### Step 8 — LSI Keyword Optimization
Run the LSI optimizer:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python lsi_optimizer.py "{draft_path}" "{cora_xlsx_path}" --min-correlation 0.2 --top-n 50
```
Apply LSI keyword recommendations:
- Focus on keywords with strongest correlation (highest absolute value = most ranking impact)
- Many LSI keywords are common phrases that may already appear naturally
- Add missing keywords in body text, not just headings
- Some LSI keywords overlap with entities — count these once, benefit twice
After applying LSI changes, save the updated draft.
### Step 9 — Structure & Density Check
Check the overall structure against Cora targets:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python cora_parser.py "{cora_xlsx_path}" --sheet structure --format json
cd {skill_dir}/scripts && uv run --with openpyxl python cora_parser.py "{cora_xlsx_path}" --sheet densities --format json
```
Verify and adjust:
- **Heading counts:** Compare H1, H2, H3, H4 counts against Page 1 Average targets. Add or consolidate headings as needed.
- **Entities per heading level:** Check that each heading level has enough entity mentions vs. the Structure sheet targets.
- **Variations in headings:** Ensure keyword variations appear in H2/H3 headings at target levels.
- **Density targets:** Check variation density, entity density, and LSI density against the Strategic Overview percentages.
- **Word count:** Compare against the cluster target (NOT the raw average). If below target, identify which sections could be expanded.
**Important density note:** Adding content to meet one target changes the denominator for ALL density calculations. After significant word count changes, re-check densities. Usually 1-2 optimization passes are sufficient.
### Step 10 — Keyword Density Check (Optional)
If a quick keyword density check is useful:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python seo_optimizer.py "{draft_path}" --cora-xlsx "{cora_xlsx_path}"
```
Key rules:
- Exact match keyword density: 2% minimum, no upper limit
- Variations capture exact match — hitting variation density targets covers exact match
- Do NOT flag keyword stuffing. There is no practical upper limit that hurts rankings.
### Step 11 — Meta Title, Meta Description, and URL Slug
Generate meta tags and add them as an HTML comment block at the top of the draft file.
**Meta title format:** Pack keyword variations into a pipe-separated title tag. Google reads far more than the ~60 characters it displays — a long title tag with variations gives the page more surface area across related searches. You can go up to 500 characters but do not have to.
Format: `Exact Search Term | Variation 1 | Variation 2 | ... | Company Name`
Use the keyword variations from the Cora report. Only include variations that have a page1_avg > 0 (competitors actually use them). Put the highest-value variations first.
**Meta description:** Write a keyword-rich summary (~350-500 characters) that hits the primary keyword, key variations, materials, sizes, and company name. This is not just a copy of the intro paragraph — it should be independently optimized.
**URL slug:** Short, keyword-focused. Example: `/custom-spun-hemispheres`
Add to the top of the draft file:
```html
<!--
META TITLE: Exact Search Term | Variation 1 | Variation 2 | Company Name
META DESCRIPTION: Keyword-rich summary here.
URL SLUG: /url-slug-here
-->
```
### Step 12 — Image & Diagram Placement
Read through the draft md file and identify where visuals would enhance the content:
For each recommendation, specify:
- **Location:** After which heading or paragraph
- **Type:** Photo, diagram, chart, infographic, screenshot, illustration
- **Description:** What the visual should show
- **Rationale:** Why it adds value at that point (breaks up text, illustrates a process, makes data tangible, etc.)
Common placement triggers:
- Sections describing a process or workflow (diagram)
- Sections with comparative data (chart or table)
- Long text-only stretches (break up with a relevant image)
- Technical concepts that benefit from visual explanation (diagram)
- Before/after scenarios (side-by-side images)
### Step 13 — HUMAN REVIEW (STOP AND WAIT)
**Present the final draft, optimization summary, and image suggestions to the user:**
> "Here's the optimized draft. Summary of changes:
> - [X] entities added across [Y] sections
> - [X] LSI keywords incorporated
> - Word count: [current] (target: [target])
> - Variation density: [current]% (target: [target]%)
> - Entity density: [current]% (target: [target]%)
> - [X] image/diagram placements suggested
>
> Review the draft. What needs adjusting?"
**Do NOT finalize until the user approves.**
### Step 14 — HTML Export
After the user approves the draft, convert the markdown to plain HTML for WordPress. Save as `./working/draft.html` (or `draft_normal.html`, `draft_storybrand.html` if multiple versions exist).
Rules:
- **Plain HTML only** — no classes, no divs, no wrappers. Just `<h2>`, `<h3>`, `<p>`, `<ul>/<li>`, and `<strong>` tags.
- **Omit the H1** — WordPress sets the page title separately. Do not include an `<h1>` tag in the HTML.
- **Keep the meta comment block** at the top (META TITLE, META DESCRIPTION, URL SLUG).
- **Keep the FOQ comment markers** (`<!-- FOQ SECTION START -->` / `<!-- FOQ SECTION END -->`) so the user can identify that section for special styling.
- The user pastes this HTML into WordPress Gutenberg's Code Editor view, where it maps directly to blocks.
---
## Phase 3: Quick Test Block
A standalone workflow for testing whether adding entities, keywords, and headings moves rankings before investing in full content optimization. The output is a minimal text block placed in a hidden div on the page for A/B testing.
**Key principle:** The LLM handles all intelligence — filtering entities for topical relevance, writing headings, creating body templates. Python scripts handle all math — slot filling, density tracking, stop conditions, validation. There are NO per-entity mention targets — only aggregate density percentages and distinct entity counts.
### When to Use
User says "test block," "hidden div," "quick test," "test the entities," or similar. This is NOT part of Phase 2 — it is an independent workflow. Requirements: a Cora report and existing content (URL or file).
### Step T1 — Load Inputs
- Pull existing content via BS4 scraper if a URL is provided, or read from file if a path is given.
- Save existing content to `{cwd}/working/existing_content.md` if fetched from URL.
```bash
cd {skill_dir}/scripts && uv run --with requests,beautifulsoup4 python competitor_scraper.py "{url}" --output-dir {cwd}/working/
```
Then rename the output file to `{cwd}/working/existing_content.md`.
### Step T2 — Run Prep Script (Programmatic)
Run `test_block_prep.py` to extract all deficit data:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python test_block_prep.py "{content_path}" "{cora_xlsx_path}" --format json -o {cwd}/working/prep_data.json
```
This outputs structured JSON with:
- Word count vs target + deficit
- Distinct entity count vs target + deficit + list of missing 0-count entities
- Variation density % vs target (Cora row 46)
- Entity density % vs target (Cora row 47)
- LSI density % vs target (Cora row 48)
- Heading structure deficits (H2, H3 counts; entities/variations in headings)
- **Template instructions**: how many templates to generate, how many slots per template, target word count
Review the prep output. All numbers come from deterministic script analysis — no estimation.
### Step T3 — Filter Entities for Topical Relevance (LLM Step)
Read the `missing_entities` list from `{cwd}/working/prep_data.json`. This list contains ALL entities with 0 mentions on the existing page, sorted by Cora relevance score. **Many of these will be noise** — navigation terms, competitor names, unrelated concepts that happen to appear on ranking pages.
Review every entity and keep ONLY those that are topically relevant to the page's subject matter. Ask: "Would a subject matter expert writing about [page topic] naturally mention this term?"
**Remove:**
- Competitor company names and brands
- People (athletes, historical figures, etc.)
- Web furniture (blog, menu, privacy, FAQ, social media platforms)
- Geographic entities unrelated to the topic
- Software, media, organisms, and other off-topic typed entities
- Generic terms that only appear due to page chrome (calculator, glossary, children, etc.)
**Keep:**
- Terms directly related to the product/service/topic
- Materials, processes, components, and industry terms
- Related applications and industries where the product is used
- Technical specifications and engineering concepts
Save the filtered entity names to `{cwd}/working/filtered_entities.txt`, one entity per line, ordered from most to least relevant.
### Step T4 — Generate Headings and Body Templates (LLM Creative Step)
This step has two parts. Read the prep JSON for the numbers you need:
- `headings.h2.deficit`: how many H2 headings to generate
- `headings.h3.deficit`: how many H3 headings to generate
- `headings.entities_in_headings.deficit`: how many entity mentions needed across all headings
- `template_instructions.num_templates`: how many body templates to create
- `template_instructions.slots_per_sentence`: how many `{N}` slots per body template
- `template_instructions.avg_words_per_template`: target words per template (~15)
**Part 1 — Write headings:**
Using the filtered entity list from T3 and your understanding of the page topic, write topically relevant H2 and H3 headings. These are final text — NOT templates, no `{N}` slots. The headings should:
- Read like real section headings a subject matter expert would write
- Naturally incorporate entities from the filtered list (aim to hit the entities_in_headings deficit)
- Be relevant to the page's topic and the types of content that would appear under them
**Part 2 — Write body templates:**
Generate body sentence templates with numbered placeholder slots. Follow the numbers from `template_instructions`:
- Create `num_templates` templates
- Each template gets `slots_per_sentence` numbered slots: `{1}`, `{2}`, `{3}`, etc. Slots MUST be numbered — the generator regex matches `{1}`, `{2}`, NOT `{N}`.
- Templates must be topically relevant to the page's subject matter
- Templates should be grammatically coherent but brevity wins over polish
- Do NOT try to specify which entities go in which slot — the generator script handles that
Save everything to `{cwd}/working/templates.txt`, one per line. Headings are prefixed with `H2:` or `H3:`, body templates are plain text with `{N}` slots.
Example for an expansion joints page:
```
H2: Bellows Expansion Joints for Industrial Piping Systems
H2: Metal and Rubber Expansion Joint Applications in Water Treatment
H3: Gasket and Flange Connections for Expansion Joints
{1} and {2} are critical components used to absorb thermal movement and reduce stress in piping systems.
{1} provide reliable performance in demanding {2} environments where thermal cycling is constant.
```
### Step T5 — Run Generator Script (Programmatic)
Run `test_block_generator.py` to fill body template slots and assemble the test block. The script requires the LLM-curated entity list from T3:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python test_block_generator.py {cwd}/working/templates.txt {cwd}/working/prep_data.json "{cora_xlsx_path}" --entities-file {cwd}/working/filtered_entities.txt --output-dir {cwd}/working/ --min-sentences 5
```
The script:
1. Loads the LLM-curated entity list — uses ONLY these entities for slot filling (no script-level filtering)
2. Builds a term queue: filtered entities first, then keyword variations
3. Inserts pre-written headings as-is (no slot filling on heading lines)
4. Fills body template slots, rotating through the term queue (no duplicates within a sentence)
5. Tracks projected densities: (baseline_mentions + new_mentions) / (baseline_words + new_words)
6. Stops when: all density targets met, distinct entity deficit closed, word count deficit closed, AND minimum sentence count reached
Output files:
- `{cwd}/working/test_block.md` — Markdown version
- `{cwd}/working/test_block.html` — Plain HTML version
- `{cwd}/working/test_block_stats.json` — Generation stats (mentions added, entities introduced, projected densities)
### Step T6 — Rewrite Body Sentences for Readability (LLM Step — use Haiku)
The generator produces grammatically rough sentences because entities get slotted into positions where they don't naturally fit. This step rewrites each body sentence to read naturally while preserving entity strings exactly.
**Use Haiku for this step** — it's fast and cheap enough to handle sentence-by-sentence rewrites.
Read `{cwd}/working/test_block.md`. For each body sentence (NOT headings — leave all H2/H3 lines exactly as they are):
1. Identify which entity terms from `{cwd}/working/filtered_entities.txt` appear in the sentence
2. Rewrite the sentence so it is grammatically correct and reads naturally
3. **Preserve every entity string exactly** — same spelling, same case. Do not paraphrase, hyphenate, abbreviate, or pluralize entity terms. "stainless steel" must remain "stainless steel", not "stainless-steel" or "SS".
4. Keep the sentence under 20 words
5. The rewrite should be topically relevant to the page subject
Reassemble the test block with:
- Same `<!-- HIDDEN TEST BLOCK START -->` / `<!-- HIDDEN TEST BLOCK END -->` markers
- Same headings in the same positions
- Rewritten body sentences grouped into paragraphs (4 sentences per paragraph)
Overwrite both files:
- `{cwd}/working/test_block.md` (markdown format)
- `{cwd}/working/test_block.html` (HTML format with `<h2>`, `<p>` tags)
### Step T7 — Run Validation Script (Programmatic)
Run `test_block_validate.py` for a deterministic before/after comparison:
```bash
cd {skill_dir}/scripts && uv run --with openpyxl python test_block_validate.py "{content_path}" {cwd}/working/test_block.md "{cora_xlsx_path}" --format json -o {cwd}/working/validation_report.json
```
This produces a report showing every metric before and after, with targets and status:
- Word count, distinct entities, entity density %, variation density %, LSI density %
- Heading counts (H2, H3), entities/variations in headings
- List of all new 0->1 entities introduced
- All numbers are from the same counting code — no mixing of data sources
Present the validation report to the user. Flag any metric that dropped below target after the test block was added.
---
## Optimization Rules
These override any data from the Cora report:
| Rule | Detail |
|------|--------|
| H1 count | Exactly 1, always |
| H2, H3 | Primary optimization targets — focus entity/variation additions here |
| H4 | Low priority — only add if most competitors have them |
| H5, H6 | Ignore completely |
| Word count | Target the nearest competitive cluster, not the raw average. Up to ~1,500 words is always acceptable even if the target is lower. |
| Exact match density | 2% minimum, no upper limit |
| Keyword stuffing | Do NOT flag or warn about keyword stuffing |
| Variations include exact match | Optimizing variation density inherently covers exact match |
| Density is interdependent | Adding content changes ALL density calculations — re-check after big changes |
| Optimization passes | 1-2 passes is typically sufficient |
| Competitor names | NEVER use competitor company names as entities or LSI keywords. Do not mention competitors by name in content. |
| Measurement entities | Ignore measurements (dimensions, tolerances, etc.) as entities — skip these in entity optimization |
| Organization entities | Organizations like ISO, ANSI, ASTM are fine — keep these as entities |
---
## Scripts Reference
All scripts are in `{skill_dir}/scripts/`. Run them with `uv run --with openpyxl python` (or `--with requests,beautifulsoup4` for the scraper).
### cora_parser.py
Foundation module. Reads a Cora XLSX and extracts structured data.
```
uv run --with openpyxl python cora_parser.py <xlsx_path> [--sheet SHEET] [--format json|text]
```
Sheets: `summary`, `entities`, `lsi`, `variations`, `structure`, `densities`, `targets`, `wordcount`, `results`, `tunings`, `all`
### entity_optimizer.py
Counts entities in a draft against Cora targets, recommends additions sorted by (relevance x deficit).
```
uv run --with openpyxl python entity_optimizer.py <draft_path> <cora_xlsx_path> [--format json|text] [--top-n 30]
```
### lsi_optimizer.py
Counts LSI keywords in a draft against Cora targets, recommends additions sorted by (|correlation| x deficit).
```
uv run --with openpyxl python lsi_optimizer.py <draft_path> <cora_xlsx_path> [--format json|text] [--min-correlation 0.2] [--top-n 50]
```
### seo_optimizer.py
Keyword density, structure, and readability checks. Optional Cora integration.
```
uv run --with openpyxl python seo_optimizer.py <draft_path> [--keyword <kw>] [--cora-xlsx <path>] [--format json|text]
```
### competitor_scraper.py
Utility for bulk-fetching URLs when the user provides a list.
```
uv run --with requests,beautifulsoup4 python competitor_scraper.py <url1> <url2> ... [--output-dir ./working/competitor_content/]
```
### test_block_prep.py
Extracts all deficit data from existing content + Cora XLSX. Outputs structured JSON with word count, entity/variation/LSI density deficits, heading deficits, missing entities list, and calculated template instructions (num_templates, slots_per_sentence).
```
uv run --with openpyxl python test_block_prep.py <content_path> <cora_xlsx_path> [--format json|text] [-o PATH]
```
### test_block_generator.py
Fills body template slots with entities from an LLM-curated entity list. Inserts pre-written headings as-is (no slot filling). Tracks aggregate densities in real-time, stops when all targets are met. Outputs test_block.md, test_block.html, and test_block_stats.json.
```
uv run --with openpyxl python test_block_generator.py <templates_path> <prep_json_path> <cora_xlsx_path> --entities-file <path> [--output-dir DIR] [--min-sentences N]
```
### test_block_validate.py
Deterministic before/after comparison. Runs the same counting logic on existing content alone vs existing content + test block. Shows every metric with before, after, target, and status.
```
uv run --with openpyxl python test_block_validate.py <content_path> <test_block_path> <cora_xlsx_path> [--format json|text] [-o PATH]
```
---
## Reference Files
- `references/content_frameworks.md` — Article templates (how-to, listicle, comparison, case study, thought leadership), persuasion frameworks (AIDA, PAS), introduction and conclusion patterns.
- `references/brand_guidelines.md` — Voice archetypes, writing principles, tone spectrums, language preferences, pre-publication checklist.
---
## Working Directory
**CRITICAL: All output files MUST be written to `{cwd}/working/` — the `working/` subfolder inside the user's current project directory (where Claude Code was launched). NEVER write files to the skill directory, scripts directory, or any location outside the project folder. When running scripts, always use absolute paths for output flags (`-o`, `--output-dir`) pointing to `{cwd}/working/`.**
All intermediate files go in `{cwd}/working/` (the user's project directory):
- `working/research_summary.md` — Research output from Step 2
- `working/outline.md` — Outline from Step 3
- `working/draft.md` — Content draft (updated in place during optimization)
- `working/competitor_content/` — Scraped competitor text files (if URLs were fetched)
- `working/existing_content.md` — BS4-scraped existing page content (Phase 3)
- `working/prep_data.json` — Deficit analysis output from test_block_prep.py (Phase 3)
- `working/filtered_entities.txt` — LLM-curated entity list, one per line (Phase 3, Step T3)
- `working/templates.txt` — Pre-written headings + body templates with numbered slots (Phase 3, Step T4)
- `working/test_block.md` — Quick test block in markdown (Phase 3)
- `working/test_block.html` — Quick test block in plain HTML (Phase 3)
- `working/test_block_stats.json` — Generation stats: mentions added, entities introduced, projected densities (Phase 3)
- `working/validation_report.json` — Before/after comparison from test_block_validate.py (Phase 3)

View File

@ -56,6 +56,7 @@ class Scheduler:
self._clickup_thread: threading.Thread | None = None
self._folder_watch_thread: threading.Thread | None = None
self._autocora_thread: threading.Thread | None = None
self._content_watch_thread: threading.Thread | None = None
self._force_autocora = threading.Event()
self._clickup_client = None
self._field_filter_cache: dict | None = None
@ -110,6 +111,21 @@ class Scheduler:
else:
log.info("AutoCora polling disabled")
# Start content folder watcher if configured
content_inbox = self.config.content.cora_inbox
if content_inbox:
self._content_watch_thread = threading.Thread(
target=self._content_watch_loop, daemon=True, name="content-watch"
)
self._content_watch_thread.start()
log.info(
"Content folder watcher started (folder=%s, interval=%dm)",
content_inbox,
self.config.link_building.watch_interval_minutes,
)
else:
log.info("Content folder watcher disabled (no cora_inbox configured)")
log.info(
"Scheduler started (poll=%ds, heartbeat=%dm)",
self.config.scheduler.poll_interval_seconds,
@ -160,6 +176,7 @@ class Scheduler:
"clickup": self.db.kv_get("system:loop:clickup:last_run"),
"folder_watch": self.db.kv_get("system:loop:folder_watch:last_run"),
"autocora": self.db.kv_get("system:loop:autocora:last_run"),
"content_watch": self.db.kv_get("system:loop:content_watch:last_run"),
}
# ── Scheduled Tasks ──
@ -894,3 +911,207 @@ class Scheduler:
return task
return None
# ── Content Folder Watcher ──
def _content_watch_loop(self):
"""Poll the content Cora inbox for new .xlsx files on a regular interval."""
interval = self.config.link_building.watch_interval_minutes * 60
# Wait before first scan to let other systems initialize
self._stop_event.wait(60)
while not self._stop_event.is_set():
try:
self._scan_content_folder()
self.db.kv_set(
"system:loop:content_watch:last_run", datetime.now(UTC).isoformat()
)
except Exception as e:
log.error("Content folder watcher error: %s", e)
self._interruptible_wait(interval)
def _scan_content_folder(self):
"""Scan the content Cora inbox for new .xlsx files and match to ClickUp tasks."""
inbox = Path(self.config.content.cora_inbox)
if not inbox.exists():
log.warning("Content Cora inbox does not exist: %s", inbox)
return
xlsx_files = sorted(inbox.glob("*.xlsx"))
if not xlsx_files:
log.debug("No .xlsx files in content Cora inbox")
return
for xlsx_path in xlsx_files:
filename = xlsx_path.name
# Skip Office temp/lock files
if filename.startswith("~$"):
continue
kv_key = f"content:watched:{filename}"
# Skip completed/failed; retry processing/blocked/unmatched
existing = self.db.kv_get(kv_key)
if existing:
try:
state = json.loads(existing)
if state.get("status") in ("completed", "failed"):
continue
if state.get("status") in ("processing", "blocked", "unmatched"):
log.info("Retrying '%s' state for %s", state["status"], filename)
self.db.kv_delete(kv_key)
except json.JSONDecodeError:
continue
log.info("Content watcher: new .xlsx found: %s", filename)
self._process_content_file(xlsx_path, kv_key)
def _process_content_file(self, xlsx_path: Path, kv_key: str):
"""Match a content Cora .xlsx to a ClickUp task and run create_content."""
filename = xlsx_path.name
stem = xlsx_path.stem.lower().replace("-", " ").replace("_", " ")
stem = re.sub(r"\s+", " ", stem).strip()
# Mark as processing
self.db.kv_set(
kv_key,
json.dumps({"status": "processing", "started_at": datetime.now(UTC).isoformat()}),
)
# Try to find matching ClickUp task
matched_task = None
if self.config.clickup.enabled:
matched_task = self._match_xlsx_to_content_task(stem)
if not matched_task:
log.warning("No ClickUp content task match for '%s' — skipping", filename)
self.db.kv_set(
kv_key,
json.dumps(
{
"status": "unmatched",
"filename": filename,
"stem": stem,
"checked_at": datetime.now(UTC).isoformat(),
}
),
)
self._notify(
f"Content watcher: no ClickUp match for **{filename}**.\n"
f"Create a Content Creation or On Page Optimization task with Keyword "
f"matching '{stem}' to enable auto-processing.",
category="content",
)
return
task_id = matched_task.id
log.info("Matched '%s' to ClickUp task %s (%s)", filename, task_id, matched_task.name)
self._notify(
f"Content watcher: matched **{filename}** to ClickUp task "
f"**{matched_task.name}**.\nStarting content creation pipeline...",
category="content",
)
# Extract fields from the matched task
keyword = matched_task.custom_fields.get("Keyword", "") or matched_task.name
url = matched_task.custom_fields.get("IMSURL", "") or ""
cli_flags = matched_task.custom_fields.get("CLIFlags", "") or ""
args = {
"keyword": str(keyword),
"url": str(url),
"cli_flags": str(cli_flags),
"clickup_task_id": task_id,
}
try:
if hasattr(self.agent, "_tools") and self.agent._tools:
result = self.agent._tools.execute("create_content", args)
else:
result = "Error: tool registry not available"
if result.startswith("Error:"):
self.db.kv_set(
kv_key,
json.dumps(
{
"status": "failed",
"filename": filename,
"task_id": task_id,
"error": result[:500],
"failed_at": datetime.now(UTC).isoformat(),
}
),
)
self._notify(
f"Content watcher: pipeline **failed** for **{filename}**.\n"
f"Error: {result[:200]}",
category="content",
)
else:
self.db.kv_set(
kv_key,
json.dumps(
{
"status": "completed",
"filename": filename,
"task_id": task_id,
"completed_at": datetime.now(UTC).isoformat(),
}
),
)
self._notify(
f"Content watcher: pipeline **completed** for **{filename}**.\n"
f"ClickUp task: {matched_task.name}",
category="content",
)
except Exception as e:
log.error("Content watcher pipeline error for %s: %s", filename, e)
self.db.kv_set(
kv_key,
json.dumps(
{
"status": "failed",
"filename": filename,
"task_id": task_id,
"error": str(e)[:500],
"failed_at": datetime.now(UTC).isoformat(),
}
),
)
def _match_xlsx_to_content_task(self, normalized_stem: str):
"""Find a ClickUp content task whose Keyword matches the file stem.
Matches tasks with Work Category in ("Content Creation", "On Page Optimization").
Returns the matched ClickUpTask or None.
"""
from .tools.linkbuilding import _fuzzy_keyword_match, _normalize_for_match
client = self._get_clickup_client()
space_id = self.config.clickup.space_id
if not space_id:
return None
try:
tasks = client.get_tasks_from_overall_lists(space_id)
except Exception as e:
log.warning("ClickUp query failed in _match_xlsx_to_content_task: %s", e)
return None
content_types = ("Content Creation", "On Page Optimization")
for task in tasks:
if task.task_type not in content_types:
continue
keyword = task.custom_fields.get("Keyword", "")
if not keyword:
continue
keyword_norm = _normalize_for_match(str(keyword))
if _fuzzy_keyword_match(normalized_stem, keyword_norm):
return task
return None

View File

@ -225,22 +225,65 @@ def _build_phase1_prompt(
content_type: str,
cora_path: str,
capabilities_default: str,
is_service_page: bool = False,
) -> str:
"""Build the Phase 1 prompt that triggers the content-researcher skill."""
"""Build the Phase 1 prompt that triggers the content-researcher skill.
Branches on whether a URL is present:
- URL present optimization path (scrape existing page, match style)
- No URL new content path (research competitors, write net-new)
"""
if url:
# ── Optimization path ──
parts = [
f"Research, outline, and draft an optimized {content_type} for {url} "
f"targeting keyword '{keyword}'. This is an SEO content optimization project.",
f"Optimize the existing page at {url} targeting keyword '{keyword}'. "
f"This is an on-page optimization project.",
"\n**Step 1 — Scrape the existing page.**\n"
"Use the BS4 scraper (scripts/competitor_scraper.py) to fetch the "
"current page content — do NOT use web_fetch for this. Analyze its "
"style, tone, heading structure, and content organization.",
"\n**Step 2 — Build an optimization outline.**\n"
"Plan two deliverables:\n"
"1. **Optimized page rewrite** — match the original style/tone/structure "
"while weaving in entity and keyword targets from the Cora report.\n"
"2. **Hidden entity test block** — a `<div style=\"display:none\">` block "
"containing entity terms that didn't fit naturally into the content.",
]
else:
# ── New content path ──
parts = [
f"Research and outline new {content_type} targeting keyword '{keyword}'. "
f"This is a new content creation project.",
"\n**Step 1 — Competitor research.**\n"
"Scrape the top-ranking pages for this keyword using "
"scripts/competitor_scraper.py. Analyze their structure, depth, "
"and content coverage.",
"\n**Step 2 — Build an outline.**\n"
"Plan the content structure with entities woven naturally into "
"the headings and body. No hidden entity div needed for new content.",
]
if cora_path:
parts.append(
f"\nA Cora SEO report is available at: {cora_path}\n"
f"Read this report to extract keyword targets, entity requirements, "
f"and competitive analysis data."
)
if capabilities_default:
if is_service_page:
cap_note = (
f'\nThis is a **service page**. Use the following as the company '
f'capabilities answer: "{capabilities_default}"\n'
f"Do NOT ask the user about capabilities — you are running autonomously. "
f"Avoid making specific claims about services, certifications, or "
f"licenses not already present on the existing page."
)
parts.append(cap_note)
elif capabilities_default:
parts.append(
f'\nWhen asked about company capabilities, respond with: "{capabilities_default}"'
)
parts.append(
"\nDeliver the outline as a complete markdown document with sections, "
"headings, entity targets, and keyword placement notes."
@ -253,19 +296,53 @@ def _build_phase2_prompt(
keyword: str,
outline_text: str,
cora_path: str,
is_service_page: bool = False,
capabilities_default: str = "",
) -> str:
"""Build the Phase 2 prompt for writing full content from an approved outline."""
"""Build the Phase 2 prompt for writing full content from an approved outline.
Branches on whether a URL is present:
- URL present write optimized page rewrite + hidden entity div
- No URL write full new page content
"""
if url:
# ── Optimization path ──
parts = [
f"Write full SEO-optimized content based on this approved outline for {url} "
f"targeting '{keyword}'. This is the content writing phase of a "
f"content optimization project.",
f"Write the final optimized content for {url} targeting '{keyword}'. "
f"This is the writing phase of an on-page optimization project.",
f"\n## Approved Outline\n\n{outline_text}",
"\n**Deliverables:**\n"
"1. **Optimized page rewrite** — match the original page's style, tone, "
"and structure. Weave in all entity and keyword targets from the outline.\n"
"2. **Hidden entity test block** — generate a "
"`<div style=\"display:none\">` block containing entity terms that "
"didn't fit naturally into the body content. Use the entity test block "
"generator (Phase 3 of the content-researcher skill).",
]
else:
# ── New content path ──
parts = [
f"Write full new content targeting '{keyword}'. "
f"This is the writing phase of a new content creation project.",
f"\n## Approved Outline\n\n{outline_text}",
"\nWrite publication-ready content following the outline structure. "
"Weave entities naturally into the content — no hidden entity div "
"needed for new content.",
]
if cora_path:
parts.append(
f"\nThe Cora SEO report is at: {cora_path}\n"
f"Use it for keyword density targets and entity optimization."
)
if is_service_page:
parts.append(
f'\nThis is a **service page**. Company capabilities: "{capabilities_default}"\n'
f"Do NOT make specific claims about services, certifications, or "
f"licenses not found on the existing page."
)
parts.append(
"\nWrite publication-ready content following the outline structure. "
"Include all entity targets and keyword placements as noted in the outline."
@ -281,27 +358,37 @@ def _build_phase2_prompt(
@tool(
"create_content",
"Two-phase SEO content creation: Phase 1 researches + outlines, Phase 2 writes "
"full content from the approved outline. Auto-detects phase from kv_store state.",
"full content from the approved outline. Auto-detects phase from kv_store state. "
"Auto-detects content type from URL presence if not specified.",
category="content",
)
def create_content(
url: str,
keyword: str,
content_type: str = "service page",
url: str = "",
content_type: str = "",
cli_flags: str = "",
ctx: dict | None = None,
) -> str:
"""Create SEO content in two phases with human review between them.
Args:
url: Target page URL (e.g. "https://example.com/services/plumbing").
keyword: Primary target keyword (e.g. "plumbing services").
content_type: Type of content "service page", "blog post", etc.
url: Target page URL. If provided on-page optimization; if empty new content.
content_type: Type of content. Auto-detected from URL if empty.
cli_flags: Optional flags (e.g. "service" for service page hint).
"""
if not url or not keyword:
return "Error: Both 'url' and 'keyword' are required."
if not keyword:
return "Error: 'keyword' is required."
if not ctx or "agent" not in ctx:
return "Error: Tool context with agent is required."
# Auto-detect content_type from URL presence when not explicitly set
if not content_type:
content_type = "on page optimization" if url else "new content"
# Service page hint from cli_flags
is_service_page = bool(cli_flags and "service" in cli_flags.lower())
agent = ctx["agent"]
config = ctx.get("config")
db = ctx.get("db")
@ -342,6 +429,7 @@ def create_content(
content_type=content_type,
cora_path=cora_path,
capabilities_default=capabilities_default,
is_service_page=is_service_page,
)
else:
return _run_phase2(
@ -355,6 +443,8 @@ def create_content(
keyword=keyword,
cora_path=cora_path,
existing_state=existing_state,
is_service_page=is_service_page,
capabilities_default=capabilities_default,
)
@ -376,6 +466,7 @@ def _run_phase1(
content_type: str,
cora_path: str,
capabilities_default: str,
is_service_page: bool = False,
) -> str:
now = datetime.now(UTC).isoformat()
@ -383,9 +474,11 @@ def _run_phase1(
if task_id:
_sync_clickup_start(ctx, task_id)
prompt = _build_phase1_prompt(url, keyword, content_type, cora_path, capabilities_default)
prompt = _build_phase1_prompt(
url, keyword, content_type, cora_path, capabilities_default, is_service_page
)
log.info("Phase 1 — researching + outlining for '%s' (%s)", keyword, url)
log.info("Phase 1 — researching + outlining for '%s' (%s)", keyword, url or "new content")
try:
result = agent.execute_task(
prompt,
@ -430,10 +523,11 @@ def _run_phase1(
if task_id:
_sync_clickup_outline_ready(ctx, task_id, outline_path)
url_line = f"**URL:** {url}\n" if url else "**Type:** New content\n"
return (
f"## Phase 1 Complete — Outline Ready for Review\n\n"
f"**Keyword:** {keyword}\n"
f"**URL:** {url}\n"
f"{url_line}"
f"**Outline saved to:** `{outline_path}`\n\n"
f"Please review and edit the outline. When ready, move the ClickUp task "
f"to **outline approved** to trigger Phase 2 (full content writing).\n\n"
@ -459,6 +553,8 @@ def _run_phase2(
keyword: str,
cora_path: str,
existing_state: dict,
is_service_page: bool = False,
capabilities_default: str = "",
) -> str:
# Read the (possibly edited) outline
outline_path = existing_state.get("outline_path", "")
@ -483,7 +579,9 @@ def _run_phase2(
if task_id:
_sync_clickup_start(ctx, task_id)
prompt = _build_phase2_prompt(url, keyword, outline_text, cora_path)
prompt = _build_phase2_prompt(
url, keyword, outline_text, cora_path, is_service_page, capabilities_default
)
log.info("Phase 2 — writing full content for '%s' (%s)", keyword, url)
try:
@ -524,10 +622,11 @@ def _run_phase2(
if task_id:
_sync_clickup_complete(ctx, task_id, content_path)
url_line = f"**URL:** {url}\n" if url else "**Type:** New content\n"
return (
f"## Phase 2 Complete — Content Written\n\n"
f"**Keyword:** {keyword}\n"
f"**URL:** {url}\n"
f"{url_line}"
f"**Content saved to:** `{content_path}`\n\n"
f"---\n\n{result}\n\n"
f"## ClickUp Sync\nPhase 2 complete. Status: internal review."

View File

@ -60,16 +60,18 @@ clickup:
branded_url: "SocialURL"
"On Page Optimization":
tool: "create_content"
auto_execute: true
auto_execute: false
field_mapping:
url: "IMSURL"
keyword: "Keyword"
cli_flags: "CLIFlags"
"Content Creation":
tool: "create_content"
auto_execute: true
auto_execute: false
field_mapping:
url: "IMSURL"
keyword: "Keyword"
cli_flags: "CLIFlags"
"Link Building":
tool: "run_link_building"
auto_execute: false

View File

@ -49,9 +49,8 @@ class TestBuildPhase1Prompt:
"",
"",
)
assert "SEO content optimization" in prompt
assert "on-page optimization" in prompt
assert "plumbing services" in prompt
assert "service page" in prompt
assert "https://example.com/plumbing" in prompt
def test_includes_cora_path(self):
@ -104,7 +103,7 @@ class TestBuildPhase2Prompt:
"",
)
assert outline in prompt
assert "content writing phase" in prompt
assert "writing phase" in prompt
assert "plumbing" in prompt
def test_includes_cora_path(self):
@ -229,13 +228,12 @@ class TestCreateContentPhase1:
"clickup_task_id": "task123",
}
def test_requires_url_and_keyword(self, tmp_db):
def test_requires_keyword(self, tmp_db):
ctx = {"agent": MagicMock(), "config": Config(), "db": tmp_db}
assert create_content(url="", keyword="test", ctx=ctx).startswith("Error:")
assert create_content(url="http://x", keyword="", ctx=ctx).startswith("Error:")
assert create_content(keyword="", ctx=ctx).startswith("Error:")
def test_requires_context(self):
assert create_content(url="http://x", keyword="kw", ctx=None).startswith("Error:")
assert create_content(keyword="kw", url="http://x", ctx=None).startswith("Error:")
def test_phase1_runs_without_prior_state(self, tmp_db, tmp_path):
ctx = self._make_ctx(tmp_db, tmp_path)