Big-Link-Man/docs/stories/story-2.2-content-rule-engi...

5.2 KiB

Story 2.2: Configurable Content Rule Engine

Overview

Implementation of a CORA-compliant content validation engine that ensures AI-generated HTML meets both universal quality standards and project-specific CORA targets.

Status

COMPLETED

Implementation Details

1. Database Changes

  • Added tier field to Project model (default=1, indexed)
  • Created migration script: scripts/add_tier_to_projects.py
  • Tier 1 = strictest validation (default)
  • Tier 2+ = warnings only for CORA target misses

2. Configuration Updates

File: master.config.json

Restructured content_rules with two validation levels:

Universal Rules (apply to all tiers, hard failures):

  • min_content_length: 1000 words minimum
  • max_content_length: 5000 words maximum
  • title_exact_match_required: Title must contain main keyword
  • h1_exact_match_required: H1 must contain main keyword
  • h2_exact_match_min: At least 1 H2 with main keyword
  • h3_exact_match_min: At least 1 H3 with main keyword
  • faq_section_required: Must include FAQ section
  • faq_question_restatement_required: FAQ answers restate questions
  • image_alt_text_keyword_required: Alt text must contain keyword
  • image_alt_text_entity_required: Alt text must contain entities

CORA Validation Config:

  • enabled: Toggle CORA validation on/off
  • tier_1_strict: Tier 1 fails on CORA target misses
  • tier_2_plus_warn_only: Tier 2+ only warns
  • round_averages_down: Round CORA averages down (e.g., 5.6 → 5)

3. Core Rule Engine

File: src/generation/rule_engine.py

Classes:

  • ValidationIssue: Single validation error or warning
  • ValidationResult: Complete validation result with errors/warnings
  • ContentHTMLParser: Extracts structure from HTML (H1/H2/H3/images/links/text)
  • ContentRuleEngine: Main validation engine

Key Features:

  • HTML parsing and element extraction
  • Keyword/entity counting with word boundary matching
  • Universal rule validation (hard failures)
  • CORA target validation (tier-aware)
  • FAQ section detection
  • Image alt text validation
  • Detailed error/warning reporting

4. Config System Updates

File: src/core/config.py

Added:

  • UniversalRulesConfig model
  • CORAValidationConfig model
  • Updated ContentRulesConfig to use nested structure
  • Added Config.get() method for dot notation access (e.g., config.get("content_rules.universal"))

5. Tests

File: tests/unit/test_rule_engine.py

21 comprehensive tests covering:

  • HTML parser functionality (6 tests)
  • ValidationResult class (4 tests)
  • Universal rules validation (6 tests)
  • CORA target validation (4 tests)
  • Fully compliant content (1 test)

All tests passing ✓

Usage Example

from src.generation.rule_engine import ContentRuleEngine
from src.core.config import get_config
from src.database.models import Project

# Initialize engine
config = get_config()
engine = ContentRuleEngine(config)

# Validate content
html_content = "<html>...</html>"
project = # ... load from database
result = engine.validate(html_content, project)

if result.passed:
    print("Content is valid!")
else:
    print(f"Errors: {len(result.errors)}")
    for error in result.errors:
        print(f"  - {error.message}")
    
    print(f"Warnings: {len(result.warnings)}")
    for warning in result.warnings:
        print(f"  - {warning.message}")

# Get detailed report
report = result.to_dict()

Validation Logic

Universal Rules (All Tiers)

  1. Word Count: Content length between min/max bounds
  2. Title: Must contain main keyword
  3. H1: At least one H1 with main keyword
  4. H2/H3 Minimums: Minimum keyword counts
  5. FAQ: Must have FAQ section
  6. Images: Alt text contains keyword + entities

CORA Targets (Tier-Aware)

For each CORA metric (h1_exact, h2_total, h2_entities, etc.):

  • Tier 1: FAIL if actual < target (rounded down)
  • Tier 2+: WARN if actual < target (but pass)

Keyword Matching

  • Case-insensitive
  • Word boundary detection (avoids partial matches)
  • Supports related searches and entities

Acceptance Criteria

System loads "content_rules" from master JSON configuration Validates H1 tag contains main keyword Validates at least one H2 starts with main keyword Validates other H2s incorporate entities and related searches Validates H3 tags similarly to H2s Validates FAQ section format Validates image alt text contains keyword and entities Tier-based validation (strict for Tier 1, warnings for Tier 2+) Rounds CORA averages down as configured All tests passing (21/21)

Files Modified

  1. src/database/models.py - Added tier field to Project
  2. master.config.json - Restructured content_rules
  3. src/core/config.py - Added config models and get() method
  4. src/generation/rule_engine.py - Implemented validation engine
  5. scripts/add_tier_to_projects.py - Database migration
  6. tests/unit/test_rule_engine.py - Comprehensive test suite

Next Steps (Story 2.3)

The rule engine is ready to be integrated into Story 2.3 (AI-Powered Content Generation):

  • Story 2.3 will use this engine to validate AI-generated content
  • Can implement retry logic if validation fails
  • Engine provides detailed feedback for AI prompt refinement