Big-Link-Man/docs/stories/story-2.2-content-rule-engi...

# Story 2.2: Configurable Content Rule Engine

## Overview
Implementation of a CORA-compliant content validation engine that ensures AI-generated HTML meets both universal quality standards and project-specific CORA targets.

## Status
**COMPLETED**

## Implementation Details

### 1. Database Changes
- Added `tier` field to `Project` model (default=1, indexed)
- Created migration script: `scripts/add_tier_to_projects.py`
- Tier 1 = strictest validation (default)
- Tier 2+ = warnings only for CORA target misses

### 2. Configuration Updates
**File:** `master.config.json`

Restructured `content_rules` with two validation levels:

**Universal Rules** (apply to all tiers, hard failures):
- `min_content_length`: 1000 words minimum
- `max_content_length`: 5000 words maximum
- `title_exact_match_required`: Title must contain main keyword
- `h1_exact_match_required`: H1 must contain main keyword
- `h2_exact_match_min`: At least 1 H2 with main keyword
- `h3_exact_match_min`: At least 1 H3 with main keyword
- `faq_section_required`: Must include FAQ section
- `faq_question_restatement_required`: FAQ answers restate questions
- `image_alt_text_keyword_required`: Alt text must contain keyword
- `image_alt_text_entity_required`: Alt text must contain entities

**CORA Validation Config**:
- `enabled`: Toggle CORA validation on/off
- `tier_1_strict`: Tier 1 fails on CORA target misses
- `tier_2_plus_warn_only`: Tier 2+ only warns
- `round_averages_down`: Round CORA averages down (e.g., 5.6 → 5)

### 3. Core Rule Engine
**File:** `src/generation/rule_engine.py`

**Classes:**
- `ValidationIssue`: Single validation error or warning
- `ValidationResult`: Complete validation result with errors/warnings
- `ContentHTMLParser`: Extracts structure from HTML (H1/H2/H3/images/links/text)
- `ContentRuleEngine`: Main validation engine

**Key Features:**
- HTML parsing and element extraction
- Keyword/entity counting with word boundary matching
- Universal rule validation (hard failures)
- CORA target validation (tier-aware)
- FAQ section detection
- Image alt text validation
- Detailed error/warning reporting

### 4. Config System Updates
**File:** `src/core/config.py`

Added:
- `UniversalRulesConfig` model
- `CORAValidationConfig` model
- Updated `ContentRulesConfig` to use nested structure
- Added `Config.get()` method for dot notation access (e.g., `config.get("content_rules.universal")`)

### 5. Tests
**File:** `tests/unit/test_rule_engine.py`

**21 comprehensive tests covering:**
- HTML parser functionality (6 tests)
- ValidationResult class (4 tests)
- Universal rules validation (6 tests)
- CORA target validation (4 tests)
- Fully compliant content (1 test)

**All tests passing ✓**

## Usage Example

```python
from src.generation.rule_engine import ContentRuleEngine
from src.core.config import get_config
from src.database.models import Project

# Initialize engine
config = get_config()
engine = ContentRuleEngine(config)

# Validate content
html_content = "<html>...</html>"
project = # ... load from database
result = engine.validate(html_content, project)

if result.passed:
    print("Content is valid!")
else:
    print(f"Errors: {len(result.errors)}")
    for error in result.errors:
        print(f"  - {error.message}")

    print(f"Warnings: {len(result.warnings)}")
    for warning in result.warnings:
        print(f"  - {warning.message}")

# Get detailed report
report = result.to_dict()
```

## Validation Logic

### Universal Rules (All Tiers)
1. **Word Count**: Content length between min/max bounds
2. **Title**: Must contain main keyword
3. **H1**: At least one H1 with main keyword
4. **H2/H3 Minimums**: Minimum keyword counts
5. **FAQ**: Must have FAQ section
6. **Images**: Alt text contains keyword + entities

### CORA Targets (Tier-Aware)
For each CORA metric (h1_exact, h2_total, h2_entities, etc.):
- **Tier 1**: FAIL if actual < target (rounded down)
- **Tier 2+**: WARN if actual < target (but pass)

### Keyword Matching
- Case-insensitive
- Word boundary detection (avoids partial matches)
- Supports related searches and entities

## Acceptance Criteria

✅ System loads "content_rules" from master JSON configuration
✅ Validates H1 tag contains main keyword
✅ Validates at least one H2 starts with main keyword
✅ Validates other H2s incorporate entities and related searches
✅ Validates H3 tags similarly to H2s
✅ Validates FAQ section format
✅ Validates image alt text contains keyword and entities
✅ Tier-based validation (strict for Tier 1, warnings for Tier 2+)
✅ Rounds CORA averages down as configured
✅ All tests passing (21/21)

## Files Modified

1. `src/database/models.py` - Added tier field to Project
2. `master.config.json` - Restructured content_rules
3. `src/core/config.py` - Added config models and get() method
4. `src/generation/rule_engine.py` - Implemented validation engine
5. `scripts/add_tier_to_projects.py` - Database migration
6. `tests/unit/test_rule_engine.py` - Comprehensive test suite

## Next Steps (Story 2.3)

The rule engine is ready to be integrated into Story 2.3 (AI-Powered Content Generation):
- Story 2.3 will use this engine to validate AI-generated content
- Can implement retry logic if validation fails
- Engine provides detailed feedback for AI prompt refinement