Big-Link-Man/docs/stories/story-2.2-content-rule-engi...

159 lines
5.2 KiB
Markdown

# Story 2.2: Configurable Content Rule Engine
## Overview
Implementation of a CORA-compliant content validation engine that ensures AI-generated HTML meets both universal quality standards and project-specific CORA targets.
## Status
**COMPLETED**
## Implementation Details
### 1. Database Changes
- Added `tier` field to `Project` model (default=1, indexed)
- Created migration script: `scripts/add_tier_to_projects.py`
- Tier 1 = strictest validation (default)
- Tier 2+ = warnings only for CORA target misses
### 2. Configuration Updates
**File:** `master.config.json`
Restructured `content_rules` with two validation levels:
**Universal Rules** (apply to all tiers, hard failures):
- `min_content_length`: 1000 words minimum
- `max_content_length`: 5000 words maximum
- `title_exact_match_required`: Title must contain main keyword
- `h1_exact_match_required`: H1 must contain main keyword
- `h2_exact_match_min`: At least 1 H2 with main keyword
- `h3_exact_match_min`: At least 1 H3 with main keyword
- `faq_section_required`: Must include FAQ section
- `faq_question_restatement_required`: FAQ answers restate questions
- `image_alt_text_keyword_required`: Alt text must contain keyword
- `image_alt_text_entity_required`: Alt text must contain entities
**CORA Validation Config**:
- `enabled`: Toggle CORA validation on/off
- `tier_1_strict`: Tier 1 fails on CORA target misses
- `tier_2_plus_warn_only`: Tier 2+ only warns
- `round_averages_down`: Round CORA averages down (e.g., 5.6 → 5)
### 3. Core Rule Engine
**File:** `src/generation/rule_engine.py`
**Classes:**
- `ValidationIssue`: Single validation error or warning
- `ValidationResult`: Complete validation result with errors/warnings
- `ContentHTMLParser`: Extracts structure from HTML (H1/H2/H3/images/links/text)
- `ContentRuleEngine`: Main validation engine
**Key Features:**
- HTML parsing and element extraction
- Keyword/entity counting with word boundary matching
- Universal rule validation (hard failures)
- CORA target validation (tier-aware)
- FAQ section detection
- Image alt text validation
- Detailed error/warning reporting
### 4. Config System Updates
**File:** `src/core/config.py`
Added:
- `UniversalRulesConfig` model
- `CORAValidationConfig` model
- Updated `ContentRulesConfig` to use nested structure
- Added `Config.get()` method for dot notation access (e.g., `config.get("content_rules.universal")`)
### 5. Tests
**File:** `tests/unit/test_rule_engine.py`
**21 comprehensive tests covering:**
- HTML parser functionality (6 tests)
- ValidationResult class (4 tests)
- Universal rules validation (6 tests)
- CORA target validation (4 tests)
- Fully compliant content (1 test)
**All tests passing ✓**
## Usage Example
```python
from src.generation.rule_engine import ContentRuleEngine
from src.core.config import get_config
from src.database.models import Project
# Initialize engine
config = get_config()
engine = ContentRuleEngine(config)
# Validate content
html_content = "<html>...</html>"
project = # ... load from database
result = engine.validate(html_content, project)
if result.passed:
print("Content is valid!")
else:
print(f"Errors: {len(result.errors)}")
for error in result.errors:
print(f" - {error.message}")
print(f"Warnings: {len(result.warnings)}")
for warning in result.warnings:
print(f" - {warning.message}")
# Get detailed report
report = result.to_dict()
```
## Validation Logic
### Universal Rules (All Tiers)
1. **Word Count**: Content length between min/max bounds
2. **Title**: Must contain main keyword
3. **H1**: At least one H1 with main keyword
4. **H2/H3 Minimums**: Minimum keyword counts
5. **FAQ**: Must have FAQ section
6. **Images**: Alt text contains keyword + entities
### CORA Targets (Tier-Aware)
For each CORA metric (h1_exact, h2_total, h2_entities, etc.):
- **Tier 1**: FAIL if actual < target (rounded down)
- **Tier 2+**: WARN if actual < target (but pass)
### Keyword Matching
- Case-insensitive
- Word boundary detection (avoids partial matches)
- Supports related searches and entities
## Acceptance Criteria
System loads "content_rules" from master JSON configuration
Validates H1 tag contains main keyword
Validates at least one H2 starts with main keyword
Validates other H2s incorporate entities and related searches
Validates H3 tags similarly to H2s
Validates FAQ section format
Validates image alt text contains keyword and entities
Tier-based validation (strict for Tier 1, warnings for Tier 2+)
Rounds CORA averages down as configured
All tests passing (21/21)
## Files Modified
1. `src/database/models.py` - Added tier field to Project
2. `master.config.json` - Restructured content_rules
3. `src/core/config.py` - Added config models and get() method
4. `src/generation/rule_engine.py` - Implemented validation engine
5. `scripts/add_tier_to_projects.py` - Database migration
6. `tests/unit/test_rule_engine.py` - Comprehensive test suite
## Next Steps (Story 2.3)
The rule engine is ready to be integrated into Story 2.3 (AI-Powered Content Generation):
- Story 2.3 will use this engine to validate AI-generated content
- Can implement retry logic if validation fails
- Engine provides detailed feedback for AI prompt refinement