159 lines
5.2 KiB
Markdown
159 lines
5.2 KiB
Markdown
# Story 2.2: Configurable Content Rule Engine
|
|
|
|
## Overview
|
|
Implementation of a CORA-compliant content validation engine that ensures AI-generated HTML meets both universal quality standards and project-specific CORA targets.
|
|
|
|
## Status
|
|
**COMPLETED**
|
|
|
|
## Implementation Details
|
|
|
|
### 1. Database Changes
|
|
- Added `tier` field to `Project` model (default=1, indexed)
|
|
- Created migration script: `scripts/add_tier_to_projects.py`
|
|
- Tier 1 = strictest validation (default)
|
|
- Tier 2+ = warnings only for CORA target misses
|
|
|
|
### 2. Configuration Updates
|
|
**File:** `master.config.json`
|
|
|
|
Restructured `content_rules` with two validation levels:
|
|
|
|
**Universal Rules** (apply to all tiers, hard failures):
|
|
- `min_content_length`: 1000 words minimum
|
|
- `max_content_length`: 5000 words maximum
|
|
- `title_exact_match_required`: Title must contain main keyword
|
|
- `h1_exact_match_required`: H1 must contain main keyword
|
|
- `h2_exact_match_min`: At least 1 H2 with main keyword
|
|
- `h3_exact_match_min`: At least 1 H3 with main keyword
|
|
- `faq_section_required`: Must include FAQ section
|
|
- `faq_question_restatement_required`: FAQ answers restate questions
|
|
- `image_alt_text_keyword_required`: Alt text must contain keyword
|
|
- `image_alt_text_entity_required`: Alt text must contain entities
|
|
|
|
**CORA Validation Config**:
|
|
- `enabled`: Toggle CORA validation on/off
|
|
- `tier_1_strict`: Tier 1 fails on CORA target misses
|
|
- `tier_2_plus_warn_only`: Tier 2+ only warns
|
|
- `round_averages_down`: Round CORA averages down (e.g., 5.6 → 5)
|
|
|
|
### 3. Core Rule Engine
|
|
**File:** `src/generation/rule_engine.py`
|
|
|
|
**Classes:**
|
|
- `ValidationIssue`: Single validation error or warning
|
|
- `ValidationResult`: Complete validation result with errors/warnings
|
|
- `ContentHTMLParser`: Extracts structure from HTML (H1/H2/H3/images/links/text)
|
|
- `ContentRuleEngine`: Main validation engine
|
|
|
|
**Key Features:**
|
|
- HTML parsing and element extraction
|
|
- Keyword/entity counting with word boundary matching
|
|
- Universal rule validation (hard failures)
|
|
- CORA target validation (tier-aware)
|
|
- FAQ section detection
|
|
- Image alt text validation
|
|
- Detailed error/warning reporting
|
|
|
|
### 4. Config System Updates
|
|
**File:** `src/core/config.py`
|
|
|
|
Added:
|
|
- `UniversalRulesConfig` model
|
|
- `CORAValidationConfig` model
|
|
- Updated `ContentRulesConfig` to use nested structure
|
|
- Added `Config.get()` method for dot notation access (e.g., `config.get("content_rules.universal")`)
|
|
|
|
### 5. Tests
|
|
**File:** `tests/unit/test_rule_engine.py`
|
|
|
|
**21 comprehensive tests covering:**
|
|
- HTML parser functionality (6 tests)
|
|
- ValidationResult class (4 tests)
|
|
- Universal rules validation (6 tests)
|
|
- CORA target validation (4 tests)
|
|
- Fully compliant content (1 test)
|
|
|
|
**All tests passing ✓**
|
|
|
|
## Usage Example
|
|
|
|
```python
|
|
from src.generation.rule_engine import ContentRuleEngine
|
|
from src.core.config import get_config
|
|
from src.database.models import Project
|
|
|
|
# Initialize engine
|
|
config = get_config()
|
|
engine = ContentRuleEngine(config)
|
|
|
|
# Validate content
|
|
html_content = "<html>...</html>"
|
|
project = # ... load from database
|
|
result = engine.validate(html_content, project)
|
|
|
|
if result.passed:
|
|
print("Content is valid!")
|
|
else:
|
|
print(f"Errors: {len(result.errors)}")
|
|
for error in result.errors:
|
|
print(f" - {error.message}")
|
|
|
|
print(f"Warnings: {len(result.warnings)}")
|
|
for warning in result.warnings:
|
|
print(f" - {warning.message}")
|
|
|
|
# Get detailed report
|
|
report = result.to_dict()
|
|
```
|
|
|
|
## Validation Logic
|
|
|
|
### Universal Rules (All Tiers)
|
|
1. **Word Count**: Content length between min/max bounds
|
|
2. **Title**: Must contain main keyword
|
|
3. **H1**: At least one H1 with main keyword
|
|
4. **H2/H3 Minimums**: Minimum keyword counts
|
|
5. **FAQ**: Must have FAQ section
|
|
6. **Images**: Alt text contains keyword + entities
|
|
|
|
### CORA Targets (Tier-Aware)
|
|
For each CORA metric (h1_exact, h2_total, h2_entities, etc.):
|
|
- **Tier 1**: FAIL if actual < target (rounded down)
|
|
- **Tier 2+**: WARN if actual < target (but pass)
|
|
|
|
### Keyword Matching
|
|
- Case-insensitive
|
|
- Word boundary detection (avoids partial matches)
|
|
- Supports related searches and entities
|
|
|
|
## Acceptance Criteria
|
|
|
|
✅ System loads "content_rules" from master JSON configuration
|
|
✅ Validates H1 tag contains main keyword
|
|
✅ Validates at least one H2 starts with main keyword
|
|
✅ Validates other H2s incorporate entities and related searches
|
|
✅ Validates H3 tags similarly to H2s
|
|
✅ Validates FAQ section format
|
|
✅ Validates image alt text contains keyword and entities
|
|
✅ Tier-based validation (strict for Tier 1, warnings for Tier 2+)
|
|
✅ Rounds CORA averages down as configured
|
|
✅ All tests passing (21/21)
|
|
|
|
## Files Modified
|
|
|
|
1. `src/database/models.py` - Added tier field to Project
|
|
2. `master.config.json` - Restructured content_rules
|
|
3. `src/core/config.py` - Added config models and get() method
|
|
4. `src/generation/rule_engine.py` - Implemented validation engine
|
|
5. `scripts/add_tier_to_projects.py` - Database migration
|
|
6. `tests/unit/test_rule_engine.py` - Comprehensive test suite
|
|
|
|
## Next Steps (Story 2.3)
|
|
|
|
The rule engine is ready to be integrated into Story 2.3 (AI-Powered Content Generation):
|
|
- Story 2.3 will use this engine to validate AI-generated content
|
|
- Can implement retry logic if validation fails
|
|
- Engine provides detailed feedback for AI prompt refinement
|
|
|