11 KiB
Story 2.1: CORA Report Data Ingestion - COMPLETED
Overview
Implemented complete CORA .xlsx file ingestion system with parser module, database models, CLI commands, and comprehensive test coverage.
Story Details
As a User, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data.
Acceptance Criteria - ALL MET
1. CLI Command to Ingest CORA Files
Status: COMPLETE
A CLI command exists to accept CORA .xlsx file paths:
- Command:
ingest-cora - Options:
--file,--name,--custom-anchors,--username,--password - Requires user authentication (any authenticated user can create projects)
- Returns success message with project details
2. Data Extraction from CORA Files
Status: COMPLETE
The parser correctly extracts all specified data points:
- Main keyword: From Strategic Overview B5 or filename
- Strategic Overview metrics: Word count, term frequency, densities, spintax
- Structure metrics: Title, meta, H1, H2, H3 counts and distributions
- Entities: From Entities sheet where column J < -0.195
- Related searches: Parsed from spintax format
- Optional anchor text: User-provided via CLI
3. Database Storage
Status: COMPLETE
Project records are created with all data:
- User association (user_id foreign key)
- Main keyword and project name
- All numeric metrics from CORA file
- Entities and related searches as JSON arrays
- Custom anchor text as JSON array
- Timestamps (created_at, updated_at)
4. Error Handling
Status: COMPLETE
Graceful error handling for:
- File not found errors
- Invalid Excel file format
- Missing required sheets (Strategic Overview, Structure)
- Authentication failures
- Database errors
Implementation Details
Files Created/Modified
1. src/database/models.py - UPDATED
Added Project model:
class Project(Base):
"""Project model for CORA-ingested SEO data"""
- id, user_id, name, main_keyword
- word_count, term_frequency (with defaults)
- Strategic Overview metrics (densities)
- Structure metrics (title, meta, H1-H3 distributions)
- entities, related_searches, custom_anchor_text (JSON)
- spintax_related_search_terms (raw text)
- created_at, updated_at
2. src/ingestion/parser.py - NEW
CORA parser module with:
class CORAParser:
- __init__(file_path): Initialize with file validation
- extract_main_keyword(): Get keyword from B5 or filename
- extract_strategic_overview(): Get Strategic Overview metrics
- extract_structure_metrics(): Get Structure sheet data
- extract_entities(threshold): Get entities below threshold
- parse_spintax_to_list(): Parse spintax to list
- parse(): Complete file parsing with error handling
Key Features:
- Validates file existence and format on init
- Required sheets must exist (Strategic Overview, Structure)
- Optional sheets handled gracefully (Entities)
- Cell value extraction with defaults for zero/empty
- Comprehensive error messages via CORAParseError
3. src/database/interfaces.py - UPDATED
Added IProjectRepository interface:
class IProjectRepository(ABC):
- create(user_id, name, data)
- get_by_id(project_id)
- get_by_user_id(user_id)
- get_all()
- update(project)
- delete(project_id)
4. src/database/repositories.py - UPDATED
Added ProjectRepository implementation:
class ProjectRepository(IProjectRepository):
- Full CRUD operations for projects
- Maps dictionary data to model fields
- Handles JSON serialization for arrays
- Database transaction management
5. src/cli/commands.py - UPDATED
Added two new CLI commands:
@app.command()
def ingest_cora(...):
"""Ingest CORA .xlsx report and create project"""
- Authenticate user
- Parse CORA file
- Create project in database
- Display success summary
@app.command()
def list_projects(...):
"""List projects for authenticated user"""
- Admin sees all projects
- Regular users see only their projects
- Formatted table output
CLI Commands
Ingest CORA File
Usage:
python main.py ingest-cora \
--file path/to/cora_file.xlsx \
--name "Project Name" \
[--custom-anchors "anchor1,anchor2"] \
--username user \
--password pass
Example:
python main.py ingest-cora \
--file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
--name "Shaft Machining Test" \
--username testadmin \
--password password123
Output:
Authenticated as: testadmin (Admin)
Parsing CORA file: shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx
Main Keyword: shaft machining
Word Count: 939.6
Entities Found: 36
Related Searches: 31
Creating project: Shaft Machining Test
Success: Project 'Shaft Machining Test' created (ID: 1)
Main Keyword: shaft machining
Entities: 36
Related Searches: 31
List Projects
Usage:
python main.py list-projects --username user --password pass
Example Output:
All Projects (Admin View):
Total projects: 1
--------------------------------------------------------------------------------
ID Name Keyword Created
--------------------------------------------------------------------------------
1 Shaft Machining Test shaft machining 2025-10-18 19:37:30
--------------------------------------------------------------------------------
CORA File Structure
The parser expects the following structure:
Strategic Overview Sheet:
- B5: Main keyword
- D24: Word count (default 1250 if zero/error)
- D31: Term frequency (default 3 if zero/error)
- D46: Related search density
- D47: Entity density
- D48: LSI density
- B10: Spintax related search terms
Structure Sheet:
- D25-D26: Title metrics
- D31-D33: Meta metrics
- D45-D48: H1 metrics
- D51-D55: H2 metrics
- D58-D62: H3 metrics
Entities Sheet:
- Column A: Entity names
- Column J: Threshold values (capture if < -0.195)
Test Coverage
Unit Tests (31 tests, all passing)
tests/unit/test_cora_parser.py (24 tests):
- CORAParser initialization and validation
- Cell value extraction with defaults
- Sheet retrieval logic
- Main keyword extraction (from sheet or filename)
- Strategic Overview data extraction
- Structure metrics extraction
- Entity filtering by threshold
- Spintax parsing
- Complete file parsing
tests/unit/test_cli_commands.py (7 tests):
- ingest-cora command success
- ingest-cora with custom anchors
- Authentication failures
- Parse error handling
- list-projects for users
- list-projects for admins
- Empty project lists
Integration Tests (7 passing)
tests/integration/test_cora_ingestion.py:
- Real CORA file parsing
- Project repository CRUD operations
- User-project associations
- Database integrity
Manual Testing Results
Test 1: Ingest Real CORA File
python main.py ingest-cora \
--file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
--name "Shaft Machining Test" \
--username testadmin \
--password password123
✅ SUCCESS - Extracted all data correctly
Test 2: Verify Database Storage
# Query showed:
Project: Shaft Machining Test
Keyword: shaft machining
Word Count: 939.6
Term Frequency: 2.5
H2 Total: 5.6
H3 Total: 13.1
Entities (first 5): ['cnc', 'machining', 'shaft', 'cnc turning', 'boring']
Related Searches (first 5): ['automated machining', 'cnc machining', ...]
✅ SUCCESS - All data stored correctly
Test 3: List Projects
python main.py list-projects --username testadmin --password password123
✅ SUCCESS - Projects display correctly
Error Handling Examples
Missing File:
Error: File not found: nonexistent.xlsx
Invalid Format:
Error parsing CORA file: Failed to open Excel file
Missing Required Sheet:
Error parsing CORA file: Required sheet 'Strategic Overview' not found
Authentication Failure:
Error: Authentication failed
Data Model
Project Table Schema
CREATE TABLE projects (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id INTEGER NOT NULL REFERENCES users(id),
name VARCHAR(255) NOT NULL,
main_keyword VARCHAR(255) NOT NULL,
-- Strategic Overview metrics
word_count INTEGER NOT NULL DEFAULT 1250,
term_frequency INTEGER NOT NULL DEFAULT 3,
related_search_density FLOAT,
entity_density FLOAT,
lsi_density FLOAT,
spintax_related_search_terms TEXT,
-- Structure metrics (title, meta)
title_exact_match INTEGER,
title_related_search INTEGER,
meta_exact_match INTEGER,
meta_related_search INTEGER,
meta_entities INTEGER,
-- Structure metrics (H1)
h1_exact INTEGER,
h1_related_search INTEGER,
h1_entities INTEGER,
h1_lsi INTEGER,
-- Structure metrics (H2)
h2_total INTEGER,
h2_exact INTEGER,
h2_related_search INTEGER,
h2_entities INTEGER,
h2_lsi INTEGER,
-- Structure metrics (H3)
h3_total INTEGER,
h3_exact INTEGER,
h3_related_search INTEGER,
h3_entities INTEGER,
h3_lsi INTEGER,
-- Extracted data (JSON)
entities JSON,
related_searches JSON,
custom_anchor_text JSON,
-- Timestamps
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_projects_user_id ON projects(user_id);
CREATE INDEX idx_projects_main_keyword ON projects(main_keyword);
Dependencies
openpyxl==3.1.2- Excel file parsing (already in requirements.txt)- All existing project dependencies
Architecture Decisions
Why Fail on Missing Sheets?
- Better to fail fast with clear error than create partial data
- CORA files have expected structure
- Silent defaults could mask file format issues
- User gets immediate feedback to fix the file
Why Store Entities/Searches as JSON?
- Variable-length arrays
- Easy to query and serialize
- No additional tables needed
- Simple to update
Why Allow Any User to Create Projects?
- Users work on their own content
- Admins can see all projects
- Matches workflow (users ingest, admins oversee)
- Can be restricted later if needed
Why Store Spintax as Raw Text?
- Preserve original format for reference
- Parsed version available in related_searches
- May need original for regeneration
Next Steps
This completes Story 2.1. The data ingestion foundation is ready for:
- Story 2.2: Configurable Content Rule Engine (use project data for validation)
- Story 2.3: AI-Powered Content Generation (use project SEO data for prompts)
- Story 2.4: HTML Formatting (use project data in templates)
Completion Checklist
- Project database model created
- CORA parser module implemented
- ProjectRepository with CRUD operations
- ingest-cora CLI command
- list-projects CLI command
- Unit tests written and passing (31 tests)
- Integration tests written and passing (7 tests)
- Manual testing with real CORA file
- Error handling for all failure scenarios
- Database schema updated
- Story documentation completed
Notes
- Word count stored as float to preserve decimal values from CORA
- Entity threshold of -0.195 is configurable via parser method parameter
- Custom anchor text is optional and stored as empty array if not provided
- All sheets except Entities are required (proper validation added)
- Parser closes workbook properly to prevent file locks
- Timestamps use UTC for consistency