11 KiB

Raw Blame History

Story 2.1: CORA Report Data Ingestion - COMPLETED

Overview

Implemented complete CORA .xlsx file ingestion system with parser module, database models, CLI commands, and comprehensive test coverage.

Story Details

As a User, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data.

Acceptance Criteria - ALL MET

1. CLI Command to Ingest CORA Files

Status: COMPLETE

A CLI command exists to accept CORA .xlsx file paths:

Command: ingest-cora
Options: --file, --name, --custom-anchors, --username, --password
Requires user authentication (any authenticated user can create projects)
Returns success message with project details

2. Data Extraction from CORA Files

Status: COMPLETE

The parser correctly extracts all specified data points:

Main keyword: From Strategic Overview B5 or filename
Strategic Overview metrics: Word count, term frequency, densities, spintax
Structure metrics: Title, meta, H1, H2, H3 counts and distributions
Entities: From Entities sheet where column J < -0.195
Related searches: Parsed from spintax format
Optional anchor text: User-provided via CLI

3. Database Storage

Status: COMPLETE

Project records are created with all data:

User association (user_id foreign key)
Main keyword and project name
All numeric metrics from CORA file
Entities and related searches as JSON arrays
Custom anchor text as JSON array
Timestamps (created_at, updated_at)

4. Error Handling

Status: COMPLETE

Graceful error handling for:

File not found errors
Invalid Excel file format
Missing required sheets (Strategic Overview, Structure)
Authentication failures
Database errors

Implementation Details

Files Created/Modified

1. `src/database/models.py` - UPDATED

Added Project model:

class Project(Base):
    """Project model for CORA-ingested SEO data"""
    - id, user_id, name, main_keyword
    - word_count, term_frequency (with defaults)
    - Strategic Overview metrics (densities)
    - Structure metrics (title, meta, H1-H3 distributions)
    - entities, related_searches, custom_anchor_text (JSON)
    - spintax_related_search_terms (raw text)
    - created_at, updated_at

2. `src/ingestion/parser.py` - NEW

CORA parser module with:

class CORAParser:
    - __init__(file_path): Initialize with file validation
    - extract_main_keyword(): Get keyword from B5 or filename
    - extract_strategic_overview(): Get Strategic Overview metrics
    - extract_structure_metrics(): Get Structure sheet data
    - extract_entities(threshold): Get entities below threshold
    - parse_spintax_to_list(): Parse spintax to list
    - parse(): Complete file parsing with error handling

Key Features:

Validates file existence and format on init
Required sheets must exist (Strategic Overview, Structure)
Optional sheets handled gracefully (Entities)
Cell value extraction with defaults for zero/empty
Comprehensive error messages via CORAParseError

3. `src/database/interfaces.py` - UPDATED

Added IProjectRepository interface:

class IProjectRepository(ABC):
    - create(user_id, name, data)
    - get_by_id(project_id)
    - get_by_user_id(user_id)
    - get_all()
    - update(project)
    - delete(project_id)

4. `src/database/repositories.py` - UPDATED

Added ProjectRepository implementation:

class ProjectRepository(IProjectRepository):
    - Full CRUD operations for projects
    - Maps dictionary data to model fields
    - Handles JSON serialization for arrays
    - Database transaction management

5. `src/cli/commands.py` - UPDATED

Added two new CLI commands:

@app.command()
def ingest_cora(...):
    """Ingest CORA .xlsx report and create project"""
    - Authenticate user
    - Parse CORA file
    - Create project in database
    - Display success summary

@app.command()
def list_projects(...):
    """List projects for authenticated user"""
    - Admin sees all projects
    - Regular users see only their projects
    - Formatted table output

CLI Commands

Ingest CORA File

Usage:

python main.py ingest-cora \
  --file path/to/cora_file.xlsx \
  --name "Project Name" \
  [--custom-anchors "anchor1,anchor2"] \
  --username user \
  --password pass

Example:

python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123

Output:

Authenticated as: testadmin (Admin)

Parsing CORA file: shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx
Main Keyword: shaft machining
Word Count: 939.6
Entities Found: 36
Related Searches: 31

Creating project: Shaft Machining Test

Success: Project 'Shaft Machining Test' created (ID: 1)
Main Keyword: shaft machining
Entities: 36
Related Searches: 31

List Projects

Usage:

python main.py list-projects --username user --password pass

Example Output:

All Projects (Admin View):
Total projects: 1
--------------------------------------------------------------------------------
ID    Name                           Keyword                   Created             
--------------------------------------------------------------------------------
1     Shaft Machining Test           shaft machining           2025-10-18 19:37:30 
--------------------------------------------------------------------------------

CORA File Structure

The parser expects the following structure:

Strategic Overview Sheet:

B5: Main keyword
D24: Word count (default 1250 if zero/error)
D31: Term frequency (default 3 if zero/error)
D46: Related search density
D47: Entity density
D48: LSI density
B10: Spintax related search terms

Structure Sheet:

D25-D26: Title metrics
D31-D33: Meta metrics
D45-D48: H1 metrics
D51-D55: H2 metrics
D58-D62: H3 metrics

Entities Sheet:

Column A: Entity names
Column J: Threshold values (capture if < -0.195)

Test Coverage

Unit Tests (31 tests, all passing)

tests/unit/test_cora_parser.py (24 tests):

CORAParser initialization and validation
Cell value extraction with defaults
Sheet retrieval logic
Main keyword extraction (from sheet or filename)
Strategic Overview data extraction
Structure metrics extraction
Entity filtering by threshold
Spintax parsing
Complete file parsing

tests/unit/test_cli_commands.py (7 tests):

ingest-cora command success
ingest-cora with custom anchors
Authentication failures
Parse error handling
list-projects for users
list-projects for admins
Empty project lists

Integration Tests (7 passing)

tests/integration/test_cora_ingestion.py:

Real CORA file parsing
Project repository CRUD operations
User-project associations
Database integrity

Manual Testing Results

Test 1: Ingest Real CORA File

python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123

✅ SUCCESS - Extracted all data correctly

Test 2: Verify Database Storage

# Query showed:
Project: Shaft Machining Test
Keyword: shaft machining
Word Count: 939.6
Term Frequency: 2.5
H2 Total: 5.6
H3 Total: 13.1
Entities (first 5): ['cnc', 'machining', 'shaft', 'cnc turning', 'boring']
Related Searches (first 5): ['automated machining', 'cnc machining', ...]

✅ SUCCESS - All data stored correctly

Test 3: List Projects

python main.py list-projects --username testadmin --password password123

✅ SUCCESS - Projects display correctly

Error Handling Examples

Missing File:

Error: File not found: nonexistent.xlsx

Invalid Format:

Error parsing CORA file: Failed to open Excel file

Missing Required Sheet:

Error parsing CORA file: Required sheet 'Strategic Overview' not found

Authentication Failure:

Error: Authentication failed

Data Model

Project Table Schema

CREATE TABLE projects (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id INTEGER NOT NULL REFERENCES users(id),
    name VARCHAR(255) NOT NULL,
    main_keyword VARCHAR(255) NOT NULL,
    
    -- Strategic Overview metrics
    word_count INTEGER NOT NULL DEFAULT 1250,
    term_frequency INTEGER NOT NULL DEFAULT 3,
    related_search_density FLOAT,
    entity_density FLOAT,
    lsi_density FLOAT,
    spintax_related_search_terms TEXT,
    
    -- Structure metrics (title, meta)
    title_exact_match INTEGER,
    title_related_search INTEGER,
    meta_exact_match INTEGER,
    meta_related_search INTEGER,
    meta_entities INTEGER,
    
    -- Structure metrics (H1)
    h1_exact INTEGER,
    h1_related_search INTEGER,
    h1_entities INTEGER,
    h1_lsi INTEGER,
    
    -- Structure metrics (H2)
    h2_total INTEGER,
    h2_exact INTEGER,
    h2_related_search INTEGER,
    h2_entities INTEGER,
    h2_lsi INTEGER,
    
    -- Structure metrics (H3)
    h3_total INTEGER,
    h3_exact INTEGER,
    h3_related_search INTEGER,
    h3_entities INTEGER,
    h3_lsi INTEGER,
    
    -- Extracted data (JSON)
    entities JSON,
    related_searches JSON,
    custom_anchor_text JSON,
    
    -- Timestamps
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_projects_user_id ON projects(user_id);
CREATE INDEX idx_projects_main_keyword ON projects(main_keyword);

Dependencies

openpyxl==3.1.2 - Excel file parsing (already in requirements.txt)
All existing project dependencies

Architecture Decisions

Why Fail on Missing Sheets?

Better to fail fast with clear error than create partial data
CORA files have expected structure
Silent defaults could mask file format issues
User gets immediate feedback to fix the file

Why Store Entities/Searches as JSON?

Variable-length arrays
Easy to query and serialize
No additional tables needed
Simple to update

Why Allow Any User to Create Projects?

Users work on their own content
Admins can see all projects
Matches workflow (users ingest, admins oversee)
Can be restricted later if needed

Why Store Spintax as Raw Text?

Preserve original format for reference
Parsed version available in related_searches
May need original for regeneration

Next Steps

This completes Story 2.1. The data ingestion foundation is ready for:

Story 2.2: Configurable Content Rule Engine (use project data for validation)
Story 2.3: AI-Powered Content Generation (use project SEO data for prompts)
Story 2.4: HTML Formatting (use project data in templates)

Completion Checklist

Project database model created
CORA parser module implemented
ProjectRepository with CRUD operations
ingest-cora CLI command
list-projects CLI command
Unit tests written and passing (31 tests)
Integration tests written and passing (7 tests)
Manual testing with real CORA file
Error handling for all failure scenarios
Database schema updated
Story documentation completed

Notes

Word count stored as float to preserve decimal values from CORA
Entity threshold of -0.195 is configurable via parser method parameter
Custom anchor text is optional and stored as empty array if not provided
All sheets except Entities are required (proper validation added)
Parser closes workbook properly to prevent file locks
Timestamps use UTC for consistency

11 KiB Raw Blame History

Story 2.1: CORA Report Data Ingestion - COMPLETED

Overview

Story Details

Acceptance Criteria - ALL MET

1. CLI Command to Ingest CORA Files

2. Data Extraction from CORA Files

3. Database Storage

4. Error Handling

Implementation Details

Files Created/Modified

1. src/database/models.py - UPDATED

2. src/ingestion/parser.py - NEW

3. src/database/interfaces.py - UPDATED

4. src/database/repositories.py - UPDATED

5. src/cli/commands.py - UPDATED

CLI Commands

Ingest CORA File

List Projects

CORA File Structure

Test Coverage

Unit Tests (31 tests, all passing)

Integration Tests (7 passing)

Manual Testing Results

Error Handling Examples

Data Model

Project Table Schema

Dependencies

Architecture Decisions

Next Steps

Completion Checklist

Notes

11 KiB

Raw Blame History

1. `src/database/models.py` - UPDATED

2. `src/ingestion/parser.py` - NEW

3. `src/database/interfaces.py` - UPDATED

4. `src/database/repositories.py` - UPDATED

5. `src/cli/commands.py` - UPDATED