Big-Link-Man/docs/stories/story-2.1-cora-ingestion.md

11 KiB

Story 2.1: CORA Report Data Ingestion - COMPLETED

Overview

Implemented complete CORA .xlsx file ingestion system with parser module, database models, CLI commands, and comprehensive test coverage.

Story Details

As a User, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data.

Acceptance Criteria - ALL MET

1. CLI Command to Ingest CORA Files

Status: COMPLETE

A CLI command exists to accept CORA .xlsx file paths:

  • Command: ingest-cora
  • Options: --file, --name, --custom-anchors, --username, --password
  • Requires user authentication (any authenticated user can create projects)
  • Returns success message with project details

2. Data Extraction from CORA Files

Status: COMPLETE

The parser correctly extracts all specified data points:

  • Main keyword: From Strategic Overview B5 or filename
  • Strategic Overview metrics: Word count, term frequency, densities, spintax
  • Structure metrics: Title, meta, H1, H2, H3 counts and distributions
  • Entities: From Entities sheet where column J < -0.195
  • Related searches: Parsed from spintax format
  • Optional anchor text: User-provided via CLI

3. Database Storage

Status: COMPLETE

Project records are created with all data:

  • User association (user_id foreign key)
  • Main keyword and project name
  • All numeric metrics from CORA file
  • Entities and related searches as JSON arrays
  • Custom anchor text as JSON array
  • Timestamps (created_at, updated_at)

4. Error Handling

Status: COMPLETE

Graceful error handling for:

  • File not found errors
  • Invalid Excel file format
  • Missing required sheets (Strategic Overview, Structure)
  • Authentication failures
  • Database errors

Implementation Details

Files Created/Modified

1. src/database/models.py - UPDATED

Added Project model:

class Project(Base):
    """Project model for CORA-ingested SEO data"""
    - id, user_id, name, main_keyword
    - word_count, term_frequency (with defaults)
    - Strategic Overview metrics (densities)
    - Structure metrics (title, meta, H1-H3 distributions)
    - entities, related_searches, custom_anchor_text (JSON)
    - spintax_related_search_terms (raw text)
    - created_at, updated_at

2. src/ingestion/parser.py - NEW

CORA parser module with:

class CORAParser:
    - __init__(file_path): Initialize with file validation
    - extract_main_keyword(): Get keyword from B5 or filename
    - extract_strategic_overview(): Get Strategic Overview metrics
    - extract_structure_metrics(): Get Structure sheet data
    - extract_entities(threshold): Get entities below threshold
    - parse_spintax_to_list(): Parse spintax to list
    - parse(): Complete file parsing with error handling

Key Features:

  • Validates file existence and format on init
  • Required sheets must exist (Strategic Overview, Structure)
  • Optional sheets handled gracefully (Entities)
  • Cell value extraction with defaults for zero/empty
  • Comprehensive error messages via CORAParseError

3. src/database/interfaces.py - UPDATED

Added IProjectRepository interface:

class IProjectRepository(ABC):
    - create(user_id, name, data)
    - get_by_id(project_id)
    - get_by_user_id(user_id)
    - get_all()
    - update(project)
    - delete(project_id)

4. src/database/repositories.py - UPDATED

Added ProjectRepository implementation:

class ProjectRepository(IProjectRepository):
    - Full CRUD operations for projects
    - Maps dictionary data to model fields
    - Handles JSON serialization for arrays
    - Database transaction management

5. src/cli/commands.py - UPDATED

Added two new CLI commands:

@app.command()
def ingest_cora(...):
    """Ingest CORA .xlsx report and create project"""
    - Authenticate user
    - Parse CORA file
    - Create project in database
    - Display success summary

@app.command()
def list_projects(...):
    """List projects for authenticated user"""
    - Admin sees all projects
    - Regular users see only their projects
    - Formatted table output

CLI Commands

Ingest CORA File

Usage:

python main.py ingest-cora \
  --file path/to/cora_file.xlsx \
  --name "Project Name" \
  [--custom-anchors "anchor1,anchor2"] \
  --username user \
  --password pass

Example:

python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123

Output:

Authenticated as: testadmin (Admin)

Parsing CORA file: shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx
Main Keyword: shaft machining
Word Count: 939.6
Entities Found: 36
Related Searches: 31

Creating project: Shaft Machining Test

Success: Project 'Shaft Machining Test' created (ID: 1)
Main Keyword: shaft machining
Entities: 36
Related Searches: 31

List Projects

Usage:

python main.py list-projects --username user --password pass

Example Output:

All Projects (Admin View):
Total projects: 1
--------------------------------------------------------------------------------
ID    Name                           Keyword                   Created             
--------------------------------------------------------------------------------
1     Shaft Machining Test           shaft machining           2025-10-18 19:37:30 
--------------------------------------------------------------------------------

CORA File Structure

The parser expects the following structure:

Strategic Overview Sheet:

  • B5: Main keyword
  • D24: Word count (default 1250 if zero/error)
  • D31: Term frequency (default 3 if zero/error)
  • D46: Related search density
  • D47: Entity density
  • D48: LSI density
  • B10: Spintax related search terms

Structure Sheet:

  • D25-D26: Title metrics
  • D31-D33: Meta metrics
  • D45-D48: H1 metrics
  • D51-D55: H2 metrics
  • D58-D62: H3 metrics

Entities Sheet:

  • Column A: Entity names
  • Column J: Threshold values (capture if < -0.195)

Test Coverage

Unit Tests (31 tests, all passing)

tests/unit/test_cora_parser.py (24 tests):

  • CORAParser initialization and validation
  • Cell value extraction with defaults
  • Sheet retrieval logic
  • Main keyword extraction (from sheet or filename)
  • Strategic Overview data extraction
  • Structure metrics extraction
  • Entity filtering by threshold
  • Spintax parsing
  • Complete file parsing

tests/unit/test_cli_commands.py (7 tests):

  • ingest-cora command success
  • ingest-cora with custom anchors
  • Authentication failures
  • Parse error handling
  • list-projects for users
  • list-projects for admins
  • Empty project lists

Integration Tests (7 passing)

tests/integration/test_cora_ingestion.py:

  • Real CORA file parsing
  • Project repository CRUD operations
  • User-project associations
  • Database integrity

Manual Testing Results

Test 1: Ingest Real CORA File

python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123

SUCCESS - Extracted all data correctly

Test 2: Verify Database Storage

# Query showed:
Project: Shaft Machining Test
Keyword: shaft machining
Word Count: 939.6
Term Frequency: 2.5
H2 Total: 5.6
H3 Total: 13.1
Entities (first 5): ['cnc', 'machining', 'shaft', 'cnc turning', 'boring']
Related Searches (first 5): ['automated machining', 'cnc machining', ...]

SUCCESS - All data stored correctly

Test 3: List Projects

python main.py list-projects --username testadmin --password password123

SUCCESS - Projects display correctly

Error Handling Examples

Missing File:

Error: File not found: nonexistent.xlsx

Invalid Format:

Error parsing CORA file: Failed to open Excel file

Missing Required Sheet:

Error parsing CORA file: Required sheet 'Strategic Overview' not found

Authentication Failure:

Error: Authentication failed

Data Model

Project Table Schema

CREATE TABLE projects (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id INTEGER NOT NULL REFERENCES users(id),
    name VARCHAR(255) NOT NULL,
    main_keyword VARCHAR(255) NOT NULL,
    
    -- Strategic Overview metrics
    word_count INTEGER NOT NULL DEFAULT 1250,
    term_frequency INTEGER NOT NULL DEFAULT 3,
    related_search_density FLOAT,
    entity_density FLOAT,
    lsi_density FLOAT,
    spintax_related_search_terms TEXT,
    
    -- Structure metrics (title, meta)
    title_exact_match INTEGER,
    title_related_search INTEGER,
    meta_exact_match INTEGER,
    meta_related_search INTEGER,
    meta_entities INTEGER,
    
    -- Structure metrics (H1)
    h1_exact INTEGER,
    h1_related_search INTEGER,
    h1_entities INTEGER,
    h1_lsi INTEGER,
    
    -- Structure metrics (H2)
    h2_total INTEGER,
    h2_exact INTEGER,
    h2_related_search INTEGER,
    h2_entities INTEGER,
    h2_lsi INTEGER,
    
    -- Structure metrics (H3)
    h3_total INTEGER,
    h3_exact INTEGER,
    h3_related_search INTEGER,
    h3_entities INTEGER,
    h3_lsi INTEGER,
    
    -- Extracted data (JSON)
    entities JSON,
    related_searches JSON,
    custom_anchor_text JSON,
    
    -- Timestamps
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_projects_user_id ON projects(user_id);
CREATE INDEX idx_projects_main_keyword ON projects(main_keyword);

Dependencies

  • openpyxl==3.1.2 - Excel file parsing (already in requirements.txt)
  • All existing project dependencies

Architecture Decisions

Why Fail on Missing Sheets?

  • Better to fail fast with clear error than create partial data
  • CORA files have expected structure
  • Silent defaults could mask file format issues
  • User gets immediate feedback to fix the file

Why Store Entities/Searches as JSON?

  • Variable-length arrays
  • Easy to query and serialize
  • No additional tables needed
  • Simple to update

Why Allow Any User to Create Projects?

  • Users work on their own content
  • Admins can see all projects
  • Matches workflow (users ingest, admins oversee)
  • Can be restricted later if needed

Why Store Spintax as Raw Text?

  • Preserve original format for reference
  • Parsed version available in related_searches
  • May need original for regeneration

Next Steps

This completes Story 2.1. The data ingestion foundation is ready for:

  • Story 2.2: Configurable Content Rule Engine (use project data for validation)
  • Story 2.3: AI-Powered Content Generation (use project SEO data for prompts)
  • Story 2.4: HTML Formatting (use project data in templates)

Completion Checklist

  • Project database model created
  • CORA parser module implemented
  • ProjectRepository with CRUD operations
  • ingest-cora CLI command
  • list-projects CLI command
  • Unit tests written and passing (31 tests)
  • Integration tests written and passing (7 tests)
  • Manual testing with real CORA file
  • Error handling for all failure scenarios
  • Database schema updated
  • Story documentation completed

Notes

  • Word count stored as float to preserve decimal values from CORA
  • Entity threshold of -0.195 is configurable via parser method parameter
  • Custom anchor text is optional and stored as empty array if not provided
  • All sheets except Entities are required (proper validation added)
  • Parser closes workbook properly to prevent file locks
  • Timestamps use UTC for consistency