# Story 2.1: CORA Report Data Ingestion - COMPLETED ## Overview Implemented complete CORA .xlsx file ingestion system with parser module, database models, CLI commands, and comprehensive test coverage. ## Story Details **As a User**, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data. ## Acceptance Criteria - ALL MET ### 1. CLI Command to Ingest CORA Files **Status:** COMPLETE A CLI command exists to accept CORA .xlsx file paths: - Command: `ingest-cora` - Options: `--file`, `--name`, `--custom-anchors`, `--username`, `--password` - Requires user authentication (any authenticated user can create projects) - Returns success message with project details ### 2. Data Extraction from CORA Files **Status:** COMPLETE The parser correctly extracts all specified data points: - **Main keyword**: From Strategic Overview B5 or filename - **Strategic Overview metrics**: Word count, term frequency, densities, spintax - **Structure metrics**: Title, meta, H1, H2, H3 counts and distributions - **Entities**: From Entities sheet where column J < -0.195 - **Related searches**: Parsed from spintax format - **Optional anchor text**: User-provided via CLI ### 3. Database Storage **Status:** COMPLETE Project records are created with all data: - User association (user_id foreign key) - Main keyword and project name - All numeric metrics from CORA file - Entities and related searches as JSON arrays - Custom anchor text as JSON array - Timestamps (created_at, updated_at) ### 4. Error Handling **Status:** COMPLETE Graceful error handling for: - File not found errors - Invalid Excel file format - Missing required sheets (Strategic Overview, Structure) - Authentication failures - Database errors ## Implementation Details ### Files Created/Modified #### 1. `src/database/models.py` - UPDATED Added `Project` model: ```python class Project(Base): """Project model for CORA-ingested SEO data""" - id, user_id, name, main_keyword - word_count, term_frequency (with defaults) - Strategic Overview metrics (densities) - Structure metrics (title, meta, H1-H3 distributions) - entities, related_searches, custom_anchor_text (JSON) - spintax_related_search_terms (raw text) - created_at, updated_at ``` #### 2. `src/ingestion/parser.py` - NEW CORA parser module with: ```python class CORAParser: - __init__(file_path): Initialize with file validation - extract_main_keyword(): Get keyword from B5 or filename - extract_strategic_overview(): Get Strategic Overview metrics - extract_structure_metrics(): Get Structure sheet data - extract_entities(threshold): Get entities below threshold - parse_spintax_to_list(): Parse spintax to list - parse(): Complete file parsing with error handling ``` **Key Features:** - Validates file existence and format on init - Required sheets must exist (Strategic Overview, Structure) - Optional sheets handled gracefully (Entities) - Cell value extraction with defaults for zero/empty - Comprehensive error messages via CORAParseError #### 3. `src/database/interfaces.py` - UPDATED Added `IProjectRepository` interface: ```python class IProjectRepository(ABC): - create(user_id, name, data) - get_by_id(project_id) - get_by_user_id(user_id) - get_all() - update(project) - delete(project_id) ``` #### 4. `src/database/repositories.py` - UPDATED Added `ProjectRepository` implementation: ```python class ProjectRepository(IProjectRepository): - Full CRUD operations for projects - Maps dictionary data to model fields - Handles JSON serialization for arrays - Database transaction management ``` #### 5. `src/cli/commands.py` - UPDATED Added two new CLI commands: ```python @app.command() def ingest_cora(...): """Ingest CORA .xlsx report and create project""" - Authenticate user - Parse CORA file - Create project in database - Display success summary @app.command() def list_projects(...): """List projects for authenticated user""" - Admin sees all projects - Regular users see only their projects - Formatted table output ``` ### CLI Commands #### Ingest CORA File **Usage:** ```bash python main.py ingest-cora \ --file path/to/cora_file.xlsx \ --name "Project Name" \ [--custom-anchors "anchor1,anchor2"] \ --username user \ --password pass ``` **Example:** ```bash python main.py ingest-cora \ --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \ --name "Shaft Machining Test" \ --username testadmin \ --password password123 ``` **Output:** ``` Authenticated as: testadmin (Admin) Parsing CORA file: shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx Main Keyword: shaft machining Word Count: 939.6 Entities Found: 36 Related Searches: 31 Creating project: Shaft Machining Test Success: Project 'Shaft Machining Test' created (ID: 1) Main Keyword: shaft machining Entities: 36 Related Searches: 31 ``` #### List Projects **Usage:** ```bash python main.py list-projects --username user --password pass ``` **Example Output:** ``` All Projects (Admin View): Total projects: 1 -------------------------------------------------------------------------------- ID Name Keyword Created -------------------------------------------------------------------------------- 1 Shaft Machining Test shaft machining 2025-10-18 19:37:30 -------------------------------------------------------------------------------- ``` ### CORA File Structure The parser expects the following structure: **Strategic Overview Sheet:** - B5: Main keyword - D24: Word count (default 1250 if zero/error) - D31: Term frequency (default 3 if zero/error) - D46: Related search density - D47: Entity density - D48: LSI density - B10: Spintax related search terms **Structure Sheet:** - D25-D26: Title metrics - D31-D33: Meta metrics - D45-D48: H1 metrics - D51-D55: H2 metrics - D58-D62: H3 metrics **Entities Sheet:** - Column A: Entity names - Column J: Threshold values (capture if < -0.195) ### Test Coverage #### Unit Tests (31 tests, all passing) **`tests/unit/test_cora_parser.py`** (24 tests): - CORAParser initialization and validation - Cell value extraction with defaults - Sheet retrieval logic - Main keyword extraction (from sheet or filename) - Strategic Overview data extraction - Structure metrics extraction - Entity filtering by threshold - Spintax parsing - Complete file parsing **`tests/unit/test_cli_commands.py`** (7 tests): - ingest-cora command success - ingest-cora with custom anchors - Authentication failures - Parse error handling - list-projects for users - list-projects for admins - Empty project lists #### Integration Tests (7 passing) **`tests/integration/test_cora_ingestion.py`**: - Real CORA file parsing - Project repository CRUD operations - User-project associations - Database integrity ### Manual Testing Results **Test 1: Ingest Real CORA File** ```bash python main.py ingest-cora \ --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \ --name "Shaft Machining Test" \ --username testadmin \ --password password123 ``` ✅ SUCCESS - Extracted all data correctly **Test 2: Verify Database Storage** ```python # Query showed: Project: Shaft Machining Test Keyword: shaft machining Word Count: 939.6 Term Frequency: 2.5 H2 Total: 5.6 H3 Total: 13.1 Entities (first 5): ['cnc', 'machining', 'shaft', 'cnc turning', 'boring'] Related Searches (first 5): ['automated machining', 'cnc machining', ...] ``` ✅ SUCCESS - All data stored correctly **Test 3: List Projects** ```bash python main.py list-projects --username testadmin --password password123 ``` ✅ SUCCESS - Projects display correctly ### Error Handling Examples **Missing File:** ``` Error: File not found: nonexistent.xlsx ``` **Invalid Format:** ``` Error parsing CORA file: Failed to open Excel file ``` **Missing Required Sheet:** ``` Error parsing CORA file: Required sheet 'Strategic Overview' not found ``` **Authentication Failure:** ``` Error: Authentication failed ``` ## Data Model ### Project Table Schema ```sql CREATE TABLE projects ( id INTEGER PRIMARY KEY AUTOINCREMENT, user_id INTEGER NOT NULL REFERENCES users(id), name VARCHAR(255) NOT NULL, main_keyword VARCHAR(255) NOT NULL, -- Strategic Overview metrics word_count INTEGER NOT NULL DEFAULT 1250, term_frequency INTEGER NOT NULL DEFAULT 3, related_search_density FLOAT, entity_density FLOAT, lsi_density FLOAT, spintax_related_search_terms TEXT, -- Structure metrics (title, meta) title_exact_match INTEGER, title_related_search INTEGER, meta_exact_match INTEGER, meta_related_search INTEGER, meta_entities INTEGER, -- Structure metrics (H1) h1_exact INTEGER, h1_related_search INTEGER, h1_entities INTEGER, h1_lsi INTEGER, -- Structure metrics (H2) h2_total INTEGER, h2_exact INTEGER, h2_related_search INTEGER, h2_entities INTEGER, h2_lsi INTEGER, -- Structure metrics (H3) h3_total INTEGER, h3_exact INTEGER, h3_related_search INTEGER, h3_entities INTEGER, h3_lsi INTEGER, -- Extracted data (JSON) entities JSON, related_searches JSON, custom_anchor_text JSON, -- Timestamps created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_projects_user_id ON projects(user_id); CREATE INDEX idx_projects_main_keyword ON projects(main_keyword); ``` ## Dependencies - `openpyxl==3.1.2` - Excel file parsing (already in requirements.txt) - All existing project dependencies ## Architecture Decisions **Why Fail on Missing Sheets?** - Better to fail fast with clear error than create partial data - CORA files have expected structure - Silent defaults could mask file format issues - User gets immediate feedback to fix the file **Why Store Entities/Searches as JSON?** - Variable-length arrays - Easy to query and serialize - No additional tables needed - Simple to update **Why Allow Any User to Create Projects?** - Users work on their own content - Admins can see all projects - Matches workflow (users ingest, admins oversee) - Can be restricted later if needed **Why Store Spintax as Raw Text?** - Preserve original format for reference - Parsed version available in related_searches - May need original for regeneration ## Next Steps This completes Story 2.1. The data ingestion foundation is ready for: - **Story 2.2**: Configurable Content Rule Engine (use project data for validation) - **Story 2.3**: AI-Powered Content Generation (use project SEO data for prompts) - **Story 2.4**: HTML Formatting (use project data in templates) ## Completion Checklist - [x] Project database model created - [x] CORA parser module implemented - [x] ProjectRepository with CRUD operations - [x] ingest-cora CLI command - [x] list-projects CLI command - [x] Unit tests written and passing (31 tests) - [x] Integration tests written and passing (7 tests) - [x] Manual testing with real CORA file - [x] Error handling for all failure scenarios - [x] Database schema updated - [x] Story documentation completed ## Notes - Word count stored as float to preserve decimal values from CORA - Entity threshold of -0.195 is configurable via parser method parameter - Custom anchor text is optional and stored as empty array if not provided - All sheets except Entities are required (proper validation added) - Parser closes workbook properly to prevent file locks - Timestamps use UTC for consistency