# Story 2.1: CORA Report Data Ingestion - COMPLETED

## Overview
Implemented complete CORA .xlsx file ingestion system with parser module, database models, CLI commands, and comprehensive test coverage.

## Story Details
**As a User**, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data.

## Acceptance Criteria - ALL MET

### 1. CLI Command to Ingest CORA Files
**Status:** COMPLETE

A CLI command exists to accept CORA .xlsx file paths:
- Command: `ingest-cora`
- Options: `--file`, `--name`, `--custom-anchors`, `--username`, `--password`
- Requires user authentication (any authenticated user can create projects)
- Returns success message with project details

### 2. Data Extraction from CORA Files
**Status:** COMPLETE

The parser correctly extracts all specified data points:
- **Main keyword**: From Strategic Overview B5 or filename
- **Strategic Overview metrics**: Word count, term frequency, densities, spintax
- **Structure metrics**: Title, meta, H1, H2, H3 counts and distributions
- **Entities**: From Entities sheet where column J < -0.195
- **Related searches**: Parsed from spintax format
- **Optional anchor text**: User-provided via CLI

### 3. Database Storage
**Status:** COMPLETE

Project records are created with all data:
- User association (user_id foreign key)
- Main keyword and project name
- All numeric metrics from CORA file
- Entities and related searches as JSON arrays
- Custom anchor text as JSON array
- Timestamps (created_at, updated_at)

### 4. Error Handling
**Status:** COMPLETE

Graceful error handling for:
- File not found errors
- Invalid Excel file format
- Missing required sheets (Strategic Overview, Structure)
- Authentication failures
- Database errors

## Implementation Details

### Files Created/Modified

#### 1. `src/database/models.py` - UPDATED
Added `Project` model:
```python
class Project(Base):
    """Project model for CORA-ingested SEO data"""
    - id, user_id, name, main_keyword
    - word_count, term_frequency (with defaults)
    - Strategic Overview metrics (densities)
    - Structure metrics (title, meta, H1-H3 distributions)
    - entities, related_searches, custom_anchor_text (JSON)
    - spintax_related_search_terms (raw text)
    - created_at, updated_at
```

#### 2. `src/ingestion/parser.py` - NEW
CORA parser module with:
```python
class CORAParser:
    - __init__(file_path): Initialize with file validation
    - extract_main_keyword(): Get keyword from B5 or filename
    - extract_strategic_overview(): Get Strategic Overview metrics
    - extract_structure_metrics(): Get Structure sheet data
    - extract_entities(threshold): Get entities below threshold
    - parse_spintax_to_list(): Parse spintax to list
    - parse(): Complete file parsing with error handling
```

**Key Features:**
- Validates file existence and format on init
- Required sheets must exist (Strategic Overview, Structure)
- Optional sheets handled gracefully (Entities)
- Cell value extraction with defaults for zero/empty
- Comprehensive error messages via CORAParseError

#### 3. `src/database/interfaces.py` - UPDATED
Added `IProjectRepository` interface:
```python
class IProjectRepository(ABC):
    - create(user_id, name, data)
    - get_by_id(project_id)
    - get_by_user_id(user_id)
    - get_all()
    - update(project)
    - delete(project_id)
```

#### 4. `src/database/repositories.py` - UPDATED
Added `ProjectRepository` implementation:
```python
class ProjectRepository(IProjectRepository):
    - Full CRUD operations for projects
    - Maps dictionary data to model fields
    - Handles JSON serialization for arrays
    - Database transaction management
```

#### 5. `src/cli/commands.py` - UPDATED
Added two new CLI commands:
```python
@app.command()
def ingest_cora(...):
    """Ingest CORA .xlsx report and create project"""
    - Authenticate user
    - Parse CORA file
    - Create project in database
    - Display success summary

@app.command()
def list_projects(...):
    """List projects for authenticated user"""
    - Admin sees all projects
    - Regular users see only their projects
    - Formatted table output
```

### CLI Commands

#### Ingest CORA File

**Usage:**
```bash
python main.py ingest-cora \
  --file path/to/cora_file.xlsx \
  --name "Project Name" \
  [--custom-anchors "anchor1,anchor2"] \
  --username user \
  --password pass
```

**Example:**
```bash
python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123
```

**Output:**
```
Authenticated as: testadmin (Admin)

Parsing CORA file: shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx
Main Keyword: shaft machining
Word Count: 939.6
Entities Found: 36
Related Searches: 31

Creating project: Shaft Machining Test

Success: Project 'Shaft Machining Test' created (ID: 1)
Main Keyword: shaft machining
Entities: 36
Related Searches: 31
```

#### List Projects

**Usage:**
```bash
python main.py list-projects --username user --password pass
```

**Example Output:**
```
All Projects (Admin View):
Total projects: 1
--------------------------------------------------------------------------------
ID    Name                           Keyword                   Created             
--------------------------------------------------------------------------------
1     Shaft Machining Test           shaft machining           2025-10-18 19:37:30 
--------------------------------------------------------------------------------
```

### CORA File Structure

The parser expects the following structure:

**Strategic Overview Sheet:**
- B5: Main keyword
- D24: Word count (default 1250 if zero/error)
- D31: Term frequency (default 3 if zero/error)
- D46: Related search density
- D47: Entity density
- D48: LSI density
- B10: Spintax related search terms

**Structure Sheet:**
- D25-D26: Title metrics
- D31-D33: Meta metrics
- D45-D48: H1 metrics
- D51-D55: H2 metrics
- D58-D62: H3 metrics

**Entities Sheet:**
- Column A: Entity names
- Column J: Threshold values (capture if < -0.195)

### Test Coverage

#### Unit Tests (31 tests, all passing)

**`tests/unit/test_cora_parser.py`** (24 tests):
- CORAParser initialization and validation
- Cell value extraction with defaults
- Sheet retrieval logic
- Main keyword extraction (from sheet or filename)
- Strategic Overview data extraction
- Structure metrics extraction
- Entity filtering by threshold
- Spintax parsing
- Complete file parsing

**`tests/unit/test_cli_commands.py`** (7 tests):
- ingest-cora command success
- ingest-cora with custom anchors
- Authentication failures
- Parse error handling
- list-projects for users
- list-projects for admins
- Empty project lists

#### Integration Tests (7 passing)

**`tests/integration/test_cora_ingestion.py`**:
- Real CORA file parsing
- Project repository CRUD operations
- User-project associations
- Database integrity

### Manual Testing Results

**Test 1: Ingest Real CORA File**
```bash
python main.py ingest-cora \
  --file shaft_machining_goog_251011_C_US_L_EN_M3P1A_GMW.xlsx \
  --name "Shaft Machining Test" \
  --username testadmin \
  --password password123
```
✅ SUCCESS - Extracted all data correctly

**Test 2: Verify Database Storage**
```python
# Query showed:
Project: Shaft Machining Test
Keyword: shaft machining
Word Count: 939.6
Term Frequency: 2.5
H2 Total: 5.6
H3 Total: 13.1
Entities (first 5): ['cnc', 'machining', 'shaft', 'cnc turning', 'boring']
Related Searches (first 5): ['automated machining', 'cnc machining', ...]
```
✅ SUCCESS - All data stored correctly

**Test 3: List Projects**
```bash
python main.py list-projects --username testadmin --password password123
```
✅ SUCCESS - Projects display correctly

### Error Handling Examples

**Missing File:**
```
Error: File not found: nonexistent.xlsx
```

**Invalid Format:**
```
Error parsing CORA file: Failed to open Excel file
```

**Missing Required Sheet:**
```
Error parsing CORA file: Required sheet 'Strategic Overview' not found
```

**Authentication Failure:**
```
Error: Authentication failed
```

## Data Model

### Project Table Schema

```sql
CREATE TABLE projects (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id INTEGER NOT NULL REFERENCES users(id),
    name VARCHAR(255) NOT NULL,
    main_keyword VARCHAR(255) NOT NULL,
    
    -- Strategic Overview metrics
    word_count INTEGER NOT NULL DEFAULT 1250,
    term_frequency INTEGER NOT NULL DEFAULT 3,
    related_search_density FLOAT,
    entity_density FLOAT,
    lsi_density FLOAT,
    spintax_related_search_terms TEXT,
    
    -- Structure metrics (title, meta)
    title_exact_match INTEGER,
    title_related_search INTEGER,
    meta_exact_match INTEGER,
    meta_related_search INTEGER,
    meta_entities INTEGER,
    
    -- Structure metrics (H1)
    h1_exact INTEGER,
    h1_related_search INTEGER,
    h1_entities INTEGER,
    h1_lsi INTEGER,
    
    -- Structure metrics (H2)
    h2_total INTEGER,
    h2_exact INTEGER,
    h2_related_search INTEGER,
    h2_entities INTEGER,
    h2_lsi INTEGER,
    
    -- Structure metrics (H3)
    h3_total INTEGER,
    h3_exact INTEGER,
    h3_related_search INTEGER,
    h3_entities INTEGER,
    h3_lsi INTEGER,
    
    -- Extracted data (JSON)
    entities JSON,
    related_searches JSON,
    custom_anchor_text JSON,
    
    -- Timestamps
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_projects_user_id ON projects(user_id);
CREATE INDEX idx_projects_main_keyword ON projects(main_keyword);
```

## Dependencies

- `openpyxl==3.1.2` - Excel file parsing (already in requirements.txt)
- All existing project dependencies

## Architecture Decisions

**Why Fail on Missing Sheets?**
- Better to fail fast with clear error than create partial data
- CORA files have expected structure
- Silent defaults could mask file format issues
- User gets immediate feedback to fix the file

**Why Store Entities/Searches as JSON?**
- Variable-length arrays
- Easy to query and serialize
- No additional tables needed
- Simple to update

**Why Allow Any User to Create Projects?**
- Users work on their own content
- Admins can see all projects
- Matches workflow (users ingest, admins oversee)
- Can be restricted later if needed

**Why Store Spintax as Raw Text?**
- Preserve original format for reference
- Parsed version available in related_searches
- May need original for regeneration

## Next Steps

This completes Story 2.1. The data ingestion foundation is ready for:
- **Story 2.2**: Configurable Content Rule Engine (use project data for validation)
- **Story 2.3**: AI-Powered Content Generation (use project SEO data for prompts)
- **Story 2.4**: HTML Formatting (use project data in templates)

## Completion Checklist

- [x] Project database model created
- [x] CORA parser module implemented
- [x] ProjectRepository with CRUD operations
- [x] ingest-cora CLI command
- [x] list-projects CLI command
- [x] Unit tests written and passing (31 tests)
- [x] Integration tests written and passing (7 tests)
- [x] Manual testing with real CORA file
- [x] Error handling for all failure scenarios
- [x] Database schema updated
- [x] Story documentation completed

## Notes

- Word count stored as float to preserve decimal values from CORA
- Entity threshold of -0.195 is configurable via parser method parameter
- Custom anchor text is optional and stored as empty array if not provided
- All sheets except Entities are required (proper validation added)
- Parser closes workbook properly to prevent file locks
- Timestamps use UTC for consistency