Big-Link-Man/docs/stories/story-2.8-simple-spreadshee...

210 lines
6.5 KiB
Markdown

# Story 2.8: Simple Spreadsheet Ingestion
## Overview
Implement a simplified spreadsheet ingestion path that allows users to quickly create projects from basic data without requiring a full CORA report. This addresses the need for faster project setup when a full CORA run (20-25 minutes) is unnecessary.
## Story Details
**As a User**, I want to ingest a simple spreadsheet with minimal required data, so that I can quickly create a project for content generation without waiting for a full CORA analysis.
## Context
A full CORA run takes 20-25 minutes and includes extensive metrics. Sometimes users only need to add information from a few cells they pasted into a spreadsheet. Eventually this will be entered via a webform, but for now a simpler spreadsheet format is needed.
## Acceptance Criteria
### 1. CLI Command to Ingest Simple Spreadsheets
**Status:** PENDING
A CLI command exists to accept simple .xlsx file paths:
- Command: `ingest-simple`
- Options: `--file`, `--name` (optional, overrides spreadsheet), `--money-site-url`, `--username`, `--password`
- Requires user authentication (any authenticated user can create projects)
- Returns success message with project details
### 2. Spreadsheet Format
**Status:** PENDING
The parser accepts a simple single-sheet spreadsheet format:
- **First row**: Headers (column names)
- **Second row**: Data values
**Required columns:**
- `main_keyword`: Single phrase keyword (e.g., "shaft machining")
- `project_name`: Name for the project
- `related_searches`: Comma-delimited list (e.g., "term1, term2, term3")
- `entities`: Comma-delimited list (e.g., "entity1, entity2, entity3")
**Optional columns:**
- `word_count`: Integer (default: 1500)
- `term_frequency`: Integer (default: 3)
### 3. Data Parsing
**Status:** PENDING
The parser correctly extracts and processes data:
- Parses comma-delimited `related_searches` into array
- Parses comma-delimited `entities` into array
- Applies defaults for optional fields (word_count=1500, term_frequency=3)
- Sets all structure metrics (title_exact_match, h1_exact, h2_total, etc.) to `None`
- Validates required fields are present
### 4. Database Storage
**Status:** PENDING
Project records are created with all data:
- User association (user_id foreign key)
- Main keyword and project name
- Word count and term frequency (with defaults)
- Entities and related searches as JSON arrays
- Structure metrics as `NULL` (not required for simple ingestion)
- Money site URL (prompted if not provided)
- Timestamps (created_at, updated_at)
### 5. Error Handling
**Status:** PENDING
Graceful error handling for:
- File not found errors
- Invalid Excel file format
- Missing required columns (main_keyword, project_name)
- Empty or invalid comma-delimited lists (treated as empty arrays)
- Authentication failures
- Database errors
## Implementation Details
### Files to Create/Modify
#### 1. `src/ingestion/parser.py` - UPDATED
Add `SimpleSpreadsheetParser` class:
```python
class SimpleSpreadsheetParser:
"""Parser for simple single-sheet spreadsheets with basic project data"""
def __init__(self, file_path: str)
def _parse_comma_delimited(self, value: Any) -> List[str]
def parse(self) -> Dict[str, Any]
```
**Key Features:**
- Reads first sheet of workbook
- First row as headers (case-insensitive)
- Second row as data values
- Parses comma-delimited strings into arrays
- Applies defaults for optional fields
- Returns data structure compatible with `ProjectRepository.create()`
#### 2. `src/cli/commands.py` - UPDATED
Add `ingest-simple` command:
```python
@app.command()
@click.option('--file', '-f', required=True)
@click.option('--name', '-n', help='Override project_name from spreadsheet')
@click.option('--money-site-url', '-m')
@click.option('--username', '-u')
@click.option('--password', '-p')
def ingest_simple(...)
```
**Features:**
- Authenticate user
- Parse simple spreadsheet
- Display parsed data summary
- Prompt for money_site_url if not provided
- Create project via ProjectRepository
- Show success summary
### Data Model
Uses existing `Project` model - no database changes required. Structure metrics will be `NULL` for simple ingestion projects.
### Spreadsheet Example
**Simple Format:**
| main_keyword | project_name | related_searches | entities | word_count | term_frequency |
|-------------|--------------|------------------|----------|------------|----------------|
| best coffee makers | Coffee Project | best espresso machines, coffee maker reviews, top coffee makers | coffee, espresso, brewing | 1500 | 3 |
**Minimal Format (uses defaults):**
| main_keyword | project_name | related_searches | entities |
|-------------|--------------|------------------|----------|
| shaft machining | Machining Project | CNC machining, precision machining | machining, lathe, milling |
## CLI Usage
**Basic:**
```bash
python main.py ingest-simple \
--file simple_project.xlsx \
--username admin \
--password pass
```
**With Overrides:**
```bash
python main.py ingest-simple \
--file simple_project.xlsx \
--name "Custom Project Name" \
--money-site-url https://example.com \
--username admin \
--password pass
```
**Expected Output:**
```
Authenticated as: admin (Admin)
Parsing simple spreadsheet: simple_project.xlsx
Main Keyword: best coffee makers
Project Name: Coffee Project
Word Count: 1500
Term Frequency: 3
Entities: 3
Related Searches: 3
Entities: coffee, espresso, brewing
Related Searches: best espresso machines, coffee maker reviews, top coffee makers
Enter money site URL (required for tiered linking): https://moneysite.com
Creating project: Coffee Project
Money Site URL: https://moneysite.com
Success: Project 'Coffee Project' created (ID: 1)
Main Keyword: best coffee makers
Money Site URL: https://moneysite.com
Word Count: 1500
Term Frequency: 3
Entities: 3
Related Searches: 3
```
## Error Handling Examples
**Missing Required Column:**
```
Error parsing spreadsheet: Required field 'main_keyword' not found
```
**Invalid File:**
```
Error parsing spreadsheet: Failed to open Excel file: [details]
```
**Empty Spreadsheet:**
```
Error parsing spreadsheet: No headers found in spreadsheet
```
## Dependencies
- Story 2.1 (CORA ingestion) - Reuses ProjectRepository and Project model
- Existing authentication system
- Existing database models
## Future Enhancements
- Support for multiple projects per spreadsheet (multiple data rows)
- CSV format support (in addition to Excel)
- Web form interface (deferred to future story)
- Validation of comma-delimited format with better error messages