210 lines
6.5 KiB
Markdown
210 lines
6.5 KiB
Markdown
# Story 2.8: Simple Spreadsheet Ingestion
|
|
|
|
## Overview
|
|
Implement a simplified spreadsheet ingestion path that allows users to quickly create projects from basic data without requiring a full CORA report. This addresses the need for faster project setup when a full CORA run (20-25 minutes) is unnecessary.
|
|
|
|
## Story Details
|
|
**As a User**, I want to ingest a simple spreadsheet with minimal required data, so that I can quickly create a project for content generation without waiting for a full CORA analysis.
|
|
|
|
## Context
|
|
A full CORA run takes 20-25 minutes and includes extensive metrics. Sometimes users only need to add information from a few cells they pasted into a spreadsheet. Eventually this will be entered via a webform, but for now a simpler spreadsheet format is needed.
|
|
|
|
## Acceptance Criteria
|
|
|
|
### 1. CLI Command to Ingest Simple Spreadsheets
|
|
**Status:** PENDING
|
|
|
|
A CLI command exists to accept simple .xlsx file paths:
|
|
- Command: `ingest-simple`
|
|
- Options: `--file`, `--name` (optional, overrides spreadsheet), `--money-site-url`, `--username`, `--password`
|
|
- Requires user authentication (any authenticated user can create projects)
|
|
- Returns success message with project details
|
|
|
|
### 2. Spreadsheet Format
|
|
**Status:** PENDING
|
|
|
|
The parser accepts a simple single-sheet spreadsheet format:
|
|
- **First row**: Headers (column names)
|
|
- **Second row**: Data values
|
|
|
|
**Required columns:**
|
|
- `main_keyword`: Single phrase keyword (e.g., "shaft machining")
|
|
- `project_name`: Name for the project
|
|
- `related_searches`: Comma-delimited list (e.g., "term1, term2, term3")
|
|
- `entities`: Comma-delimited list (e.g., "entity1, entity2, entity3")
|
|
|
|
**Optional columns:**
|
|
- `word_count`: Integer (default: 1500)
|
|
- `term_frequency`: Integer (default: 3)
|
|
|
|
### 3. Data Parsing
|
|
**Status:** PENDING
|
|
|
|
The parser correctly extracts and processes data:
|
|
- Parses comma-delimited `related_searches` into array
|
|
- Parses comma-delimited `entities` into array
|
|
- Applies defaults for optional fields (word_count=1500, term_frequency=3)
|
|
- Sets all structure metrics (title_exact_match, h1_exact, h2_total, etc.) to `None`
|
|
- Validates required fields are present
|
|
|
|
### 4. Database Storage
|
|
**Status:** PENDING
|
|
|
|
Project records are created with all data:
|
|
- User association (user_id foreign key)
|
|
- Main keyword and project name
|
|
- Word count and term frequency (with defaults)
|
|
- Entities and related searches as JSON arrays
|
|
- Structure metrics as `NULL` (not required for simple ingestion)
|
|
- Money site URL (prompted if not provided)
|
|
- Timestamps (created_at, updated_at)
|
|
|
|
### 5. Error Handling
|
|
**Status:** PENDING
|
|
|
|
Graceful error handling for:
|
|
- File not found errors
|
|
- Invalid Excel file format
|
|
- Missing required columns (main_keyword, project_name)
|
|
- Empty or invalid comma-delimited lists (treated as empty arrays)
|
|
- Authentication failures
|
|
- Database errors
|
|
|
|
## Implementation Details
|
|
|
|
### Files to Create/Modify
|
|
|
|
#### 1. `src/ingestion/parser.py` - UPDATED
|
|
Add `SimpleSpreadsheetParser` class:
|
|
```python
|
|
class SimpleSpreadsheetParser:
|
|
"""Parser for simple single-sheet spreadsheets with basic project data"""
|
|
|
|
def __init__(self, file_path: str)
|
|
def _parse_comma_delimited(self, value: Any) -> List[str]
|
|
def parse(self) -> Dict[str, Any]
|
|
```
|
|
|
|
**Key Features:**
|
|
- Reads first sheet of workbook
|
|
- First row as headers (case-insensitive)
|
|
- Second row as data values
|
|
- Parses comma-delimited strings into arrays
|
|
- Applies defaults for optional fields
|
|
- Returns data structure compatible with `ProjectRepository.create()`
|
|
|
|
#### 2. `src/cli/commands.py` - UPDATED
|
|
Add `ingest-simple` command:
|
|
```python
|
|
@app.command()
|
|
@click.option('--file', '-f', required=True)
|
|
@click.option('--name', '-n', help='Override project_name from spreadsheet')
|
|
@click.option('--money-site-url', '-m')
|
|
@click.option('--username', '-u')
|
|
@click.option('--password', '-p')
|
|
def ingest_simple(...)
|
|
```
|
|
|
|
**Features:**
|
|
- Authenticate user
|
|
- Parse simple spreadsheet
|
|
- Display parsed data summary
|
|
- Prompt for money_site_url if not provided
|
|
- Create project via ProjectRepository
|
|
- Show success summary
|
|
|
|
### Data Model
|
|
|
|
Uses existing `Project` model - no database changes required. Structure metrics will be `NULL` for simple ingestion projects.
|
|
|
|
### Spreadsheet Example
|
|
|
|
**Simple Format:**
|
|
| main_keyword | project_name | related_searches | entities | word_count | term_frequency |
|
|
|-------------|--------------|------------------|----------|------------|----------------|
|
|
| best coffee makers | Coffee Project | best espresso machines, coffee maker reviews, top coffee makers | coffee, espresso, brewing | 1500 | 3 |
|
|
|
|
**Minimal Format (uses defaults):**
|
|
| main_keyword | project_name | related_searches | entities |
|
|
|-------------|--------------|------------------|----------|
|
|
| shaft machining | Machining Project | CNC machining, precision machining | machining, lathe, milling |
|
|
|
|
## CLI Usage
|
|
|
|
**Basic:**
|
|
```bash
|
|
python main.py ingest-simple \
|
|
--file simple_project.xlsx \
|
|
--username admin \
|
|
--password pass
|
|
```
|
|
|
|
**With Overrides:**
|
|
```bash
|
|
python main.py ingest-simple \
|
|
--file simple_project.xlsx \
|
|
--name "Custom Project Name" \
|
|
--money-site-url https://example.com \
|
|
--username admin \
|
|
--password pass
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Authenticated as: admin (Admin)
|
|
|
|
Parsing simple spreadsheet: simple_project.xlsx
|
|
Main Keyword: best coffee makers
|
|
Project Name: Coffee Project
|
|
Word Count: 1500
|
|
Term Frequency: 3
|
|
Entities: 3
|
|
Related Searches: 3
|
|
Entities: coffee, espresso, brewing
|
|
Related Searches: best espresso machines, coffee maker reviews, top coffee makers
|
|
|
|
Enter money site URL (required for tiered linking): https://moneysite.com
|
|
|
|
Creating project: Coffee Project
|
|
Money Site URL: https://moneysite.com
|
|
|
|
Success: Project 'Coffee Project' created (ID: 1)
|
|
Main Keyword: best coffee makers
|
|
Money Site URL: https://moneysite.com
|
|
Word Count: 1500
|
|
Term Frequency: 3
|
|
Entities: 3
|
|
Related Searches: 3
|
|
```
|
|
|
|
## Error Handling Examples
|
|
|
|
**Missing Required Column:**
|
|
```
|
|
Error parsing spreadsheet: Required field 'main_keyword' not found
|
|
```
|
|
|
|
**Invalid File:**
|
|
```
|
|
Error parsing spreadsheet: Failed to open Excel file: [details]
|
|
```
|
|
|
|
**Empty Spreadsheet:**
|
|
```
|
|
Error parsing spreadsheet: No headers found in spreadsheet
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
- Story 2.1 (CORA ingestion) - Reuses ProjectRepository and Project model
|
|
- Existing authentication system
|
|
- Existing database models
|
|
|
|
## Future Enhancements
|
|
|
|
- Support for multiple projects per spreadsheet (multiple data rows)
|
|
- CSV format support (in addition to Excel)
|
|
- Web form interface (deferred to future story)
|
|
- Validation of comma-delimited format with better error messages
|
|
|