Big-Link-Man/docs/stories/story-2.8-simple-spreadshee...

6.5 KiB

Story 2.8: Simple Spreadsheet Ingestion

Overview

Implement a simplified spreadsheet ingestion path that allows users to quickly create projects from basic data without requiring a full CORA report. This addresses the need for faster project setup when a full CORA run (20-25 minutes) is unnecessary.

Story Details

As a User, I want to ingest a simple spreadsheet with minimal required data, so that I can quickly create a project for content generation without waiting for a full CORA analysis.

Context

A full CORA run takes 20-25 minutes and includes extensive metrics. Sometimes users only need to add information from a few cells they pasted into a spreadsheet. Eventually this will be entered via a webform, but for now a simpler spreadsheet format is needed.

Acceptance Criteria

1. CLI Command to Ingest Simple Spreadsheets

Status: PENDING

A CLI command exists to accept simple .xlsx file paths:

  • Command: ingest-simple
  • Options: --file, --name (optional, overrides spreadsheet), --money-site-url, --username, --password
  • Requires user authentication (any authenticated user can create projects)
  • Returns success message with project details

2. Spreadsheet Format

Status: PENDING

The parser accepts a simple single-sheet spreadsheet format:

  • First row: Headers (column names)
  • Second row: Data values

Required columns:

  • main_keyword: Single phrase keyword (e.g., "shaft machining")
  • project_name: Name for the project
  • related_searches: Comma-delimited list (e.g., "term1, term2, term3")
  • entities: Comma-delimited list (e.g., "entity1, entity2, entity3")

Optional columns:

  • word_count: Integer (default: 1500)
  • term_frequency: Integer (default: 3)

3. Data Parsing

Status: PENDING

The parser correctly extracts and processes data:

  • Parses comma-delimited related_searches into array
  • Parses comma-delimited entities into array
  • Applies defaults for optional fields (word_count=1500, term_frequency=3)
  • Sets all structure metrics (title_exact_match, h1_exact, h2_total, etc.) to None
  • Validates required fields are present

4. Database Storage

Status: PENDING

Project records are created with all data:

  • User association (user_id foreign key)
  • Main keyword and project name
  • Word count and term frequency (with defaults)
  • Entities and related searches as JSON arrays
  • Structure metrics as NULL (not required for simple ingestion)
  • Money site URL (prompted if not provided)
  • Timestamps (created_at, updated_at)

5. Error Handling

Status: PENDING

Graceful error handling for:

  • File not found errors
  • Invalid Excel file format
  • Missing required columns (main_keyword, project_name)
  • Empty or invalid comma-delimited lists (treated as empty arrays)
  • Authentication failures
  • Database errors

Implementation Details

Files to Create/Modify

1. src/ingestion/parser.py - UPDATED

Add SimpleSpreadsheetParser class:

class SimpleSpreadsheetParser:
    """Parser for simple single-sheet spreadsheets with basic project data"""
    
    def __init__(self, file_path: str)
    def _parse_comma_delimited(self, value: Any) -> List[str]
    def parse(self) -> Dict[str, Any]

Key Features:

  • Reads first sheet of workbook
  • First row as headers (case-insensitive)
  • Second row as data values
  • Parses comma-delimited strings into arrays
  • Applies defaults for optional fields
  • Returns data structure compatible with ProjectRepository.create()

2. src/cli/commands.py - UPDATED

Add ingest-simple command:

@app.command()
@click.option('--file', '-f', required=True)
@click.option('--name', '-n', help='Override project_name from spreadsheet')
@click.option('--money-site-url', '-m')
@click.option('--username', '-u')
@click.option('--password', '-p')
def ingest_simple(...)

Features:

  • Authenticate user
  • Parse simple spreadsheet
  • Display parsed data summary
  • Prompt for money_site_url if not provided
  • Create project via ProjectRepository
  • Show success summary

Data Model

Uses existing Project model - no database changes required. Structure metrics will be NULL for simple ingestion projects.

Spreadsheet Example

Simple Format:

main_keyword project_name related_searches entities word_count term_frequency
best coffee makers Coffee Project best espresso machines, coffee maker reviews, top coffee makers coffee, espresso, brewing 1500 3

Minimal Format (uses defaults):

main_keyword project_name related_searches entities
shaft machining Machining Project CNC machining, precision machining machining, lathe, milling

CLI Usage

Basic:

python main.py ingest-simple \
  --file simple_project.xlsx \
  --username admin \
  --password pass

With Overrides:

python main.py ingest-simple \
  --file simple_project.xlsx \
  --name "Custom Project Name" \
  --money-site-url https://example.com \
  --username admin \
  --password pass

Expected Output:

Authenticated as: admin (Admin)

Parsing simple spreadsheet: simple_project.xlsx
Main Keyword: best coffee makers
Project Name: Coffee Project
Word Count: 1500
Term Frequency: 3
Entities: 3
Related Searches: 3
  Entities: coffee, espresso, brewing
  Related Searches: best espresso machines, coffee maker reviews, top coffee makers

Enter money site URL (required for tiered linking): https://moneysite.com

Creating project: Coffee Project
Money Site URL: https://moneysite.com

Success: Project 'Coffee Project' created (ID: 1)
Main Keyword: best coffee makers
Money Site URL: https://moneysite.com
Word Count: 1500
Term Frequency: 3
Entities: 3
Related Searches: 3

Error Handling Examples

Missing Required Column:

Error parsing spreadsheet: Required field 'main_keyword' not found

Invalid File:

Error parsing spreadsheet: Failed to open Excel file: [details]

Empty Spreadsheet:

Error parsing spreadsheet: No headers found in spreadsheet

Dependencies

  • Story 2.1 (CORA ingestion) - Reuses ProjectRepository and Project model
  • Existing authentication system
  • Existing database models

Future Enhancements

  • Support for multiple projects per spreadsheet (multiple data rows)
  • CSV format support (in addition to Excel)
  • Web form interface (deferred to future story)
  • Validation of comma-delimited format with better error messages