# Epic 2: Content Ingestion & Generation ## Epic Goal Implement the core workflow for ingesting CORA data and using AI to generate and format content into HTML that adheres to specific quality and SEO standards. ## Stories ### Story 2.1: CORA Report Data Ingestion **As a User**, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data, including keywords, entities, related searches, and optional anchor text overrides. **Acceptance Criteria** - A CLI command is available to accept the path to a CORA .xlsx file. - The script correctly extracts the specified data points from the spreadsheet (main keyword, entities, related searches, etc.). - The script must also check for and store any optional, explicitly defined anchor text provided in the spreadsheet. - A new project record is created in the database, associated with the authenticated user. - The extracted SEO data is stored correctly in the new project record. - The script handles errors gracefully if the file is not found or is in an incorrect format. ### Story 2.2: Configurable Content Rule Engine **As an Admin**, I want to define specific content structure and quality rules in the master configuration, so that all AI-generated content consistently meets my SEO and quality standards. **Acceptance Criteria** - The system must load a "content_rules" object from the master JSON configuration file. - The rule engine must validate that the

tag contains the main keyword from the project's data. - The engine must validate that at least one

tag starts with the main keyword. - The engine must validate that other

tags incorporate entities and related searches from the project's data. - The engine must validate that at least one

tag starts with the main keyword, and that others contain a mix of the keyword, entities, and related searches. - The engine must validate a dedicated FAQ section where each question is an

tag. - The engine must enforce that the answer text for each FAQ

begins by restating the question. - For any AI-generated images, the engine must validate that the alt text contains the main keyword and associated entities. - For interlinks, the engine must use the explicitly provided anchor text from the project data if it exists. - If no explicit anchor text is provided, the engine must generate a default anchor text using a combination of the linked article's main keyword, entities, and related searches. - The anchor text for the link to the home page must be the custom FQDN if one is mapped; otherwise, it must be the main keyword of the site/bucket. - The anchor text for the link to the existing random article should be that article's main keyword. ### Story 2.3: AI-Powered Content Generation **As a User**, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically. **Acceptance Criteria** - A script can be initiated for a specific project ID. - The script uses the project's SEO data to prompt an AI model for a title, an outline, and the main body content. - The content generation process must apply and validate against the rules defined and loaded by the content rule engine (Story 2.2). - The generated title, outline, and text are stored and associated with the project in the database. - The process logs its progress (e.g., "Generating title...", "Generating content..."). - The script can handle potential API errors from the AI service. ### Story 2.4: HTML Formatting with Multiple Templates **As a developer**, I want a module that takes the generated text content and formats it into a standard HTML file using one of a few predefined CSS templates, assigning one template per bucket/subdomain, so that all deployed content has a consistent look and feel per site. **Acceptance Criteria** - A directory of multiple, predefined HTML/CSS templates exists. - The master JSON configuration file maps a specific template to each deployment target (e.g., S3 bucket, subdomain). - A function accepts the generated content and a target identifier (e.g., bucket name). - The function correctly selects and applies the appropriate template based on the configuration mapping. - The content is structured into a valid HTML document with the selected CSS. - The final HTML content is stored and associated with the project in the database. **Dependencies** - Story 2.5 (optional): If no site_deployment_id is assigned, template selection defaults to random. ### Story 2.5: Deployment Target Assignment **As a developer**, I want to assign deployment targets to generated content during the content generation process, so that each article knows which site/bucket it will be deployed to and can use the appropriate template. **Acceptance Criteria** - The job configuration file supports an optional `deployment_targets` array containing site custom_hostnames or site_deployment_ids. - The job configuration file supports an optional `deployment_overflow` strategy ("round_robin", "random_available", or "none"). - During content generation, each article is assigned a `site_deployment_id` based on its index in the batch: - If `deployment_targets` is specified, cycle through the list (round-robin by default). - If the batch size exceeds the target list, apply the overflow strategy. - If no `deployment_targets` specified, `site_deployment_id` remains null (random template in Story 2.4). - The `site_deployment_id` is stored in the `GeneratedContent` record at creation time. - Invalid site references in `deployment_targets` cause graceful errors with clear messages.