5.6 KiB
Epic 2: Content Ingestion & Generation
Epic Goal
Implement the core workflow for ingesting CORA data and using AI to generate and format content into HTML that adheres to specific quality and SEO standards.
Stories
Story 2.1: CORA Report Data Ingestion
As a User, I want to run a script that ingests a CORA .xlsx file, so that a new project is created in the database with the necessary SEO data, including keywords, entities, related searches, and optional anchor text overrides.
Acceptance Criteria
- A CLI command is available to accept the path to a CORA .xlsx file.
- The script correctly extracts the specified data points from the spreadsheet (main keyword, entities, related searches, etc.).
- The script must also check for and store any optional, explicitly defined anchor text provided in the spreadsheet.
- A new project record is created in the database, associated with the authenticated user.
- The extracted SEO data is stored correctly in the new project record.
- The script handles errors gracefully if the file is not found or is in an incorrect format.
Story 2.2: Configurable Content Rule Engine
As an Admin, I want to define specific content structure and quality rules in the master configuration, so that all AI-generated content consistently meets my SEO and quality standards.
Acceptance Criteria
- The system must load a "content_rules" object from the master JSON configuration file.
- The rule engine must validate that the
tag contains the main keyword from the project's data.
- The engine must validate that at least one
tag starts with the main keyword.
- The engine must validate that other
tags incorporate entities and related searches from the project's data.
- The engine must validate that at least one
tag starts with the main keyword, and that others contain a mix of the keyword, entities, and related searches.
- The engine must validate a dedicated FAQ section where each question is an
tag.
- The engine must enforce that the answer text for each FAQ
begins by restating the question.
- For any AI-generated images, the engine must validate that the alt text contains the main keyword and associated entities.
- For interlinks, the engine must use the explicitly provided anchor text from the project data if it exists.
- If no explicit anchor text is provided, the engine must generate a default anchor text using a combination of the linked article's main keyword, entities, and related searches.
- The anchor text for the link to the home page must be the custom FQDN if one is mapped; otherwise, it must be the main keyword of the site/bucket.
- The anchor text for the link to the existing random article should be that article's main keyword.
Story 2.3: AI-Powered Content Generation
As a User, I want to execute a job for a project that uses AI to generate a title, an outline, and full-text content, so that the core content is created automatically.
Acceptance Criteria
- A script can be initiated for a specific project ID.
- The script uses the project's SEO data to prompt an AI model for a title, an outline, and the main body content.
- The content generation process must apply and validate against the rules defined and loaded by the content rule engine (Story 2.2).
- The generated title, outline, and text are stored and associated with the project in the database.
- The process logs its progress (e.g., "Generating title...", "Generating content...").
- The script can handle potential API errors from the AI service.
Story 2.4: HTML Formatting with Multiple Templates
As a developer, I want a module that takes the generated text content and formats it into a standard HTML file using one of a few predefined CSS templates, assigning one template per bucket/subdomain, so that all deployed content has a consistent look and feel per site.
Acceptance Criteria
- A directory of multiple, predefined HTML/CSS templates exists.
- The master JSON configuration file maps a specific template to each deployment target (e.g., S3 bucket, subdomain).
- A function accepts the generated content and a target identifier (e.g., bucket name).
- The function correctly selects and applies the appropriate template based on the configuration mapping.
- The content is structured into a valid HTML document with the selected CSS.
- The final HTML content is stored and associated with the project in the database.
Dependencies
- Story 2.5 (optional): If no site_deployment_id is assigned, template selection defaults to random.
Story 2.5: Deployment Target Assignment
As a developer, I want to assign deployment targets to tier1 content during the content generation process, so that high-quality tier1 articles know which site they will be deployed to and can use the appropriate template.
Acceptance Criteria
- The job configuration file supports an optional
deployment_targetsarray containing site custom_hostnames. - Only tier1 articles are assigned to deployment targets - tier2, tier3, etc. always get
site_deployment_id = null. - During tier1 content generation, each article is assigned a
site_deployment_idbased on its index:- If
deployment_targetshas N sites, tier1 articles 0 through N-1 get assigned round-robin. - Tier1 articles N and beyond get
site_deployment_id = null. - If no
deployment_targetsspecified, all tier1 articles getsite_deployment_id = null.
- If
- The
site_deployment_idis stored in theGeneratedContentrecord at creation time. - Invalid hostnames in
deployment_targetscause graceful errors with clear messages. - Validation occurs at job start (fail-fast approach).