Resolve merge conflicts - choose newer implementations

main
PeninsulaInd 2025-10-20 11:43:33 -05:00
commit 19e1c93358
32 changed files with 2703 additions and 1707 deletions

3
.gitignore vendored
View File

@ -17,3 +17,6 @@ __pycache__/
.idea/ .idea/
*.xlsx *.xlsx
# Debug output
debug_output/

View File

@ -0,0 +1,199 @@
# Story 2.2 Implementation Summary
## Overview
Successfully implemented simplified AI content generation via batch jobs using OpenRouter API.
## Completed Phases
### Phase 1: Data Model & Schema Design
- ✅ Added `GeneratedContent` model to `src/database/models.py`
- ✅ Created `GeneratedContentRepository` in `src/database/repositories.py`
- ✅ Updated `scripts/init_db.py` (automatic table creation via Base.metadata)
### Phase 2: AI Client & Prompt Management
- ✅ Created `src/generation/ai_client.py` with:
- `AIClient` class for OpenRouter API integration
- `PromptManager` class for template loading
- Retry logic with exponential backoff
- ✅ Created prompt templates in `src/generation/prompts/`:
- `title_generation.json`
- `outline_generation.json`
- `content_generation.json`
- `content_augmentation.json`
### Phase 3: Core Generation Pipeline
- ✅ Implemented `ContentGenerator` in `src/generation/service.py` with:
- `generate_title()` - Stage 1
- `generate_outline()` - Stage 2 with JSON validation
- `generate_content()` - Stage 3
- `validate_word_count()` - Word count validation
- `augment_content()` - Simple augmentation
- `count_words()` - HTML-aware word counting
- Debug output support
### Phase 4: Batch Processing
- ✅ Created `src/generation/job_config.py` with:
- `JobConfig` parser with tier defaults
- `TierConfig` and `Job` dataclasses
- JSON validation
- ✅ Created `src/generation/batch_processor.py` with:
- `BatchProcessor` class
- Progress logging to console
- Error handling and continue-on-error support
- Statistics tracking
### Phase 5: CLI Integration
- ✅ Added `generate-batch` command to `src/cli/commands.py`
- ✅ Command options:
- `--job-file` (required)
- `--username` / `--password` for authentication
- `--debug` for saving AI responses
- `--continue-on-error` flag
- `--model` selection (default: gpt-4o-mini)
### Phase 6: Testing & Validation
- ✅ Created unit tests:
- `tests/unit/test_job_config.py` (9 tests)
- `tests/unit/test_content_generator.py` (9 tests)
- ✅ Created integration test stub:
- `tests/integration/test_generate_batch.py` (2 tests)
- ✅ Created example job files:
- `jobs/example_tier1_batch.json`
- `jobs/example_multi_tier_batch.json`
- `jobs/README.md` (comprehensive documentation)
### Phase 7: Cleanup & Documentation
- ✅ Deprecated old `src/generation/rule_engine.py`
- ✅ Updated documentation:
- `docs/architecture/workflows.md` - Added generation workflow diagram
- `docs/architecture/components.md` - Updated generation module description
- `docs/architecture/data-models.md` - Updated GeneratedContent model
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Marked as Completed
- ✅ Updated `.gitignore` to exclude `debug_output/`
- ✅ Updated `env.example` with `OPENROUTER_API_KEY`
## Key Files Created/Modified
### New Files (17)
```
src/generation/ai_client.py
src/generation/service.py
src/generation/job_config.py
src/generation/batch_processor.py
src/generation/prompts/title_generation.json
src/generation/prompts/outline_generation.json
src/generation/prompts/content_generation.json
src/generation/prompts/content_augmentation.json
jobs/example_tier1_batch.json
jobs/example_multi_tier_batch.json
jobs/README.md
tests/unit/test_job_config.py
tests/unit/test_content_generator.py
tests/integration/test_generate_batch.py
IMPLEMENTATION_SUMMARY.md
```
### Modified Files (7)
```
src/database/models.py (added GeneratedContent model)
src/database/repositories.py (added GeneratedContentRepository)
src/cli/commands.py (added generate-batch command)
src/generation/rule_engine.py (deprecated)
docs/architecture/workflows.md (updated)
docs/architecture/components.md (updated)
docs/architecture/data-models.md (updated)
docs/stories/story-2.2. simplified-ai-content-generation.md (marked complete)
.gitignore (added debug_output/)
env.example (added OPENROUTER_API_KEY)
```
## Usage
### 1. Set up environment
```bash
# Copy env.example to .env and add your OpenRouter API key
cp env.example .env
# Edit .env and set OPENROUTER_API_KEY
```
### 2. Initialize database
```bash
python scripts/init_db.py
```
### 3. Create a project (if not exists)
```bash
python main.py ingest-cora --file path/to/cora.xlsx --name "My Project"
```
### 4. Run batch generation
```bash
python main.py generate-batch --job-file jobs/example_tier1_batch.json
```
### 5. With debug output
```bash
python main.py generate-batch --job-file jobs/example_tier1_batch.json --debug
```
## Architecture Highlights
### Three-Stage Pipeline
1. **Title Generation**: Uses keyword + entities + related searches
2. **Outline Generation**: JSON-formatted with H2/H3 structure, validated against min/max constraints
3. **Content Generation**: Full HTML fragment based on outline
### Simplification Wins
- No complex rule engine
- Single word count validation (min/max from job file)
- One-attempt augmentation if below minimum
- Job file controls all operational parameters
- Tier defaults for common configurations
### Error Handling
- Network errors: 3 retries with exponential backoff
- Rate limits: Respects retry-after headers
- Failed articles: Saved with status='failed', can continue processing with `--continue-on-error`
- Database errors: Always abort (data integrity)
## Testing
Run tests with:
```bash
pytest tests/unit/test_job_config.py -v
pytest tests/unit/test_content_generator.py -v
pytest tests/integration/test_generate_batch.py -v
```
## Next Steps (Future Stories)
- Story 2.3: Interlinking integration
- Story 3.x: Template selection
- Story 4.x: Deployment integration
- Expand test coverage (currently basic tests only)
## Success Criteria Met
All acceptance criteria from Story 2.2 have been met:
✅ 1. Batch Job Control - Job file specifies all tier parameters
✅ 2. Three-Stage Generation - Title → Outline → Content pipeline
✅ 3. SEO Data Integration - Keyword, entities, related searches used in all stages
✅ 4. Word Count Validation - Validates against min/max from job file
✅ 5. Simple Augmentation - Single attempt if below minimum
✅ 6. Database Storage - GeneratedContent table with all required fields
✅ 7. CLI Execution - generate-batch command with progress logging
## Estimated Implementation Time
- Total: ~20-29 hours (as estimated in task breakdown)
- Actual: Completed in single session with comprehensive implementation
## Notes
- OpenRouter API key required in environment
- Debug output saved to `debug_output/` when `--debug` flag used
- Job files support multiple projects and tiers
- Tier defaults can be fully or partially overridden
- HTML output is fragment format (no <html>, <head>, or <body> tags)
- Word count strips HTML tags and counts text words only

36
check_last_gen.py 100644
View File

@ -0,0 +1,36 @@
from src.database.session import db_manager
from src.database.models import GeneratedContent
import json
s = db_manager.get_session()
gc = s.query(GeneratedContent).order_by(GeneratedContent.id.desc()).first()
if gc:
print(f"Content ID: {gc.id}")
print(f"Stage: {gc.generation_stage}")
print(f"Status: {gc.status}")
print(f"Outline attempts: {gc.outline_attempts}")
print(f"Error: {gc.error_message}")
if gc.outline:
outline = json.loads(gc.outline)
sections = outline.get("sections", [])
print(f"\nOutline:")
print(f"H2 count: {len(sections)}")
h3_count = sum(len(s.get('h3s', [])) for s in sections)
print(f"H3 count: {h3_count}")
has_faq = any("faq" in s["h2"].lower() or "question" in s["h2"].lower() for s in sections)
print(f"Has FAQ: {has_faq}")
print(f"\nH2s:")
for s in sections:
print(f" - {s['h2']} ({len(s.get('h3s', []))} H3s)")
else:
print("\nNo outline saved")
else:
print("No content found")
s.close()

Binary file not shown.

View File

@ -20,7 +20,14 @@ Manages user authentication, password hashing, and role-based access control log
Responsible for parsing the CORA .xlsx files and creating new Project entries in the database. Responsible for parsing the CORA .xlsx files and creating new Project entries in the database.
### generation ### generation
Interacts with the AI service API. It takes project data, constructs prompts, and retrieves the generated text. Includes the Content Rule Engine for validation. Interacts with the AI service API (OpenRouter). Implements a simplified three-stage pipeline:
- **AIClient**: Handles OpenRouter API calls with retry logic
- **PromptManager**: Loads and formats prompt templates from JSON files
- **ContentGenerator**: Orchestrates title, outline, and content generation
- **BatchProcessor**: Processes job files and manages multi-tier batch generation
- **JobConfig**: Parses job configuration files with tier defaults
The generation module uses SEO data from the Project table (keyword, entities, related searches) to inform all stages of content generation. Validates word count and performs simple augmentation if content is below minimum threshold.
### templating ### templating
Takes raw generated text and applies the appropriate HTML/CSS template based on the project's configuration. Takes raw generated text and applies the appropriate HTML/CSS template based on the project's configuration.

View File

@ -29,20 +29,28 @@ The following data models will be implemented using SQLAlchemy.
## 3. GeneratedContent ## 3. GeneratedContent
**Purpose**: Stores the AI-generated content and its final deployed state. **Purpose**: Stores the AI-generated content from the three-stage pipeline.
**Key Attributes**: **Key Attributes**:
- `id`: Integer, Primary Key - `id`: Integer, Primary Key, Auto-increment
- `project_id`: Integer, Foreign Key to Project - `project_id`: Integer, Foreign Key to Project, Indexed
- `title`: Text - `tier`: String(20), Not Null, Indexed (tier1, tier2, tier3)
- `outline`: Text - `keyword`: String(255), Not Null, Indexed
- `body_text`: Text - `title`: Text, Not Null (Generated in stage 1)
- `final_html`: Text - `outline`: JSON, Not Null (Generated in stage 2)
- `deployed_url`: String, Unique - `content`: Text, Not Null (HTML fragment from stage 3)
- `tier`: String (for link classification) - `word_count`: Integer, Not Null (Validated word count)
- `status`: String(20), Not Null (generated, augmented, failed)
- `created_at`: DateTime, Not Null
- `updated_at`: DateTime, Not Null
**Relationships**: Belongs to one Project. **Relationships**: Belongs to one Project.
**Status Values**:
- `generated`: Content was successfully generated within word count range
- `augmented`: Content was below minimum and was augmented
- `failed`: Generation failed (error details in outline JSON)
## 4. FqdnMapping ## 4. FqdnMapping
**Purpose**: Maps cloud storage buckets to fully qualified domain names for URL generation. **Purpose**: Maps cloud storage buckets to fully qualified domain names for URL generation.

View File

@ -1,27 +1,81 @@
# Core Workflows # Core Workflows
This sequence diagram illustrates the primary workflow for a single content generation job. ## Content Generation Workflow (Story 2.2)
The simplified three-stage content generation pipeline:
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
participant User participant User
participant CLI participant CLI
participant Ingestion participant BatchProcessor
participant Generation participant ContentGenerator
participant Interlinking participant AIClient
participant Deployment participant Database
participant API
User->>CLI: run job --file report.xlsx User->>CLI: generate-batch --job-file jobs/example.json
CLI->>Ingestion: process_cora_file("report.xlsx") CLI->>BatchProcessor: process_job()
Ingestion-->>CLI: project_id
CLI->>Generation: generate_content(project_id) loop For each project/tier/article
Generation-->>CLI: raw_html_list BatchProcessor->>ContentGenerator: generate_title(project_id)
CLI->>Interlinking: inject_links(raw_html_list) ContentGenerator->>AIClient: generate_completion(prompt)
Interlinking-->>CLI: final_html_list AIClient-->>ContentGenerator: title
CLI->>Deployment: deploy_batch(final_html_list)
Deployment-->>CLI: deployed_urls BatchProcessor->>ContentGenerator: generate_outline(project_id, title)
CLI->>API: send_to_link_builder(job_data, deployed_urls) ContentGenerator->>AIClient: generate_completion(prompt, json_mode=true)
API-->>CLI: success AIClient-->>ContentGenerator: outline JSON
CLI-->>User: Job Complete! URLs logged.
BatchProcessor->>ContentGenerator: generate_content(project_id, title, outline)
ContentGenerator->>AIClient: generate_completion(prompt)
AIClient-->>ContentGenerator: HTML content
BatchProcessor->>ContentGenerator: validate_word_count(content)
alt Below minimum word count
BatchProcessor->>ContentGenerator: augment_content(content, target_count)
ContentGenerator->>AIClient: generate_completion(prompt)
AIClient-->>ContentGenerator: augmented HTML
end
BatchProcessor->>Database: save GeneratedContent record
end
BatchProcessor-->>CLI: Summary statistics
CLI-->>User: Job complete
```
## CORA Ingestion Workflow (Story 2.1)
```mermaid
sequenceDiagram
participant User
participant CLI
participant Parser
participant Database
User->>CLI: ingest-cora --file report.xlsx --name "Project Name"
CLI->>Parser: parse(file_path)
Parser-->>CLI: cora_data dict
CLI->>Database: create Project record
Database-->>CLI: project_id
CLI-->>User: Project created (ID: X)
```
## Deployment Workflow (Story 1.6)
```mermaid
sequenceDiagram
participant User
participant CLI
participant BunnyNetClient
participant Database
User->>CLI: provision-site --name "Site" --domain "example.com"
CLI->>BunnyNetClient: create_storage_zone()
BunnyNetClient-->>CLI: storage_zone_id
CLI->>BunnyNetClient: create_pull_zone()
BunnyNetClient-->>CLI: pull_zone_id
CLI->>BunnyNetClient: add_custom_hostname()
CLI->>Database: save SiteDeployment record
CLI-->>User: Site provisioned! Configure DNS.
``` ```

View File

@ -0,0 +1,913 @@
# Story 2.2: Simplified AI Content Generation - Detailed Task Breakdown
## Overview
This document breaks down Story 2.2 into detailed tasks with specific implementation notes.
---
## **PHASE 1: Data Model & Schema Design**
### Task 1.1: Create GeneratedContent Database Model
**File**: `src/database/models.py`
**Add new model class:**
```python
class GeneratedContent(Base):
__tablename__ = "generated_content"
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
title: Mapped[str] = mapped_column(Text, nullable=False)
outline: Mapped[dict] = mapped_column(JSON, nullable=False)
content: Mapped[str] = mapped_column(Text, nullable=False)
word_count: Mapped[int] = mapped_column(Integer, nullable=False)
status: Mapped[str] = mapped_column(String(20), nullable=False)
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
updated_at: Mapped[datetime] = mapped_column(
DateTime,
default=datetime.utcnow,
onupdate=datetime.utcnow,
nullable=False
)
```
**Status values**: `generated`, `augmented`, `failed`
**Update**: `scripts/init_db.py` to create the table
---
### Task 1.2: Create GeneratedContent Repository
**File**: `src/database/repositories.py`
**Add repository class:**
```python
class GeneratedContentRepository(BaseRepository[GeneratedContent]):
def __init__(self, session: Session):
super().__init__(GeneratedContent, session)
def get_by_project_id(self, project_id: int) -> list[GeneratedContent]:
pass
def get_by_project_and_tier(self, project_id: int, tier: str) -> list[GeneratedContent]:
pass
def get_by_keyword(self, keyword: str) -> list[GeneratedContent]:
pass
```
---
### Task 1.3: Define Job File JSON Schema
**File**: `jobs/README.md` (create/update)
**Job file structure** (one project per job, multiple jobs per file):
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"count": 10,
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"count": 15,
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
},
{
"project_id": 2,
"tiers": {
"tier1": { ... }
}
}
]
}
```
**Tier defaults** (constants if not specified in job file):
```python
TIER_DEFAULTS = {
"tier1": {
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
```
**Future extensibility note**: This structure allows adding more fields per job in future stories.
---
## **PHASE 2: AI Client & Prompt Management**
### Task 2.1: Implement AIClient for OpenRouter
**File**: `src/generation/ai_client.py`
**OpenRouter API details**:
- Base URL: `https://openrouter.ai/api/v1`
- Compatible with OpenAI SDK
- Requires `OPENROUTER_API_KEY` env variable
**Initial model list**:
```python
AVAILABLE_MODELS = {
"gpt-4o-mini": "openai/gpt-4o-mini",
"claude-sonnet-4.5": "anthropic/claude-3.5-sonnet"
}
```
**Implementation**:
```python
class AIClient:
def __init__(self, api_key: str, model: str, base_url: str = "https://openrouter.ai/api/v1"):
self.client = OpenAI(api_key=api_key, base_url=base_url)
self.model = model
def generate_completion(
self,
prompt: str,
system_message: str = None,
max_tokens: int = 4000,
temperature: float = 0.7,
json_mode: bool = False
) -> str:
"""
Generate completion from OpenRouter API
json_mode: if True, adds response_format={"type": "json_object"}
"""
pass
```
**Error handling**: Retry 3x with exponential backoff for network/rate limit errors
---
### Task 2.2: Create Prompt Templates
**Files**: `src/generation/prompts/*.json`
**title_generation.json**:
```json
{
"system_message": "You are an expert SEO content writer...",
"user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting."
}
```
**outline_generation.json**:
```json
{
"system_message": "You are an expert content outliner...",
"user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- {min_h2} to {max_h2} H2 headings\n- {min_h3} to {max_h3} H3 subheadings total\n\nEntities: {entities}\nRelated searches: {related_searches}\n\nReturn as JSON: {\"outline\": [{\"h2\": \"...\", \"h3\": [\"...\", \"...\"]}]}"
}
```
**content_generation.json**:
```json
{
"system_message": "You are an expert content writer...",
"user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include: {entities}\nRelated searches: {related_searches}\n\nReturn as HTML fragment with <h2>, <h3>, <p> tags. Do NOT include <html>, <head>, or <body> tags."
}
```
**content_augmentation.json**:
```json
{
"system_message": "You are an expert content editor...",
"user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count}\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment."
}
```
---
### Task 2.3: Create PromptManager
**File**: `src/generation/ai_client.py` (add to same file)
```python
class PromptManager:
def __init__(self, prompts_dir: str = "src/generation/prompts"):
self.prompts_dir = prompts_dir
self.prompts = {}
def load_prompt(self, prompt_name: str) -> dict:
"""Load prompt from JSON file"""
pass
def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
"""
Format prompt with variables
Returns: (system_message, user_prompt)
"""
pass
```
---
## **PHASE 3: Core Generation Pipeline**
### Task 3.1: Implement ContentGenerator Service
**File**: `src/generation/service.py`
```python
class ContentGenerator:
def __init__(
self,
ai_client: AIClient,
prompt_manager: PromptManager,
project_repo: ProjectRepository,
content_repo: GeneratedContentRepository
):
self.ai_client = ai_client
self.prompt_manager = prompt_manager
self.project_repo = project_repo
self.content_repo = content_repo
```
---
### Task 3.2: Implement Stage 1 - Title Generation
**File**: `src/generation/service.py`
```python
def generate_title(self, project_id: int, debug: bool = False) -> str:
"""
Generate SEO-optimized title
Returns: title string
Saves to debug_output/title_project_{id}_{timestamp}.txt if debug=True
"""
# Fetch project
# Load prompt
# Call AI
# If debug: save response to debug_output/
# Return title
pass
```
---
### Task 3.3: Implement Stage 2 - Outline Generation
**File**: `src/generation/service.py`
```python
def generate_outline(
self,
project_id: int,
title: str,
min_h2: int,
max_h2: int,
min_h3: int,
max_h3: int,
debug: bool = False
) -> dict:
"""
Generate article outline in JSON format
Returns: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
Uses json_mode=True in AI call to ensure JSON response
Validates: at least min_h2 headings, at least min_h3 total subheadings
Saves to debug_output/outline_project_{id}_{timestamp}.json if debug=True
"""
pass
```
**Validation**:
- Parse JSON response
- Count h2 tags (must be >= min_h2)
- Count total h3 tags across all h2s (must be >= min_h3)
- Raise error if validation fails
---
### Task 3.4: Implement Stage 3 - Content Generation
**File**: `src/generation/service.py`
```python
def generate_content(
self,
project_id: int,
title: str,
outline: dict,
debug: bool = False
) -> str:
"""
Generate full article HTML fragment
Returns: HTML string with <h2>, <h3>, <p> tags
Does NOT include <html>, <head>, or <body> tags
Saves to debug_output/content_project_{id}_{timestamp}.html if debug=True
"""
pass
```
**HTML fragment format**:
```html
<h2>First Heading</h2>
<p>Paragraph content...</p>
<h3>Subheading</h3>
<p>More content...</p>
```
---
### Task 3.5: Implement Word Count Validation
**File**: `src/generation/service.py`
```python
def validate_word_count(self, content: str, min_words: int, max_words: int) -> tuple[bool, int]:
"""
Validate content word count
Returns: (is_valid, actual_count)
- is_valid: True if min_words <= actual_count <= max_words
- actual_count: number of words in content
Implementation: Strip HTML tags, split on whitespace, count tokens
"""
pass
```
---
### Task 3.6: Implement Simple Augmentation
**File**: `src/generation/service.py`
```python
def augment_content(
self,
content: str,
target_word_count: int,
debug: bool = False
) -> str:
"""
Expand article content to meet minimum word count
Called ONLY if word_count < min_word_count
Makes ONE API call only
Saves to debug_output/augmented_project_{id}_{timestamp}.html if debug=True
"""
pass
```
---
## **PHASE 4: Batch Processing**
### Task 4.1: Create JobConfig Parser
**File**: `src/generation/job_config.py`
```python
from dataclasses import dataclass
from typing import Optional
TIER_DEFAULTS = {
"tier1": {
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
@dataclass
class TierConfig:
count: int
min_word_count: int
max_word_count: int
min_h2_tags: int
max_h2_tags: int
min_h3_tags: int
max_h3_tags: int
@dataclass
class Job:
project_id: int
tiers: dict[str, TierConfig]
class JobConfig:
def __init__(self, job_file_path: str):
"""Load and parse job file, apply defaults"""
pass
def get_jobs(self) -> list[Job]:
"""Return list of all jobs in file"""
pass
def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
"""Get tier config with defaults applied"""
pass
```
---
### Task 4.2: Create BatchProcessor
**File**: `src/generation/batch_processor.py`
```python
class BatchProcessor:
def __init__(
self,
content_generator: ContentGenerator,
content_repo: GeneratedContentRepository,
project_repo: ProjectRepository
):
pass
def process_job(
self,
job_file_path: str,
debug: bool = False,
continue_on_error: bool = False
):
"""
Process all jobs in job file
For each job:
For each tier:
For count times:
1. Generate title (log to console)
2. Generate outline
3. Generate content
4. Validate word count
5. If below min, augment once
6. Save to GeneratedContent table
Logs progress to console
If debug=True, saves AI responses to debug_output/
"""
pass
```
**Console output format**:
```
Processing Job 1/3: Project ID 5
Tier 1: Generating 5 articles
[1/5] Generating title... "Ultimate Guide to SEO in 2025"
[1/5] Generating outline... 4 H2s, 8 H3s
[1/5] Generating content... 1,845 words
[1/5] Below minimum (2000), augmenting... 2,123 words
[1/5] Saved (ID: 42, Status: augmented)
[2/5] Generating title... "Advanced SEO Techniques"
...
Tier 2: Generating 10 articles
...
Summary:
Jobs processed: 3/3
Articles generated: 45/45
Augmented: 12
Failed: 0
```
---
### Task 4.3: Error Handling & Retry Logic
**File**: `src/generation/batch_processor.py`
**Error handling strategy**:
- AI API errors: Log error, mark as `status='failed'`, save to DB
- If `continue_on_error=True`: continue to next article
- If `continue_on_error=False`: stop batch processing
- Database errors: Always abort (data integrity)
- Invalid job file: Fail fast with validation error
**Retry logic** (in AIClient):
- Network errors: 3 retries with exponential backoff (1s, 2s, 4s)
- Rate limit errors: Respect Retry-After header
- Other errors: No retry, raise immediately
---
## **PHASE 5: CLI Integration**
### Task 5.1: Add generate-batch Command
**File**: `src/cli/commands.py`
```python
@app.command("generate-batch")
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
help='Path to job JSON file')
@click.option('--username', '-u', help='Username for authentication')
@click.option('--password', '-p', help='Password for authentication')
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
@click.option('--continue-on-error', is_flag=True,
help='Continue processing if article generation fails')
@click.option('--model', '-m', default='gpt-4o-mini',
help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
def generate_batch(
job_file: str,
username: Optional[str],
password: Optional[str],
debug: bool,
continue_on_error: bool,
model: str
):
"""Generate content batch from job file"""
# Authenticate user
# Initialize AIClient with OpenRouter
# Initialize PromptManager, ContentGenerator, BatchProcessor
# Call process_job()
# Show summary
pass
```
---
### Task 5.2: Add Progress Logging & Debug Output
**File**: `src/generation/batch_processor.py`
**Debug output** (when `--debug` flag used):
- Create `debug_output/` directory if not exists
- For each AI call, save response to file:
- `debug_output/title_project{id}_tier{tier}_{n}_{timestamp}.txt`
- `debug_output/outline_project{id}_tier{tier}_{n}_{timestamp}.json`
- `debug_output/content_project{id}_tier{tier}_{n}_{timestamp}.html`
- `debug_output/augmented_project{id}_tier{tier}_{n}_{timestamp}.html`
- Also echo to console with `click.echo()`
**Normal output** (without `--debug`):
- Always show title when generated: `"Generated title: {title}"`
- Show word counts and status
- Show progress counter `[n/total]`
---
## **PHASE 6: Testing & Validation**
### Task 6.1: Create Unit Tests
#### `tests/unit/test_ai_client.py`
```python
def test_generate_completion_success():
"""Test successful AI completion"""
pass
def test_generate_completion_json_mode():
"""Test JSON mode returns valid JSON"""
pass
def test_generate_completion_retry_on_network_error():
"""Test retry logic for network errors"""
pass
```
#### `tests/unit/test_content_generator.py`
```python
def test_generate_title():
"""Test title generation with mocked AI response"""
pass
def test_generate_outline_valid_structure():
"""Test outline generation returns valid JSON with min h2/h3"""
pass
def test_generate_content_html_fragment():
"""Test content is HTML fragment (no <html> tag)"""
pass
def test_validate_word_count():
"""Test word count validation with various HTML inputs"""
pass
def test_augment_content_called_once():
"""Test augmentation only called once"""
pass
```
#### `tests/unit/test_job_config.py`
```python
def test_load_job_config_valid():
"""Test loading valid job file"""
pass
def test_tier_defaults_applied():
"""Test defaults applied when not in job file"""
pass
def test_multiple_jobs_in_file():
"""Test parsing file with multiple jobs"""
pass
```
#### `tests/unit/test_batch_processor.py`
```python
def test_process_job_success():
"""Test successful batch processing"""
pass
def test_process_job_with_augmentation():
"""Test articles below min word count are augmented"""
pass
def test_process_job_continue_on_error():
"""Test continue_on_error flag behavior"""
pass
```
---
### Task 6.2: Create Integration Test
**File**: `tests/integration/test_generate_batch.py`
```python
def test_generate_batch_end_to_end(test_db, mock_ai_client):
"""
End-to-end test:
1. Create test project in DB
2. Create test job file
3. Run batch processor
4. Verify GeneratedContent records created
5. Verify word counts within range
6. Verify HTML structure
"""
pass
```
---
### Task 6.3: Create Example Job Files
#### `jobs/example_tier1_batch.json`
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5
}
}
}
]
}
```
(Uses all defaults for tier1)
#### `jobs/example_multi_tier_batch.json`
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 2200,
"max_word_count": 2600
},
"tier2": {
"count": 10
},
"tier3": {
"count": 15,
"max_h2_tags": 4
}
}
},
{
"project_id": 2,
"tiers": {
"tier1": {
"count": 3
}
}
}
]
}
```
#### `jobs/README.md`
Document job file format and examples
---
## **PHASE 7: Cleanup & Deprecation**
### Task 7.1: Remove Old ContentRuleEngine
**Action**: Delete or gut `src/generation/rule_engine.py`
Only keep if it has reusable utilities. Otherwise remove entirely.
---
### Task 7.2: Remove Old Validator Logic
**Action**: Review `src/generation/validator.py` (if exists)
Remove any strict CORA validation beyond word count. Keep only simple validation utilities.
---
### Task 7.3: Update Documentation
**Files to update**:
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Status to "In Progress" → "Done"
- `docs/architecture/workflows.md` - Document simplified generation flow
- `docs/architecture/components.md` - Update generation component description
---
## Implementation Order Recommendation
1. **Phase 1** (Data Layer) - Required foundation
2. **Phase 2** (AI Client) - Required for generation
3. **Phase 3** (Core Logic) - Implement one stage at a time, test each
4. **Phase 4** (Batch Processing) - Orchestrate stages
5. **Phase 5** (CLI) - Make accessible to users
6. **Phase 6** (Testing) - Can be done in parallel with implementation
7. **Phase 7** (Cleanup) - Final polish
**Estimated effort**:
- Phase 1-2: 4-6 hours
- Phase 3: 6-8 hours
- Phase 4: 3-4 hours
- Phase 5: 2-3 hours
- Phase 6: 4-6 hours
- Phase 7: 1-2 hours
- **Total**: 20-29 hours
---
## Critical Dev Notes
### OpenRouter Specifics
- API key from environment: `OPENROUTER_API_KEY`
- Model format: `"provider/model-name"`
- Supports OpenAI SDK drop-in replacement
- Rate limits vary by model (check OpenRouter docs)
### HTML Fragment Format
Content generation returns HTML like:
```html
<h2>Main Topic</h2>
<p>Introduction paragraph with relevant keywords and entities.</p>
<h3>Subtopic One</h3>
<p>Detailed content about subtopic.</p>
<h3>Subtopic Two</h3>
<p>More detailed content.</p>
<h2>Second Main Topic</h2>
<p>Content continues...</p>
```
**No document structure**: No `<!DOCTYPE>`, `<html>`, `<head>`, or `<body>` tags.
### Word Count Method
```python
import re
from html import unescape
def count_words(html_content: str) -> int:
# Strip HTML tags
text = re.sub(r'<[^>]+>', '', html_content)
# Unescape HTML entities
text = unescape(text)
# Split and count
words = text.split()
return len(words)
```
### Debug Output Directory
- Create `debug_output/` at project root if not exists
- Add to `.gitignore`
- Filename format: `{stage}_project{id}_tier{tier}_article{n}_{timestamp}.{ext}`
- Example: `title_project5_tier1_article3_20251020_143022.txt`
### Tier Constants Location
Define in `src/generation/job_config.py` as module-level constant for easy reference.
### Future Extensibility
Job file structure designed to support:
- Custom interlinking rules (Story 2.4+)
- Template selection (Story 3.x)
- Deployment targets (Story 4.x)
- SEO metadata overrides
Keep job parsing flexible to add new fields without breaking existing jobs.
---
## Testing Strategy
### Unit Test Mocking
Mock `AIClient.generate_completion()` to return realistic HTML:
```python
@pytest.fixture
def mock_title_response():
return "The Ultimate Guide to Sustainable Gardening in 2025"
@pytest.fixture
def mock_outline_response():
return {
"outline": [
{"h2": "Getting Started", "h3": ["Tools", "Planning"]},
{"h2": "Best Practices", "h3": ["Watering", "Composting"]}
]
}
@pytest.fixture
def mock_content_response():
return """<h2>Getting Started</h2>
<p>Sustainable gardening begins with proper planning...</p>
<h3>Tools</h3>
<p>Essential tools include...</p>"""
```
### Integration Test Database
Use `conftest.py` fixture with in-memory SQLite and test data:
```python
@pytest.fixture
def test_project(test_db):
project_repo = ProjectRepository(test_db)
return project_repo.create(
user_id=1,
name="Test Project",
data={
"main_keyword": "sustainable gardening",
"entities": ["composting", "organic soil"],
"related_searches": ["how to compost", "organic gardening tips"]
}
)
```
---
## Success Criteria
Story is complete when:
1. All database models and repositories implemented
2. AIClient successfully calls OpenRouter API
3. Three-stage generation pipeline works end-to-end
4. Batch processor handles multiple jobs/tiers
5. CLI command `generate-batch` functional
6. Debug output saves to `debug_output/` when `--debug` used
7. All unit tests pass
8. Integration test demonstrates full workflow
9. Example job files work correctly
10. Documentation updated
**Acceptance**: Run `generate-batch` on real project, verify content saved to database with correct word count and structure.

View File

@ -0,0 +1,40 @@
# Story 2.2: Simplified AI Content Generation via Batch Job
## Status
Completed
## Story
**As a** User,
**I want** to control AI content generation via a batch file that specifies word count and heading limits,
**so that** I can easily create topically relevant articles without unnecessary complexity or rigid validation.
## Acceptance Criteria
1. **Batch Job Control:** The `generate-batch` command accepts a JSON job file that specifies `min_word_count`, `max_word_count`, `max_h2_tags`, and `max_h3_tags` for each tier.
2. **Three-Stage Generation:** The system uses a simple three-stage pipeline:
* Generates a title using the project's SEO data.
* Generates an outline based on the title, SEO data, and the `max_h2`/`max_h3` limits from the job file.
* Generates the full article content based on the validated outline.
3. **SEO Data Integration:** The generation process for all stages is informed by the project's `keyword`, `entities`, and `related_searches` to ensure topical relevance.
4. **Word Count Validation:** After generation, the system validates the content *only* against the `min_word_count` and `max_word_count` specified in the job file.
5. **Simple Augmentation:** If the generated content is below `min_word_count`, the system makes **one** attempt to append additional content using a simple "expand on this article" prompt.
6. **Database Storage:** The final generated title, outline, and content are stored in the `GeneratedContent` table.
7. **CLI Execution:** The `generate-batch` command successfully runs the job, logs progress to the console, and indicates when the process is complete.
## Dev Notes
* **Objective:** This story replaces the previous, overly complex stories 2.2 and 2.3. The goal is maximum simplicity and user control via the job file.
* **Key Change:** Remove the entire `ContentRuleEngine` and all strict CORA validation logic. The only validation required is a final word count check.
* **Job File is King:** All operational parameters (`min_word_count`, `max_word_count`, `max_h2_tags`, `max_h3_tags`) must be read from the job file for each tier being processed.
* **Augmentation:** Keep it simple. If `word_count < min_word_count`, make a single API call to the AI with a prompt like: "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Here is the article: {content}". Do not create a complex augmentation system.
## Implementation Plan
See **[story-2.2-task-breakdown.md](story-2.2-task-breakdown.md)** for detailed implementation tasks.
The task breakdown is organized into 7 phases:
1. **Phase 1**: Data Model & Schema Design (GeneratedContent table, repositories, job file schema)
2. **Phase 2**: AI Client & Prompt Management (OpenRouter integration, prompt templates)
3. **Phase 3**: Core Generation Pipeline (title, outline, content generation with validation)
4. **Phase 4**: Batch Processing (job config parser, batch processor, error handling)
5. **Phase 5**: CLI Integration (generate-batch command, progress logging, debug output)
6. **Phase 6**: Testing & Validation (unit tests, integration tests, example job files)
7. **Phase 7**: Cleanup & Deprecation (remove old rule engine and validators)

View File

@ -2,7 +2,7 @@
DATABASE_URL=sqlite:///./content_automation.db DATABASE_URL=sqlite:///./content_automation.db
# AI Service Configuration (OpenRouter) # AI Service Configuration (OpenRouter)
AI_API_KEY=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 OPENROUTER_API_KEY=your_openrouter_api_key_here
AI_API_BASE_URL=https://openrouter.ai/api/v1 AI_API_BASE_URL=https://openrouter.ai/api/v1
AI_MODEL=anthropic/claude-3.5-sonnet AI_MODEL=anthropic/claude-3.5-sonnet

16
et --hard d81537f 100644
View File

@ -0,0 +1,16 @@
5b5bd1b (HEAD -> feature/tier-word-count-override) Add tier-specific word count and outline controls
3063fc4 (origin/main, origin/HEAD, main) Story 2.3 - content generation script nightmare alomst done - fixed (maybe) outline too big issue
b6b0acf Story 2.3 - content generation script nightmare alomst done - pre-fix outline too big issue
f73b070 (github/main) Story 2.3 - content generation script finished - fix ci
e2afabb Story 2.3 - content generation script finished
0069e6e Story 2.2 - rule engine finished
d81537f Story 2.1 finished
02dd5a3 Story 2.1 finished
29ecaec Story 1.7 finished
da797c2 Story 1.6 finished - added sync
4cada9d Story 1.6 finished
b6e495e feat: Story 1.5 - CLI User Management
0a223e2 Complete Story 1.4: Internal API Foundation
8641bca Complete Epic 1 Stories 1.1-1.3: Foundation, Database, and Authentication
70b9de2 feat: Complete Story 1.1 - Project Initialization & Configuration
31b9580 Initial commit: Project structure and planning documents

View File

@ -1,77 +1,179 @@
# Job Configuration Files # Job File Format
This directory contains batch job configuration files for content generation. Job files define batch content generation parameters using JSON format.
## Usage ## Structure
Run a batch job using the CLI:
```bash
python main.py generate-batch --job-file jobs/example_tier1_batch.json -u admin -p password
```
## Job Configuration Structure
```json ```json
{ {
"job_name": "Descriptive name", "jobs": [
"project_id": 1,
"description": "Optional description",
"tiers": [
{ {
"tier": 1, "project_id": 1,
"article_count": 15, "tiers": {
"models": { "tier1": {
"title": "model-id", "count": 5,
"outline": "model-id", "min_word_count": 2000,
"content": "model-id" "max_word_count": 2500,
}, "min_h2_tags": 3,
"anchor_text_config": { "max_h2_tags": 5,
"mode": "default|override|append", "min_h3_tags": 5,
"custom_text": ["optional", "custom", "anchors"], "max_h3_tags": 10
"additional_text": ["optional", "additions"]
},
"validation_attempts": 3
} }
],
"failure_config": {
"max_consecutive_failures": 5,
"skip_on_failure": true
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4,
"include_home_link": true
} }
}
]
} }
``` ```
## Available Models ## Fields
- `anthropic/claude-3.5-sonnet` - Best for high-quality content ### Job Level
- `anthropic/claude-3-haiku` - Fast and cost-effective - `project_id` (required): The project ID to generate content for
- `openai/gpt-4o` - Excellent quality - `tiers` (required): Dictionary of tier configurations
- `openai/gpt-4o-mini` - Good for titles/outlines
- `meta-llama/llama-3.1-70b-instruct` - Open source alternative
- `google/gemini-pro-1.5` - Google's offering
## Anchor Text Modes ### Tier Level
- `count` (required): Number of articles to generate for this tier
- `min_word_count` (optional): Minimum word count (uses defaults if not specified)
- `max_word_count` (optional): Maximum word count (uses defaults if not specified)
- `min_h2_tags` (optional): Minimum H2 headings (uses defaults if not specified)
- `max_h2_tags` (optional): Maximum H2 headings (uses defaults if not specified)
- `min_h3_tags` (optional): Minimum H3 subheadings total (uses defaults if not specified)
- `max_h3_tags` (optional): Maximum H3 subheadings total (uses defaults if not specified)
- **default**: Use CORA rules (keyword, entities, related searches) ## Tier Defaults
- **override**: Replace default with custom_text list
- **append**: Add additional_text to default anchor text
## Example Files If tier parameters are not specified, these defaults are used:
- `example_tier1_batch.json` - Single tier 1 with 15 articles ### tier1
- `example_multi_tier_batch.json` - Three tiers with 165 total articles - `min_word_count`: 2000
- `example_custom_anchors.json` - Custom anchor text demo - `max_word_count`: 2500
- `min_h2_tags`: 3
- `max_h2_tags`: 5
- `min_h3_tags`: 5
- `max_h3_tags`: 10
## Tips ### tier2
- `min_word_count`: 1500
- `max_word_count`: 2000
- `min_h2_tags`: 2
- `max_h2_tags`: 4
- `min_h3_tags`: 3
- `max_h3_tags`: 8
1. Start with tier 1 to ensure quality ### tier3
2. Use faster/cheaper models for tier 2+ - `min_word_count`: 1000
3. Set `skip_on_failure: true` to continue on errors - `max_word_count`: 1500
4. Adjust `max_consecutive_failures` based on model reliability - `min_h2_tags`: 2
5. Test with small batches first - `max_h2_tags`: 3
- `min_h3_tags`: 2
- `max_h3_tags`: 6
## Examples
### Simple: Single Tier with Defaults
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5
}
}
}
]
}
```
### Custom Word Counts
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 3,
"min_word_count": 2500,
"max_word_count": 3000
}
}
}
]
}
```
### Multi-Tier
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5
},
"tier2": {
"count": 10
},
"tier3": {
"count": 15
}
}
}
]
}
```
### Multiple Projects
```json
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5
}
}
},
{
"project_id": 2,
"tiers": {
"tier1": {
"count": 3
},
"tier2": {
"count": 8
}
}
}
]
}
```
## Usage
Run batch generation with:
```bash
python main.py generate-batch --job-file jobs/example_tier1_batch.json --username youruser --password yourpass
```
### Options
- `--job-file, -j`: Path to job JSON file (required)
- `--username, -u`: Username for authentication
- `--password, -p`: Password for authentication
- `--debug`: Save AI responses to debug_output/
- `--continue-on-error`: Continue processing if article generation fails
- `--model, -m`: AI model to use (default: gpt-4o-mini)
### Debug Mode
When using `--debug`, AI responses are saved to `debug_output/`:
- `title_project{id}_tier{tier}_article{n}_{timestamp}.txt`
- `outline_project{id}_tier{tier}_article{n}_{timestamp}.json`
- `content_project{id}_tier{tier}_article{n}_{timestamp}.html`
- `augmented_project{id}_tier{tier}_article{n}_{timestamp}.html` (if augmented)

View File

@ -1,57 +1,30 @@
{ {
"job_name": "Multi-Tier Site Build", "jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 2200,
"max_word_count": 2600
},
"tier2": {
"count": 10
},
"tier3": {
"count": 15,
"max_h2_tags": 4
}
}
},
{
"project_id": 2, "project_id": 2,
"description": "Complete site build with 165 articles across 3 tiers", "tiers": {
"tiers": [ "tier1": {
{ "count": 3
"tier": 1,
"article_count": 15,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "anthropic/claude-4.5-sonnet"
},
"anchor_text_config": {
"mode": "default"
},
"validation_attempts": 3
},
{
"tier": 2,
"article_count": 50,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "openai/gpt-4o-mini"
},
"anchor_text_config": {
"mode": "append",
"additional_text": ["comprehensive guide", "expert insights"]
},
"validation_attempts": 2
},
{
"tier": 3,
"article_count": 100,
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o-mini",
"content": "openai/gpt-4o-mini"
},
"anchor_text_config": {
"mode": "default"
},
"validation_attempts": 2
} }
],
"failure_config": {
"max_consecutive_failures": 3,
"skip_on_failure": true
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4,
"include_home_link": true
} }
}
]
} }

View File

@ -1,30 +1,13 @@
{ {
"job_name": "Tier 1 Launch Batch", "jobs": [
"project_id": 1,
"description": "Initial tier 1 content - 15 high-quality articles with strict validation",
"tiers": [
{ {
"tier": 1, "project_id": 1,
"article_count": 15, "tiers": {
"models": { "tier1": {
"title": "anthropic/claude-3.5-sonnet", "count": 5
"outline": "anthropic/claude-3.5-sonnet",
"content": "anthropic/claude-3.5-sonnet"
},
"anchor_text_config": {
"mode": "default"
},
"validation_attempts": 3
} }
],
"failure_config": {
"max_consecutive_failures": 5,
"skip_on_failure": true
},
"interlinking": {
"links_per_article_min": 2,
"links_per_article_max": 4,
"include_home_link": true
} }
}
]
} }

View File

@ -0,0 +1,19 @@
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 1,
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
}
}
}
]
}

View File

@ -0,0 +1,19 @@
{
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {
"count": 1,
"min_word_count": 500,
"max_word_count": 800,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 3,
"max_h3_tags": 6
}
}
}
]
}

View File

@ -0,0 +1,27 @@
import sys
from pathlib import Path
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))
from src.database.session import db_manager
from src.database.repositories import UserRepository
from src.auth.service import AuthService
db_manager.initialize()
session = db_manager.get_session()
try:
user_repo = UserRepository(session)
auth_service = AuthService(user_repo)
user = auth_service.create_user_with_hashed_password(
username="admin",
password="admin1234",
role="Admin"
)
print(f"Admin user created: {user.username}")
finally:
session.close()
db_manager.close()

View File

@ -16,6 +16,11 @@ from src.deployment.bunnynet import (
BunnyNetResourceConflictError BunnyNetResourceConflictError
) )
from src.ingestion.parser import CORAParser, CORAParseError from src.ingestion.parser import CORAParser, CORAParseError
from src.generation.ai_client import AIClient, PromptManager
from src.generation.service import ContentGenerator
from src.generation.batch_processor import BatchProcessor
from src.database.repositories import GeneratedContentRepository
import os
def authenticate_admin(username: str, password: str) -> Optional[User]: def authenticate_admin(username: str, password: str) -> Optional[User]:
@ -871,22 +876,26 @@ def list_projects(username: Optional[str], password: Optional[str]):
raise click.Abort() raise click.Abort()
@app.command() <<<<<<< HEAD
@click.option("--job-file", "-j", required=True, help="Path to job configuration JSON file") @app.command("generate-batch")
@click.option("--force-regenerate", "-f", is_flag=True, help="Force regeneration even if content exists") @click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
@click.option("--debug", "-d", is_flag=True, help="Enable debug mode (saves generated content to debug_output/)") help='Path to job JSON file')
@click.option("--username", "-u", help="Username for authentication") @click.option('--username', '-u', help='Username for authentication')
@click.option("--password", "-p", help="Password for authentication") @click.option('--password', '-p', help='Password for authentication')
def generate_batch(job_file: str, force_regenerate: bool, debug: bool, username: Optional[str], password: Optional[str]): @click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
""" @click.option('--continue-on-error', is_flag=True,
Generate batch of articles from a job configuration file help='Continue processing if article generation fails')
@click.option('--model', '-m', default='gpt-4o-mini',
Example: help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
python main.py generate-batch --job-file jobs/tier1_batch.json -u admin -p pass def generate_batch(
""" job_file: str,
from src.generation.batch_processor import BatchProcessor username: Optional[str],
from src.generation.job_config import JobConfig password: Optional[str],
debug: bool,
continue_on_error: bool,
model: str
):
"""Generate content batch from job file"""
try: try:
if not username or not password: if not username or not password:
username, password = prompt_admin_credentials() username, password = prompt_admin_credentials()
@ -903,70 +912,47 @@ def generate_batch(job_file: str, force_regenerate: bool, debug: bool, username:
click.echo(f"Authenticated as: {user.username} ({user.role})") click.echo(f"Authenticated as: {user.username} ({user.role})")
job_config = JobConfig.from_file(job_file) api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
click.echo("Error: OPENROUTER_API_KEY not found in environment", err=True)
click.echo("Please set OPENROUTER_API_KEY in your .env file", err=True)
raise click.Abort()
click.echo(f"\nLoading Job: {job_config.job_name}") click.echo(f"Initializing AI client with model: {model}")
click.echo(f"Project ID: {job_config.project_id}") ai_client = AIClient(api_key=api_key, model=model)
click.echo(f"Total Articles: {job_config.get_total_articles()}") prompt_manager = PromptManager()
click.echo(f"\nTiers:")
for tier_config in job_config.tiers:
click.echo(f" Tier {tier_config.tier}: {tier_config.article_count} articles")
click.echo(f" Models: {tier_config.models.title} / {tier_config.models.outline} / {tier_config.models.content}")
if not click.confirm("\nProceed with generation?"): project_repo = ProjectRepository(session)
click.echo("Aborted") content_repo = GeneratedContentRepository(session)
return
click.echo("\nStarting batch generation...") content_generator = ContentGenerator(
click.echo("-" * 80) ai_client=ai_client,
prompt_manager=prompt_manager,
project_repo=project_repo,
content_repo=content_repo
)
def progress_callback(tier=None, article_num=None, total=None, status=None, stage=None, **kwargs): batch_processor = BatchProcessor(
if stage: content_generator=content_generator,
if status == "completed": content_repo=content_repo,
if stage == "title": project_repo=project_repo
title = kwargs.get("title", "") )
click.echo(f" - Title generated: {title}")
elif stage == "outline":
outline = kwargs.get("outline", {})
h2_count = len(outline.get("sections", []))
h3_count = sum(len(s.get("h3s", [])) for s in outline.get("sections", []))
click.echo(f" - Outline generated: {h2_count} H2s, {h3_count} H3s")
elif stage == "content":
word_count = kwargs.get("word_count", 0)
click.echo(f" - Content generated: {word_count} words")
elif status == "starting":
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Generating...")
elif status == "completed":
content_id = kwargs.get("content_id", "?")
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Completed (ID: {content_id})")
elif status == "skipped":
error = kwargs.get("error", "Unknown error")
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Skipped - {error}", err=True)
elif status == "failed":
error = kwargs.get("error", "Unknown error")
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Failed - {error}", err=True)
click.echo(f"\nProcessing job file: {job_file}")
if debug: if debug:
click.echo("\n[DEBUG MODE ENABLED - Content will be saved to debug_output/]\n") click.echo("Debug mode: AI responses will be saved to debug_output/\n")
processor = BatchProcessor(session) batch_processor.process_job(
result = processor.process_job(job_config, progress_callback, debug=debug) job_file_path=job_file,
debug=debug,
click.echo("-" * 80) continue_on_error=continue_on_error
click.echo("\nBatch Generation Complete!") )
click.echo(result.to_summary())
finally: finally:
session.close() session.close()
except FileNotFoundError as e:
click.echo(f"Error: {e}", err=True)
raise click.Abort()
except ValueError as e:
click.echo(f"Error: {e}", err=True)
raise click.Abort()
except Exception as e: except Exception as e:
click.echo(f"Error: {e}", err=True) click.echo(f"Error processing batch: {e}", err=True)
raise click.Abort() raise click.Abort()

View File

@ -3,7 +3,7 @@ SQLAlchemy database models
""" """
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Literal, Optional from typing import Optional
from sqlalchemy import String, Integer, DateTime, Float, ForeignKey, JSON, Text from sqlalchemy import String, Integer, DateTime, Float, ForeignKey, JSON, Text
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
@ -120,40 +120,18 @@ class Project(Base):
class GeneratedContent(Base): class GeneratedContent(Base):
"""Generated content model for AI-generated articles with version tracking""" """Generated content model for AI-created articles"""
__tablename__ = "generated_content" __tablename__ = "generated_content"
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True) id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True) project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
tier: Mapped[int] = mapped_column(Integer, nullable=False, index=True) tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
title: Mapped[Optional[str]] = mapped_column(String(500), nullable=True) title: Mapped[str] = mapped_column(Text, nullable=False)
outline: Mapped[Optional[str]] = mapped_column(Text, nullable=True) outline: Mapped[dict] = mapped_column(JSON, nullable=False)
content: Mapped[Optional[str]] = mapped_column(Text, nullable=True) content: Mapped[str] = mapped_column(Text, nullable=False)
word_count: Mapped[int] = mapped_column(Integer, nullable=False)
status: Mapped[str] = mapped_column(String(20), nullable=False, default="pending", index=True) status: Mapped[str] = mapped_column(String(20), nullable=False)
is_active: Mapped[bool] = mapped_column(Integer, nullable=False, default=False)
generation_stage: Mapped[str] = mapped_column(String(20), nullable=False, default="title")
title_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
outline_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
content_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
title_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
outline_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
content_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
validation_errors: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
validation_warnings: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
validation_report: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)
word_count: Mapped[Optional[int]] = mapped_column(Integer, nullable=True)
augmented: Mapped[bool] = mapped_column(Integer, nullable=False, default=False)
augmentation_log: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)
generation_duration: Mapped[Optional[float]] = mapped_column(Float, nullable=True)
error_message: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
updated_at: Mapped[datetime] = mapped_column( updated_at: Mapped[datetime] = mapped_column(
DateTime, DateTime,
@ -163,4 +141,4 @@ class GeneratedContent(Base):
) )
def __repr__(self) -> str: def __repr__(self) -> str:
return f"<GeneratedContent(id={self.id}, project_id={self.project_id}, tier={self.tier}, status='{self.status}', stage='{self.generation_stage}')>" return f"<GeneratedContent(id={self.id}, project_id={self.project_id}, tier='{self.tier}', status='{self.status}')>"

View File

@ -5,9 +5,8 @@ Concrete repository implementations
from typing import Optional, List, Dict, Any from typing import Optional, List, Dict, Any
from sqlalchemy.orm import Session from sqlalchemy.orm import Session
from sqlalchemy.exc import IntegrityError from sqlalchemy.exc import IntegrityError
from src.database.interfaces import IUserRepository, ISiteDeploymentRepository, IProjectRepository, IGeneratedContentRepository from src.database.interfaces import IUserRepository, ISiteDeploymentRepository, IProjectRepository
from src.database.models import User, SiteDeployment, Project, GeneratedContent from src.database.models import User, SiteDeployment, Project, GeneratedContent
from src.core.config import get_config
class UserRepository(IUserRepository): class UserRepository(IUserRepository):
@ -377,35 +376,55 @@ class ProjectRepository(IProjectRepository):
return False return False
class GeneratedContentRepository(IGeneratedContentRepository): <<<<<<< HEAD
"""Repository implementation for GeneratedContent data access""" class GeneratedContentRepository:
"""Repository for GeneratedContent data access"""
def __init__(self, session: Session): def __init__(self, session: Session):
self.session = session self.session = session
def create(self, project_id: int, tier: int) -> GeneratedContent: def create(
self,
project_id: int,
tier: str,
keyword: str,
title: str,
outline: dict,
content: str,
word_count: int,
status: str
) -> GeneratedContent:
""" """
Create a new generated content record Create a new generated content record
Args: Args:
project_id: The ID of the project project_id: The project ID this content belongs to
tier: The tier level (1, 2, etc.) tier: Content tier (tier1, tier2, tier3)
keyword: The keyword used for generation
title: Generated title
outline: Generated outline (JSON)
content: Generated HTML content
word_count: Final word count
status: Status (generated, augmented, failed)
Returns: Returns:
The created GeneratedContent object The created GeneratedContent object
""" """
content = GeneratedContent( content_record = GeneratedContent(
project_id=project_id, project_id=project_id,
tier=tier, tier=tier,
status="pending", keyword=keyword,
generation_stage="title", title=title,
is_active=False outline=outline,
content=content,
word_count=word_count,
status=status
) )
self.session.add(content) self.session.add(content_record)
self.session.commit() self.session.commit()
self.session.refresh(content) self.session.refresh(content_record)
return content return content_record
def get_by_id(self, content_id: int) -> Optional[GeneratedContent]: def get_by_id(self, content_id: int) -> Optional[GeneratedContent]:
""" """
@ -482,46 +501,51 @@ class GeneratedContentRepository(IGeneratedContentRepository):
Returns: Returns:
The updated GeneratedContent object The updated GeneratedContent object
""" """
=======
content_record = GeneratedContent(
project_id=project_id,
tier=tier,
keyword=keyword,
title=title,
outline=outline,
content=content,
word_count=word_count,
status=status
)
self.session.add(content_record)
self.session.commit()
self.session.refresh(content_record)
return content_record
def get_by_id(self, content_id: int) -> Optional[GeneratedContent]:
"""Get content by ID"""
return self.session.query(GeneratedContent).filter(GeneratedContent.id == content_id).first()
def get_by_project_id(self, project_id: int) -> List[GeneratedContent]:
"""Get all content for a project"""
return self.session.query(GeneratedContent).filter(GeneratedContent.project_id == project_id).all()
def get_by_project_and_tier(self, project_id: int, tier: str) -> List[GeneratedContent]:
"""Get content for a project and tier"""
return self.session.query(GeneratedContent).filter(
GeneratedContent.project_id == project_id,
GeneratedContent.tier == tier
).all()
def get_by_keyword(self, keyword: str) -> List[GeneratedContent]:
"""Get content by keyword"""
return self.session.query(GeneratedContent).filter(GeneratedContent.keyword == keyword).all()
def update(self, content: GeneratedContent) -> GeneratedContent:
"""Update existing content"""
self.session.add(content) self.session.add(content)
self.session.commit() self.session.commit()
self.session.refresh(content) self.session.refresh(content)
return content return content
def set_active(self, content_id: int, project_id: int, tier: int) -> bool:
"""
Set a content version as active (deactivates others)
Args:
content_id: The ID of the content to activate
project_id: The project ID
tier: The tier level
Returns:
True if successful, False if content not found
"""
content = self.get_by_id(content_id)
if not content:
return False
self.session.query(GeneratedContent).filter(
GeneratedContent.project_id == project_id,
GeneratedContent.tier == tier
).update({"is_active": False})
content.is_active = True
self.session.commit()
return True
def delete(self, content_id: int) -> bool: def delete(self, content_id: int) -> bool:
""" """Delete content by ID"""
Delete a generated content record by ID
Args:
content_id: The ID of the content to delete
Returns:
True if deleted, False if content not found
"""
content = self.get_by_id(content_id) content = self.get_by_id(content_id)
if content: if content:
self.session.delete(content) self.session.delete(content)

View File

@ -1,169 +1,145 @@
""" """
AI client for OpenRouter API integration OpenRouter AI client and prompt management
""" """
import os import time
import json import json
from typing import Dict, Any, Optional from pathlib import Path
from openai import OpenAI from typing import Optional, Dict, Any
from dotenv import load_dotenv from openai import OpenAI, RateLimitError, APIError
from src.core.config import Config from src.core.config import get_config
AVAILABLE_MODELS = {
class AIClientError(Exception): "gpt-4o-mini": "openai/gpt-4o-mini",
"""Base exception for AI client errors""" "claude-sonnet-4.5": "anthropic/claude-3.5-sonnet"
pass }
class AIClient: class AIClient:
"""Client for interacting with AI models via OpenRouter""" """OpenRouter API client using OpenAI SDK"""
def __init__(self, config: Optional[Config] = None): def __init__(
""" self,
Initialize AI client api_key: str,
model: str,
base_url: str = "https://openrouter.ai/api/v1"
):
self.client = OpenAI(api_key=api_key, base_url=base_url)
Args: if model in AVAILABLE_MODELS:
config: Application configuration (uses get_config() if None) self.model = AVAILABLE_MODELS[model]
""" else:
load_dotenv() self.model = model
from src.core.config import get_config def generate_completion(
self.config = config or get_config()
api_key = os.getenv("AI_API_KEY")
if not api_key:
raise AIClientError("AI_API_KEY environment variable not set")
# OpenRouter requires specific headers and configuration
self.client = OpenAI(
base_url=self.config.ai_service.base_url,
api_key=api_key,
default_headers={
"HTTP-Referer": "https://github.com/yourusername/Big-Link-Man",
"X-Title": "Big Link Man Content Generator"
}
)
self.default_model = self.config.ai_service.model
self.max_tokens = self.config.ai_service.max_tokens
self.temperature = self.config.ai_service.temperature
self.timeout = self.config.ai_service.timeout
def generate(
self, self,
prompt: str, prompt: str,
model: Optional[str] = None, system_message: Optional[str] = None,
temperature: Optional[float] = None, max_tokens: int = 4000,
max_tokens: Optional[int] = None, temperature: float = 0.7,
response_format: Optional[Dict[str, Any]] = None json_mode: bool = False
) -> str: ) -> str:
""" """
Generate text using AI model Generate completion from OpenRouter API
Args: Args:
prompt: The prompt text prompt: User prompt text
model: Model to use (defaults to config default) system_message: Optional system message
temperature: Temperature (defaults to config default) max_tokens: Maximum tokens to generate
max_tokens: Max tokens (defaults to config default) temperature: Sampling temperature (0-1)
response_format: Optional response format for structured output json_mode: If True, requests JSON response format
Returns: Returns:
Generated text Generated text completion
Raises:
AIClientError: If generation fails
""" """
try: messages = []
kwargs = { if system_message:
"model": model or self.default_model, messages.append({"role": "system", "content": system_message})
"messages": [{"role": "user", "content": prompt}], messages.append({"role": "user", "content": prompt})
"temperature": temperature if temperature is not None else self.temperature,
"max_tokens": max_tokens or self.max_tokens, kwargs: Dict[str, Any] = {
"timeout": self.timeout, "model": self.model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature
} }
if response_format: if json_mode:
kwargs["response_format"] = response_format kwargs["response_format"] = {"type": "json_object"}
retries = 3
for attempt in range(retries):
try:
response = self.client.chat.completions.create(**kwargs) response = self.client.chat.completions.create(**kwargs)
content = response.choices[0].message.content or ""
# Debug: print first 200 chars if json_mode
if json_mode:
print(f"[DEBUG] AI Response (first 200 chars): {content[:200]}")
return content
if not response.choices: except RateLimitError as e:
raise AIClientError("No response from AI model") if attempt < retries - 1:
wait_time = 2 ** attempt
print(f"Rate limit hit. Retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise
content = response.choices[0].message.content except APIError as e:
if not content: if attempt < retries - 1 and "network" in str(e).lower():
raise AIClientError("Empty response from AI model") wait_time = 2 ** attempt
print(f"Network error. Retrying in {wait_time}s...")
return content.strip() time.sleep(wait_time)
else:
raise
except Exception as e: except Exception as e:
raise AIClientError(f"AI generation failed: {e}") raise
def generate_json( return ""
self,
prompt: str,
model: Optional[str] = None, class PromptManager:
temperature: Optional[float] = None, """Manages loading and formatting of prompt templates"""
max_tokens: Optional[int] = None
) -> Dict[str, Any]: def __init__(self, prompts_dir: str = "src/generation/prompts"):
self.prompts_dir = Path(prompts_dir)
self.prompts: Dict[str, dict] = {}
def load_prompt(self, prompt_name: str) -> dict:
"""Load prompt from JSON file"""
if prompt_name in self.prompts:
return self.prompts[prompt_name]
prompt_file = self.prompts_dir / f"{prompt_name}.json"
if not prompt_file.exists():
raise FileNotFoundError(f"Prompt file not found: {prompt_file}")
with open(prompt_file, 'r', encoding='utf-8') as f:
prompt_data = json.load(f)
self.prompts[prompt_name] = prompt_data
return prompt_data
def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
""" """
Generate JSON-formatted response Format prompt with variables
Args: Args:
prompt: The prompt text (should request JSON output) prompt_name: Name of the prompt template
model: Model to use **kwargs: Variables to inject into the template
temperature: Temperature
max_tokens: Max tokens
Returns: Returns:
Parsed JSON response Tuple of (system_message, user_prompt)
Raises:
AIClientError: If generation or parsing fails
""" """
response_text = self.generate( prompt_data = self.load_prompt(prompt_name)
prompt=prompt,
model=model,
temperature=temperature,
max_tokens=max_tokens,
response_format={"type": "json_object"}
)
try: system_message = prompt_data.get("system_message", "")
return json.loads(response_text) user_prompt = prompt_data.get("user_prompt", "")
except json.JSONDecodeError as e:
raise AIClientError(f"Failed to parse JSON response: {e}\nResponse: {response_text}")
def validate_model(self, model: str) -> bool: if system_message:
""" system_message = system_message.format(**kwargs)
Check if a model is available in configuration
Args: user_prompt = user_prompt.format(**kwargs)
model: Model identifier
Returns:
True if model is available
"""
available = self.config.ai_service.available_models
return model in available.values() or model in available.keys()
def get_model_id(self, model_name: str) -> str:
"""
Get full model ID from short name
Args:
model_name: Short name (e.g., "claude-3.5-sonnet") or full ID
Returns:
Full model ID
"""
available = self.config.ai_service.available_models
if model_name in available:
return available[model_name]
if model_name in available.values():
return model_name
return model_name
return system_message, user_prompt

View File

@ -1,15 +1,12 @@
""" """
Batch job processor for generating multiple articles across tiers Batch processor for content generation jobs
""" """
import time from typing import Dict, Any
from typing import Optional import click
from sqlalchemy.orm import Session from src.generation.service import ContentGenerator
from src.database.models import Project from src.generation.job_config import JobConfig, Job, TierConfig
from src.database.repositories import ProjectRepository from src.database.repositories import GeneratedContentRepository, ProjectRepository
from src.generation.service import ContentGenerationService, GenerationError
from src.generation.job_config import JobConfig, JobResult
from src.core.config import Config, get_config
class BatchProcessor: class BatchProcessor:
@ -17,167 +14,205 @@ class BatchProcessor:
def __init__( def __init__(
self, self,
session: Session, content_generator: ContentGenerator,
config: Optional[Config] = None content_repo: GeneratedContentRepository,
project_repo: ProjectRepository
): ):
""" self.generator = content_generator
Initialize batch processor self.content_repo = content_repo
self.project_repo = project_repo
Args: self.stats = {
session: Database session "total_jobs": 0,
config: Application configuration "processed_jobs": 0,
""" "total_articles": 0,
self.session = session "generated_articles": 0,
self.config = config or get_config() "augmented_articles": 0,
self.project_repo = ProjectRepository(session) "failed_articles": 0
self.generation_service = ContentGenerationService(session, config) }
def process_job( def process_job(
self, self,
job_config: JobConfig, job_file_path: str,
progress_callback: Optional[callable] = None, debug: bool = False,
debug: bool = False continue_on_error: bool = False
) -> JobResult: ):
""" """
Process a batch job according to configuration Process all jobs in job file
Args: Args:
job_config: Job configuration job_file_path: Path to job JSON file
progress_callback: Optional callback function(tier, article_num, total, status) debug: If True, save AI responses to debug_output/
continue_on_error: If True, continue on article generation failure
Returns:
JobResult with statistics
""" """
start_time = time.time() job_config = JobConfig(job_file_path)
jobs = job_config.get_jobs()
project = self.project_repo.get_by_id(job_config.project_id) self.stats["total_jobs"] = len(jobs)
for job_idx, job in enumerate(jobs, 1):
try:
self._process_single_job(job, job_idx, debug, continue_on_error)
self.stats["processed_jobs"] += 1
except Exception as e:
click.echo(f"Error processing job {job_idx}: {e}")
if not continue_on_error:
raise
self._print_summary()
def _process_single_job(
self,
job: Job,
job_idx: int,
debug: bool,
continue_on_error: bool
):
"""Process a single job"""
project = self.project_repo.get_by_id(job.project_id)
if not project: if not project:
raise ValueError(f"Project {job_config.project_id} not found") raise ValueError(f"Project {job.project_id} not found")
result = JobResult( click.echo(f"\nProcessing Job {job_idx}/{self.stats['total_jobs']}: Project ID {job.project_id}")
job_name=job_config.job_name,
project_id=job_config.project_id, for tier_name, tier_config in job.tiers.items():
total_articles=job_config.get_total_articles(), self._process_tier(
successful=0, job.project_id,
failed=0, tier_name,
skipped=0 tier_config,
debug,
continue_on_error
) )
consecutive_failures = 0 def _process_tier(
self,
project_id: int,
tier_name: str,
tier_config: TierConfig,
debug: bool,
continue_on_error: bool
):
"""Process all articles for a tier"""
click.echo(f" {tier_name}: Generating {tier_config.count} articles")
for tier_config in job_config.tiers: project = self.project_repo.get_by_id(project_id)
tier = tier_config.tier keyword = project.main_keyword
for article_num in range(1, tier_config.article_count + 1): for article_num in range(1, tier_config.count + 1):
if progress_callback: self.stats["total_articles"] += 1
progress_callback(
tier=tier,
article_num=article_num,
total=tier_config.article_count,
status="starting"
)
try: try:
content = self.generation_service.generate_article( self._generate_single_article(
project=project, project_id,
tier=tier, tier_name,
title_model=tier_config.models.title, tier_config,
outline_model=tier_config.models.outline, article_num,
content_model=tier_config.models.content, keyword,
max_retries=tier_config.validation_attempts, debug
progress_callback=progress_callback, )
self.stats["generated_articles"] += 1
except Exception as e:
self.stats["failed_articles"] += 1
import traceback
click.echo(f" [{article_num}/{tier_config.count}] FAILED: {e}")
click.echo(f" Traceback: {traceback.format_exc()}")
try:
self.content_repo.create(
project_id=project_id,
tier=tier_name,
keyword=keyword,
title="Failed Generation",
outline={"error": str(e)},
content="",
word_count=0,
status="failed"
)
except Exception as db_error:
click.echo(f" Failed to save error record: {db_error}")
if not continue_on_error:
raise
def _generate_single_article(
self,
project_id: int,
tier_name: str,
tier_config: TierConfig,
article_num: int,
keyword: str,
debug: bool
):
"""Generate a single article"""
prefix = f" [{article_num}/{tier_config.count}]"
click.echo(f"{prefix} Generating title...")
title = self.generator.generate_title(project_id, debug=debug)
click.echo(f"{prefix} Generated title: \"{title}\"")
click.echo(f"{prefix} Generating outline...")
outline = self.generator.generate_outline(
project_id=project_id,
title=title,
min_h2=tier_config.min_h2_tags,
max_h2=tier_config.max_h2_tags,
min_h3=tier_config.min_h3_tags,
max_h3=tier_config.max_h3_tags,
debug=debug debug=debug
) )
result.successful += 1 h2_count = len(outline["outline"])
result.add_tier_result(tier, "successful") h3_count = sum(len(section.get("h3", [])) for section in outline["outline"])
consecutive_failures = 0 click.echo(f"{prefix} Generated outline: {h2_count} H2s, {h3_count} H3s")
if progress_callback: click.echo(f"{prefix} Generating content...")
progress_callback( content = self.generator.generate_content(
tier=tier, project_id=project_id,
article_num=article_num, title=title,
total=tier_config.article_count, outline=outline,
status="completed", min_word_count=tier_config.min_word_count,
content_id=content.id max_word_count=tier_config.max_word_count,
debug=debug
) )
except GenerationError as e: word_count = self.generator.count_words(content)
error_msg = f"Tier {tier}, Article {article_num}: {str(e)}" click.echo(f"{prefix} Generated content: {word_count:,} words")
result.add_error(error_msg)
consecutive_failures += 1
if job_config.failure_config.skip_on_failure: status = "generated"
result.skipped += 1
result.add_tier_result(tier, "skipped")
if progress_callback: if word_count < tier_config.min_word_count:
progress_callback( click.echo(f"{prefix} Below minimum ({tier_config.min_word_count:,}), augmenting...")
tier=tier, content = self.generator.augment_content(
article_num=article_num, content=content,
total=tier_config.article_count, target_word_count=tier_config.min_word_count,
status="skipped", debug=debug,
error=str(e) project_id=project_id
)
word_count = self.generator.count_words(content)
click.echo(f"{prefix} Augmented content: {word_count:,} words")
status = "augmented"
self.stats["augmented_articles"] += 1
saved_content = self.content_repo.create(
project_id=project_id,
tier=tier_name,
keyword=keyword,
title=title,
outline=outline,
content=content,
word_count=word_count,
status=status
) )
if consecutive_failures >= job_config.failure_config.max_consecutive_failures: click.echo(f"{prefix} Saved (ID: {saved_content.id}, Status: {status})")
result.add_error(
f"Stopping job: {consecutive_failures} consecutive failures exceeded threshold"
)
result.duration = time.time() - start_time
return result
else:
result.failed += 1
result.add_tier_result(tier, "failed")
result.duration = time.time() - start_time
if progress_callback:
progress_callback(
tier=tier,
article_num=article_num,
total=tier_config.article_count,
status="failed",
error=str(e)
)
return result
except Exception as e:
error_msg = f"Tier {tier}, Article {article_num}: Unexpected error: {str(e)}"
result.add_error(error_msg)
result.failed += 1
result.add_tier_result(tier, "failed")
result.duration = time.time() - start_time
if progress_callback:
progress_callback(
tier=tier,
article_num=article_num,
total=tier_config.article_count,
status="failed",
error=str(e)
)
return result
result.duration = time.time() - start_time
return result
def process_job_from_file(
self,
job_file_path: str,
progress_callback: Optional[callable] = None
) -> JobResult:
"""
Load and process a job from a JSON file
Args:
job_file_path: Path to job configuration JSON file
progress_callback: Optional progress callback
Returns:
JobResult with statistics
"""
job_config = JobConfig.from_file(job_file_path)
return self.process_job(job_config, progress_callback)
def _print_summary(self):
"""Print job processing summary"""
click.echo("\n" + "="*60)
click.echo("SUMMARY")
click.echo("="*60)
click.echo(f"Jobs processed: {self.stats['processed_jobs']}/{self.stats['total_jobs']}")
click.echo(f"Articles generated: {self.stats['generated_articles']}/{self.stats['total_articles']}")
click.echo(f"Augmented: {self.stats['augmented_articles']}")
click.echo(f"Failed: {self.stats['failed_articles']}")
click.echo("="*60)

View File

@ -1,213 +1,129 @@
""" """
Job configuration schema and validation for batch content generation Job configuration parser for batch content generation
""" """
from typing import List, Dict, Optional, Literal
from pydantic import BaseModel, Field, field_validator
import json import json
from dataclasses import dataclass
from typing import Optional, Dict, Any
from pathlib import Path from pathlib import Path
TIER_DEFAULTS = {
class ModelConfig(BaseModel): "tier1": {
"""AI models configuration for each generation stage""" "min_word_count": 2000,
title: str = Field(..., description="Model for title generation") "max_word_count": 2500,
outline: str = Field(..., description="Model for outline generation") "min_h2_tags": 3,
content: str = Field(..., description="Model for content generation") "max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
},
"tier3": {
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
}
class AnchorTextConfig(BaseModel): @dataclass
"""Anchor text configuration""" class TierConfig:
mode: Literal["default", "override", "append"] = Field( """Configuration for a specific tier"""
default="default", count: int
description="How to handle anchor text: default (use CORA), override (replace), append (add to)" min_word_count: int
) max_word_count: int
custom_text: Optional[List[str]] = Field( min_h2_tags: int
default=None, max_h2_tags: int
description="Custom anchor text for override mode" min_h3_tags: int
) max_h3_tags: int
additional_text: Optional[List[str]] = Field(
default=None,
description="Additional anchor text for append mode"
)
class TierConfig(BaseModel): @dataclass
"""Configuration for a single tier""" class Job:
tier: int = Field(..., ge=1, description="Tier number (1 = strictest validation)") """Job definition for content generation"""
article_count: int = Field(..., ge=1, description="Number of articles to generate")
models: ModelConfig = Field(..., description="AI models for this tier")
anchor_text_config: AnchorTextConfig = Field(
default_factory=AnchorTextConfig,
description="Anchor text configuration"
)
validation_attempts: int = Field(
default=3,
ge=1,
le=10,
description="Max validation retry attempts per stage"
)
class FailureConfig(BaseModel):
"""Failure handling configuration"""
max_consecutive_failures: int = Field(
default=5,
ge=1,
description="Stop job after this many consecutive failures"
)
skip_on_failure: bool = Field(
default=True,
description="Skip failed articles and continue, or stop immediately"
)
class InterlinkingConfig(BaseModel):
"""Interlinking configuration"""
links_per_article_min: int = Field(
default=2,
ge=0,
description="Minimum links to other articles"
)
links_per_article_max: int = Field(
default=4,
ge=0,
description="Maximum links to other articles"
)
include_home_link: bool = Field(
default=True,
description="Include link to home page"
)
@field_validator('links_per_article_max')
@classmethod
def validate_max_greater_than_min(cls, v, info):
if 'links_per_article_min' in info.data and v < info.data['links_per_article_min']:
raise ValueError("links_per_article_max must be >= links_per_article_min")
return v
class JobConfig(BaseModel):
"""Complete job configuration"""
job_name: str = Field(..., description="Descriptive name for the job")
project_id: int = Field(..., ge=1, description="Project ID to use for all tiers")
description: Optional[str] = Field(None, description="Optional job description")
tiers: List[TierConfig] = Field(..., min_length=1, description="Tier configurations")
failure_config: FailureConfig = Field(
default_factory=FailureConfig,
description="Failure handling configuration"
)
interlinking: InterlinkingConfig = Field(
default_factory=InterlinkingConfig,
description="Interlinking configuration"
)
@field_validator('tiers')
@classmethod
def validate_unique_tiers(cls, v):
tier_numbers = [tier.tier for tier in v]
if len(tier_numbers) != len(set(tier_numbers)):
raise ValueError("Tier numbers must be unique")
return v
@classmethod
def from_file(cls, file_path: str) -> 'JobConfig':
"""
Load job configuration from JSON file
Args:
file_path: Path to the JSON file
Returns:
JobConfig instance
Raises:
FileNotFoundError: If file doesn't exist
ValueError: If JSON is invalid or validation fails
"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"Job configuration file not found: {file_path}")
try:
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
return cls(**data)
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON in {file_path}: {e}")
except Exception as e:
raise ValueError(f"Failed to parse job configuration: {e}")
def to_file(self, file_path: str) -> None:
"""
Save job configuration to JSON file
Args:
file_path: Path to save the JSON file
"""
path = Path(file_path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, 'w', encoding='utf-8') as f:
json.dump(self.model_dump(), f, indent=2)
def get_total_articles(self) -> int:
"""Get total number of articles across all tiers"""
return sum(tier.article_count for tier in self.tiers)
class JobResult(BaseModel):
"""Result of a job execution"""
job_name: str
project_id: int project_id: int
total_articles: int tiers: Dict[str, TierConfig]
successful: int
failed: int
skipped: int
tier_results: Dict[int, Dict[str, int]] = Field(default_factory=dict)
errors: List[str] = Field(default_factory=list)
duration: float = 0.0
def add_tier_result(self, tier: int, status: str) -> None:
"""Track result for a tier"""
if tier not in self.tier_results:
self.tier_results[tier] = {"successful": 0, "failed": 0, "skipped": 0}
if status in self.tier_results[tier]: class JobConfig:
self.tier_results[tier][status] += 1 """Parser for job configuration files"""
def add_error(self, error: str) -> None: def __init__(self, job_file_path: str):
"""Add an error message""" """
self.errors.append(error) Load and parse job file, apply defaults
def to_summary(self) -> str: Args:
"""Generate a human-readable summary""" job_file_path: Path to JSON job file
lines = [ """
f"Job: {self.job_name}", self.job_file_path = Path(job_file_path)
f"Project ID: {self.project_id}", self.jobs: list[Job] = []
f"Duration: {self.duration:.2f}s", self._load()
f"",
f"Results:",
f" Total Articles: {self.total_articles}",
f" Successful: {self.successful}",
f" Failed: {self.failed}",
f" Skipped: {self.skipped}",
f"",
f"By Tier:"
]
for tier, results in sorted(self.tier_results.items()): def _load(self):
lines.append(f" Tier {tier}:") """Load and parse the job file"""
lines.append(f" Successful: {results['successful']}") if not self.job_file_path.exists():
lines.append(f" Failed: {results['failed']}") raise FileNotFoundError(f"Job file not found: {self.job_file_path}")
lines.append(f" Skipped: {results['skipped']}")
if self.errors: with open(self.job_file_path, 'r', encoding='utf-8') as f:
lines.append("") data = json.load(f)
lines.append(f"Errors ({len(self.errors)}):")
for error in self.errors[:10]:
lines.append(f" - {error}")
if len(self.errors) > 10:
lines.append(f" ... and {len(self.errors) - 10} more")
return "\n".join(lines) if "jobs" not in data:
raise ValueError("Job file must contain 'jobs' array")
for job_data in data["jobs"]:
self._validate_job(job_data)
job = self._parse_job(job_data)
self.jobs.append(job)
def _validate_job(self, job_data: dict):
"""Validate job structure"""
if "project_id" not in job_data:
raise ValueError("Job missing 'project_id'")
if "tiers" not in job_data:
raise ValueError("Job missing 'tiers'")
if not isinstance(job_data["tiers"], dict):
raise ValueError("'tiers' must be a dictionary")
def _parse_job(self, job_data: dict) -> Job:
"""Parse a single job"""
project_id = job_data["project_id"]
tiers = {}
for tier_name, tier_data in job_data["tiers"].items():
tier_config = self._parse_tier(tier_name, tier_data)
tiers[tier_name] = tier_config
return Job(project_id=project_id, tiers=tiers)
def _parse_tier(self, tier_name: str, tier_data: dict) -> TierConfig:
"""Parse tier configuration with defaults"""
defaults = TIER_DEFAULTS.get(tier_name, TIER_DEFAULTS["tier3"])
return TierConfig(
count=tier_data.get("count", 1),
min_word_count=tier_data.get("min_word_count", defaults["min_word_count"]),
max_word_count=tier_data.get("max_word_count", defaults["max_word_count"]),
min_h2_tags=tier_data.get("min_h2_tags", defaults["min_h2_tags"]),
max_h2_tags=tier_data.get("max_h2_tags", defaults["max_h2_tags"]),
min_h3_tags=tier_data.get("min_h3_tags", defaults["min_h3_tags"]),
max_h3_tags=tier_data.get("max_h3_tags", defaults["max_h3_tags"])
)
def get_jobs(self) -> list[Job]:
"""Return list of all jobs in file"""
return self.jobs
def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
"""Get tier config with defaults applied"""
return job.tiers.get(tier_name)

View File

@ -1,9 +1,5 @@
{ {
"system": "You are a content enhancement specialist who adds natural, relevant paragraphs to articles to meet optimization targets.", "system_message": "You are an expert content editor who expands articles by adding depth, detail, and additional relevant information while maintaining topical focus and quality.",
"user_template": "Add new paragraph(s) to the following article to address these missing elements:\n\nCurrent Article:\n{current_content}\n\nWhat's Missing:\n{missing_elements}\n\nMain Keyword: {main_keyword}\nEntities to use: {target_entities}\nRelated Searches to reference: {target_searches}\nTarget Word Count for New Content: {target_word_count} words\n\nInstructions:\n1. Write {target_word_count} words of new content (1-3 paragraphs as needed)\n2. Naturally incorporate the missing keywords/entities/searches\n3. Make it relevant to the article topic\n4. Use a professional, engaging tone\n5. Don't directly repeat information already in the article\n6. The paragraphs should feel like natural additions\n7. IMPORTANT: Write at least {target_word_count} words to ensure we meet the target\n\nSuggested placement: {suggested_placement}\n\nRespond with ONLY the new paragraph(s) in HTML format:\n<p>First paragraph here...</p>\n<p>Second paragraph here...</p>\n\nDo not include the entire article, just the new paragraph(s) to insert.", "user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count} words.\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment with the same structure (using <h2>, <h3>, <p> tags). You can add new paragraphs, expand existing ones, or add new subsections as needed. Do NOT change the existing headings unless necessary."
"validation": {
"output_format": "html"
}
} }

View File

@ -1,12 +1,5 @@
{ {
"system": "You are an creative content writer who creates comprehensive, engaging articles that strictly follow the provided outline and meet all CORA optimization requirements.", "system_message": "You are an expert content writer who creates engaging, informative, and SEO-optimized articles that provide real value to readers while incorporating relevant keywords naturally.",
"user_template": "Write a complete, SEO-optimized article following this outline:\n\n{outline}\n\nArticle Details:\n- Title: {title}\n- Main Keyword: {main_keyword}\n- Target Token Count: {word_count}\n- Keyword Frequency Target: {term_frequency}% mentions\n\nEntities to incorporate: {entities}\nRelated Searches to reference: {related_searches}\n\nCritical Requirements:\n1. Follow the outline structure EXACTLY - use the provided H2 and H3 headings word-for-word\n2. Do NOT add numbering, Roman numerals, or letters to the headings\n3. The article must be {word_count} tokens long (±100 tokens)\n4. Mention the main keyword \"{main_keyword}\" naturally {term_frequency}% times throughout\n5. Write 2-3 substantial paragraphs under each heading. Reference industry standards, regulations, or best practices. Use relevant LSI and entities for the topic\n6. For the FAQ section:\n - Each FAQ answer MUST begin by restating the question\n - Provide detailed, helpful answers (100-150 words each)\n7. Incorporate entities and related searches naturally throughout\n8. Write in a professional, engaging tone. Use active voice for 80% of sentences\n9. Make content informative and valuable to readers. Use technical terminology appropriate for industry professionals.\n10. Use varied sentence structures and vocabulary.\n11. STRICTLY PROHIBITED: Filler phrases: 'it is important to note', as mentioned earlier', 'in conclusion' - Marketing language: 'revolutionary', 'game-changing', 'industry-leading', 'best-in-class' - Generic openings: 'In today's world', 'As we all know', 'It goes without saying' \n\nFormatting Requirements:\n- Use <h1> for the main title\n- Use <h2> for major sections\n- Use <h3> for subsections\n- Use <p> for paragraphs\n- Use <ul> and <li> for lists where appropriate\n- Do NOT include any CSS, <html>, <head>, or <body> tags\n- Return ONLY the article content HTML\n\nExample structure:\n<h1>Main Title</h1>\n<p>Introduction paragraph...</p>\n\n<h2>First Section</h2>\n<p>Content...</p>\n\n<h3>Subsection</h3>\n<p>More content...</p>\n\nWrite the complete article now.", "user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include naturally: {entities}\nRelated searches to address: {related_searches}\n\nTarget word count range: {min_word_count} to {max_word_count} words\n\nReturn as an HTML fragment with <h2>, <h3>, and <p> tags. Do NOT include <!DOCTYPE>, <html>, <head>, or <body> tags. Start directly with the first <h2> heading.\n\nWrite naturally and informatively. Incorporate the keyword, entities, and related searches organically throughout the content."
"validation": {
"output_format": "html",
"min_word_count": true,
"max_word_count": true,
"keyword_frequency_target": true,
"outline_structure_match": true
}
} }

View File

@ -1,11 +1,5 @@
{ {
"system": "You are an expert content strategist who creates compelling, specific article titles that provide clear direction for content creation. You also strive to meet strict CORA optimization targets.", "system_message": "You are an expert content outliner who creates well-structured, comprehensive article outlines that cover topics thoroughly and logically.",
"user_template": "Create a detailed article outline for the following:\n\nTitle: {title}\nMain Keyword: {main_keyword}\nTarget Word Count: {word_count}\n\nCORA Targets:\n- H2 headings needed: {h2_total}\n- H2s with main keyword: {h2_exact}\n- H2s with related searches: {h2_related_search}\n- H2s with entities: {h2_entities}\n- H3 headings needed: {h3_total}\n- H3s with main keyword: {h3_exact}\n- H3s with related searches: {h3_related_search}\n- H3s with entities: {h3_entities}\n\nAvailable Entities: {entities}\nRelated Searches: {related_searches}\n\nThe title provided above will serve as the H1 heading for this article. Focus on creating the H2 and H3 structure that supports this title.\n\nRequirements:\n1. Create exactly {h2_total} H2 headings\n2. Create exactly {h3_total} H3 headings (distributed under H2s)\n3. At least {h2_exact} H2s must contain the exact keyword \"{main_keyword}\"\n4. The FIRST H2 should contain the main keyword\n5. Incorporate entities and related searches naturally into headings\n6. Include a \"Frequently Asked Questions\" H2 section with at least 3 H3 questions\n7. Each H3 question should be a complete question ending with ?\n8. Structure should flow logically\nCreate headings that build logically toward actionable insights\n9. Use specific, searchable language over generic terms\n 9. Include sub-topic hints in parentheses where helpful \n 10. Focus on reader problems and solutions.\n 11. FORBIDDEN ELEMENTS: Future-tense speculation ('The Future of...', 'Upcoming Trends') - Generic business-speak ('in today's competitive landscape', 'cutting-edge solutions') - Vague qualifiers ('best practices', 'industry-leading', 'world-class') \n\nIMPORTANT FORMATTING RULES:\n- Do NOT include numbering (1., 2., 3.)\n- Do NOT include Roman numerals (I., II., III.)\n- Do NOT include letters (A., B., C.)\n- Do NOT include any outline-style prefixes\n- Return clean heading text only\n\nWRONG: \"I. Introduction to {main_keyword}\"\nWRONG: \"1. Getting Started with {main_keyword}\"\nRIGHT: \"Introduction to {main_keyword}\"\nRIGHT: \"Getting Started with {main_keyword}\"\n\nRespond ONLY with valid JSON in this exact format (no additional text, explanations, or commentary):\n{{\n \"sections\": [\n {{\n \"h2\": \"H2 heading text\",\n \"h3s\": [\"H3 heading 1\", \"H3 heading 2\"]\n }}\n ]\n}}\n\nReturn ONLY the JSON object. Do not include any text before or after the JSON.", "user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- Between {min_h2} and {max_h2} H2 headings\n- Between {min_h3} and {max_h3} H3 subheadings total (distributed across H2 sections)\n\nEntities to incorporate: {entities}\nRelated searches to address: {related_searches}\n\nReturn ONLY valid JSON in this exact format:\n{{\"outline\": [{{\"h2\": \"Heading text\", \"h3\": [\"Subheading 1\", \"Subheading 2\"]}}, ...]}}\n\nEnsure the outline meets the minimum heading requirements and includes relevant entities and related searches."
"validation": {
"output_format": "json",
"required_fields": ["sections"],
"h2_count_must_match": true,
"h3_count_must_match": true
}
} }

View File

@ -1,10 +1,5 @@
{ {
"system": "You are an expert content strategist who creates compelling, specific article titles that provide clear direction for content creation.", "system_message": "You are an expert SEO content writer who creates compelling, search-optimized titles that attract clicks while accurately representing the content topic.",
"user_template": "Generate an unique, compelling article title for the broad topic: \"{main_keyword}\".\n\nContext:\n- Main Keyword: {main_keyword}\n- - Top Entities: {entities}\n- Related Searches: {related_searches}\n\nRequirements:\n1. The title MUST contain the exact main keyword: \"{main_keyword}\"\n2. The title should be compelling and click-worthy\n3. Each title must be specific enough that an AI could create substantial, focused content outline from the title alone\n4.Titles should be creative yet professionally relevant to: {{subject}}. It does not have to be directly related but must be at least tangentially related.\n5. Consider incorporating 1-2 related entities or searches if natural\n6. Mix formats: how-to guides (25%), case studies (10%), expert analyses (20%), comparison pieces (15%), trend analyses (10%), problem-solving articles (10%), listicles(10%)\nAvoid generic business jargon and AI slop (cutting-edge,game-changing, revolutionary)\n7- Use domain-specific terminology appropriate for an article about {main_keyword}\n 8-Include specific, actionable language that suggests clear content direction\n\nRespond with ONLY the title text, no quotes or additional formatting.\n\nExample format: \"Complete Guide to {main_keyword}: Tips and Best Practices\"", "user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting or quotes."
"validation": {
"must_contain_keyword": true,
"min_length": 30,
"max_length": 120
}
} }

View File

@ -1,337 +1,3 @@
""" # Content validation rules
Content validation rule engine for CORA-compliant HTML generation # DEPRECATED: This module has been replaced by the simplified generation pipeline in service.py
""" # Kept for reference only.
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from html.parser import HTMLParser
import re
from src.core.config import Config
from src.database.models import Project
@dataclass
class ValidationIssue:
"""Single validation issue (error or warning)"""
rule_name: str
severity: str
message: str
expected: Optional[Any] = None
actual: Optional[Any] = None
@dataclass
class ValidationResult:
"""Result of content validation"""
passed: bool
errors: List[ValidationIssue] = field(default_factory=list)
warnings: List[ValidationIssue] = field(default_factory=list)
def add_error(self, rule_name: str, message: str, expected: Any = None, actual: Any = None):
self.errors.append(ValidationIssue(rule_name, "error", message, expected, actual))
self.passed = False
def add_warning(self, rule_name: str, message: str, expected: Any = None, actual: Any = None):
self.warnings.append(ValidationIssue(rule_name, "warning", message, expected, actual))
def to_dict(self) -> Dict:
return {
"passed": self.passed,
"errors": [
{
"rule": e.rule_name,
"severity": e.severity,
"message": e.message,
"expected": e.expected,
"actual": e.actual
} for e in self.errors
],
"warnings": [
{
"rule": w.rule_name,
"severity": w.severity,
"message": w.message,
"expected": w.expected,
"actual": w.actual
} for w in self.warnings
]
}
class ContentHTMLParser(HTMLParser):
"""HTML parser to extract structure and content for validation"""
def __init__(self):
super().__init__()
self.title: Optional[str] = None
self.meta_description: Optional[str] = None
self.h1_tags: List[str] = []
self.h2_tags: List[str] = []
self.h3_tags: List[str] = []
self.images: List[Dict[str, str]] = []
self.links: List[Dict[str, str]] = []
self.text_content: str = ""
self._current_tag: Optional[str] = None
self._current_data: List[str] = []
self._in_title = False
self._in_h1 = False
self._in_h2 = False
self._in_h3 = False
def handle_starttag(self, tag: str, attrs: List[tuple]):
self._current_tag = tag
attrs_dict = dict(attrs)
if tag == "title":
self._in_title = True
self._current_data = []
elif tag == "meta" and attrs_dict.get("name") == "description":
self.meta_description = attrs_dict.get("content", "")
elif tag == "h1":
self._in_h1 = True
self._current_data = []
elif tag == "h2":
self._in_h2 = True
self._current_data = []
elif tag == "h3":
self._in_h3 = True
self._current_data = []
elif tag == "img":
self.images.append({
"src": attrs_dict.get("src", ""),
"alt": attrs_dict.get("alt", "")
})
elif tag == "a":
self.links.append({
"href": attrs_dict.get("href", ""),
"text": ""
})
def handle_endtag(self, tag: str):
if tag == "title" and self._in_title:
self.title = "".join(self._current_data).strip()
self._in_title = False
elif tag == "h1" and self._in_h1:
self.h1_tags.append("".join(self._current_data).strip())
self._in_h1 = False
elif tag == "h2" and self._in_h2:
self.h2_tags.append("".join(self._current_data).strip())
self._in_h2 = False
elif tag == "h3" and self._in_h3:
self.h3_tags.append("".join(self._current_data).strip())
self._in_h3 = False
self._current_tag = None
def handle_data(self, data: str):
if self._in_title or self._in_h1 or self._in_h2 or self._in_h3:
self._current_data.append(data)
if self._current_tag == "a" and self.links:
self.links[-1]["text"] += data
if self._current_tag not in ["script", "style", "head"]:
self.text_content += data
class ContentRuleEngine:
"""Validates HTML content against universal rules and CORA targets"""
def __init__(self, config: Config):
self.config = config
self.universal_rules = config.get("content_rules.universal", {})
self.cora_config = config.get("content_rules.cora_validation", {})
def validate(self, html_content: str, project: Project) -> ValidationResult:
"""
Validate HTML content against all rules
Args:
html_content: Generated HTML content
project: Project with CORA targets
Returns:
ValidationResult with errors and warnings
"""
result = ValidationResult(passed=True)
parser = ContentHTMLParser()
parser.feed(html_content)
self._validate_universal_rules(parser, project, result)
if self.cora_config.get("enabled", True):
self._validate_cora_targets(parser, project, result)
return result
def _validate_universal_rules(self, parser: ContentHTMLParser, project: Project, result: ValidationResult):
"""Validate universal hard rules that apply to all content"""
word_count = len(parser.text_content.split())
min_length = self.universal_rules.get("min_content_length", 0)
max_length = self.universal_rules.get("max_content_length", float('inf'))
if word_count < min_length:
result.add_error(
"min_content_length",
f"Content is too short",
expected=f">={min_length} words",
actual=f"{word_count} words"
)
if word_count > max_length:
result.add_error(
"max_content_length",
f"Content is too long",
expected=f"<={max_length} words",
actual=f"{word_count} words"
)
if self.universal_rules.get("title_exact_match_required", False):
if not parser.title or not self._contains_keyword(parser.title, project.main_keyword):
result.add_error(
"title_exact_match_required",
"Title must contain main keyword",
expected=project.main_keyword,
actual=parser.title or "(no title)"
)
if self.universal_rules.get("h1_exact_match_required", False):
if not parser.h1_tags or not any(self._contains_keyword(h1, project.main_keyword) for h1 in parser.h1_tags):
result.add_error(
"h1_exact_match_required",
"At least one H1 must contain main keyword",
expected=project.main_keyword,
actual=parser.h1_tags
)
h2_min = self.universal_rules.get("h2_exact_match_min", 0)
h2_with_keyword = sum(1 for h2 in parser.h2_tags if self._contains_keyword(h2, project.main_keyword))
if h2_with_keyword < h2_min:
result.add_error(
"h2_exact_match_min",
f"Not enough H2 tags with main keyword",
expected=f">={h2_min}",
actual=h2_with_keyword
)
h3_min = self.universal_rules.get("h3_exact_match_min", 0)
h3_with_keyword = sum(1 for h3 in parser.h3_tags if self._contains_keyword(h3, project.main_keyword))
if h3_with_keyword < h3_min:
result.add_error(
"h3_exact_match_min",
f"Not enough H3 tags with main keyword",
expected=f">={h3_min}",
actual=h3_with_keyword
)
if self.universal_rules.get("faq_section_required", False):
if not self._has_faq_section(parser.h2_tags, parser.h3_tags):
result.add_error(
"faq_section_required",
"Content must include an FAQ section"
)
if self.universal_rules.get("image_alt_text_keyword_required", False):
for img in parser.images:
if not self._contains_keyword(img.get("alt", ""), project.main_keyword):
result.add_error(
"image_alt_text_keyword_required",
f"Image alt text missing main keyword",
expected=project.main_keyword,
actual=img.get("alt", "(no alt)")
)
if self.universal_rules.get("image_alt_text_entity_required", False) and project.entities:
for img in parser.images:
alt_text = img.get("alt", "")
has_entity = any(self._contains_keyword(alt_text, entity) for entity in project.entities)
if not has_entity:
result.add_error(
"image_alt_text_entity_required",
f"Image alt text missing entities",
expected=f"One of: {project.entities[:3]}",
actual=alt_text or "(no alt)"
)
def _validate_cora_targets(self, parser: ContentHTMLParser, project: Project, result: ValidationResult):
"""Validate content against CORA-specific targets"""
is_tier_1 = project.tier == 1
round_down = self.cora_config.get("round_averages_down", True)
counts = self._count_keyword_entities(parser, project)
checks = [
("h1_exact", counts["h1_exact"], project.h1_exact, "H1 tags with exact keyword match"),
("h1_related_search", counts["h1_related_search"], project.h1_related_search, "H1 tags with related searches"),
("h1_entities", counts["h1_entities"], project.h1_entities, "H1 tags with entities"),
("h2_total", len(parser.h2_tags), project.h2_total, "Total H2 tags"),
("h2_exact", counts["h2_exact"], project.h2_exact, "H2 tags with exact keyword match"),
("h2_related_search", counts["h2_related_search"], project.h2_related_search, "H2 tags with related searches"),
("h2_entities", counts["h2_entities"], project.h2_entities, "H2 tags with entities"),
("h3_total", len(parser.h3_tags), project.h3_total, "Total H3 tags"),
("h3_exact", counts["h3_exact"], project.h3_exact, "H3 tags with exact keyword match"),
("h3_related_search", counts["h3_related_search"], project.h3_related_search, "H3 tags with related searches"),
("h3_entities", counts["h3_entities"], project.h3_entities, "H3 tags with entities"),
]
for rule_name, actual, target, description in checks:
if target is None:
continue
expected = int(target) if round_down else round(target)
if actual < expected:
message = f"{description} below CORA target"
if is_tier_1:
result.add_error(rule_name, message, expected=expected, actual=actual)
else:
result.add_warning(rule_name, message, expected=expected, actual=actual)
def _count_keyword_entities(self, parser: ContentHTMLParser, project: Project) -> Dict[str, int]:
"""Count occurrences of keywords, entities, and related searches in headings"""
entities = project.entities or []
related_searches = project.related_searches or []
return {
"h1_exact": sum(1 for h1 in parser.h1_tags if self._contains_keyword(h1, project.main_keyword)),
"h1_related_search": sum(1 for h1 in parser.h1_tags if self._contains_any(h1, related_searches)),
"h1_entities": sum(1 for h1 in parser.h1_tags if self._contains_any(h1, entities)),
"h2_exact": sum(1 for h2 in parser.h2_tags if self._contains_keyword(h2, project.main_keyword)),
"h2_related_search": sum(1 for h2 in parser.h2_tags if self._contains_any(h2, related_searches)),
"h2_entities": sum(1 for h2 in parser.h2_tags if self._contains_any(h2, entities)),
"h3_exact": sum(1 for h3 in parser.h3_tags if self._contains_keyword(h3, project.main_keyword)),
"h3_related_search": sum(1 for h3 in parser.h3_tags if self._contains_any(h3, related_searches)),
"h3_entities": sum(1 for h3 in parser.h3_tags if self._contains_any(h3, entities)),
}
def _contains_keyword(self, text: str, keyword: str) -> bool:
"""Check if text contains keyword (case-insensitive, word boundary)"""
if not text or not keyword:
return False
pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
return bool(re.search(pattern, text.lower()))
def _contains_any(self, text: str, terms: List[str]) -> bool:
"""Check if text contains any of the terms"""
if not text or not terms:
return False
return any(self._contains_keyword(text, term) for term in terms)
def _has_faq_section(self, h2_tags: List[str], h3_tags: List[str]) -> bool:
"""Check if content has an FAQ section"""
faq_patterns = [r'\bfaq\b', r'\bfrequently asked questions\b', r'\bq&a\b', r'\bquestions\b']
for h2 in h2_tags:
if any(re.search(pattern, h2.lower()) for pattern in faq_patterns):
return True
for h3 in h3_tags:
if any(re.search(pattern, h3.lower()) for pattern in faq_patterns):
return True
return False

View File

@ -1,388 +1,311 @@
""" """
Content generation service - orchestrates the three-stage AI generation pipeline Content generation service with three-stage pipeline
""" """
import time import re
import json import json
from html import unescape
from pathlib import Path from pathlib import Path
from typing import Dict, Any, Optional, Tuple from datetime import datetime
from src.database.models import Project, GeneratedContent from typing import Optional, Tuple
from src.database.repositories import GeneratedContentRepository from src.generation.ai_client import AIClient, PromptManager
from src.generation.ai_client import AIClient, AIClientError from src.database.repositories import ProjectRepository, GeneratedContentRepository
from src.generation.validator import StageValidator
from src.generation.augmenter import ContentAugmenter
from src.generation.rule_engine import ContentRuleEngine
from src.core.config import Config, get_config
from sqlalchemy.orm import Session
class GenerationError(Exception): class ContentGenerator:
"""Content generation error""" """Main service for generating content through AI pipeline"""
pass
class ContentGenerationService:
"""Service for AI-powered content generation with validation"""
MAX_H2_TOTAL = 5
MAX_H3_TOTAL = 13
def __init__( def __init__(
self, self,
session: Session, ai_client: AIClient,
config: Optional[Config] = None, prompt_manager: PromptManager,
ai_client: Optional[AIClient] = None project_repo: ProjectRepository,
content_repo: GeneratedContentRepository
): ):
self.ai_client = ai_client
self.prompt_manager = prompt_manager
self.project_repo = project_repo
self.content_repo = content_repo
def generate_title(self, project_id: int, debug: bool = False) -> str:
""" """
Initialize service Generate SEO-optimized title
Args: Args:
session: Database session project_id: Project ID to generate title for
config: Application configuration debug: If True, save response to debug_output/
ai_client: AI client (creates new if None)
"""
self.session = session
self.config = config or get_config()
self.ai_client = ai_client or AIClient(self.config)
self.content_repo = GeneratedContentRepository(session)
self.rule_engine = ContentRuleEngine(self.config)
self.validator = StageValidator(self.config, self.rule_engine)
self.augmenter = ContentAugmenter(ai_client=self.ai_client)
self.prompts_dir = Path(__file__).parent / "prompts"
def generate_article(
self,
project: Project,
tier: int,
title_model: str,
outline_model: str,
content_model: str,
max_retries: int = 3,
progress_callback: Optional[callable] = None,
debug: bool = False
) -> GeneratedContent:
"""
Generate complete article through three-stage pipeline
Args:
project: Project with CORA data
tier: Tier level
title_model: Model for title generation
outline_model: Model for outline generation
content_model: Model for content generation
max_retries: Max retry attempts per stage
progress_callback: Optional callback for progress updates
debug: Enable debug output
Returns: Returns:
GeneratedContent record with completed article Generated title string
Raises:
GenerationError: If generation fails after all retries
""" """
start_time = time.time() project = self.project_repo.get_by_id(project_id)
if not project:
raise ValueError(f"Project {project_id} not found")
content_record = self.content_repo.create(project.id, tier) entities_str = ", ".join(project.entities or [])
content_record.title_model = title_model related_str = ", ".join(project.related_searches or [])
content_record.outline_model = outline_model
content_record.content_model = content_model
self.content_repo.update(content_record)
try: system_msg, user_prompt = self.prompt_manager.format_prompt(
title = self._generate_title(project, content_record, title_model, max_retries) "title_generation",
keyword=project.main_keyword,
content_record.generation_stage = "outline"
self.content_repo.update(content_record)
outline = self._generate_outline(project, title, content_record, outline_model, max_retries)
content_record.generation_stage = "content"
self.content_repo.update(content_record)
html_content = self._generate_content(
project, title, outline, content_record, content_model, max_retries
)
content_record.status = "completed"
content_record.generation_duration = time.time() - start_time
self.content_repo.update(content_record)
return content_record
except Exception as e:
content_record.status = "failed"
content_record.error_message = str(e)
content_record.generation_duration = time.time() - start_time
self.content_repo.update(content_record)
raise GenerationError(f"Article generation failed: {e}")
def _generate_title(
self,
project: Project,
content_record: GeneratedContent,
model: str,
max_retries: int
) -> str:
"""Generate and validate title"""
prompt_template = self._load_prompt("title_generation.json")
entities_str = ", ".join(project.entities[:10]) if project.entities else "N/A"
searches_str = ", ".join(project.related_searches[:10]) if project.related_searches else "N/A"
prompt = prompt_template["user_template"].format(
main_keyword=project.main_keyword,
word_count=project.word_count,
entities=entities_str, entities=entities_str,
related_searches=searches_str related_searches=related_str
) )
for attempt in range(1, max_retries + 1): title = self.ai_client.generate_completion(
content_record.title_attempts = attempt prompt=user_prompt,
self.content_repo.update(content_record) system_message=system_msg,
max_tokens=100,
try:
title = self.ai_client.generate(
prompt=prompt,
model=model,
temperature=0.7 temperature=0.7
) )
is_valid, errors = self.validator.validate_title(title, project) title = title.strip().strip('"').strip("'")
if debug:
self._save_debug_output(
project_id, "title", title, "txt"
)
if is_valid:
content_record.title = title
self.content_repo.update(content_record)
return title return title
if attempt < max_retries: def generate_outline(
prompt += f"\n\nPrevious attempt failed: {', '.join(errors)}. Please fix these issues."
except AIClientError as e:
if attempt == max_retries:
raise GenerationError(f"Title generation failed after {max_retries} attempts: {e}")
raise GenerationError(f"Title validation failed after {max_retries} attempts")
def _generate_outline(
self, self,
project: Project, project_id: int,
title: str, title: str,
content_record: GeneratedContent, min_h2: int,
model: str, max_h2: int,
max_retries: int min_h3: int,
) -> Dict[str, Any]: max_h3: int,
"""Generate and validate outline""" debug: bool = False
prompt_template = self._load_prompt("outline_generation.json") ) -> dict:
"""
Generate article outline in JSON format
entities_str = ", ".join(project.entities[:20]) if project.entities else "N/A" Args:
searches_str = ", ".join(project.related_searches[:20]) if project.related_searches else "N/A" project_id: Project ID
title: Article title
min_h2: Minimum H2 headings
max_h2: Maximum H2 headings
min_h3: Minimum H3 subheadings total
max_h3: Maximum H3 subheadings total
debug: If True, save response to debug_output/
h2_total = int(project.h2_total) if project.h2_total else 5 Returns:
h2_exact = int(project.h2_exact) if project.h2_exact else 1 Outline dictionary: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
h2_related = int(project.h2_related_search) if project.h2_related_search else 1
h2_entities = int(project.h2_entities) if project.h2_entities else 2
h3_total = int(project.h3_total) if project.h3_total else 10 Raises:
h3_exact = int(project.h3_exact) if project.h3_exact else 1 ValueError: If outline doesn't meet minimum requirements
h3_related = int(project.h3_related_search) if project.h3_related_search else 2 """
h3_entities = int(project.h3_entities) if project.h3_entities else 3 project = self.project_repo.get_by_id(project_id)
if not project:
raise ValueError(f"Project {project_id} not found")
if self.config.content_rules.cora_validation.round_averages_down: entities_str = ", ".join(project.entities or [])
h2_total = int(h2_total) related_str = ", ".join(project.related_searches or [])
h3_total = int(h3_total)
h2_total = min(h2_total, self.MAX_H2_TOTAL) system_msg, user_prompt = self.prompt_manager.format_prompt(
h3_total = min(h3_total, self.MAX_H3_TOTAL) "outline_generation",
prompt = prompt_template["user_template"].format(
title=title, title=title,
main_keyword=project.main_keyword, keyword=project.main_keyword,
word_count=project.word_count, min_h2=min_h2,
h2_total=h2_total, max_h2=max_h2,
h2_exact=h2_exact, min_h3=min_h3,
h2_related_search=h2_related, max_h3=max_h3,
h2_entities=h2_entities,
h3_total=h3_total,
h3_exact=h3_exact,
h3_related_search=h3_related,
h3_entities=h3_entities,
entities=entities_str, entities=entities_str,
related_searches=searches_str related_searches=related_str
) )
for attempt in range(1, max_retries + 1): outline_json = self.ai_client.generate_completion(
content_record.outline_attempts = attempt prompt=user_prompt,
self.content_repo.update(content_record) system_message=system_msg,
max_tokens=2000,
temperature=0.7,
json_mode=True
)
print(f"[DEBUG] Raw outline response: {outline_json}")
# Save raw response immediately
if debug:
self._save_debug_output(project_id, "outline_raw", outline_json, "txt")
print(f"[DEBUG] Raw outline response: {outline_json}")
try: try:
outline_json_str = self.ai_client.generate_json( outline = json.loads(outline_json)
prompt=prompt, except json.JSONDecodeError as e:
model=model, if debug:
temperature=0.7, self._save_debug_output(project_id, "outline_error", outline_json, "txt")
max_tokens=2000 raise ValueError(f"Failed to parse outline JSON: {e}\nResponse: {outline_json[:500]}")
if "outline" not in outline:
if debug:
self._save_debug_output(project_id, "outline_invalid", json.dumps(outline, indent=2), "json")
raise ValueError(f"Outline missing 'outline' key. Got keys: {list(outline.keys())}\nContent: {outline}")
h2_count = len(outline["outline"])
h3_count = sum(len(section.get("h3", [])) for section in outline["outline"])
if h2_count < min_h2:
raise ValueError(f"Outline has {h2_count} H2s, minimum is {min_h2}")
if h3_count < min_h3:
raise ValueError(f"Outline has {h3_count} H3s, minimum is {min_h3}")
if debug:
self._save_debug_output(
project_id, "outline", json.dumps(outline, indent=2), "json"
) )
if isinstance(outline_json_str, str):
outline = json.loads(outline_json_str)
else:
outline = outline_json_str
is_valid, errors, missing = self.validator.validate_outline(outline, project)
if is_valid:
content_record.outline = json.dumps(outline)
self.content_repo.update(content_record)
return outline return outline
if attempt < max_retries: def generate_content(
if missing:
augmented_outline, aug_log = self.augmenter.augment_outline(
outline, missing, project.main_keyword,
project.entities or [], project.related_searches or []
)
is_valid_aug, errors_aug, _ = self.validator.validate_outline(
augmented_outline, project
)
if is_valid_aug:
content_record.outline = json.dumps(augmented_outline)
content_record.augmented = True
content_record.augmentation_log = aug_log
self.content_repo.update(content_record)
return augmented_outline
prompt += f"\n\nPrevious attempt failed: {', '.join(errors)}. Please meet ALL CORA targets exactly."
except (AIClientError, json.JSONDecodeError) as e:
if attempt == max_retries:
raise GenerationError(f"Outline generation failed after {max_retries} attempts: {e}")
raise GenerationError(f"Outline validation failed after {max_retries} attempts")
def _generate_content(
self, self,
project: Project, project_id: int,
title: str, title: str,
outline: Dict[str, Any], outline: dict,
content_record: GeneratedContent, min_word_count: int,
model: str, max_word_count: int,
max_retries: int debug: bool = False
) -> str: ) -> str:
"""Generate and validate full HTML content""" """
prompt_template = self._load_prompt("content_generation.json") Generate full article HTML fragment
outline_str = self._format_outline_for_prompt(outline) Args:
entities_str = ", ".join(project.entities[:30]) if project.entities else "N/A" project_id: Project ID
searches_str = ", ".join(project.related_searches[:30]) if project.related_searches else "N/A" title: Article title
outline: Article outline dict
min_word_count: Minimum word count for guidance
max_word_count: Maximum word count for guidance
debug: If True, save response to debug_output/
prompt = prompt_template["user_template"].format( Returns:
outline=outline_str, HTML string with <h2>, <h3>, <p> tags
"""
project = self.project_repo.get_by_id(project_id)
if not project:
raise ValueError(f"Project {project_id} not found")
entities_str = ", ".join(project.entities or [])
related_str = ", ".join(project.related_searches or [])
outline_str = json.dumps(outline, indent=2)
system_msg, user_prompt = self.prompt_manager.format_prompt(
"content_generation",
title=title, title=title,
main_keyword=project.main_keyword, outline=outline_str,
word_count=project.word_count, keyword=project.main_keyword,
term_frequency=project.term_frequency or self.config.content_rules.universal.default_term_frequency,
entities=entities_str, entities=entities_str,
related_searches=searches_str related_searches=related_str,
min_word_count=min_word_count,
max_word_count=max_word_count
) )
for attempt in range(1, max_retries + 1): content = self.ai_client.generate_completion(
content_record.content_attempts = attempt prompt=user_prompt,
self.content_repo.update(content_record) system_message=system_msg,
max_tokens=8000,
try: temperature=0.7
html_content = self.ai_client.generate(
prompt=prompt,
model=model,
temperature=0.7,
max_tokens=self.config.ai_service.max_tokens
) )
is_valid, validation_result = self.validator.validate_content(html_content, project) content = content.strip()
content_record.validation_errors = len(validation_result.errors) if debug:
content_record.validation_warnings = len(validation_result.warnings) self._save_debug_output(
content_record.validation_report = validation_result.to_dict() project_id, "content", content, "html"
self.content_repo.update(content_record)
if is_valid:
content_record.content = html_content
word_count = len(html_content.split())
content_record.word_count = word_count
self.content_repo.update(content_record)
return html_content
if attempt < max_retries:
missing = self.validator.extract_missing_elements(validation_result, project, html_content)
has_word_deficit = missing.get("word_count_deficit", 0) > 0
if has_word_deficit:
try:
augmented_html, aug_log = self.augmenter.augment_content_with_ai(
html_content, missing, project.main_keyword,
project.entities or [], project.related_searches or [],
model=model
) )
is_valid_aug, validation_result_aug = self.validator.validate_content( return content
augmented_html, project
def validate_word_count(self, content: str, min_words: int, max_words: int) -> Tuple[bool, int]:
"""
Validate content word count
Args:
content: HTML content string
min_words: Minimum word count
max_words: Maximum word count
Returns:
Tuple of (is_valid, actual_count)
"""
word_count = self.count_words(content)
is_valid = min_words <= word_count <= max_words
return is_valid, word_count
def count_words(self, html_content: str) -> int:
"""
Count words in HTML content
Args:
html_content: HTML string
Returns:
Number of words
"""
text = re.sub(r'<[^>]+>', '', html_content)
text = unescape(text)
words = text.split()
return len(words)
def augment_content(
self,
content: str,
target_word_count: int,
debug: bool = False,
project_id: Optional[int] = None
) -> str:
"""
Expand article content to meet minimum word count
Args:
content: Current HTML content
target_word_count: Target word count
debug: If True, save response to debug_output/
project_id: Optional project ID for debug output
Returns:
Expanded HTML content
"""
system_msg, user_prompt = self.prompt_manager.format_prompt(
"content_augmentation",
content=content,
target_word_count=target_word_count
) )
content_record.content = augmented_html augmented = self.ai_client.generate_completion(
content_record.augmented = True prompt=user_prompt,
existing_log = content_record.augmentation_log or {} system_message=system_msg,
existing_log["content_ai_augmentation"] = aug_log max_tokens=8000,
content_record.augmentation_log = existing_log temperature=0.7
content_record.validation_errors = len(validation_result_aug.errors) )
content_record.validation_warnings = len(validation_result_aug.warnings)
content_record.validation_report = validation_result_aug.to_dict()
word_count = len(augmented_html.split())
content_record.word_count = word_count
self.content_repo.update(content_record)
missing_after = self.validator.extract_missing_elements(validation_result_aug, project, augmented_html) augmented = augmented.strip()
still_short = missing_after.get("word_count_deficit", 0) > 0
if not still_short: if debug and project_id:
return augmented_html self._save_debug_output(
project_id, "augmented", augmented, "html"
)
html_content = augmented_html return augmented
validation_result = validation_result_aug
except Exception as e: def _save_debug_output(
print(f"AI augmentation failed: {e}") self,
error_summary = f"Word count too short. AI augmentation failed: {str(e)}" project_id: int,
prompt += f"\n\nPrevious content failed validation: {error_summary}. Generate MORE content to meet the word count target." stage: str,
else: content: str,
content_record.content = html_content extension: str,
word_count = len(html_content.split()) tier: Optional[str] = None,
content_record.word_count = word_count article_num: Optional[int] = None
self.content_repo.update(content_record) ):
return html_content """Save debug output to file"""
debug_dir = Path("debug_output")
debug_dir.mkdir(exist_ok=True)
except AIClientError as e: timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
if attempt == max_retries:
raise GenerationError(f"Content generation failed after {max_retries} attempts: {e}")
raise GenerationError(f"Content validation failed after {max_retries} attempts") tier_part = f"_tier{tier}" if tier else ""
article_part = f"_article{article_num}" if article_num else ""
def _load_prompt(self, filename: str) -> Dict[str, Any]: filename = f"{stage}_project{project_id}{tier_part}{article_part}_{timestamp}.{extension}"
"""Load prompt template from JSON file""" filepath = debug_dir / filename
prompt_path = self.prompts_dir / filename
if not prompt_path.exists():
raise GenerationError(f"Prompt template not found: {filename}")
with open(prompt_path, 'r', encoding='utf-8') as f: with open(filepath, 'w', encoding='utf-8') as f:
return json.load(f) f.write(content)
def _format_outline_for_prompt(self, outline: Dict[str, Any]) -> str:
"""Format outline JSON into readable string for content prompt"""
lines = [f"H1: {outline.get('h1', '')}"]
for section in outline.get("sections", []):
lines.append(f"\nH2: {section['h2']}")
for h3 in section.get("h3s", []):
lines.append(f" H3: {h3}")
return "\n".join(lines)

View File

@ -0,0 +1,52 @@
"""
Integration test for batch generation (stub)
"""
import pytest
from unittest.mock import Mock, MagicMock
from src.generation.batch_processor import BatchProcessor
from src.generation.service import ContentGenerator
def test_batch_processor_initialization():
"""Test BatchProcessor can be initialized"""
mock_generator = Mock(spec=ContentGenerator)
mock_content_repo = Mock()
mock_project_repo = Mock()
processor = BatchProcessor(
content_generator=mock_generator,
content_repo=mock_content_repo,
project_repo=mock_project_repo
)
assert processor is not None
assert processor.stats["total_jobs"] == 0
assert processor.stats["processed_jobs"] == 0
def test_batch_processor_stats_initialization():
"""Test BatchProcessor initializes stats correctly"""
mock_generator = Mock(spec=ContentGenerator)
mock_content_repo = Mock()
mock_project_repo = Mock()
processor = BatchProcessor(
content_generator=mock_generator,
content_repo=mock_content_repo,
project_repo=mock_project_repo
)
expected_keys = [
"total_jobs",
"processed_jobs",
"total_articles",
"generated_articles",
"augmented_articles",
"failed_articles"
]
for key in expected_keys:
assert key in processor.stats
assert processor.stats[key] == 0

View File

@ -0,0 +1,95 @@
"""
Unit tests for ContentGenerator service
"""
import pytest
from src.generation.service import ContentGenerator
def test_count_words_simple():
"""Test word count on simple text"""
generator = ContentGenerator(None, None, None, None)
html = "<p>This is a test with five words</p>"
count = generator.count_words(html)
assert count == 7
def test_count_words_with_headings():
"""Test word count with HTML headings"""
generator = ContentGenerator(None, None, None, None)
html = """
<h2>Main Heading</h2>
<p>This is a paragraph with some words.</p>
<h3>Subheading</h3>
<p>Another paragraph here.</p>
"""
count = generator.count_words(html)
assert count > 10
def test_count_words_strips_html_tags():
"""Test that HTML tags are stripped before counting"""
generator = ContentGenerator(None, None, None, None)
html = "<p>Hello <strong>world</strong> this <em>is</em> a test</p>"
count = generator.count_words(html)
assert count == 6
def test_validate_word_count_within_range():
"""Test validation when word count is within range"""
generator = ContentGenerator(None, None, None, None)
content = "<p>" + " ".join(["word"] * 100) + "</p>"
is_valid, count = generator.validate_word_count(content, 50, 150)
assert is_valid is True
assert count == 100
def test_validate_word_count_below_minimum():
"""Test validation when word count is below minimum"""
generator = ContentGenerator(None, None, None, None)
content = "<p>" + " ".join(["word"] * 30) + "</p>"
is_valid, count = generator.validate_word_count(content, 50, 150)
assert is_valid is False
assert count == 30
def test_validate_word_count_above_maximum():
"""Test validation when word count is above maximum"""
generator = ContentGenerator(None, None, None, None)
content = "<p>" + " ".join(["word"] * 200) + "</p>"
is_valid, count = generator.validate_word_count(content, 50, 150)
assert is_valid is False
assert count == 200
def test_count_words_empty_content():
"""Test word count on empty content"""
generator = ContentGenerator(None, None, None, None)
count = generator.count_words("")
assert count == 0
def test_count_words_only_tags():
"""Test word count on content with only HTML tags"""
generator = ContentGenerator(None, None, None, None)
html = "<div><p></p><span></span></div>"
count = generator.count_words(html)
assert count == 0

View File

@ -1,208 +1,176 @@
""" """
Unit tests for job configuration Unit tests for JobConfig parser
""" """
import pytest import pytest
import json import json
import tempfile
from pathlib import Path from pathlib import Path
from src.generation.job_config import ( from src.generation.job_config import JobConfig, TIER_DEFAULTS
JobConfig, TierConfig, ModelConfig, AnchorTextConfig,
FailureConfig, InterlinkingConfig
)
def test_model_config_creation(): @pytest.fixture
"""Test ModelConfig creation""" def temp_job_file(tmp_path):
config = ModelConfig( """Create a temporary job file for testing"""
title="model1", def _create_file(data):
outline="model2", job_file = tmp_path / "test_job.json"
content="model3" with open(job_file, 'w') as f:
) json.dump(data, f)
return str(job_file)
assert config.title == "model1" return _create_file
assert config.outline == "model2"
assert config.content == "model3"
def test_anchor_text_config_modes(): def test_load_job_config_valid(temp_job_file):
"""Test different anchor text modes""" """Test loading valid job file"""
default_config = AnchorTextConfig(mode="default") data = {
assert default_config.mode == "default" "jobs": [
override_config = AnchorTextConfig(
mode="override",
custom_text=["anchor1", "anchor2"]
)
assert override_config.mode == "override"
assert len(override_config.custom_text) == 2
append_config = AnchorTextConfig(
mode="append",
additional_text=["extra"]
)
assert append_config.mode == "append"
def test_tier_config_creation():
"""Test TierConfig creation"""
models = ModelConfig(
title="model1",
outline="model2",
content="model3"
)
tier_config = TierConfig(
tier=1,
article_count=15,
models=models
)
assert tier_config.tier == 1
assert tier_config.article_count == 15
assert tier_config.validation_attempts == 3
def test_job_config_creation():
"""Test JobConfig creation"""
models = ModelConfig(
title="model1",
outline="model2",
content="model3"
)
tier = TierConfig(
tier=1,
article_count=10,
models=models
)
job = JobConfig(
job_name="Test Job",
project_id=1,
tiers=[tier]
)
assert job.job_name == "Test Job"
assert job.project_id == 1
assert len(job.tiers) == 1
assert job.get_total_articles() == 10
def test_job_config_multiple_tiers():
"""Test JobConfig with multiple tiers"""
models = ModelConfig(
title="model1",
outline="model2",
content="model3"
)
tier1 = TierConfig(tier=1, article_count=10, models=models)
tier2 = TierConfig(tier=2, article_count=20, models=models)
job = JobConfig(
job_name="Multi-Tier Job",
project_id=1,
tiers=[tier1, tier2]
)
assert job.get_total_articles() == 30
def test_job_config_unique_tiers_validation():
"""Test that tier numbers must be unique"""
models = ModelConfig(
title="model1",
outline="model2",
content="model3"
)
tier1 = TierConfig(tier=1, article_count=10, models=models)
tier2 = TierConfig(tier=1, article_count=20, models=models)
with pytest.raises(ValueError, match="unique"):
JobConfig(
job_name="Duplicate Tiers",
project_id=1,
tiers=[tier1, tier2]
)
def test_job_config_from_file():
"""Test loading JobConfig from JSON file"""
config_data = {
"job_name": "Test Job",
"project_id": 1,
"tiers": [
{ {
"tier": 1, "project_id": 1,
"article_count": 5, "tiers": {
"models": { "tier1": {
"title": "model1", "count": 5
"outline": "model2", }
"content": "model3"
} }
} }
] ]
} }
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f: job_file = temp_job_file(data)
json.dump(config_data, f) config = JobConfig(job_file)
temp_path = f.name
try: assert len(config.get_jobs()) == 1
job = JobConfig.from_file(temp_path) assert config.get_jobs()[0].project_id == 1
assert job.job_name == "Test Job" assert "tier1" in config.get_jobs()[0].tiers
assert job.project_id == 1
assert len(job.tiers) == 1
finally:
Path(temp_path).unlink()
def test_job_config_to_file(): def test_tier_defaults_applied(temp_job_file):
"""Test saving JobConfig to JSON file""" """Test defaults applied when not in job file"""
models = ModelConfig( data = {
title="model1", "jobs": [
outline="model2", {
content="model3" "project_id": 1,
) "tiers": {
"tier1": {
"count": 3
}
}
}
]
}
tier = TierConfig(tier=1, article_count=5, models=models) job_file = temp_job_file(data)
job = JobConfig( config = JobConfig(job_file)
job_name="Test Job",
project_id=1,
tiers=[tier]
)
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f: job = config.get_jobs()[0]
temp_path = f.name tier1_config = job.tiers["tier1"]
try: assert tier1_config.count == 3
job.to_file(temp_path) assert tier1_config.min_word_count == TIER_DEFAULTS["tier1"]["min_word_count"]
assert Path(temp_path).exists() assert tier1_config.max_word_count == TIER_DEFAULTS["tier1"]["max_word_count"]
loaded_job = JobConfig.from_file(temp_path)
assert loaded_job.job_name == job.job_name
assert loaded_job.project_id == job.project_id
finally:
Path(temp_path).unlink()
def test_interlinking_config_validation(): def test_custom_values_override_defaults(temp_job_file):
"""Test InterlinkingConfig validation""" """Test custom values override defaults"""
config = InterlinkingConfig( data = {
links_per_article_min=2, "jobs": [
links_per_article_max=4 {
) "project_id": 1,
"tiers": {
"tier1": {
"count": 5,
"min_word_count": 3000,
"max_word_count": 3500
}
}
}
]
}
assert config.links_per_article_min == 2 job_file = temp_job_file(data)
assert config.links_per_article_max == 4 config = JobConfig(job_file)
job = config.get_jobs()[0]
tier1_config = job.tiers["tier1"]
assert tier1_config.min_word_count == 3000
assert tier1_config.max_word_count == 3500
def test_failure_config_defaults(): def test_multiple_jobs_in_file(temp_job_file):
"""Test FailureConfig default values""" """Test parsing file with multiple jobs"""
config = FailureConfig() data = {
"jobs": [
{
"project_id": 1,
"tiers": {"tier1": {"count": 5}}
},
{
"project_id": 2,
"tiers": {"tier2": {"count": 10}}
}
]
}
assert config.max_consecutive_failures == 5 job_file = temp_job_file(data)
assert config.skip_on_failure is True config = JobConfig(job_file)
jobs = config.get_jobs()
assert len(jobs) == 2
assert jobs[0].project_id == 1
assert jobs[1].project_id == 2
def test_multiple_tiers_in_job(temp_job_file):
"""Test job with multiple tiers"""
data = {
"jobs": [
{
"project_id": 1,
"tiers": {
"tier1": {"count": 5},
"tier2": {"count": 10},
"tier3": {"count": 15}
}
}
]
}
job_file = temp_job_file(data)
config = JobConfig(job_file)
job = config.get_jobs()[0]
assert len(job.tiers) == 3
assert "tier1" in job.tiers
assert "tier2" in job.tiers
assert "tier3" in job.tiers
def test_invalid_job_file_no_jobs_key(temp_job_file):
"""Test error when jobs key is missing"""
data = {"invalid": []}
job_file = temp_job_file(data)
with pytest.raises(ValueError, match="must contain 'jobs'"):
JobConfig(job_file)
def test_invalid_job_missing_project_id(temp_job_file):
"""Test error when project_id is missing"""
data = {
"jobs": [
{
"tiers": {"tier1": {"count": 5}}
}
]
}
job_file = temp_job_file(data)
with pytest.raises(ValueError, match="missing 'project_id'"):
JobConfig(job_file)
def test_file_not_found():
"""Test error when file doesn't exist"""
with pytest.raises(FileNotFoundError):
JobConfig("nonexistent_file.json")