Resolve merge conflicts - choose newer implementations
commit
19e1c93358
|
|
@ -17,3 +17,6 @@ __pycache__/
|
|||
.idea/
|
||||
|
||||
*.xlsx
|
||||
|
||||
# Debug output
|
||||
debug_output/
|
||||
|
|
@ -0,0 +1,199 @@
|
|||
# Story 2.2 Implementation Summary
|
||||
|
||||
## Overview
|
||||
Successfully implemented simplified AI content generation via batch jobs using OpenRouter API.
|
||||
|
||||
## Completed Phases
|
||||
|
||||
### Phase 1: Data Model & Schema Design
|
||||
- ✅ Added `GeneratedContent` model to `src/database/models.py`
|
||||
- ✅ Created `GeneratedContentRepository` in `src/database/repositories.py`
|
||||
- ✅ Updated `scripts/init_db.py` (automatic table creation via Base.metadata)
|
||||
|
||||
### Phase 2: AI Client & Prompt Management
|
||||
- ✅ Created `src/generation/ai_client.py` with:
|
||||
- `AIClient` class for OpenRouter API integration
|
||||
- `PromptManager` class for template loading
|
||||
- Retry logic with exponential backoff
|
||||
- ✅ Created prompt templates in `src/generation/prompts/`:
|
||||
- `title_generation.json`
|
||||
- `outline_generation.json`
|
||||
- `content_generation.json`
|
||||
- `content_augmentation.json`
|
||||
|
||||
### Phase 3: Core Generation Pipeline
|
||||
- ✅ Implemented `ContentGenerator` in `src/generation/service.py` with:
|
||||
- `generate_title()` - Stage 1
|
||||
- `generate_outline()` - Stage 2 with JSON validation
|
||||
- `generate_content()` - Stage 3
|
||||
- `validate_word_count()` - Word count validation
|
||||
- `augment_content()` - Simple augmentation
|
||||
- `count_words()` - HTML-aware word counting
|
||||
- Debug output support
|
||||
|
||||
### Phase 4: Batch Processing
|
||||
- ✅ Created `src/generation/job_config.py` with:
|
||||
- `JobConfig` parser with tier defaults
|
||||
- `TierConfig` and `Job` dataclasses
|
||||
- JSON validation
|
||||
- ✅ Created `src/generation/batch_processor.py` with:
|
||||
- `BatchProcessor` class
|
||||
- Progress logging to console
|
||||
- Error handling and continue-on-error support
|
||||
- Statistics tracking
|
||||
|
||||
### Phase 5: CLI Integration
|
||||
- ✅ Added `generate-batch` command to `src/cli/commands.py`
|
||||
- ✅ Command options:
|
||||
- `--job-file` (required)
|
||||
- `--username` / `--password` for authentication
|
||||
- `--debug` for saving AI responses
|
||||
- `--continue-on-error` flag
|
||||
- `--model` selection (default: gpt-4o-mini)
|
||||
|
||||
### Phase 6: Testing & Validation
|
||||
- ✅ Created unit tests:
|
||||
- `tests/unit/test_job_config.py` (9 tests)
|
||||
- `tests/unit/test_content_generator.py` (9 tests)
|
||||
- ✅ Created integration test stub:
|
||||
- `tests/integration/test_generate_batch.py` (2 tests)
|
||||
- ✅ Created example job files:
|
||||
- `jobs/example_tier1_batch.json`
|
||||
- `jobs/example_multi_tier_batch.json`
|
||||
- `jobs/README.md` (comprehensive documentation)
|
||||
|
||||
### Phase 7: Cleanup & Documentation
|
||||
- ✅ Deprecated old `src/generation/rule_engine.py`
|
||||
- ✅ Updated documentation:
|
||||
- `docs/architecture/workflows.md` - Added generation workflow diagram
|
||||
- `docs/architecture/components.md` - Updated generation module description
|
||||
- `docs/architecture/data-models.md` - Updated GeneratedContent model
|
||||
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Marked as Completed
|
||||
- ✅ Updated `.gitignore` to exclude `debug_output/`
|
||||
- ✅ Updated `env.example` with `OPENROUTER_API_KEY`
|
||||
|
||||
## Key Files Created/Modified
|
||||
|
||||
### New Files (17)
|
||||
```
|
||||
src/generation/ai_client.py
|
||||
src/generation/service.py
|
||||
src/generation/job_config.py
|
||||
src/generation/batch_processor.py
|
||||
src/generation/prompts/title_generation.json
|
||||
src/generation/prompts/outline_generation.json
|
||||
src/generation/prompts/content_generation.json
|
||||
src/generation/prompts/content_augmentation.json
|
||||
jobs/example_tier1_batch.json
|
||||
jobs/example_multi_tier_batch.json
|
||||
jobs/README.md
|
||||
tests/unit/test_job_config.py
|
||||
tests/unit/test_content_generator.py
|
||||
tests/integration/test_generate_batch.py
|
||||
IMPLEMENTATION_SUMMARY.md
|
||||
```
|
||||
|
||||
### Modified Files (7)
|
||||
```
|
||||
src/database/models.py (added GeneratedContent model)
|
||||
src/database/repositories.py (added GeneratedContentRepository)
|
||||
src/cli/commands.py (added generate-batch command)
|
||||
src/generation/rule_engine.py (deprecated)
|
||||
docs/architecture/workflows.md (updated)
|
||||
docs/architecture/components.md (updated)
|
||||
docs/architecture/data-models.md (updated)
|
||||
docs/stories/story-2.2. simplified-ai-content-generation.md (marked complete)
|
||||
.gitignore (added debug_output/)
|
||||
env.example (added OPENROUTER_API_KEY)
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Set up environment
|
||||
```bash
|
||||
# Copy env.example to .env and add your OpenRouter API key
|
||||
cp env.example .env
|
||||
# Edit .env and set OPENROUTER_API_KEY
|
||||
```
|
||||
|
||||
### 2. Initialize database
|
||||
```bash
|
||||
python scripts/init_db.py
|
||||
```
|
||||
|
||||
### 3. Create a project (if not exists)
|
||||
```bash
|
||||
python main.py ingest-cora --file path/to/cora.xlsx --name "My Project"
|
||||
```
|
||||
|
||||
### 4. Run batch generation
|
||||
```bash
|
||||
python main.py generate-batch --job-file jobs/example_tier1_batch.json
|
||||
```
|
||||
|
||||
### 5. With debug output
|
||||
```bash
|
||||
python main.py generate-batch --job-file jobs/example_tier1_batch.json --debug
|
||||
```
|
||||
|
||||
## Architecture Highlights
|
||||
|
||||
### Three-Stage Pipeline
|
||||
1. **Title Generation**: Uses keyword + entities + related searches
|
||||
2. **Outline Generation**: JSON-formatted with H2/H3 structure, validated against min/max constraints
|
||||
3. **Content Generation**: Full HTML fragment based on outline
|
||||
|
||||
### Simplification Wins
|
||||
- No complex rule engine
|
||||
- Single word count validation (min/max from job file)
|
||||
- One-attempt augmentation if below minimum
|
||||
- Job file controls all operational parameters
|
||||
- Tier defaults for common configurations
|
||||
|
||||
### Error Handling
|
||||
- Network errors: 3 retries with exponential backoff
|
||||
- Rate limits: Respects retry-after headers
|
||||
- Failed articles: Saved with status='failed', can continue processing with `--continue-on-error`
|
||||
- Database errors: Always abort (data integrity)
|
||||
|
||||
## Testing
|
||||
|
||||
Run tests with:
|
||||
```bash
|
||||
pytest tests/unit/test_job_config.py -v
|
||||
pytest tests/unit/test_content_generator.py -v
|
||||
pytest tests/integration/test_generate_batch.py -v
|
||||
```
|
||||
|
||||
## Next Steps (Future Stories)
|
||||
|
||||
- Story 2.3: Interlinking integration
|
||||
- Story 3.x: Template selection
|
||||
- Story 4.x: Deployment integration
|
||||
- Expand test coverage (currently basic tests only)
|
||||
|
||||
## Success Criteria Met
|
||||
|
||||
All acceptance criteria from Story 2.2 have been met:
|
||||
|
||||
✅ 1. Batch Job Control - Job file specifies all tier parameters
|
||||
✅ 2. Three-Stage Generation - Title → Outline → Content pipeline
|
||||
✅ 3. SEO Data Integration - Keyword, entities, related searches used in all stages
|
||||
✅ 4. Word Count Validation - Validates against min/max from job file
|
||||
✅ 5. Simple Augmentation - Single attempt if below minimum
|
||||
✅ 6. Database Storage - GeneratedContent table with all required fields
|
||||
✅ 7. CLI Execution - generate-batch command with progress logging
|
||||
|
||||
## Estimated Implementation Time
|
||||
- Total: ~20-29 hours (as estimated in task breakdown)
|
||||
- Actual: Completed in single session with comprehensive implementation
|
||||
|
||||
## Notes
|
||||
|
||||
- OpenRouter API key required in environment
|
||||
- Debug output saved to `debug_output/` when `--debug` flag used
|
||||
- Job files support multiple projects and tiers
|
||||
- Tier defaults can be fully or partially overridden
|
||||
- HTML output is fragment format (no <html>, <head>, or <body> tags)
|
||||
- Word count strips HTML tags and counts text words only
|
||||
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
from src.database.session import db_manager
|
||||
from src.database.models import GeneratedContent
|
||||
import json
|
||||
|
||||
s = db_manager.get_session()
|
||||
gc = s.query(GeneratedContent).order_by(GeneratedContent.id.desc()).first()
|
||||
|
||||
if gc:
|
||||
print(f"Content ID: {gc.id}")
|
||||
print(f"Stage: {gc.generation_stage}")
|
||||
print(f"Status: {gc.status}")
|
||||
print(f"Outline attempts: {gc.outline_attempts}")
|
||||
print(f"Error: {gc.error_message}")
|
||||
|
||||
if gc.outline:
|
||||
outline = json.loads(gc.outline)
|
||||
sections = outline.get("sections", [])
|
||||
print(f"\nOutline:")
|
||||
print(f"H2 count: {len(sections)}")
|
||||
h3_count = sum(len(s.get('h3s', [])) for s in sections)
|
||||
print(f"H3 count: {h3_count}")
|
||||
|
||||
has_faq = any("faq" in s["h2"].lower() or "question" in s["h2"].lower() for s in sections)
|
||||
print(f"Has FAQ: {has_faq}")
|
||||
|
||||
print(f"\nH2s:")
|
||||
for s in sections:
|
||||
print(f" - {s['h2']} ({len(s.get('h3s', []))} H3s)")
|
||||
else:
|
||||
print("\nNo outline saved")
|
||||
else:
|
||||
print("No content found")
|
||||
|
||||
s.close()
|
||||
|
||||
|
||||
Binary file not shown.
|
|
@ -20,7 +20,14 @@ Manages user authentication, password hashing, and role-based access control log
|
|||
Responsible for parsing the CORA .xlsx files and creating new Project entries in the database.
|
||||
|
||||
### generation
|
||||
Interacts with the AI service API. It takes project data, constructs prompts, and retrieves the generated text. Includes the Content Rule Engine for validation.
|
||||
Interacts with the AI service API (OpenRouter). Implements a simplified three-stage pipeline:
|
||||
- **AIClient**: Handles OpenRouter API calls with retry logic
|
||||
- **PromptManager**: Loads and formats prompt templates from JSON files
|
||||
- **ContentGenerator**: Orchestrates title, outline, and content generation
|
||||
- **BatchProcessor**: Processes job files and manages multi-tier batch generation
|
||||
- **JobConfig**: Parses job configuration files with tier defaults
|
||||
|
||||
The generation module uses SEO data from the Project table (keyword, entities, related searches) to inform all stages of content generation. Validates word count and performs simple augmentation if content is below minimum threshold.
|
||||
|
||||
### templating
|
||||
Takes raw generated text and applies the appropriate HTML/CSS template based on the project's configuration.
|
||||
|
|
|
|||
|
|
@ -29,20 +29,28 @@ The following data models will be implemented using SQLAlchemy.
|
|||
|
||||
## 3. GeneratedContent
|
||||
|
||||
**Purpose**: Stores the AI-generated content and its final deployed state.
|
||||
**Purpose**: Stores the AI-generated content from the three-stage pipeline.
|
||||
|
||||
**Key Attributes**:
|
||||
- `id`: Integer, Primary Key
|
||||
- `project_id`: Integer, Foreign Key to Project
|
||||
- `title`: Text
|
||||
- `outline`: Text
|
||||
- `body_text`: Text
|
||||
- `final_html`: Text
|
||||
- `deployed_url`: String, Unique
|
||||
- `tier`: String (for link classification)
|
||||
- `id`: Integer, Primary Key, Auto-increment
|
||||
- `project_id`: Integer, Foreign Key to Project, Indexed
|
||||
- `tier`: String(20), Not Null, Indexed (tier1, tier2, tier3)
|
||||
- `keyword`: String(255), Not Null, Indexed
|
||||
- `title`: Text, Not Null (Generated in stage 1)
|
||||
- `outline`: JSON, Not Null (Generated in stage 2)
|
||||
- `content`: Text, Not Null (HTML fragment from stage 3)
|
||||
- `word_count`: Integer, Not Null (Validated word count)
|
||||
- `status`: String(20), Not Null (generated, augmented, failed)
|
||||
- `created_at`: DateTime, Not Null
|
||||
- `updated_at`: DateTime, Not Null
|
||||
|
||||
**Relationships**: Belongs to one Project.
|
||||
|
||||
**Status Values**:
|
||||
- `generated`: Content was successfully generated within word count range
|
||||
- `augmented`: Content was below minimum and was augmented
|
||||
- `failed`: Generation failed (error details in outline JSON)
|
||||
|
||||
## 4. FqdnMapping
|
||||
|
||||
**Purpose**: Maps cloud storage buckets to fully qualified domain names for URL generation.
|
||||
|
|
|
|||
|
|
@ -1,27 +1,81 @@
|
|||
# Core Workflows
|
||||
|
||||
This sequence diagram illustrates the primary workflow for a single content generation job.
|
||||
## Content Generation Workflow (Story 2.2)
|
||||
|
||||
The simplified three-stage content generation pipeline:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant CLI
|
||||
participant Ingestion
|
||||
participant Generation
|
||||
participant Interlinking
|
||||
participant Deployment
|
||||
participant API
|
||||
participant BatchProcessor
|
||||
participant ContentGenerator
|
||||
participant AIClient
|
||||
participant Database
|
||||
|
||||
User->>CLI: run job --file report.xlsx
|
||||
CLI->>Ingestion: process_cora_file("report.xlsx")
|
||||
Ingestion-->>CLI: project_id
|
||||
CLI->>Generation: generate_content(project_id)
|
||||
Generation-->>CLI: raw_html_list
|
||||
CLI->>Interlinking: inject_links(raw_html_list)
|
||||
Interlinking-->>CLI: final_html_list
|
||||
CLI->>Deployment: deploy_batch(final_html_list)
|
||||
Deployment-->>CLI: deployed_urls
|
||||
CLI->>API: send_to_link_builder(job_data, deployed_urls)
|
||||
API-->>CLI: success
|
||||
CLI-->>User: Job Complete! URLs logged.
|
||||
User->>CLI: generate-batch --job-file jobs/example.json
|
||||
CLI->>BatchProcessor: process_job()
|
||||
|
||||
loop For each project/tier/article
|
||||
BatchProcessor->>ContentGenerator: generate_title(project_id)
|
||||
ContentGenerator->>AIClient: generate_completion(prompt)
|
||||
AIClient-->>ContentGenerator: title
|
||||
|
||||
BatchProcessor->>ContentGenerator: generate_outline(project_id, title)
|
||||
ContentGenerator->>AIClient: generate_completion(prompt, json_mode=true)
|
||||
AIClient-->>ContentGenerator: outline JSON
|
||||
|
||||
BatchProcessor->>ContentGenerator: generate_content(project_id, title, outline)
|
||||
ContentGenerator->>AIClient: generate_completion(prompt)
|
||||
AIClient-->>ContentGenerator: HTML content
|
||||
|
||||
BatchProcessor->>ContentGenerator: validate_word_count(content)
|
||||
|
||||
alt Below minimum word count
|
||||
BatchProcessor->>ContentGenerator: augment_content(content, target_count)
|
||||
ContentGenerator->>AIClient: generate_completion(prompt)
|
||||
AIClient-->>ContentGenerator: augmented HTML
|
||||
end
|
||||
|
||||
BatchProcessor->>Database: save GeneratedContent record
|
||||
end
|
||||
|
||||
BatchProcessor-->>CLI: Summary statistics
|
||||
CLI-->>User: Job complete
|
||||
```
|
||||
|
||||
## CORA Ingestion Workflow (Story 2.1)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant CLI
|
||||
participant Parser
|
||||
participant Database
|
||||
|
||||
User->>CLI: ingest-cora --file report.xlsx --name "Project Name"
|
||||
CLI->>Parser: parse(file_path)
|
||||
Parser-->>CLI: cora_data dict
|
||||
CLI->>Database: create Project record
|
||||
Database-->>CLI: project_id
|
||||
CLI-->>User: Project created (ID: X)
|
||||
```
|
||||
|
||||
## Deployment Workflow (Story 1.6)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant CLI
|
||||
participant BunnyNetClient
|
||||
participant Database
|
||||
|
||||
User->>CLI: provision-site --name "Site" --domain "example.com"
|
||||
CLI->>BunnyNetClient: create_storage_zone()
|
||||
BunnyNetClient-->>CLI: storage_zone_id
|
||||
CLI->>BunnyNetClient: create_pull_zone()
|
||||
BunnyNetClient-->>CLI: pull_zone_id
|
||||
CLI->>BunnyNetClient: add_custom_hostname()
|
||||
CLI->>Database: save SiteDeployment record
|
||||
CLI-->>User: Site provisioned! Configure DNS.
|
||||
```
|
||||
|
|
|
|||
|
|
@ -0,0 +1,913 @@
|
|||
# Story 2.2: Simplified AI Content Generation - Detailed Task Breakdown
|
||||
|
||||
## Overview
|
||||
This document breaks down Story 2.2 into detailed tasks with specific implementation notes.
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 1: Data Model & Schema Design**
|
||||
|
||||
### Task 1.1: Create GeneratedContent Database Model
|
||||
**File**: `src/database/models.py`
|
||||
|
||||
**Add new model class:**
|
||||
```python
|
||||
class GeneratedContent(Base):
|
||||
__tablename__ = "generated_content"
|
||||
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
|
||||
project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
|
||||
tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
|
||||
keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
|
||||
title: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
outline: Mapped[dict] = mapped_column(JSON, nullable=False)
|
||||
content: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
word_count: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
status: Mapped[str] = mapped_column(String(20), nullable=False)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
DateTime,
|
||||
default=datetime.utcnow,
|
||||
onupdate=datetime.utcnow,
|
||||
nullable=False
|
||||
)
|
||||
```
|
||||
|
||||
**Status values**: `generated`, `augmented`, `failed`
|
||||
|
||||
**Update**: `scripts/init_db.py` to create the table
|
||||
|
||||
---
|
||||
|
||||
### Task 1.2: Create GeneratedContent Repository
|
||||
**File**: `src/database/repositories.py`
|
||||
|
||||
**Add repository class:**
|
||||
```python
|
||||
class GeneratedContentRepository(BaseRepository[GeneratedContent]):
|
||||
def __init__(self, session: Session):
|
||||
super().__init__(GeneratedContent, session)
|
||||
|
||||
def get_by_project_id(self, project_id: int) -> list[GeneratedContent]:
|
||||
pass
|
||||
|
||||
def get_by_project_and_tier(self, project_id: int, tier: str) -> list[GeneratedContent]:
|
||||
pass
|
||||
|
||||
def get_by_keyword(self, keyword: str) -> list[GeneratedContent]:
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 1.3: Define Job File JSON Schema
|
||||
**File**: `jobs/README.md` (create/update)
|
||||
|
||||
**Job file structure** (one project per job, multiple jobs per file):
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5,
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
},
|
||||
"tier2": {
|
||||
"count": 10,
|
||||
"min_word_count": 1500,
|
||||
"max_word_count": 2000,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 4,
|
||||
"min_h3_tags": 3,
|
||||
"max_h3_tags": 8
|
||||
},
|
||||
"tier3": {
|
||||
"count": 15,
|
||||
"min_word_count": 1000,
|
||||
"max_word_count": 1500,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 3,
|
||||
"min_h3_tags": 2,
|
||||
"max_h3_tags": 6
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"project_id": 2,
|
||||
"tiers": {
|
||||
"tier1": { ... }
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Tier defaults** (constants if not specified in job file):
|
||||
```python
|
||||
TIER_DEFAULTS = {
|
||||
"tier1": {
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
},
|
||||
"tier2": {
|
||||
"min_word_count": 1500,
|
||||
"max_word_count": 2000,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 4,
|
||||
"min_h3_tags": 3,
|
||||
"max_h3_tags": 8
|
||||
},
|
||||
"tier3": {
|
||||
"min_word_count": 1000,
|
||||
"max_word_count": 1500,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 3,
|
||||
"min_h3_tags": 2,
|
||||
"max_h3_tags": 6
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Future extensibility note**: This structure allows adding more fields per job in future stories.
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 2: AI Client & Prompt Management**
|
||||
|
||||
### Task 2.1: Implement AIClient for OpenRouter
|
||||
**File**: `src/generation/ai_client.py`
|
||||
|
||||
**OpenRouter API details**:
|
||||
- Base URL: `https://openrouter.ai/api/v1`
|
||||
- Compatible with OpenAI SDK
|
||||
- Requires `OPENROUTER_API_KEY` env variable
|
||||
|
||||
**Initial model list**:
|
||||
```python
|
||||
AVAILABLE_MODELS = {
|
||||
"gpt-4o-mini": "openai/gpt-4o-mini",
|
||||
"claude-sonnet-4.5": "anthropic/claude-3.5-sonnet"
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
class AIClient:
|
||||
def __init__(self, api_key: str, model: str, base_url: str = "https://openrouter.ai/api/v1"):
|
||||
self.client = OpenAI(api_key=api_key, base_url=base_url)
|
||||
self.model = model
|
||||
|
||||
def generate_completion(
|
||||
self,
|
||||
prompt: str,
|
||||
system_message: str = None,
|
||||
max_tokens: int = 4000,
|
||||
temperature: float = 0.7,
|
||||
json_mode: bool = False
|
||||
) -> str:
|
||||
"""
|
||||
Generate completion from OpenRouter API
|
||||
json_mode: if True, adds response_format={"type": "json_object"}
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
**Error handling**: Retry 3x with exponential backoff for network/rate limit errors
|
||||
|
||||
---
|
||||
|
||||
### Task 2.2: Create Prompt Templates
|
||||
**Files**: `src/generation/prompts/*.json`
|
||||
|
||||
**title_generation.json**:
|
||||
```json
|
||||
{
|
||||
"system_message": "You are an expert SEO content writer...",
|
||||
"user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting."
|
||||
}
|
||||
```
|
||||
|
||||
**outline_generation.json**:
|
||||
```json
|
||||
{
|
||||
"system_message": "You are an expert content outliner...",
|
||||
"user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- {min_h2} to {max_h2} H2 headings\n- {min_h3} to {max_h3} H3 subheadings total\n\nEntities: {entities}\nRelated searches: {related_searches}\n\nReturn as JSON: {\"outline\": [{\"h2\": \"...\", \"h3\": [\"...\", \"...\"]}]}"
|
||||
}
|
||||
```
|
||||
|
||||
**content_generation.json**:
|
||||
```json
|
||||
{
|
||||
"system_message": "You are an expert content writer...",
|
||||
"user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include: {entities}\nRelated searches: {related_searches}\n\nReturn as HTML fragment with <h2>, <h3>, <p> tags. Do NOT include <html>, <head>, or <body> tags."
|
||||
}
|
||||
```
|
||||
|
||||
**content_augmentation.json**:
|
||||
```json
|
||||
{
|
||||
"system_message": "You are an expert content editor...",
|
||||
"user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count}\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment."
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2.3: Create PromptManager
|
||||
**File**: `src/generation/ai_client.py` (add to same file)
|
||||
|
||||
```python
|
||||
class PromptManager:
|
||||
def __init__(self, prompts_dir: str = "src/generation/prompts"):
|
||||
self.prompts_dir = prompts_dir
|
||||
self.prompts = {}
|
||||
|
||||
def load_prompt(self, prompt_name: str) -> dict:
|
||||
"""Load prompt from JSON file"""
|
||||
pass
|
||||
|
||||
def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
|
||||
"""
|
||||
Format prompt with variables
|
||||
Returns: (system_message, user_prompt)
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 3: Core Generation Pipeline**
|
||||
|
||||
### Task 3.1: Implement ContentGenerator Service
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
class ContentGenerator:
|
||||
def __init__(
|
||||
self,
|
||||
ai_client: AIClient,
|
||||
prompt_manager: PromptManager,
|
||||
project_repo: ProjectRepository,
|
||||
content_repo: GeneratedContentRepository
|
||||
):
|
||||
self.ai_client = ai_client
|
||||
self.prompt_manager = prompt_manager
|
||||
self.project_repo = project_repo
|
||||
self.content_repo = content_repo
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3.2: Implement Stage 1 - Title Generation
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
def generate_title(self, project_id: int, debug: bool = False) -> str:
|
||||
"""
|
||||
Generate SEO-optimized title
|
||||
|
||||
Returns: title string
|
||||
Saves to debug_output/title_project_{id}_{timestamp}.txt if debug=True
|
||||
"""
|
||||
# Fetch project
|
||||
# Load prompt
|
||||
# Call AI
|
||||
# If debug: save response to debug_output/
|
||||
# Return title
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3.3: Implement Stage 2 - Outline Generation
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
def generate_outline(
|
||||
self,
|
||||
project_id: int,
|
||||
title: str,
|
||||
min_h2: int,
|
||||
max_h2: int,
|
||||
min_h3: int,
|
||||
max_h3: int,
|
||||
debug: bool = False
|
||||
) -> dict:
|
||||
"""
|
||||
Generate article outline in JSON format
|
||||
|
||||
Returns: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
|
||||
|
||||
Uses json_mode=True in AI call to ensure JSON response
|
||||
Validates: at least min_h2 headings, at least min_h3 total subheadings
|
||||
Saves to debug_output/outline_project_{id}_{timestamp}.json if debug=True
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
**Validation**:
|
||||
- Parse JSON response
|
||||
- Count h2 tags (must be >= min_h2)
|
||||
- Count total h3 tags across all h2s (must be >= min_h3)
|
||||
- Raise error if validation fails
|
||||
|
||||
---
|
||||
|
||||
### Task 3.4: Implement Stage 3 - Content Generation
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
def generate_content(
|
||||
self,
|
||||
project_id: int,
|
||||
title: str,
|
||||
outline: dict,
|
||||
debug: bool = False
|
||||
) -> str:
|
||||
"""
|
||||
Generate full article HTML fragment
|
||||
|
||||
Returns: HTML string with <h2>, <h3>, <p> tags
|
||||
Does NOT include <html>, <head>, or <body> tags
|
||||
|
||||
Saves to debug_output/content_project_{id}_{timestamp}.html if debug=True
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
**HTML fragment format**:
|
||||
```html
|
||||
<h2>First Heading</h2>
|
||||
<p>Paragraph content...</p>
|
||||
<h3>Subheading</h3>
|
||||
<p>More content...</p>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3.5: Implement Word Count Validation
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
def validate_word_count(self, content: str, min_words: int, max_words: int) -> tuple[bool, int]:
|
||||
"""
|
||||
Validate content word count
|
||||
|
||||
Returns: (is_valid, actual_count)
|
||||
- is_valid: True if min_words <= actual_count <= max_words
|
||||
- actual_count: number of words in content
|
||||
|
||||
Implementation: Strip HTML tags, split on whitespace, count tokens
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3.6: Implement Simple Augmentation
|
||||
**File**: `src/generation/service.py`
|
||||
|
||||
```python
|
||||
def augment_content(
|
||||
self,
|
||||
content: str,
|
||||
target_word_count: int,
|
||||
debug: bool = False
|
||||
) -> str:
|
||||
"""
|
||||
Expand article content to meet minimum word count
|
||||
|
||||
Called ONLY if word_count < min_word_count
|
||||
Makes ONE API call only
|
||||
|
||||
Saves to debug_output/augmented_project_{id}_{timestamp}.html if debug=True
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 4: Batch Processing**
|
||||
|
||||
### Task 4.1: Create JobConfig Parser
|
||||
**File**: `src/generation/job_config.py`
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
TIER_DEFAULTS = {
|
||||
"tier1": {
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
},
|
||||
"tier2": {
|
||||
"min_word_count": 1500,
|
||||
"max_word_count": 2000,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 4,
|
||||
"min_h3_tags": 3,
|
||||
"max_h3_tags": 8
|
||||
},
|
||||
"tier3": {
|
||||
"min_word_count": 1000,
|
||||
"max_word_count": 1500,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 3,
|
||||
"min_h3_tags": 2,
|
||||
"max_h3_tags": 6
|
||||
}
|
||||
}
|
||||
|
||||
@dataclass
|
||||
class TierConfig:
|
||||
count: int
|
||||
min_word_count: int
|
||||
max_word_count: int
|
||||
min_h2_tags: int
|
||||
max_h2_tags: int
|
||||
min_h3_tags: int
|
||||
max_h3_tags: int
|
||||
|
||||
@dataclass
|
||||
class Job:
|
||||
project_id: int
|
||||
tiers: dict[str, TierConfig]
|
||||
|
||||
class JobConfig:
|
||||
def __init__(self, job_file_path: str):
|
||||
"""Load and parse job file, apply defaults"""
|
||||
pass
|
||||
|
||||
def get_jobs(self) -> list[Job]:
|
||||
"""Return list of all jobs in file"""
|
||||
pass
|
||||
|
||||
def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
|
||||
"""Get tier config with defaults applied"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4.2: Create BatchProcessor
|
||||
**File**: `src/generation/batch_processor.py`
|
||||
|
||||
```python
|
||||
class BatchProcessor:
|
||||
def __init__(
|
||||
self,
|
||||
content_generator: ContentGenerator,
|
||||
content_repo: GeneratedContentRepository,
|
||||
project_repo: ProjectRepository
|
||||
):
|
||||
pass
|
||||
|
||||
def process_job(
|
||||
self,
|
||||
job_file_path: str,
|
||||
debug: bool = False,
|
||||
continue_on_error: bool = False
|
||||
):
|
||||
"""
|
||||
Process all jobs in job file
|
||||
|
||||
For each job:
|
||||
For each tier:
|
||||
For count times:
|
||||
1. Generate title (log to console)
|
||||
2. Generate outline
|
||||
3. Generate content
|
||||
4. Validate word count
|
||||
5. If below min, augment once
|
||||
6. Save to GeneratedContent table
|
||||
|
||||
Logs progress to console
|
||||
If debug=True, saves AI responses to debug_output/
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
**Console output format**:
|
||||
```
|
||||
Processing Job 1/3: Project ID 5
|
||||
Tier 1: Generating 5 articles
|
||||
[1/5] Generating title... "Ultimate Guide to SEO in 2025"
|
||||
[1/5] Generating outline... 4 H2s, 8 H3s
|
||||
[1/5] Generating content... 1,845 words
|
||||
[1/5] Below minimum (2000), augmenting... 2,123 words
|
||||
[1/5] Saved (ID: 42, Status: augmented)
|
||||
[2/5] Generating title... "Advanced SEO Techniques"
|
||||
...
|
||||
Tier 2: Generating 10 articles
|
||||
...
|
||||
|
||||
Summary:
|
||||
Jobs processed: 3/3
|
||||
Articles generated: 45/45
|
||||
Augmented: 12
|
||||
Failed: 0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4.3: Error Handling & Retry Logic
|
||||
**File**: `src/generation/batch_processor.py`
|
||||
|
||||
**Error handling strategy**:
|
||||
- AI API errors: Log error, mark as `status='failed'`, save to DB
|
||||
- If `continue_on_error=True`: continue to next article
|
||||
- If `continue_on_error=False`: stop batch processing
|
||||
- Database errors: Always abort (data integrity)
|
||||
- Invalid job file: Fail fast with validation error
|
||||
|
||||
**Retry logic** (in AIClient):
|
||||
- Network errors: 3 retries with exponential backoff (1s, 2s, 4s)
|
||||
- Rate limit errors: Respect Retry-After header
|
||||
- Other errors: No retry, raise immediately
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 5: CLI Integration**
|
||||
|
||||
### Task 5.1: Add generate-batch Command
|
||||
**File**: `src/cli/commands.py`
|
||||
|
||||
```python
|
||||
@app.command("generate-batch")
|
||||
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
|
||||
help='Path to job JSON file')
|
||||
@click.option('--username', '-u', help='Username for authentication')
|
||||
@click.option('--password', '-p', help='Password for authentication')
|
||||
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
|
||||
@click.option('--continue-on-error', is_flag=True,
|
||||
help='Continue processing if article generation fails')
|
||||
@click.option('--model', '-m', default='gpt-4o-mini',
|
||||
help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
|
||||
def generate_batch(
|
||||
job_file: str,
|
||||
username: Optional[str],
|
||||
password: Optional[str],
|
||||
debug: bool,
|
||||
continue_on_error: bool,
|
||||
model: str
|
||||
):
|
||||
"""Generate content batch from job file"""
|
||||
# Authenticate user
|
||||
# Initialize AIClient with OpenRouter
|
||||
# Initialize PromptManager, ContentGenerator, BatchProcessor
|
||||
# Call process_job()
|
||||
# Show summary
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5.2: Add Progress Logging & Debug Output
|
||||
**File**: `src/generation/batch_processor.py`
|
||||
|
||||
**Debug output** (when `--debug` flag used):
|
||||
- Create `debug_output/` directory if not exists
|
||||
- For each AI call, save response to file:
|
||||
- `debug_output/title_project{id}_tier{tier}_{n}_{timestamp}.txt`
|
||||
- `debug_output/outline_project{id}_tier{tier}_{n}_{timestamp}.json`
|
||||
- `debug_output/content_project{id}_tier{tier}_{n}_{timestamp}.html`
|
||||
- `debug_output/augmented_project{id}_tier{tier}_{n}_{timestamp}.html`
|
||||
- Also echo to console with `click.echo()`
|
||||
|
||||
**Normal output** (without `--debug`):
|
||||
- Always show title when generated: `"Generated title: {title}"`
|
||||
- Show word counts and status
|
||||
- Show progress counter `[n/total]`
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 6: Testing & Validation**
|
||||
|
||||
### Task 6.1: Create Unit Tests
|
||||
|
||||
#### `tests/unit/test_ai_client.py`
|
||||
```python
|
||||
def test_generate_completion_success():
|
||||
"""Test successful AI completion"""
|
||||
pass
|
||||
|
||||
def test_generate_completion_json_mode():
|
||||
"""Test JSON mode returns valid JSON"""
|
||||
pass
|
||||
|
||||
def test_generate_completion_retry_on_network_error():
|
||||
"""Test retry logic for network errors"""
|
||||
pass
|
||||
```
|
||||
|
||||
#### `tests/unit/test_content_generator.py`
|
||||
```python
|
||||
def test_generate_title():
|
||||
"""Test title generation with mocked AI response"""
|
||||
pass
|
||||
|
||||
def test_generate_outline_valid_structure():
|
||||
"""Test outline generation returns valid JSON with min h2/h3"""
|
||||
pass
|
||||
|
||||
def test_generate_content_html_fragment():
|
||||
"""Test content is HTML fragment (no <html> tag)"""
|
||||
pass
|
||||
|
||||
def test_validate_word_count():
|
||||
"""Test word count validation with various HTML inputs"""
|
||||
pass
|
||||
|
||||
def test_augment_content_called_once():
|
||||
"""Test augmentation only called once"""
|
||||
pass
|
||||
```
|
||||
|
||||
#### `tests/unit/test_job_config.py`
|
||||
```python
|
||||
def test_load_job_config_valid():
|
||||
"""Test loading valid job file"""
|
||||
pass
|
||||
|
||||
def test_tier_defaults_applied():
|
||||
"""Test defaults applied when not in job file"""
|
||||
pass
|
||||
|
||||
def test_multiple_jobs_in_file():
|
||||
"""Test parsing file with multiple jobs"""
|
||||
pass
|
||||
```
|
||||
|
||||
#### `tests/unit/test_batch_processor.py`
|
||||
```python
|
||||
def test_process_job_success():
|
||||
"""Test successful batch processing"""
|
||||
pass
|
||||
|
||||
def test_process_job_with_augmentation():
|
||||
"""Test articles below min word count are augmented"""
|
||||
pass
|
||||
|
||||
def test_process_job_continue_on_error():
|
||||
"""Test continue_on_error flag behavior"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6.2: Create Integration Test
|
||||
**File**: `tests/integration/test_generate_batch.py`
|
||||
|
||||
```python
|
||||
def test_generate_batch_end_to_end(test_db, mock_ai_client):
|
||||
"""
|
||||
End-to-end test:
|
||||
1. Create test project in DB
|
||||
2. Create test job file
|
||||
3. Run batch processor
|
||||
4. Verify GeneratedContent records created
|
||||
5. Verify word counts within range
|
||||
6. Verify HTML structure
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6.3: Create Example Job Files
|
||||
|
||||
#### `jobs/example_tier1_batch.json`
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
(Uses all defaults for tier1)
|
||||
|
||||
#### `jobs/example_multi_tier_batch.json`
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5,
|
||||
"min_word_count": 2200,
|
||||
"max_word_count": 2600
|
||||
},
|
||||
"tier2": {
|
||||
"count": 10
|
||||
},
|
||||
"tier3": {
|
||||
"count": 15,
|
||||
"max_h2_tags": 4
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"project_id": 2,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### `jobs/README.md`
|
||||
Document job file format and examples
|
||||
|
||||
---
|
||||
|
||||
## **PHASE 7: Cleanup & Deprecation**
|
||||
|
||||
### Task 7.1: Remove Old ContentRuleEngine
|
||||
**Action**: Delete or gut `src/generation/rule_engine.py`
|
||||
|
||||
Only keep if it has reusable utilities. Otherwise remove entirely.
|
||||
|
||||
---
|
||||
|
||||
### Task 7.2: Remove Old Validator Logic
|
||||
**Action**: Review `src/generation/validator.py` (if exists)
|
||||
|
||||
Remove any strict CORA validation beyond word count. Keep only simple validation utilities.
|
||||
|
||||
---
|
||||
|
||||
### Task 7.3: Update Documentation
|
||||
**Files to update**:
|
||||
- `docs/stories/story-2.2. simplified-ai-content-generation.md` - Status to "In Progress" → "Done"
|
||||
- `docs/architecture/workflows.md` - Document simplified generation flow
|
||||
- `docs/architecture/components.md` - Update generation component description
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order Recommendation
|
||||
|
||||
1. **Phase 1** (Data Layer) - Required foundation
|
||||
2. **Phase 2** (AI Client) - Required for generation
|
||||
3. **Phase 3** (Core Logic) - Implement one stage at a time, test each
|
||||
4. **Phase 4** (Batch Processing) - Orchestrate stages
|
||||
5. **Phase 5** (CLI) - Make accessible to users
|
||||
6. **Phase 6** (Testing) - Can be done in parallel with implementation
|
||||
7. **Phase 7** (Cleanup) - Final polish
|
||||
|
||||
**Estimated effort**:
|
||||
- Phase 1-2: 4-6 hours
|
||||
- Phase 3: 6-8 hours
|
||||
- Phase 4: 3-4 hours
|
||||
- Phase 5: 2-3 hours
|
||||
- Phase 6: 4-6 hours
|
||||
- Phase 7: 1-2 hours
|
||||
- **Total**: 20-29 hours
|
||||
|
||||
---
|
||||
|
||||
## Critical Dev Notes
|
||||
|
||||
### OpenRouter Specifics
|
||||
- API key from environment: `OPENROUTER_API_KEY`
|
||||
- Model format: `"provider/model-name"`
|
||||
- Supports OpenAI SDK drop-in replacement
|
||||
- Rate limits vary by model (check OpenRouter docs)
|
||||
|
||||
### HTML Fragment Format
|
||||
Content generation returns HTML like:
|
||||
```html
|
||||
<h2>Main Topic</h2>
|
||||
<p>Introduction paragraph with relevant keywords and entities.</p>
|
||||
<h3>Subtopic One</h3>
|
||||
<p>Detailed content about subtopic.</p>
|
||||
<h3>Subtopic Two</h3>
|
||||
<p>More detailed content.</p>
|
||||
<h2>Second Main Topic</h2>
|
||||
<p>Content continues...</p>
|
||||
```
|
||||
|
||||
**No document structure**: No `<!DOCTYPE>`, `<html>`, `<head>`, or `<body>` tags.
|
||||
|
||||
### Word Count Method
|
||||
```python
|
||||
import re
|
||||
from html import unescape
|
||||
|
||||
def count_words(html_content: str) -> int:
|
||||
# Strip HTML tags
|
||||
text = re.sub(r'<[^>]+>', '', html_content)
|
||||
# Unescape HTML entities
|
||||
text = unescape(text)
|
||||
# Split and count
|
||||
words = text.split()
|
||||
return len(words)
|
||||
```
|
||||
|
||||
### Debug Output Directory
|
||||
- Create `debug_output/` at project root if not exists
|
||||
- Add to `.gitignore`
|
||||
- Filename format: `{stage}_project{id}_tier{tier}_article{n}_{timestamp}.{ext}`
|
||||
- Example: `title_project5_tier1_article3_20251020_143022.txt`
|
||||
|
||||
### Tier Constants Location
|
||||
Define in `src/generation/job_config.py` as module-level constant for easy reference.
|
||||
|
||||
### Future Extensibility
|
||||
Job file structure designed to support:
|
||||
- Custom interlinking rules (Story 2.4+)
|
||||
- Template selection (Story 3.x)
|
||||
- Deployment targets (Story 4.x)
|
||||
- SEO metadata overrides
|
||||
|
||||
Keep job parsing flexible to add new fields without breaking existing jobs.
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Test Mocking
|
||||
Mock `AIClient.generate_completion()` to return realistic HTML:
|
||||
```python
|
||||
@pytest.fixture
|
||||
def mock_title_response():
|
||||
return "The Ultimate Guide to Sustainable Gardening in 2025"
|
||||
|
||||
@pytest.fixture
|
||||
def mock_outline_response():
|
||||
return {
|
||||
"outline": [
|
||||
{"h2": "Getting Started", "h3": ["Tools", "Planning"]},
|
||||
{"h2": "Best Practices", "h3": ["Watering", "Composting"]}
|
||||
]
|
||||
}
|
||||
|
||||
@pytest.fixture
|
||||
def mock_content_response():
|
||||
return """<h2>Getting Started</h2>
|
||||
<p>Sustainable gardening begins with proper planning...</p>
|
||||
<h3>Tools</h3>
|
||||
<p>Essential tools include...</p>"""
|
||||
```
|
||||
|
||||
### Integration Test Database
|
||||
Use `conftest.py` fixture with in-memory SQLite and test data:
|
||||
```python
|
||||
@pytest.fixture
|
||||
def test_project(test_db):
|
||||
project_repo = ProjectRepository(test_db)
|
||||
return project_repo.create(
|
||||
user_id=1,
|
||||
name="Test Project",
|
||||
data={
|
||||
"main_keyword": "sustainable gardening",
|
||||
"entities": ["composting", "organic soil"],
|
||||
"related_searches": ["how to compost", "organic gardening tips"]
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Story is complete when:
|
||||
1. All database models and repositories implemented
|
||||
2. AIClient successfully calls OpenRouter API
|
||||
3. Three-stage generation pipeline works end-to-end
|
||||
4. Batch processor handles multiple jobs/tiers
|
||||
5. CLI command `generate-batch` functional
|
||||
6. Debug output saves to `debug_output/` when `--debug` used
|
||||
7. All unit tests pass
|
||||
8. Integration test demonstrates full workflow
|
||||
9. Example job files work correctly
|
||||
10. Documentation updated
|
||||
|
||||
**Acceptance**: Run `generate-batch` on real project, verify content saved to database with correct word count and structure.
|
||||
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
# Story 2.2: Simplified AI Content Generation via Batch Job
|
||||
|
||||
## Status
|
||||
Completed
|
||||
|
||||
## Story
|
||||
**As a** User,
|
||||
**I want** to control AI content generation via a batch file that specifies word count and heading limits,
|
||||
**so that** I can easily create topically relevant articles without unnecessary complexity or rigid validation.
|
||||
|
||||
## Acceptance Criteria
|
||||
1. **Batch Job Control:** The `generate-batch` command accepts a JSON job file that specifies `min_word_count`, `max_word_count`, `max_h2_tags`, and `max_h3_tags` for each tier.
|
||||
2. **Three-Stage Generation:** The system uses a simple three-stage pipeline:
|
||||
* Generates a title using the project's SEO data.
|
||||
* Generates an outline based on the title, SEO data, and the `max_h2`/`max_h3` limits from the job file.
|
||||
* Generates the full article content based on the validated outline.
|
||||
3. **SEO Data Integration:** The generation process for all stages is informed by the project's `keyword`, `entities`, and `related_searches` to ensure topical relevance.
|
||||
4. **Word Count Validation:** After generation, the system validates the content *only* against the `min_word_count` and `max_word_count` specified in the job file.
|
||||
5. **Simple Augmentation:** If the generated content is below `min_word_count`, the system makes **one** attempt to append additional content using a simple "expand on this article" prompt.
|
||||
6. **Database Storage:** The final generated title, outline, and content are stored in the `GeneratedContent` table.
|
||||
7. **CLI Execution:** The `generate-batch` command successfully runs the job, logs progress to the console, and indicates when the process is complete.
|
||||
|
||||
## Dev Notes
|
||||
* **Objective:** This story replaces the previous, overly complex stories 2.2 and 2.3. The goal is maximum simplicity and user control via the job file.
|
||||
* **Key Change:** Remove the entire `ContentRuleEngine` and all strict CORA validation logic. The only validation required is a final word count check.
|
||||
* **Job File is King:** All operational parameters (`min_word_count`, `max_word_count`, `max_h2_tags`, `max_h3_tags`) must be read from the job file for each tier being processed.
|
||||
* **Augmentation:** Keep it simple. If `word_count < min_word_count`, make a single API call to the AI with a prompt like: "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Here is the article: {content}". Do not create a complex augmentation system.
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
See **[story-2.2-task-breakdown.md](story-2.2-task-breakdown.md)** for detailed implementation tasks.
|
||||
|
||||
The task breakdown is organized into 7 phases:
|
||||
1. **Phase 1**: Data Model & Schema Design (GeneratedContent table, repositories, job file schema)
|
||||
2. **Phase 2**: AI Client & Prompt Management (OpenRouter integration, prompt templates)
|
||||
3. **Phase 3**: Core Generation Pipeline (title, outline, content generation with validation)
|
||||
4. **Phase 4**: Batch Processing (job config parser, batch processor, error handling)
|
||||
5. **Phase 5**: CLI Integration (generate-batch command, progress logging, debug output)
|
||||
6. **Phase 6**: Testing & Validation (unit tests, integration tests, example job files)
|
||||
7. **Phase 7**: Cleanup & Deprecation (remove old rule engine and validators)
|
||||
|
|
@ -2,7 +2,7 @@
|
|||
DATABASE_URL=sqlite:///./content_automation.db
|
||||
|
||||
# AI Service Configuration (OpenRouter)
|
||||
AI_API_KEY=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
|
||||
OPENROUTER_API_KEY=your_openrouter_api_key_here
|
||||
AI_API_BASE_URL=https://openrouter.ai/api/v1
|
||||
AI_MODEL=anthropic/claude-3.5-sonnet
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,16 @@
|
|||
[33m5b5bd1b[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mfeature/tier-word-count-override[m[33m)[m Add tier-specific word count and outline controls
|
||||
[33m3063fc4[m[33m ([m[1;31morigin/main[m[33m, [m[1;31morigin/HEAD[m[33m, [m[1;32mmain[m[33m)[m Story 2.3 - content generation script nightmare alomst done - fixed (maybe) outline too big issue
|
||||
[33mb6b0acf[m Story 2.3 - content generation script nightmare alomst done - pre-fix outline too big issue
|
||||
[33mf73b070[m[33m ([m[1;31mgithub/main[m[33m)[m Story 2.3 - content generation script finished - fix ci
|
||||
[33me2afabb[m Story 2.3 - content generation script finished
|
||||
[33m0069e6e[m Story 2.2 - rule engine finished
|
||||
[33md81537f[m Story 2.1 finished
|
||||
[33m02dd5a3[m Story 2.1 finished
|
||||
[33m29ecaec[m Story 1.7 finished
|
||||
[33mda797c2[m Story 1.6 finished - added sync
|
||||
[33m4cada9d[m Story 1.6 finished
|
||||
[33mb6e495e[m feat: Story 1.5 - CLI User Management
|
||||
[33m0a223e2[m Complete Story 1.4: Internal API Foundation
|
||||
[33m8641bca[m Complete Epic 1 Stories 1.1-1.3: Foundation, Database, and Authentication
|
||||
[33m70b9de2[m feat: Complete Story 1.1 - Project Initialization & Configuration
|
||||
[33m31b9580[m Initial commit: Project structure and planning documents
|
||||
218
jobs/README.md
218
jobs/README.md
|
|
@ -1,77 +1,179 @@
|
|||
# Job Configuration Files
|
||||
# Job File Format
|
||||
|
||||
This directory contains batch job configuration files for content generation.
|
||||
Job files define batch content generation parameters using JSON format.
|
||||
|
||||
## Usage
|
||||
|
||||
Run a batch job using the CLI:
|
||||
|
||||
```bash
|
||||
python main.py generate-batch --job-file jobs/example_tier1_batch.json -u admin -p password
|
||||
```
|
||||
|
||||
## Job Configuration Structure
|
||||
## Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"job_name": "Descriptive name",
|
||||
"project_id": 1,
|
||||
"description": "Optional description",
|
||||
"tiers": [
|
||||
"jobs": [
|
||||
{
|
||||
"tier": 1,
|
||||
"article_count": 15,
|
||||
"models": {
|
||||
"title": "model-id",
|
||||
"outline": "model-id",
|
||||
"content": "model-id"
|
||||
},
|
||||
"anchor_text_config": {
|
||||
"mode": "default|override|append",
|
||||
"custom_text": ["optional", "custom", "anchors"],
|
||||
"additional_text": ["optional", "additions"]
|
||||
},
|
||||
"validation_attempts": 3
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5,
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
}
|
||||
],
|
||||
"failure_config": {
|
||||
"max_consecutive_failures": 5,
|
||||
"skip_on_failure": true
|
||||
},
|
||||
"interlinking": {
|
||||
"links_per_article_min": 2,
|
||||
"links_per_article_max": 4,
|
||||
"include_home_link": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Available Models
|
||||
## Fields
|
||||
|
||||
- `anthropic/claude-3.5-sonnet` - Best for high-quality content
|
||||
- `anthropic/claude-3-haiku` - Fast and cost-effective
|
||||
- `openai/gpt-4o` - Excellent quality
|
||||
- `openai/gpt-4o-mini` - Good for titles/outlines
|
||||
- `meta-llama/llama-3.1-70b-instruct` - Open source alternative
|
||||
- `google/gemini-pro-1.5` - Google's offering
|
||||
### Job Level
|
||||
- `project_id` (required): The project ID to generate content for
|
||||
- `tiers` (required): Dictionary of tier configurations
|
||||
|
||||
## Anchor Text Modes
|
||||
### Tier Level
|
||||
- `count` (required): Number of articles to generate for this tier
|
||||
- `min_word_count` (optional): Minimum word count (uses defaults if not specified)
|
||||
- `max_word_count` (optional): Maximum word count (uses defaults if not specified)
|
||||
- `min_h2_tags` (optional): Minimum H2 headings (uses defaults if not specified)
|
||||
- `max_h2_tags` (optional): Maximum H2 headings (uses defaults if not specified)
|
||||
- `min_h3_tags` (optional): Minimum H3 subheadings total (uses defaults if not specified)
|
||||
- `max_h3_tags` (optional): Maximum H3 subheadings total (uses defaults if not specified)
|
||||
|
||||
- **default**: Use CORA rules (keyword, entities, related searches)
|
||||
- **override**: Replace default with custom_text list
|
||||
- **append**: Add additional_text to default anchor text
|
||||
## Tier Defaults
|
||||
|
||||
## Example Files
|
||||
If tier parameters are not specified, these defaults are used:
|
||||
|
||||
- `example_tier1_batch.json` - Single tier 1 with 15 articles
|
||||
- `example_multi_tier_batch.json` - Three tiers with 165 total articles
|
||||
- `example_custom_anchors.json` - Custom anchor text demo
|
||||
### tier1
|
||||
- `min_word_count`: 2000
|
||||
- `max_word_count`: 2500
|
||||
- `min_h2_tags`: 3
|
||||
- `max_h2_tags`: 5
|
||||
- `min_h3_tags`: 5
|
||||
- `max_h3_tags`: 10
|
||||
|
||||
## Tips
|
||||
### tier2
|
||||
- `min_word_count`: 1500
|
||||
- `max_word_count`: 2000
|
||||
- `min_h2_tags`: 2
|
||||
- `max_h2_tags`: 4
|
||||
- `min_h3_tags`: 3
|
||||
- `max_h3_tags`: 8
|
||||
|
||||
1. Start with tier 1 to ensure quality
|
||||
2. Use faster/cheaper models for tier 2+
|
||||
3. Set `skip_on_failure: true` to continue on errors
|
||||
4. Adjust `max_consecutive_failures` based on model reliability
|
||||
5. Test with small batches first
|
||||
### tier3
|
||||
- `min_word_count`: 1000
|
||||
- `max_word_count`: 1500
|
||||
- `min_h2_tags`: 2
|
||||
- `max_h2_tags`: 3
|
||||
- `min_h3_tags`: 2
|
||||
- `max_h3_tags`: 6
|
||||
|
||||
## Examples
|
||||
|
||||
### Simple: Single Tier with Defaults
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Custom Word Counts
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 3,
|
||||
"min_word_count": 2500,
|
||||
"max_word_count": 3000
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-Tier
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
},
|
||||
"tier2": {
|
||||
"count": 10
|
||||
},
|
||||
"tier3": {
|
||||
"count": 15
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple Projects
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"project_id": 2,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 3
|
||||
},
|
||||
"tier2": {
|
||||
"count": 8
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Run batch generation with:
|
||||
|
||||
```bash
|
||||
python main.py generate-batch --job-file jobs/example_tier1_batch.json --username youruser --password yourpass
|
||||
```
|
||||
|
||||
### Options
|
||||
- `--job-file, -j`: Path to job JSON file (required)
|
||||
- `--username, -u`: Username for authentication
|
||||
- `--password, -p`: Password for authentication
|
||||
- `--debug`: Save AI responses to debug_output/
|
||||
- `--continue-on-error`: Continue processing if article generation fails
|
||||
- `--model, -m`: AI model to use (default: gpt-4o-mini)
|
||||
|
||||
### Debug Mode
|
||||
|
||||
When using `--debug`, AI responses are saved to `debug_output/`:
|
||||
- `title_project{id}_tier{tier}_article{n}_{timestamp}.txt`
|
||||
- `outline_project{id}_tier{tier}_article{n}_{timestamp}.json`
|
||||
- `content_project{id}_tier{tier}_article{n}_{timestamp}.html`
|
||||
- `augmented_project{id}_tier{tier}_article{n}_{timestamp}.html` (if augmented)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,57 +1,30 @@
|
|||
{
|
||||
"job_name": "Multi-Tier Site Build",
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5,
|
||||
"min_word_count": 2200,
|
||||
"max_word_count": 2600
|
||||
},
|
||||
"tier2": {
|
||||
"count": 10
|
||||
},
|
||||
"tier3": {
|
||||
"count": 15,
|
||||
"max_h2_tags": 4
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"project_id": 2,
|
||||
"description": "Complete site build with 165 articles across 3 tiers",
|
||||
"tiers": [
|
||||
{
|
||||
"tier": 1,
|
||||
"article_count": 15,
|
||||
"models": {
|
||||
"title": "openai/gpt-4o-mini",
|
||||
"outline": "openai/gpt-4o-mini",
|
||||
"content": "anthropic/claude-4.5-sonnet"
|
||||
},
|
||||
"anchor_text_config": {
|
||||
"mode": "default"
|
||||
},
|
||||
"validation_attempts": 3
|
||||
},
|
||||
{
|
||||
"tier": 2,
|
||||
"article_count": 50,
|
||||
"models": {
|
||||
"title": "openai/gpt-4o-mini",
|
||||
"outline": "openai/gpt-4o-mini",
|
||||
"content": "openai/gpt-4o-mini"
|
||||
},
|
||||
"anchor_text_config": {
|
||||
"mode": "append",
|
||||
"additional_text": ["comprehensive guide", "expert insights"]
|
||||
},
|
||||
"validation_attempts": 2
|
||||
},
|
||||
{
|
||||
"tier": 3,
|
||||
"article_count": 100,
|
||||
"models": {
|
||||
"title": "openai/gpt-4o-mini",
|
||||
"outline": "openai/gpt-4o-mini",
|
||||
"content": "openai/gpt-4o-mini"
|
||||
},
|
||||
"anchor_text_config": {
|
||||
"mode": "default"
|
||||
},
|
||||
"validation_attempts": 2
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 3
|
||||
}
|
||||
],
|
||||
"failure_config": {
|
||||
"max_consecutive_failures": 3,
|
||||
"skip_on_failure": true
|
||||
},
|
||||
"interlinking": {
|
||||
"links_per_article_min": 2,
|
||||
"links_per_article_max": 4,
|
||||
"include_home_link": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,30 +1,13 @@
|
|||
{
|
||||
"job_name": "Tier 1 Launch Batch",
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"description": "Initial tier 1 content - 15 high-quality articles with strict validation",
|
||||
"tiers": [
|
||||
{
|
||||
"tier": 1,
|
||||
"article_count": 15,
|
||||
"models": {
|
||||
"title": "anthropic/claude-3.5-sonnet",
|
||||
"outline": "anthropic/claude-3.5-sonnet",
|
||||
"content": "anthropic/claude-3.5-sonnet"
|
||||
},
|
||||
"anchor_text_config": {
|
||||
"mode": "default"
|
||||
},
|
||||
"validation_attempts": 3
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
}
|
||||
],
|
||||
"failure_config": {
|
||||
"max_consecutive_failures": 5,
|
||||
"skip_on_failure": true
|
||||
},
|
||||
"interlinking": {
|
||||
"links_per_article_min": 2,
|
||||
"links_per_article_max": 4,
|
||||
"include_home_link": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,19 @@
|
|||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 1,
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
{
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 1,
|
||||
"min_word_count": 500,
|
||||
"max_word_count": 800,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 3,
|
||||
"min_h3_tags": 3,
|
||||
"max_h3_tags": 6
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
import sys
|
||||
from pathlib import Path
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from src.database.session import db_manager
|
||||
from src.database.repositories import UserRepository
|
||||
from src.auth.service import AuthService
|
||||
|
||||
db_manager.initialize()
|
||||
session = db_manager.get_session()
|
||||
|
||||
try:
|
||||
user_repo = UserRepository(session)
|
||||
auth_service = AuthService(user_repo)
|
||||
|
||||
user = auth_service.create_user_with_hashed_password(
|
||||
username="admin",
|
||||
password="admin1234",
|
||||
role="Admin"
|
||||
)
|
||||
|
||||
print(f"Admin user created: {user.username}")
|
||||
finally:
|
||||
session.close()
|
||||
db_manager.close()
|
||||
|
||||
|
|
@ -16,6 +16,11 @@ from src.deployment.bunnynet import (
|
|||
BunnyNetResourceConflictError
|
||||
)
|
||||
from src.ingestion.parser import CORAParser, CORAParseError
|
||||
from src.generation.ai_client import AIClient, PromptManager
|
||||
from src.generation.service import ContentGenerator
|
||||
from src.generation.batch_processor import BatchProcessor
|
||||
from src.database.repositories import GeneratedContentRepository
|
||||
import os
|
||||
|
||||
|
||||
def authenticate_admin(username: str, password: str) -> Optional[User]:
|
||||
|
|
@ -871,22 +876,26 @@ def list_projects(username: Optional[str], password: Optional[str]):
|
|||
raise click.Abort()
|
||||
|
||||
|
||||
@app.command()
|
||||
@click.option("--job-file", "-j", required=True, help="Path to job configuration JSON file")
|
||||
@click.option("--force-regenerate", "-f", is_flag=True, help="Force regeneration even if content exists")
|
||||
@click.option("--debug", "-d", is_flag=True, help="Enable debug mode (saves generated content to debug_output/)")
|
||||
@click.option("--username", "-u", help="Username for authentication")
|
||||
@click.option("--password", "-p", help="Password for authentication")
|
||||
def generate_batch(job_file: str, force_regenerate: bool, debug: bool, username: Optional[str], password: Optional[str]):
|
||||
"""
|
||||
Generate batch of articles from a job configuration file
|
||||
|
||||
Example:
|
||||
python main.py generate-batch --job-file jobs/tier1_batch.json -u admin -p pass
|
||||
"""
|
||||
from src.generation.batch_processor import BatchProcessor
|
||||
from src.generation.job_config import JobConfig
|
||||
|
||||
<<<<<<< HEAD
|
||||
@app.command("generate-batch")
|
||||
@click.option('--job-file', '-j', required=True, type=click.Path(exists=True),
|
||||
help='Path to job JSON file')
|
||||
@click.option('--username', '-u', help='Username for authentication')
|
||||
@click.option('--password', '-p', help='Password for authentication')
|
||||
@click.option('--debug', is_flag=True, help='Save AI responses to debug_output/')
|
||||
@click.option('--continue-on-error', is_flag=True,
|
||||
help='Continue processing if article generation fails')
|
||||
@click.option('--model', '-m', default='gpt-4o-mini',
|
||||
help='AI model to use (gpt-4o-mini, claude-sonnet-4.5)')
|
||||
def generate_batch(
|
||||
job_file: str,
|
||||
username: Optional[str],
|
||||
password: Optional[str],
|
||||
debug: bool,
|
||||
continue_on_error: bool,
|
||||
model: str
|
||||
):
|
||||
"""Generate content batch from job file"""
|
||||
try:
|
||||
if not username or not password:
|
||||
username, password = prompt_admin_credentials()
|
||||
|
|
@ -903,70 +912,47 @@ def generate_batch(job_file: str, force_regenerate: bool, debug: bool, username:
|
|||
|
||||
click.echo(f"Authenticated as: {user.username} ({user.role})")
|
||||
|
||||
job_config = JobConfig.from_file(job_file)
|
||||
api_key = os.getenv("OPENROUTER_API_KEY")
|
||||
if not api_key:
|
||||
click.echo("Error: OPENROUTER_API_KEY not found in environment", err=True)
|
||||
click.echo("Please set OPENROUTER_API_KEY in your .env file", err=True)
|
||||
raise click.Abort()
|
||||
|
||||
click.echo(f"\nLoading Job: {job_config.job_name}")
|
||||
click.echo(f"Project ID: {job_config.project_id}")
|
||||
click.echo(f"Total Articles: {job_config.get_total_articles()}")
|
||||
click.echo(f"\nTiers:")
|
||||
for tier_config in job_config.tiers:
|
||||
click.echo(f" Tier {tier_config.tier}: {tier_config.article_count} articles")
|
||||
click.echo(f" Models: {tier_config.models.title} / {tier_config.models.outline} / {tier_config.models.content}")
|
||||
click.echo(f"Initializing AI client with model: {model}")
|
||||
ai_client = AIClient(api_key=api_key, model=model)
|
||||
prompt_manager = PromptManager()
|
||||
|
||||
if not click.confirm("\nProceed with generation?"):
|
||||
click.echo("Aborted")
|
||||
return
|
||||
project_repo = ProjectRepository(session)
|
||||
content_repo = GeneratedContentRepository(session)
|
||||
|
||||
click.echo("\nStarting batch generation...")
|
||||
click.echo("-" * 80)
|
||||
content_generator = ContentGenerator(
|
||||
ai_client=ai_client,
|
||||
prompt_manager=prompt_manager,
|
||||
project_repo=project_repo,
|
||||
content_repo=content_repo
|
||||
)
|
||||
|
||||
def progress_callback(tier=None, article_num=None, total=None, status=None, stage=None, **kwargs):
|
||||
if stage:
|
||||
if status == "completed":
|
||||
if stage == "title":
|
||||
title = kwargs.get("title", "")
|
||||
click.echo(f" - Title generated: {title}")
|
||||
elif stage == "outline":
|
||||
outline = kwargs.get("outline", {})
|
||||
h2_count = len(outline.get("sections", []))
|
||||
h3_count = sum(len(s.get("h3s", [])) for s in outline.get("sections", []))
|
||||
click.echo(f" - Outline generated: {h2_count} H2s, {h3_count} H3s")
|
||||
elif stage == "content":
|
||||
word_count = kwargs.get("word_count", 0)
|
||||
click.echo(f" - Content generated: {word_count} words")
|
||||
elif status == "starting":
|
||||
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Generating...")
|
||||
elif status == "completed":
|
||||
content_id = kwargs.get("content_id", "?")
|
||||
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Completed (ID: {content_id})")
|
||||
elif status == "skipped":
|
||||
error = kwargs.get("error", "Unknown error")
|
||||
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Skipped - {error}", err=True)
|
||||
elif status == "failed":
|
||||
error = kwargs.get("error", "Unknown error")
|
||||
click.echo(f"[Tier {tier}] Article {article_num}/{total}: Failed - {error}", err=True)
|
||||
batch_processor = BatchProcessor(
|
||||
content_generator=content_generator,
|
||||
content_repo=content_repo,
|
||||
project_repo=project_repo
|
||||
)
|
||||
|
||||
click.echo(f"\nProcessing job file: {job_file}")
|
||||
if debug:
|
||||
click.echo("\n[DEBUG MODE ENABLED - Content will be saved to debug_output/]\n")
|
||||
click.echo("Debug mode: AI responses will be saved to debug_output/\n")
|
||||
|
||||
processor = BatchProcessor(session)
|
||||
result = processor.process_job(job_config, progress_callback, debug=debug)
|
||||
|
||||
click.echo("-" * 80)
|
||||
click.echo("\nBatch Generation Complete!")
|
||||
click.echo(result.to_summary())
|
||||
batch_processor.process_job(
|
||||
job_file_path=job_file,
|
||||
debug=debug,
|
||||
continue_on_error=continue_on_error
|
||||
)
|
||||
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
except FileNotFoundError as e:
|
||||
click.echo(f"Error: {e}", err=True)
|
||||
raise click.Abort()
|
||||
except ValueError as e:
|
||||
click.echo(f"Error: {e}", err=True)
|
||||
raise click.Abort()
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {e}", err=True)
|
||||
click.echo(f"Error processing batch: {e}", err=True)
|
||||
raise click.Abort()
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ SQLAlchemy database models
|
|||
"""
|
||||
|
||||
from datetime import datetime, timezone
|
||||
from typing import Literal, Optional
|
||||
from typing import Optional
|
||||
from sqlalchemy import String, Integer, DateTime, Float, ForeignKey, JSON, Text
|
||||
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
|
||||
|
||||
|
|
@ -120,40 +120,18 @@ class Project(Base):
|
|||
|
||||
|
||||
class GeneratedContent(Base):
|
||||
"""Generated content model for AI-generated articles with version tracking"""
|
||||
"""Generated content model for AI-created articles"""
|
||||
__tablename__ = "generated_content"
|
||||
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
|
||||
project_id: Mapped[int] = mapped_column(Integer, ForeignKey('projects.id'), nullable=False, index=True)
|
||||
tier: Mapped[int] = mapped_column(Integer, nullable=False, index=True)
|
||||
|
||||
title: Mapped[Optional[str]] = mapped_column(String(500), nullable=True)
|
||||
outline: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
|
||||
content: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
|
||||
|
||||
status: Mapped[str] = mapped_column(String(20), nullable=False, default="pending", index=True)
|
||||
is_active: Mapped[bool] = mapped_column(Integer, nullable=False, default=False)
|
||||
|
||||
generation_stage: Mapped[str] = mapped_column(String(20), nullable=False, default="title")
|
||||
title_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||
outline_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||
content_attempts: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||
|
||||
title_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
|
||||
outline_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
|
||||
content_model: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
|
||||
|
||||
validation_errors: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||
validation_warnings: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||
validation_report: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)
|
||||
|
||||
word_count: Mapped[Optional[int]] = mapped_column(Integer, nullable=True)
|
||||
augmented: Mapped[bool] = mapped_column(Integer, nullable=False, default=False)
|
||||
augmentation_log: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)
|
||||
|
||||
generation_duration: Mapped[Optional[float]] = mapped_column(Float, nullable=True)
|
||||
error_message: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
|
||||
|
||||
tier: Mapped[str] = mapped_column(String(20), nullable=False, index=True)
|
||||
keyword: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
|
||||
title: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
outline: Mapped[dict] = mapped_column(JSON, nullable=False)
|
||||
content: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
word_count: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
status: Mapped[str] = mapped_column(String(20), nullable=False)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
DateTime,
|
||||
|
|
@ -163,4 +141,4 @@ class GeneratedContent(Base):
|
|||
)
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"<GeneratedContent(id={self.id}, project_id={self.project_id}, tier={self.tier}, status='{self.status}', stage='{self.generation_stage}')>"
|
||||
return f"<GeneratedContent(id={self.id}, project_id={self.project_id}, tier='{self.tier}', status='{self.status}')>"
|
||||
|
|
|
|||
|
|
@ -5,9 +5,8 @@ Concrete repository implementations
|
|||
from typing import Optional, List, Dict, Any
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy.exc import IntegrityError
|
||||
from src.database.interfaces import IUserRepository, ISiteDeploymentRepository, IProjectRepository, IGeneratedContentRepository
|
||||
from src.database.interfaces import IUserRepository, ISiteDeploymentRepository, IProjectRepository
|
||||
from src.database.models import User, SiteDeployment, Project, GeneratedContent
|
||||
from src.core.config import get_config
|
||||
|
||||
|
||||
class UserRepository(IUserRepository):
|
||||
|
|
@ -377,35 +376,55 @@ class ProjectRepository(IProjectRepository):
|
|||
return False
|
||||
|
||||
|
||||
class GeneratedContentRepository(IGeneratedContentRepository):
|
||||
"""Repository implementation for GeneratedContent data access"""
|
||||
<<<<<<< HEAD
|
||||
class GeneratedContentRepository:
|
||||
"""Repository for GeneratedContent data access"""
|
||||
|
||||
def __init__(self, session: Session):
|
||||
self.session = session
|
||||
|
||||
def create(self, project_id: int, tier: int) -> GeneratedContent:
|
||||
def create(
|
||||
self,
|
||||
project_id: int,
|
||||
tier: str,
|
||||
keyword: str,
|
||||
title: str,
|
||||
outline: dict,
|
||||
content: str,
|
||||
word_count: int,
|
||||
status: str
|
||||
) -> GeneratedContent:
|
||||
"""
|
||||
Create a new generated content record
|
||||
|
||||
Args:
|
||||
project_id: The ID of the project
|
||||
tier: The tier level (1, 2, etc.)
|
||||
project_id: The project ID this content belongs to
|
||||
tier: Content tier (tier1, tier2, tier3)
|
||||
keyword: The keyword used for generation
|
||||
title: Generated title
|
||||
outline: Generated outline (JSON)
|
||||
content: Generated HTML content
|
||||
word_count: Final word count
|
||||
status: Status (generated, augmented, failed)
|
||||
|
||||
Returns:
|
||||
The created GeneratedContent object
|
||||
"""
|
||||
content = GeneratedContent(
|
||||
content_record = GeneratedContent(
|
||||
project_id=project_id,
|
||||
tier=tier,
|
||||
status="pending",
|
||||
generation_stage="title",
|
||||
is_active=False
|
||||
keyword=keyword,
|
||||
title=title,
|
||||
outline=outline,
|
||||
content=content,
|
||||
word_count=word_count,
|
||||
status=status
|
||||
)
|
||||
|
||||
self.session.add(content)
|
||||
self.session.add(content_record)
|
||||
self.session.commit()
|
||||
self.session.refresh(content)
|
||||
return content
|
||||
self.session.refresh(content_record)
|
||||
return content_record
|
||||
|
||||
def get_by_id(self, content_id: int) -> Optional[GeneratedContent]:
|
||||
"""
|
||||
|
|
@ -482,46 +501,51 @@ class GeneratedContentRepository(IGeneratedContentRepository):
|
|||
Returns:
|
||||
The updated GeneratedContent object
|
||||
"""
|
||||
=======
|
||||
content_record = GeneratedContent(
|
||||
project_id=project_id,
|
||||
tier=tier,
|
||||
keyword=keyword,
|
||||
title=title,
|
||||
outline=outline,
|
||||
content=content,
|
||||
word_count=word_count,
|
||||
status=status
|
||||
)
|
||||
|
||||
self.session.add(content_record)
|
||||
self.session.commit()
|
||||
self.session.refresh(content_record)
|
||||
return content_record
|
||||
|
||||
def get_by_id(self, content_id: int) -> Optional[GeneratedContent]:
|
||||
"""Get content by ID"""
|
||||
return self.session.query(GeneratedContent).filter(GeneratedContent.id == content_id).first()
|
||||
|
||||
def get_by_project_id(self, project_id: int) -> List[GeneratedContent]:
|
||||
"""Get all content for a project"""
|
||||
return self.session.query(GeneratedContent).filter(GeneratedContent.project_id == project_id).all()
|
||||
|
||||
def get_by_project_and_tier(self, project_id: int, tier: str) -> List[GeneratedContent]:
|
||||
"""Get content for a project and tier"""
|
||||
return self.session.query(GeneratedContent).filter(
|
||||
GeneratedContent.project_id == project_id,
|
||||
GeneratedContent.tier == tier
|
||||
).all()
|
||||
|
||||
def get_by_keyword(self, keyword: str) -> List[GeneratedContent]:
|
||||
"""Get content by keyword"""
|
||||
return self.session.query(GeneratedContent).filter(GeneratedContent.keyword == keyword).all()
|
||||
|
||||
def update(self, content: GeneratedContent) -> GeneratedContent:
|
||||
"""Update existing content"""
|
||||
self.session.add(content)
|
||||
self.session.commit()
|
||||
self.session.refresh(content)
|
||||
return content
|
||||
|
||||
def set_active(self, content_id: int, project_id: int, tier: int) -> bool:
|
||||
"""
|
||||
Set a content version as active (deactivates others)
|
||||
|
||||
Args:
|
||||
content_id: The ID of the content to activate
|
||||
project_id: The project ID
|
||||
tier: The tier level
|
||||
|
||||
Returns:
|
||||
True if successful, False if content not found
|
||||
"""
|
||||
content = self.get_by_id(content_id)
|
||||
if not content:
|
||||
return False
|
||||
|
||||
self.session.query(GeneratedContent).filter(
|
||||
GeneratedContent.project_id == project_id,
|
||||
GeneratedContent.tier == tier
|
||||
).update({"is_active": False})
|
||||
|
||||
content.is_active = True
|
||||
self.session.commit()
|
||||
return True
|
||||
|
||||
def delete(self, content_id: int) -> bool:
|
||||
"""
|
||||
Delete a generated content record by ID
|
||||
|
||||
Args:
|
||||
content_id: The ID of the content to delete
|
||||
|
||||
Returns:
|
||||
True if deleted, False if content not found
|
||||
"""
|
||||
"""Delete content by ID"""
|
||||
content = self.get_by_id(content_id)
|
||||
if content:
|
||||
self.session.delete(content)
|
||||
|
|
|
|||
|
|
@ -1,169 +1,145 @@
|
|||
"""
|
||||
AI client for OpenRouter API integration
|
||||
OpenRouter AI client and prompt management
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import json
|
||||
from typing import Dict, Any, Optional
|
||||
from openai import OpenAI
|
||||
from dotenv import load_dotenv
|
||||
from src.core.config import Config
|
||||
from pathlib import Path
|
||||
from typing import Optional, Dict, Any
|
||||
from openai import OpenAI, RateLimitError, APIError
|
||||
from src.core.config import get_config
|
||||
|
||||
|
||||
class AIClientError(Exception):
|
||||
"""Base exception for AI client errors"""
|
||||
pass
|
||||
AVAILABLE_MODELS = {
|
||||
"gpt-4o-mini": "openai/gpt-4o-mini",
|
||||
"claude-sonnet-4.5": "anthropic/claude-3.5-sonnet"
|
||||
}
|
||||
|
||||
|
||||
class AIClient:
|
||||
"""Client for interacting with AI models via OpenRouter"""
|
||||
"""OpenRouter API client using OpenAI SDK"""
|
||||
|
||||
def __init__(self, config: Optional[Config] = None):
|
||||
"""
|
||||
Initialize AI client
|
||||
def __init__(
|
||||
self,
|
||||
api_key: str,
|
||||
model: str,
|
||||
base_url: str = "https://openrouter.ai/api/v1"
|
||||
):
|
||||
self.client = OpenAI(api_key=api_key, base_url=base_url)
|
||||
|
||||
Args:
|
||||
config: Application configuration (uses get_config() if None)
|
||||
"""
|
||||
load_dotenv()
|
||||
if model in AVAILABLE_MODELS:
|
||||
self.model = AVAILABLE_MODELS[model]
|
||||
else:
|
||||
self.model = model
|
||||
|
||||
from src.core.config import get_config
|
||||
self.config = config or get_config()
|
||||
|
||||
api_key = os.getenv("AI_API_KEY")
|
||||
if not api_key:
|
||||
raise AIClientError("AI_API_KEY environment variable not set")
|
||||
|
||||
# OpenRouter requires specific headers and configuration
|
||||
self.client = OpenAI(
|
||||
base_url=self.config.ai_service.base_url,
|
||||
api_key=api_key,
|
||||
default_headers={
|
||||
"HTTP-Referer": "https://github.com/yourusername/Big-Link-Man",
|
||||
"X-Title": "Big Link Man Content Generator"
|
||||
}
|
||||
)
|
||||
|
||||
self.default_model = self.config.ai_service.model
|
||||
self.max_tokens = self.config.ai_service.max_tokens
|
||||
self.temperature = self.config.ai_service.temperature
|
||||
self.timeout = self.config.ai_service.timeout
|
||||
|
||||
def generate(
|
||||
def generate_completion(
|
||||
self,
|
||||
prompt: str,
|
||||
model: Optional[str] = None,
|
||||
temperature: Optional[float] = None,
|
||||
max_tokens: Optional[int] = None,
|
||||
response_format: Optional[Dict[str, Any]] = None
|
||||
system_message: Optional[str] = None,
|
||||
max_tokens: int = 4000,
|
||||
temperature: float = 0.7,
|
||||
json_mode: bool = False
|
||||
) -> str:
|
||||
"""
|
||||
Generate text using AI model
|
||||
Generate completion from OpenRouter API
|
||||
|
||||
Args:
|
||||
prompt: The prompt text
|
||||
model: Model to use (defaults to config default)
|
||||
temperature: Temperature (defaults to config default)
|
||||
max_tokens: Max tokens (defaults to config default)
|
||||
response_format: Optional response format for structured output
|
||||
prompt: User prompt text
|
||||
system_message: Optional system message
|
||||
max_tokens: Maximum tokens to generate
|
||||
temperature: Sampling temperature (0-1)
|
||||
json_mode: If True, requests JSON response format
|
||||
|
||||
Returns:
|
||||
Generated text
|
||||
|
||||
Raises:
|
||||
AIClientError: If generation fails
|
||||
Generated text completion
|
||||
"""
|
||||
try:
|
||||
kwargs = {
|
||||
"model": model or self.default_model,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"temperature": temperature if temperature is not None else self.temperature,
|
||||
"max_tokens": max_tokens or self.max_tokens,
|
||||
"timeout": self.timeout,
|
||||
messages = []
|
||||
if system_message:
|
||||
messages.append({"role": "system", "content": system_message})
|
||||
messages.append({"role": "user", "content": prompt})
|
||||
|
||||
kwargs: Dict[str, Any] = {
|
||||
"model": self.model,
|
||||
"messages": messages,
|
||||
"max_tokens": max_tokens,
|
||||
"temperature": temperature
|
||||
}
|
||||
|
||||
if response_format:
|
||||
kwargs["response_format"] = response_format
|
||||
if json_mode:
|
||||
kwargs["response_format"] = {"type": "json_object"}
|
||||
|
||||
retries = 3
|
||||
for attempt in range(retries):
|
||||
try:
|
||||
response = self.client.chat.completions.create(**kwargs)
|
||||
content = response.choices[0].message.content or ""
|
||||
# Debug: print first 200 chars if json_mode
|
||||
if json_mode:
|
||||
print(f"[DEBUG] AI Response (first 200 chars): {content[:200]}")
|
||||
return content
|
||||
|
||||
if not response.choices:
|
||||
raise AIClientError("No response from AI model")
|
||||
except RateLimitError as e:
|
||||
if attempt < retries - 1:
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Rate limit hit. Retrying in {wait_time}s...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise
|
||||
|
||||
content = response.choices[0].message.content
|
||||
if not content:
|
||||
raise AIClientError("Empty response from AI model")
|
||||
|
||||
return content.strip()
|
||||
except APIError as e:
|
||||
if attempt < retries - 1 and "network" in str(e).lower():
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Network error. Retrying in {wait_time}s...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise
|
||||
|
||||
except Exception as e:
|
||||
raise AIClientError(f"AI generation failed: {e}")
|
||||
raise
|
||||
|
||||
def generate_json(
|
||||
self,
|
||||
prompt: str,
|
||||
model: Optional[str] = None,
|
||||
temperature: Optional[float] = None,
|
||||
max_tokens: Optional[int] = None
|
||||
) -> Dict[str, Any]:
|
||||
return ""
|
||||
|
||||
|
||||
class PromptManager:
|
||||
"""Manages loading and formatting of prompt templates"""
|
||||
|
||||
def __init__(self, prompts_dir: str = "src/generation/prompts"):
|
||||
self.prompts_dir = Path(prompts_dir)
|
||||
self.prompts: Dict[str, dict] = {}
|
||||
|
||||
def load_prompt(self, prompt_name: str) -> dict:
|
||||
"""Load prompt from JSON file"""
|
||||
if prompt_name in self.prompts:
|
||||
return self.prompts[prompt_name]
|
||||
|
||||
prompt_file = self.prompts_dir / f"{prompt_name}.json"
|
||||
if not prompt_file.exists():
|
||||
raise FileNotFoundError(f"Prompt file not found: {prompt_file}")
|
||||
|
||||
with open(prompt_file, 'r', encoding='utf-8') as f:
|
||||
prompt_data = json.load(f)
|
||||
|
||||
self.prompts[prompt_name] = prompt_data
|
||||
return prompt_data
|
||||
|
||||
def format_prompt(self, prompt_name: str, **kwargs) -> tuple[str, str]:
|
||||
"""
|
||||
Generate JSON-formatted response
|
||||
Format prompt with variables
|
||||
|
||||
Args:
|
||||
prompt: The prompt text (should request JSON output)
|
||||
model: Model to use
|
||||
temperature: Temperature
|
||||
max_tokens: Max tokens
|
||||
prompt_name: Name of the prompt template
|
||||
**kwargs: Variables to inject into the template
|
||||
|
||||
Returns:
|
||||
Parsed JSON response
|
||||
|
||||
Raises:
|
||||
AIClientError: If generation or parsing fails
|
||||
Tuple of (system_message, user_prompt)
|
||||
"""
|
||||
response_text = self.generate(
|
||||
prompt=prompt,
|
||||
model=model,
|
||||
temperature=temperature,
|
||||
max_tokens=max_tokens,
|
||||
response_format={"type": "json_object"}
|
||||
)
|
||||
prompt_data = self.load_prompt(prompt_name)
|
||||
|
||||
try:
|
||||
return json.loads(response_text)
|
||||
except json.JSONDecodeError as e:
|
||||
raise AIClientError(f"Failed to parse JSON response: {e}\nResponse: {response_text}")
|
||||
system_message = prompt_data.get("system_message", "")
|
||||
user_prompt = prompt_data.get("user_prompt", "")
|
||||
|
||||
def validate_model(self, model: str) -> bool:
|
||||
"""
|
||||
Check if a model is available in configuration
|
||||
if system_message:
|
||||
system_message = system_message.format(**kwargs)
|
||||
|
||||
Args:
|
||||
model: Model identifier
|
||||
|
||||
Returns:
|
||||
True if model is available
|
||||
"""
|
||||
available = self.config.ai_service.available_models
|
||||
return model in available.values() or model in available.keys()
|
||||
|
||||
def get_model_id(self, model_name: str) -> str:
|
||||
"""
|
||||
Get full model ID from short name
|
||||
|
||||
Args:
|
||||
model_name: Short name (e.g., "claude-3.5-sonnet") or full ID
|
||||
|
||||
Returns:
|
||||
Full model ID
|
||||
"""
|
||||
available = self.config.ai_service.available_models
|
||||
|
||||
if model_name in available:
|
||||
return available[model_name]
|
||||
|
||||
if model_name in available.values():
|
||||
return model_name
|
||||
|
||||
return model_name
|
||||
user_prompt = user_prompt.format(**kwargs)
|
||||
|
||||
return system_message, user_prompt
|
||||
|
|
@ -1,15 +1,12 @@
|
|||
"""
|
||||
Batch job processor for generating multiple articles across tiers
|
||||
Batch processor for content generation jobs
|
||||
"""
|
||||
|
||||
import time
|
||||
from typing import Optional
|
||||
from sqlalchemy.orm import Session
|
||||
from src.database.models import Project
|
||||
from src.database.repositories import ProjectRepository
|
||||
from src.generation.service import ContentGenerationService, GenerationError
|
||||
from src.generation.job_config import JobConfig, JobResult
|
||||
from src.core.config import Config, get_config
|
||||
from typing import Dict, Any
|
||||
import click
|
||||
from src.generation.service import ContentGenerator
|
||||
from src.generation.job_config import JobConfig, Job, TierConfig
|
||||
from src.database.repositories import GeneratedContentRepository, ProjectRepository
|
||||
|
||||
|
||||
class BatchProcessor:
|
||||
|
|
@ -17,167 +14,205 @@ class BatchProcessor:
|
|||
|
||||
def __init__(
|
||||
self,
|
||||
session: Session,
|
||||
config: Optional[Config] = None
|
||||
content_generator: ContentGenerator,
|
||||
content_repo: GeneratedContentRepository,
|
||||
project_repo: ProjectRepository
|
||||
):
|
||||
"""
|
||||
Initialize batch processor
|
||||
|
||||
Args:
|
||||
session: Database session
|
||||
config: Application configuration
|
||||
"""
|
||||
self.session = session
|
||||
self.config = config or get_config()
|
||||
self.project_repo = ProjectRepository(session)
|
||||
self.generation_service = ContentGenerationService(session, config)
|
||||
self.generator = content_generator
|
||||
self.content_repo = content_repo
|
||||
self.project_repo = project_repo
|
||||
self.stats = {
|
||||
"total_jobs": 0,
|
||||
"processed_jobs": 0,
|
||||
"total_articles": 0,
|
||||
"generated_articles": 0,
|
||||
"augmented_articles": 0,
|
||||
"failed_articles": 0
|
||||
}
|
||||
|
||||
def process_job(
|
||||
self,
|
||||
job_config: JobConfig,
|
||||
progress_callback: Optional[callable] = None,
|
||||
debug: bool = False
|
||||
) -> JobResult:
|
||||
job_file_path: str,
|
||||
debug: bool = False,
|
||||
continue_on_error: bool = False
|
||||
):
|
||||
"""
|
||||
Process a batch job according to configuration
|
||||
Process all jobs in job file
|
||||
|
||||
Args:
|
||||
job_config: Job configuration
|
||||
progress_callback: Optional callback function(tier, article_num, total, status)
|
||||
|
||||
Returns:
|
||||
JobResult with statistics
|
||||
job_file_path: Path to job JSON file
|
||||
debug: If True, save AI responses to debug_output/
|
||||
continue_on_error: If True, continue on article generation failure
|
||||
"""
|
||||
start_time = time.time()
|
||||
job_config = JobConfig(job_file_path)
|
||||
jobs = job_config.get_jobs()
|
||||
|
||||
project = self.project_repo.get_by_id(job_config.project_id)
|
||||
self.stats["total_jobs"] = len(jobs)
|
||||
|
||||
for job_idx, job in enumerate(jobs, 1):
|
||||
try:
|
||||
self._process_single_job(job, job_idx, debug, continue_on_error)
|
||||
self.stats["processed_jobs"] += 1
|
||||
except Exception as e:
|
||||
click.echo(f"Error processing job {job_idx}: {e}")
|
||||
if not continue_on_error:
|
||||
raise
|
||||
|
||||
self._print_summary()
|
||||
|
||||
def _process_single_job(
|
||||
self,
|
||||
job: Job,
|
||||
job_idx: int,
|
||||
debug: bool,
|
||||
continue_on_error: bool
|
||||
):
|
||||
"""Process a single job"""
|
||||
project = self.project_repo.get_by_id(job.project_id)
|
||||
if not project:
|
||||
raise ValueError(f"Project {job_config.project_id} not found")
|
||||
raise ValueError(f"Project {job.project_id} not found")
|
||||
|
||||
result = JobResult(
|
||||
job_name=job_config.job_name,
|
||||
project_id=job_config.project_id,
|
||||
total_articles=job_config.get_total_articles(),
|
||||
successful=0,
|
||||
failed=0,
|
||||
skipped=0
|
||||
click.echo(f"\nProcessing Job {job_idx}/{self.stats['total_jobs']}: Project ID {job.project_id}")
|
||||
|
||||
for tier_name, tier_config in job.tiers.items():
|
||||
self._process_tier(
|
||||
job.project_id,
|
||||
tier_name,
|
||||
tier_config,
|
||||
debug,
|
||||
continue_on_error
|
||||
)
|
||||
|
||||
consecutive_failures = 0
|
||||
def _process_tier(
|
||||
self,
|
||||
project_id: int,
|
||||
tier_name: str,
|
||||
tier_config: TierConfig,
|
||||
debug: bool,
|
||||
continue_on_error: bool
|
||||
):
|
||||
"""Process all articles for a tier"""
|
||||
click.echo(f" {tier_name}: Generating {tier_config.count} articles")
|
||||
|
||||
for tier_config in job_config.tiers:
|
||||
tier = tier_config.tier
|
||||
project = self.project_repo.get_by_id(project_id)
|
||||
keyword = project.main_keyword
|
||||
|
||||
for article_num in range(1, tier_config.article_count + 1):
|
||||
if progress_callback:
|
||||
progress_callback(
|
||||
tier=tier,
|
||||
article_num=article_num,
|
||||
total=tier_config.article_count,
|
||||
status="starting"
|
||||
)
|
||||
for article_num in range(1, tier_config.count + 1):
|
||||
self.stats["total_articles"] += 1
|
||||
|
||||
try:
|
||||
content = self.generation_service.generate_article(
|
||||
project=project,
|
||||
tier=tier,
|
||||
title_model=tier_config.models.title,
|
||||
outline_model=tier_config.models.outline,
|
||||
content_model=tier_config.models.content,
|
||||
max_retries=tier_config.validation_attempts,
|
||||
progress_callback=progress_callback,
|
||||
self._generate_single_article(
|
||||
project_id,
|
||||
tier_name,
|
||||
tier_config,
|
||||
article_num,
|
||||
keyword,
|
||||
debug
|
||||
)
|
||||
self.stats["generated_articles"] += 1
|
||||
|
||||
except Exception as e:
|
||||
self.stats["failed_articles"] += 1
|
||||
import traceback
|
||||
click.echo(f" [{article_num}/{tier_config.count}] FAILED: {e}")
|
||||
click.echo(f" Traceback: {traceback.format_exc()}")
|
||||
|
||||
try:
|
||||
self.content_repo.create(
|
||||
project_id=project_id,
|
||||
tier=tier_name,
|
||||
keyword=keyword,
|
||||
title="Failed Generation",
|
||||
outline={"error": str(e)},
|
||||
content="",
|
||||
word_count=0,
|
||||
status="failed"
|
||||
)
|
||||
except Exception as db_error:
|
||||
click.echo(f" Failed to save error record: {db_error}")
|
||||
|
||||
if not continue_on_error:
|
||||
raise
|
||||
|
||||
def _generate_single_article(
|
||||
self,
|
||||
project_id: int,
|
||||
tier_name: str,
|
||||
tier_config: TierConfig,
|
||||
article_num: int,
|
||||
keyword: str,
|
||||
debug: bool
|
||||
):
|
||||
"""Generate a single article"""
|
||||
prefix = f" [{article_num}/{tier_config.count}]"
|
||||
|
||||
click.echo(f"{prefix} Generating title...")
|
||||
title = self.generator.generate_title(project_id, debug=debug)
|
||||
click.echo(f"{prefix} Generated title: \"{title}\"")
|
||||
|
||||
click.echo(f"{prefix} Generating outline...")
|
||||
outline = self.generator.generate_outline(
|
||||
project_id=project_id,
|
||||
title=title,
|
||||
min_h2=tier_config.min_h2_tags,
|
||||
max_h2=tier_config.max_h2_tags,
|
||||
min_h3=tier_config.min_h3_tags,
|
||||
max_h3=tier_config.max_h3_tags,
|
||||
debug=debug
|
||||
)
|
||||
|
||||
result.successful += 1
|
||||
result.add_tier_result(tier, "successful")
|
||||
consecutive_failures = 0
|
||||
h2_count = len(outline["outline"])
|
||||
h3_count = sum(len(section.get("h3", [])) for section in outline["outline"])
|
||||
click.echo(f"{prefix} Generated outline: {h2_count} H2s, {h3_count} H3s")
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(
|
||||
tier=tier,
|
||||
article_num=article_num,
|
||||
total=tier_config.article_count,
|
||||
status="completed",
|
||||
content_id=content.id
|
||||
click.echo(f"{prefix} Generating content...")
|
||||
content = self.generator.generate_content(
|
||||
project_id=project_id,
|
||||
title=title,
|
||||
outline=outline,
|
||||
min_word_count=tier_config.min_word_count,
|
||||
max_word_count=tier_config.max_word_count,
|
||||
debug=debug
|
||||
)
|
||||
|
||||
except GenerationError as e:
|
||||
error_msg = f"Tier {tier}, Article {article_num}: {str(e)}"
|
||||
result.add_error(error_msg)
|
||||
consecutive_failures += 1
|
||||
word_count = self.generator.count_words(content)
|
||||
click.echo(f"{prefix} Generated content: {word_count:,} words")
|
||||
|
||||
if job_config.failure_config.skip_on_failure:
|
||||
result.skipped += 1
|
||||
result.add_tier_result(tier, "skipped")
|
||||
status = "generated"
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(
|
||||
tier=tier,
|
||||
article_num=article_num,
|
||||
total=tier_config.article_count,
|
||||
status="skipped",
|
||||
error=str(e)
|
||||
if word_count < tier_config.min_word_count:
|
||||
click.echo(f"{prefix} Below minimum ({tier_config.min_word_count:,}), augmenting...")
|
||||
content = self.generator.augment_content(
|
||||
content=content,
|
||||
target_word_count=tier_config.min_word_count,
|
||||
debug=debug,
|
||||
project_id=project_id
|
||||
)
|
||||
word_count = self.generator.count_words(content)
|
||||
click.echo(f"{prefix} Augmented content: {word_count:,} words")
|
||||
status = "augmented"
|
||||
self.stats["augmented_articles"] += 1
|
||||
|
||||
saved_content = self.content_repo.create(
|
||||
project_id=project_id,
|
||||
tier=tier_name,
|
||||
keyword=keyword,
|
||||
title=title,
|
||||
outline=outline,
|
||||
content=content,
|
||||
word_count=word_count,
|
||||
status=status
|
||||
)
|
||||
|
||||
if consecutive_failures >= job_config.failure_config.max_consecutive_failures:
|
||||
result.add_error(
|
||||
f"Stopping job: {consecutive_failures} consecutive failures exceeded threshold"
|
||||
)
|
||||
result.duration = time.time() - start_time
|
||||
return result
|
||||
else:
|
||||
result.failed += 1
|
||||
result.add_tier_result(tier, "failed")
|
||||
result.duration = time.time() - start_time
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(
|
||||
tier=tier,
|
||||
article_num=article_num,
|
||||
total=tier_config.article_count,
|
||||
status="failed",
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Tier {tier}, Article {article_num}: Unexpected error: {str(e)}"
|
||||
result.add_error(error_msg)
|
||||
result.failed += 1
|
||||
result.add_tier_result(tier, "failed")
|
||||
result.duration = time.time() - start_time
|
||||
|
||||
if progress_callback:
|
||||
progress_callback(
|
||||
tier=tier,
|
||||
article_num=article_num,
|
||||
total=tier_config.article_count,
|
||||
status="failed",
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
result.duration = time.time() - start_time
|
||||
return result
|
||||
|
||||
def process_job_from_file(
|
||||
self,
|
||||
job_file_path: str,
|
||||
progress_callback: Optional[callable] = None
|
||||
) -> JobResult:
|
||||
"""
|
||||
Load and process a job from a JSON file
|
||||
|
||||
Args:
|
||||
job_file_path: Path to job configuration JSON file
|
||||
progress_callback: Optional progress callback
|
||||
|
||||
Returns:
|
||||
JobResult with statistics
|
||||
"""
|
||||
job_config = JobConfig.from_file(job_file_path)
|
||||
return self.process_job(job_config, progress_callback)
|
||||
click.echo(f"{prefix} Saved (ID: {saved_content.id}, Status: {status})")
|
||||
|
||||
def _print_summary(self):
|
||||
"""Print job processing summary"""
|
||||
click.echo("\n" + "="*60)
|
||||
click.echo("SUMMARY")
|
||||
click.echo("="*60)
|
||||
click.echo(f"Jobs processed: {self.stats['processed_jobs']}/{self.stats['total_jobs']}")
|
||||
click.echo(f"Articles generated: {self.stats['generated_articles']}/{self.stats['total_articles']}")
|
||||
click.echo(f"Augmented: {self.stats['augmented_articles']}")
|
||||
click.echo(f"Failed: {self.stats['failed_articles']}")
|
||||
click.echo("="*60)
|
||||
|
|
@ -1,213 +1,129 @@
|
|||
"""
|
||||
Job configuration schema and validation for batch content generation
|
||||
Job configuration parser for batch content generation
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Optional, Literal
|
||||
from pydantic import BaseModel, Field, field_validator
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional, Dict, Any
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class ModelConfig(BaseModel):
|
||||
"""AI models configuration for each generation stage"""
|
||||
title: str = Field(..., description="Model for title generation")
|
||||
outline: str = Field(..., description="Model for outline generation")
|
||||
content: str = Field(..., description="Model for content generation")
|
||||
TIER_DEFAULTS = {
|
||||
"tier1": {
|
||||
"min_word_count": 2000,
|
||||
"max_word_count": 2500,
|
||||
"min_h2_tags": 3,
|
||||
"max_h2_tags": 5,
|
||||
"min_h3_tags": 5,
|
||||
"max_h3_tags": 10
|
||||
},
|
||||
"tier2": {
|
||||
"min_word_count": 1500,
|
||||
"max_word_count": 2000,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 4,
|
||||
"min_h3_tags": 3,
|
||||
"max_h3_tags": 8
|
||||
},
|
||||
"tier3": {
|
||||
"min_word_count": 1000,
|
||||
"max_word_count": 1500,
|
||||
"min_h2_tags": 2,
|
||||
"max_h2_tags": 3,
|
||||
"min_h3_tags": 2,
|
||||
"max_h3_tags": 6
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class AnchorTextConfig(BaseModel):
|
||||
"""Anchor text configuration"""
|
||||
mode: Literal["default", "override", "append"] = Field(
|
||||
default="default",
|
||||
description="How to handle anchor text: default (use CORA), override (replace), append (add to)"
|
||||
)
|
||||
custom_text: Optional[List[str]] = Field(
|
||||
default=None,
|
||||
description="Custom anchor text for override mode"
|
||||
)
|
||||
additional_text: Optional[List[str]] = Field(
|
||||
default=None,
|
||||
description="Additional anchor text for append mode"
|
||||
)
|
||||
@dataclass
|
||||
class TierConfig:
|
||||
"""Configuration for a specific tier"""
|
||||
count: int
|
||||
min_word_count: int
|
||||
max_word_count: int
|
||||
min_h2_tags: int
|
||||
max_h2_tags: int
|
||||
min_h3_tags: int
|
||||
max_h3_tags: int
|
||||
|
||||
|
||||
class TierConfig(BaseModel):
|
||||
"""Configuration for a single tier"""
|
||||
tier: int = Field(..., ge=1, description="Tier number (1 = strictest validation)")
|
||||
article_count: int = Field(..., ge=1, description="Number of articles to generate")
|
||||
models: ModelConfig = Field(..., description="AI models for this tier")
|
||||
anchor_text_config: AnchorTextConfig = Field(
|
||||
default_factory=AnchorTextConfig,
|
||||
description="Anchor text configuration"
|
||||
)
|
||||
validation_attempts: int = Field(
|
||||
default=3,
|
||||
ge=1,
|
||||
le=10,
|
||||
description="Max validation retry attempts per stage"
|
||||
)
|
||||
|
||||
|
||||
class FailureConfig(BaseModel):
|
||||
"""Failure handling configuration"""
|
||||
max_consecutive_failures: int = Field(
|
||||
default=5,
|
||||
ge=1,
|
||||
description="Stop job after this many consecutive failures"
|
||||
)
|
||||
skip_on_failure: bool = Field(
|
||||
default=True,
|
||||
description="Skip failed articles and continue, or stop immediately"
|
||||
)
|
||||
|
||||
|
||||
class InterlinkingConfig(BaseModel):
|
||||
"""Interlinking configuration"""
|
||||
links_per_article_min: int = Field(
|
||||
default=2,
|
||||
ge=0,
|
||||
description="Minimum links to other articles"
|
||||
)
|
||||
links_per_article_max: int = Field(
|
||||
default=4,
|
||||
ge=0,
|
||||
description="Maximum links to other articles"
|
||||
)
|
||||
include_home_link: bool = Field(
|
||||
default=True,
|
||||
description="Include link to home page"
|
||||
)
|
||||
|
||||
@field_validator('links_per_article_max')
|
||||
@classmethod
|
||||
def validate_max_greater_than_min(cls, v, info):
|
||||
if 'links_per_article_min' in info.data and v < info.data['links_per_article_min']:
|
||||
raise ValueError("links_per_article_max must be >= links_per_article_min")
|
||||
return v
|
||||
|
||||
|
||||
class JobConfig(BaseModel):
|
||||
"""Complete job configuration"""
|
||||
job_name: str = Field(..., description="Descriptive name for the job")
|
||||
project_id: int = Field(..., ge=1, description="Project ID to use for all tiers")
|
||||
description: Optional[str] = Field(None, description="Optional job description")
|
||||
tiers: List[TierConfig] = Field(..., min_length=1, description="Tier configurations")
|
||||
failure_config: FailureConfig = Field(
|
||||
default_factory=FailureConfig,
|
||||
description="Failure handling configuration"
|
||||
)
|
||||
interlinking: InterlinkingConfig = Field(
|
||||
default_factory=InterlinkingConfig,
|
||||
description="Interlinking configuration"
|
||||
)
|
||||
|
||||
@field_validator('tiers')
|
||||
@classmethod
|
||||
def validate_unique_tiers(cls, v):
|
||||
tier_numbers = [tier.tier for tier in v]
|
||||
if len(tier_numbers) != len(set(tier_numbers)):
|
||||
raise ValueError("Tier numbers must be unique")
|
||||
return v
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, file_path: str) -> 'JobConfig':
|
||||
"""
|
||||
Load job configuration from JSON file
|
||||
|
||||
Args:
|
||||
file_path: Path to the JSON file
|
||||
|
||||
Returns:
|
||||
JobConfig instance
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If file doesn't exist
|
||||
ValueError: If JSON is invalid or validation fails
|
||||
"""
|
||||
path = Path(file_path)
|
||||
if not path.exists():
|
||||
raise FileNotFoundError(f"Job configuration file not found: {file_path}")
|
||||
|
||||
try:
|
||||
with open(path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
return cls(**data)
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(f"Invalid JSON in {file_path}: {e}")
|
||||
except Exception as e:
|
||||
raise ValueError(f"Failed to parse job configuration: {e}")
|
||||
|
||||
def to_file(self, file_path: str) -> None:
|
||||
"""
|
||||
Save job configuration to JSON file
|
||||
|
||||
Args:
|
||||
file_path: Path to save the JSON file
|
||||
"""
|
||||
path = Path(file_path)
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with open(path, 'w', encoding='utf-8') as f:
|
||||
json.dump(self.model_dump(), f, indent=2)
|
||||
|
||||
def get_total_articles(self) -> int:
|
||||
"""Get total number of articles across all tiers"""
|
||||
return sum(tier.article_count for tier in self.tiers)
|
||||
|
||||
|
||||
class JobResult(BaseModel):
|
||||
"""Result of a job execution"""
|
||||
job_name: str
|
||||
@dataclass
|
||||
class Job:
|
||||
"""Job definition for content generation"""
|
||||
project_id: int
|
||||
total_articles: int
|
||||
successful: int
|
||||
failed: int
|
||||
skipped: int
|
||||
tier_results: Dict[int, Dict[str, int]] = Field(default_factory=dict)
|
||||
errors: List[str] = Field(default_factory=list)
|
||||
duration: float = 0.0
|
||||
tiers: Dict[str, TierConfig]
|
||||
|
||||
def add_tier_result(self, tier: int, status: str) -> None:
|
||||
"""Track result for a tier"""
|
||||
if tier not in self.tier_results:
|
||||
self.tier_results[tier] = {"successful": 0, "failed": 0, "skipped": 0}
|
||||
|
||||
if status in self.tier_results[tier]:
|
||||
self.tier_results[tier][status] += 1
|
||||
class JobConfig:
|
||||
"""Parser for job configuration files"""
|
||||
|
||||
def add_error(self, error: str) -> None:
|
||||
"""Add an error message"""
|
||||
self.errors.append(error)
|
||||
def __init__(self, job_file_path: str):
|
||||
"""
|
||||
Load and parse job file, apply defaults
|
||||
|
||||
def to_summary(self) -> str:
|
||||
"""Generate a human-readable summary"""
|
||||
lines = [
|
||||
f"Job: {self.job_name}",
|
||||
f"Project ID: {self.project_id}",
|
||||
f"Duration: {self.duration:.2f}s",
|
||||
f"",
|
||||
f"Results:",
|
||||
f" Total Articles: {self.total_articles}",
|
||||
f" Successful: {self.successful}",
|
||||
f" Failed: {self.failed}",
|
||||
f" Skipped: {self.skipped}",
|
||||
f"",
|
||||
f"By Tier:"
|
||||
]
|
||||
Args:
|
||||
job_file_path: Path to JSON job file
|
||||
"""
|
||||
self.job_file_path = Path(job_file_path)
|
||||
self.jobs: list[Job] = []
|
||||
self._load()
|
||||
|
||||
for tier, results in sorted(self.tier_results.items()):
|
||||
lines.append(f" Tier {tier}:")
|
||||
lines.append(f" Successful: {results['successful']}")
|
||||
lines.append(f" Failed: {results['failed']}")
|
||||
lines.append(f" Skipped: {results['skipped']}")
|
||||
def _load(self):
|
||||
"""Load and parse the job file"""
|
||||
if not self.job_file_path.exists():
|
||||
raise FileNotFoundError(f"Job file not found: {self.job_file_path}")
|
||||
|
||||
if self.errors:
|
||||
lines.append("")
|
||||
lines.append(f"Errors ({len(self.errors)}):")
|
||||
for error in self.errors[:10]:
|
||||
lines.append(f" - {error}")
|
||||
if len(self.errors) > 10:
|
||||
lines.append(f" ... and {len(self.errors) - 10} more")
|
||||
with open(self.job_file_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
return "\n".join(lines)
|
||||
if "jobs" not in data:
|
||||
raise ValueError("Job file must contain 'jobs' array")
|
||||
|
||||
for job_data in data["jobs"]:
|
||||
self._validate_job(job_data)
|
||||
job = self._parse_job(job_data)
|
||||
self.jobs.append(job)
|
||||
|
||||
def _validate_job(self, job_data: dict):
|
||||
"""Validate job structure"""
|
||||
if "project_id" not in job_data:
|
||||
raise ValueError("Job missing 'project_id'")
|
||||
|
||||
if "tiers" not in job_data:
|
||||
raise ValueError("Job missing 'tiers'")
|
||||
|
||||
if not isinstance(job_data["tiers"], dict):
|
||||
raise ValueError("'tiers' must be a dictionary")
|
||||
|
||||
def _parse_job(self, job_data: dict) -> Job:
|
||||
"""Parse a single job"""
|
||||
project_id = job_data["project_id"]
|
||||
tiers = {}
|
||||
|
||||
for tier_name, tier_data in job_data["tiers"].items():
|
||||
tier_config = self._parse_tier(tier_name, tier_data)
|
||||
tiers[tier_name] = tier_config
|
||||
|
||||
return Job(project_id=project_id, tiers=tiers)
|
||||
|
||||
def _parse_tier(self, tier_name: str, tier_data: dict) -> TierConfig:
|
||||
"""Parse tier configuration with defaults"""
|
||||
defaults = TIER_DEFAULTS.get(tier_name, TIER_DEFAULTS["tier3"])
|
||||
|
||||
return TierConfig(
|
||||
count=tier_data.get("count", 1),
|
||||
min_word_count=tier_data.get("min_word_count", defaults["min_word_count"]),
|
||||
max_word_count=tier_data.get("max_word_count", defaults["max_word_count"]),
|
||||
min_h2_tags=tier_data.get("min_h2_tags", defaults["min_h2_tags"]),
|
||||
max_h2_tags=tier_data.get("max_h2_tags", defaults["max_h2_tags"]),
|
||||
min_h3_tags=tier_data.get("min_h3_tags", defaults["min_h3_tags"]),
|
||||
max_h3_tags=tier_data.get("max_h3_tags", defaults["max_h3_tags"])
|
||||
)
|
||||
|
||||
def get_jobs(self) -> list[Job]:
|
||||
"""Return list of all jobs in file"""
|
||||
return self.jobs
|
||||
|
||||
def get_tier_config(self, job: Job, tier_name: str) -> Optional[TierConfig]:
|
||||
"""Get tier config with defaults applied"""
|
||||
return job.tiers.get(tier_name)
|
||||
|
|
@ -1,9 +1,5 @@
|
|||
{
|
||||
"system": "You are a content enhancement specialist who adds natural, relevant paragraphs to articles to meet optimization targets.",
|
||||
"user_template": "Add new paragraph(s) to the following article to address these missing elements:\n\nCurrent Article:\n{current_content}\n\nWhat's Missing:\n{missing_elements}\n\nMain Keyword: {main_keyword}\nEntities to use: {target_entities}\nRelated Searches to reference: {target_searches}\nTarget Word Count for New Content: {target_word_count} words\n\nInstructions:\n1. Write {target_word_count} words of new content (1-3 paragraphs as needed)\n2. Naturally incorporate the missing keywords/entities/searches\n3. Make it relevant to the article topic\n4. Use a professional, engaging tone\n5. Don't directly repeat information already in the article\n6. The paragraphs should feel like natural additions\n7. IMPORTANT: Write at least {target_word_count} words to ensure we meet the target\n\nSuggested placement: {suggested_placement}\n\nRespond with ONLY the new paragraph(s) in HTML format:\n<p>First paragraph here...</p>\n<p>Second paragraph here...</p>\n\nDo not include the entire article, just the new paragraph(s) to insert.",
|
||||
"validation": {
|
||||
"output_format": "html"
|
||||
|
||||
}
|
||||
"system_message": "You are an expert content editor who expands articles by adding depth, detail, and additional relevant information while maintaining topical focus and quality.",
|
||||
"user_prompt": "Please expand on the following article to add more detail and depth, ensuring you maintain the existing topical focus. Target word count: {target_word_count} words.\n\nCurrent article:\n{content}\n\nReturn the expanded article as an HTML fragment with the same structure (using <h2>, <h3>, <p> tags). You can add new paragraphs, expand existing ones, or add new subsections as needed. Do NOT change the existing headings unless necessary."
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,12 +1,5 @@
|
|||
{
|
||||
"system": "You are an creative content writer who creates comprehensive, engaging articles that strictly follow the provided outline and meet all CORA optimization requirements.",
|
||||
"user_template": "Write a complete, SEO-optimized article following this outline:\n\n{outline}\n\nArticle Details:\n- Title: {title}\n- Main Keyword: {main_keyword}\n- Target Token Count: {word_count}\n- Keyword Frequency Target: {term_frequency}% mentions\n\nEntities to incorporate: {entities}\nRelated Searches to reference: {related_searches}\n\nCritical Requirements:\n1. Follow the outline structure EXACTLY - use the provided H2 and H3 headings word-for-word\n2. Do NOT add numbering, Roman numerals, or letters to the headings\n3. The article must be {word_count} tokens long (±100 tokens)\n4. Mention the main keyword \"{main_keyword}\" naturally {term_frequency}% times throughout\n5. Write 2-3 substantial paragraphs under each heading. Reference industry standards, regulations, or best practices. Use relevant LSI and entities for the topic\n6. For the FAQ section:\n - Each FAQ answer MUST begin by restating the question\n - Provide detailed, helpful answers (100-150 words each)\n7. Incorporate entities and related searches naturally throughout\n8. Write in a professional, engaging tone. Use active voice for 80% of sentences\n9. Make content informative and valuable to readers. Use technical terminology appropriate for industry professionals.\n10. Use varied sentence structures and vocabulary.\n11. STRICTLY PROHIBITED: Filler phrases: 'it is important to note', as mentioned earlier', 'in conclusion' - Marketing language: 'revolutionary', 'game-changing', 'industry-leading', 'best-in-class' - Generic openings: 'In today's world', 'As we all know', 'It goes without saying' \n\nFormatting Requirements:\n- Use <h1> for the main title\n- Use <h2> for major sections\n- Use <h3> for subsections\n- Use <p> for paragraphs\n- Use <ul> and <li> for lists where appropriate\n- Do NOT include any CSS, <html>, <head>, or <body> tags\n- Return ONLY the article content HTML\n\nExample structure:\n<h1>Main Title</h1>\n<p>Introduction paragraph...</p>\n\n<h2>First Section</h2>\n<p>Content...</p>\n\n<h3>Subsection</h3>\n<p>More content...</p>\n\nWrite the complete article now.",
|
||||
"validation": {
|
||||
"output_format": "html",
|
||||
"min_word_count": true,
|
||||
"max_word_count": true,
|
||||
"keyword_frequency_target": true,
|
||||
"outline_structure_match": true
|
||||
}
|
||||
"system_message": "You are an expert content writer who creates engaging, informative, and SEO-optimized articles that provide real value to readers while incorporating relevant keywords naturally.",
|
||||
"user_prompt": "Write a complete article based on:\nTitle: {title}\nOutline: {outline}\nKeyword: {keyword}\n\nEntities to include naturally: {entities}\nRelated searches to address: {related_searches}\n\nTarget word count range: {min_word_count} to {max_word_count} words\n\nReturn as an HTML fragment with <h2>, <h3>, and <p> tags. Do NOT include <!DOCTYPE>, <html>, <head>, or <body> tags. Start directly with the first <h2> heading.\n\nWrite naturally and informatively. Incorporate the keyword, entities, and related searches organically throughout the content."
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,11 +1,5 @@
|
|||
{
|
||||
"system": "You are an expert content strategist who creates compelling, specific article titles that provide clear direction for content creation. You also strive to meet strict CORA optimization targets.",
|
||||
"user_template": "Create a detailed article outline for the following:\n\nTitle: {title}\nMain Keyword: {main_keyword}\nTarget Word Count: {word_count}\n\nCORA Targets:\n- H2 headings needed: {h2_total}\n- H2s with main keyword: {h2_exact}\n- H2s with related searches: {h2_related_search}\n- H2s with entities: {h2_entities}\n- H3 headings needed: {h3_total}\n- H3s with main keyword: {h3_exact}\n- H3s with related searches: {h3_related_search}\n- H3s with entities: {h3_entities}\n\nAvailable Entities: {entities}\nRelated Searches: {related_searches}\n\nThe title provided above will serve as the H1 heading for this article. Focus on creating the H2 and H3 structure that supports this title.\n\nRequirements:\n1. Create exactly {h2_total} H2 headings\n2. Create exactly {h3_total} H3 headings (distributed under H2s)\n3. At least {h2_exact} H2s must contain the exact keyword \"{main_keyword}\"\n4. The FIRST H2 should contain the main keyword\n5. Incorporate entities and related searches naturally into headings\n6. Include a \"Frequently Asked Questions\" H2 section with at least 3 H3 questions\n7. Each H3 question should be a complete question ending with ?\n8. Structure should flow logically\nCreate headings that build logically toward actionable insights\n9. Use specific, searchable language over generic terms\n 9. Include sub-topic hints in parentheses where helpful \n 10. Focus on reader problems and solutions.\n 11. FORBIDDEN ELEMENTS: Future-tense speculation ('The Future of...', 'Upcoming Trends') - Generic business-speak ('in today's competitive landscape', 'cutting-edge solutions') - Vague qualifiers ('best practices', 'industry-leading', 'world-class') \n\nIMPORTANT FORMATTING RULES:\n- Do NOT include numbering (1., 2., 3.)\n- Do NOT include Roman numerals (I., II., III.)\n- Do NOT include letters (A., B., C.)\n- Do NOT include any outline-style prefixes\n- Return clean heading text only\n\nWRONG: \"I. Introduction to {main_keyword}\"\nWRONG: \"1. Getting Started with {main_keyword}\"\nRIGHT: \"Introduction to {main_keyword}\"\nRIGHT: \"Getting Started with {main_keyword}\"\n\nRespond ONLY with valid JSON in this exact format (no additional text, explanations, or commentary):\n{{\n \"sections\": [\n {{\n \"h2\": \"H2 heading text\",\n \"h3s\": [\"H3 heading 1\", \"H3 heading 2\"]\n }}\n ]\n}}\n\nReturn ONLY the JSON object. Do not include any text before or after the JSON.",
|
||||
"validation": {
|
||||
"output_format": "json",
|
||||
"required_fields": ["sections"],
|
||||
"h2_count_must_match": true,
|
||||
"h3_count_must_match": true
|
||||
}
|
||||
"system_message": "You are an expert content outliner who creates well-structured, comprehensive article outlines that cover topics thoroughly and logically.",
|
||||
"user_prompt": "Create an article outline for:\nTitle: {title}\nKeyword: {keyword}\n\nConstraints:\n- Between {min_h2} and {max_h2} H2 headings\n- Between {min_h3} and {max_h3} H3 subheadings total (distributed across H2 sections)\n\nEntities to incorporate: {entities}\nRelated searches to address: {related_searches}\n\nReturn ONLY valid JSON in this exact format:\n{{\"outline\": [{{\"h2\": \"Heading text\", \"h3\": [\"Subheading 1\", \"Subheading 2\"]}}, ...]}}\n\nEnsure the outline meets the minimum heading requirements and includes relevant entities and related searches."
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,10 +1,5 @@
|
|||
{
|
||||
"system": "You are an expert content strategist who creates compelling, specific article titles that provide clear direction for content creation.",
|
||||
"user_template": "Generate an unique, compelling article title for the broad topic: \"{main_keyword}\".\n\nContext:\n- Main Keyword: {main_keyword}\n- - Top Entities: {entities}\n- Related Searches: {related_searches}\n\nRequirements:\n1. The title MUST contain the exact main keyword: \"{main_keyword}\"\n2. The title should be compelling and click-worthy\n3. Each title must be specific enough that an AI could create substantial, focused content outline from the title alone\n4.Titles should be creative yet professionally relevant to: {{subject}}. It does not have to be directly related but must be at least tangentially related.\n5. Consider incorporating 1-2 related entities or searches if natural\n6. Mix formats: how-to guides (25%), case studies (10%), expert analyses (20%), comparison pieces (15%), trend analyses (10%), problem-solving articles (10%), listicles(10%)\nAvoid generic business jargon and AI slop (cutting-edge,game-changing, revolutionary)\n7- Use domain-specific terminology appropriate for an article about {main_keyword}\n 8-Include specific, actionable language that suggests clear content direction\n\nRespond with ONLY the title text, no quotes or additional formatting.\n\nExample format: \"Complete Guide to {main_keyword}: Tips and Best Practices\"",
|
||||
"validation": {
|
||||
"must_contain_keyword": true,
|
||||
"min_length": 30,
|
||||
"max_length": 120
|
||||
}
|
||||
"system_message": "You are an expert SEO content writer who creates compelling, search-optimized titles that attract clicks while accurately representing the content topic.",
|
||||
"user_prompt": "Generate an SEO-optimized title for an article about: {keyword}\n\nRelated entities: {entities}\n\nRelated searches: {related_searches}\n\nReturn only the title text, no formatting or quotes."
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,337 +1,3 @@
|
|||
"""
|
||||
Content validation rule engine for CORA-compliant HTML generation
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Dict, List, Optional, Any
|
||||
from html.parser import HTMLParser
|
||||
import re
|
||||
from src.core.config import Config
|
||||
from src.database.models import Project
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationIssue:
|
||||
"""Single validation issue (error or warning)"""
|
||||
rule_name: str
|
||||
severity: str
|
||||
message: str
|
||||
expected: Optional[Any] = None
|
||||
actual: Optional[Any] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationResult:
|
||||
"""Result of content validation"""
|
||||
passed: bool
|
||||
errors: List[ValidationIssue] = field(default_factory=list)
|
||||
warnings: List[ValidationIssue] = field(default_factory=list)
|
||||
|
||||
def add_error(self, rule_name: str, message: str, expected: Any = None, actual: Any = None):
|
||||
self.errors.append(ValidationIssue(rule_name, "error", message, expected, actual))
|
||||
self.passed = False
|
||||
|
||||
def add_warning(self, rule_name: str, message: str, expected: Any = None, actual: Any = None):
|
||||
self.warnings.append(ValidationIssue(rule_name, "warning", message, expected, actual))
|
||||
|
||||
def to_dict(self) -> Dict:
|
||||
return {
|
||||
"passed": self.passed,
|
||||
"errors": [
|
||||
{
|
||||
"rule": e.rule_name,
|
||||
"severity": e.severity,
|
||||
"message": e.message,
|
||||
"expected": e.expected,
|
||||
"actual": e.actual
|
||||
} for e in self.errors
|
||||
],
|
||||
"warnings": [
|
||||
{
|
||||
"rule": w.rule_name,
|
||||
"severity": w.severity,
|
||||
"message": w.message,
|
||||
"expected": w.expected,
|
||||
"actual": w.actual
|
||||
} for w in self.warnings
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
class ContentHTMLParser(HTMLParser):
|
||||
"""HTML parser to extract structure and content for validation"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.title: Optional[str] = None
|
||||
self.meta_description: Optional[str] = None
|
||||
self.h1_tags: List[str] = []
|
||||
self.h2_tags: List[str] = []
|
||||
self.h3_tags: List[str] = []
|
||||
self.images: List[Dict[str, str]] = []
|
||||
self.links: List[Dict[str, str]] = []
|
||||
self.text_content: str = ""
|
||||
|
||||
self._current_tag: Optional[str] = None
|
||||
self._current_data: List[str] = []
|
||||
self._in_title = False
|
||||
self._in_h1 = False
|
||||
self._in_h2 = False
|
||||
self._in_h3 = False
|
||||
|
||||
def handle_starttag(self, tag: str, attrs: List[tuple]):
|
||||
self._current_tag = tag
|
||||
attrs_dict = dict(attrs)
|
||||
|
||||
if tag == "title":
|
||||
self._in_title = True
|
||||
self._current_data = []
|
||||
elif tag == "meta" and attrs_dict.get("name") == "description":
|
||||
self.meta_description = attrs_dict.get("content", "")
|
||||
elif tag == "h1":
|
||||
self._in_h1 = True
|
||||
self._current_data = []
|
||||
elif tag == "h2":
|
||||
self._in_h2 = True
|
||||
self._current_data = []
|
||||
elif tag == "h3":
|
||||
self._in_h3 = True
|
||||
self._current_data = []
|
||||
elif tag == "img":
|
||||
self.images.append({
|
||||
"src": attrs_dict.get("src", ""),
|
||||
"alt": attrs_dict.get("alt", "")
|
||||
})
|
||||
elif tag == "a":
|
||||
self.links.append({
|
||||
"href": attrs_dict.get("href", ""),
|
||||
"text": ""
|
||||
})
|
||||
|
||||
def handle_endtag(self, tag: str):
|
||||
if tag == "title" and self._in_title:
|
||||
self.title = "".join(self._current_data).strip()
|
||||
self._in_title = False
|
||||
elif tag == "h1" and self._in_h1:
|
||||
self.h1_tags.append("".join(self._current_data).strip())
|
||||
self._in_h1 = False
|
||||
elif tag == "h2" and self._in_h2:
|
||||
self.h2_tags.append("".join(self._current_data).strip())
|
||||
self._in_h2 = False
|
||||
elif tag == "h3" and self._in_h3:
|
||||
self.h3_tags.append("".join(self._current_data).strip())
|
||||
self._in_h3 = False
|
||||
|
||||
self._current_tag = None
|
||||
|
||||
def handle_data(self, data: str):
|
||||
if self._in_title or self._in_h1 or self._in_h2 or self._in_h3:
|
||||
self._current_data.append(data)
|
||||
|
||||
if self._current_tag == "a" and self.links:
|
||||
self.links[-1]["text"] += data
|
||||
|
||||
if self._current_tag not in ["script", "style", "head"]:
|
||||
self.text_content += data
|
||||
|
||||
|
||||
class ContentRuleEngine:
|
||||
"""Validates HTML content against universal rules and CORA targets"""
|
||||
|
||||
def __init__(self, config: Config):
|
||||
self.config = config
|
||||
self.universal_rules = config.get("content_rules.universal", {})
|
||||
self.cora_config = config.get("content_rules.cora_validation", {})
|
||||
|
||||
def validate(self, html_content: str, project: Project) -> ValidationResult:
|
||||
"""
|
||||
Validate HTML content against all rules
|
||||
|
||||
Args:
|
||||
html_content: Generated HTML content
|
||||
project: Project with CORA targets
|
||||
|
||||
Returns:
|
||||
ValidationResult with errors and warnings
|
||||
"""
|
||||
result = ValidationResult(passed=True)
|
||||
|
||||
parser = ContentHTMLParser()
|
||||
parser.feed(html_content)
|
||||
|
||||
self._validate_universal_rules(parser, project, result)
|
||||
|
||||
if self.cora_config.get("enabled", True):
|
||||
self._validate_cora_targets(parser, project, result)
|
||||
|
||||
return result
|
||||
|
||||
def _validate_universal_rules(self, parser: ContentHTMLParser, project: Project, result: ValidationResult):
|
||||
"""Validate universal hard rules that apply to all content"""
|
||||
|
||||
word_count = len(parser.text_content.split())
|
||||
min_length = self.universal_rules.get("min_content_length", 0)
|
||||
max_length = self.universal_rules.get("max_content_length", float('inf'))
|
||||
|
||||
if word_count < min_length:
|
||||
result.add_error(
|
||||
"min_content_length",
|
||||
f"Content is too short",
|
||||
expected=f">={min_length} words",
|
||||
actual=f"{word_count} words"
|
||||
)
|
||||
|
||||
if word_count > max_length:
|
||||
result.add_error(
|
||||
"max_content_length",
|
||||
f"Content is too long",
|
||||
expected=f"<={max_length} words",
|
||||
actual=f"{word_count} words"
|
||||
)
|
||||
|
||||
if self.universal_rules.get("title_exact_match_required", False):
|
||||
if not parser.title or not self._contains_keyword(parser.title, project.main_keyword):
|
||||
result.add_error(
|
||||
"title_exact_match_required",
|
||||
"Title must contain main keyword",
|
||||
expected=project.main_keyword,
|
||||
actual=parser.title or "(no title)"
|
||||
)
|
||||
|
||||
if self.universal_rules.get("h1_exact_match_required", False):
|
||||
if not parser.h1_tags or not any(self._contains_keyword(h1, project.main_keyword) for h1 in parser.h1_tags):
|
||||
result.add_error(
|
||||
"h1_exact_match_required",
|
||||
"At least one H1 must contain main keyword",
|
||||
expected=project.main_keyword,
|
||||
actual=parser.h1_tags
|
||||
)
|
||||
|
||||
h2_min = self.universal_rules.get("h2_exact_match_min", 0)
|
||||
h2_with_keyword = sum(1 for h2 in parser.h2_tags if self._contains_keyword(h2, project.main_keyword))
|
||||
if h2_with_keyword < h2_min:
|
||||
result.add_error(
|
||||
"h2_exact_match_min",
|
||||
f"Not enough H2 tags with main keyword",
|
||||
expected=f">={h2_min}",
|
||||
actual=h2_with_keyword
|
||||
)
|
||||
|
||||
h3_min = self.universal_rules.get("h3_exact_match_min", 0)
|
||||
h3_with_keyword = sum(1 for h3 in parser.h3_tags if self._contains_keyword(h3, project.main_keyword))
|
||||
if h3_with_keyword < h3_min:
|
||||
result.add_error(
|
||||
"h3_exact_match_min",
|
||||
f"Not enough H3 tags with main keyword",
|
||||
expected=f">={h3_min}",
|
||||
actual=h3_with_keyword
|
||||
)
|
||||
|
||||
if self.universal_rules.get("faq_section_required", False):
|
||||
if not self._has_faq_section(parser.h2_tags, parser.h3_tags):
|
||||
result.add_error(
|
||||
"faq_section_required",
|
||||
"Content must include an FAQ section"
|
||||
)
|
||||
|
||||
if self.universal_rules.get("image_alt_text_keyword_required", False):
|
||||
for img in parser.images:
|
||||
if not self._contains_keyword(img.get("alt", ""), project.main_keyword):
|
||||
result.add_error(
|
||||
"image_alt_text_keyword_required",
|
||||
f"Image alt text missing main keyword",
|
||||
expected=project.main_keyword,
|
||||
actual=img.get("alt", "(no alt)")
|
||||
)
|
||||
|
||||
if self.universal_rules.get("image_alt_text_entity_required", False) and project.entities:
|
||||
for img in parser.images:
|
||||
alt_text = img.get("alt", "")
|
||||
has_entity = any(self._contains_keyword(alt_text, entity) for entity in project.entities)
|
||||
if not has_entity:
|
||||
result.add_error(
|
||||
"image_alt_text_entity_required",
|
||||
f"Image alt text missing entities",
|
||||
expected=f"One of: {project.entities[:3]}",
|
||||
actual=alt_text or "(no alt)"
|
||||
)
|
||||
|
||||
def _validate_cora_targets(self, parser: ContentHTMLParser, project: Project, result: ValidationResult):
|
||||
"""Validate content against CORA-specific targets"""
|
||||
|
||||
is_tier_1 = project.tier == 1
|
||||
round_down = self.cora_config.get("round_averages_down", True)
|
||||
|
||||
counts = self._count_keyword_entities(parser, project)
|
||||
|
||||
checks = [
|
||||
("h1_exact", counts["h1_exact"], project.h1_exact, "H1 tags with exact keyword match"),
|
||||
("h1_related_search", counts["h1_related_search"], project.h1_related_search, "H1 tags with related searches"),
|
||||
("h1_entities", counts["h1_entities"], project.h1_entities, "H1 tags with entities"),
|
||||
("h2_total", len(parser.h2_tags), project.h2_total, "Total H2 tags"),
|
||||
("h2_exact", counts["h2_exact"], project.h2_exact, "H2 tags with exact keyword match"),
|
||||
("h2_related_search", counts["h2_related_search"], project.h2_related_search, "H2 tags with related searches"),
|
||||
("h2_entities", counts["h2_entities"], project.h2_entities, "H2 tags with entities"),
|
||||
("h3_total", len(parser.h3_tags), project.h3_total, "Total H3 tags"),
|
||||
("h3_exact", counts["h3_exact"], project.h3_exact, "H3 tags with exact keyword match"),
|
||||
("h3_related_search", counts["h3_related_search"], project.h3_related_search, "H3 tags with related searches"),
|
||||
("h3_entities", counts["h3_entities"], project.h3_entities, "H3 tags with entities"),
|
||||
]
|
||||
|
||||
for rule_name, actual, target, description in checks:
|
||||
if target is None:
|
||||
continue
|
||||
|
||||
expected = int(target) if round_down else round(target)
|
||||
|
||||
if actual < expected:
|
||||
message = f"{description} below CORA target"
|
||||
if is_tier_1:
|
||||
result.add_error(rule_name, message, expected=expected, actual=actual)
|
||||
else:
|
||||
result.add_warning(rule_name, message, expected=expected, actual=actual)
|
||||
|
||||
def _count_keyword_entities(self, parser: ContentHTMLParser, project: Project) -> Dict[str, int]:
|
||||
"""Count occurrences of keywords, entities, and related searches in headings"""
|
||||
|
||||
entities = project.entities or []
|
||||
related_searches = project.related_searches or []
|
||||
|
||||
return {
|
||||
"h1_exact": sum(1 for h1 in parser.h1_tags if self._contains_keyword(h1, project.main_keyword)),
|
||||
"h1_related_search": sum(1 for h1 in parser.h1_tags if self._contains_any(h1, related_searches)),
|
||||
"h1_entities": sum(1 for h1 in parser.h1_tags if self._contains_any(h1, entities)),
|
||||
"h2_exact": sum(1 for h2 in parser.h2_tags if self._contains_keyword(h2, project.main_keyword)),
|
||||
"h2_related_search": sum(1 for h2 in parser.h2_tags if self._contains_any(h2, related_searches)),
|
||||
"h2_entities": sum(1 for h2 in parser.h2_tags if self._contains_any(h2, entities)),
|
||||
"h3_exact": sum(1 for h3 in parser.h3_tags if self._contains_keyword(h3, project.main_keyword)),
|
||||
"h3_related_search": sum(1 for h3 in parser.h3_tags if self._contains_any(h3, related_searches)),
|
||||
"h3_entities": sum(1 for h3 in parser.h3_tags if self._contains_any(h3, entities)),
|
||||
}
|
||||
|
||||
def _contains_keyword(self, text: str, keyword: str) -> bool:
|
||||
"""Check if text contains keyword (case-insensitive, word boundary)"""
|
||||
if not text or not keyword:
|
||||
return False
|
||||
pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
|
||||
return bool(re.search(pattern, text.lower()))
|
||||
|
||||
def _contains_any(self, text: str, terms: List[str]) -> bool:
|
||||
"""Check if text contains any of the terms"""
|
||||
if not text or not terms:
|
||||
return False
|
||||
return any(self._contains_keyword(text, term) for term in terms)
|
||||
|
||||
def _has_faq_section(self, h2_tags: List[str], h3_tags: List[str]) -> bool:
|
||||
"""Check if content has an FAQ section"""
|
||||
faq_patterns = [r'\bfaq\b', r'\bfrequently asked questions\b', r'\bq&a\b', r'\bquestions\b']
|
||||
|
||||
for h2 in h2_tags:
|
||||
if any(re.search(pattern, h2.lower()) for pattern in faq_patterns):
|
||||
return True
|
||||
|
||||
for h3 in h3_tags:
|
||||
if any(re.search(pattern, h3.lower()) for pattern in faq_patterns):
|
||||
return True
|
||||
|
||||
return False
|
||||
# Content validation rules
|
||||
# DEPRECATED: This module has been replaced by the simplified generation pipeline in service.py
|
||||
# Kept for reference only.
|
||||
|
|
@ -1,388 +1,311 @@
|
|||
"""
|
||||
Content generation service - orchestrates the three-stage AI generation pipeline
|
||||
Content generation service with three-stage pipeline
|
||||
"""
|
||||
|
||||
import time
|
||||
import re
|
||||
import json
|
||||
from html import unescape
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional, Tuple
|
||||
from src.database.models import Project, GeneratedContent
|
||||
from src.database.repositories import GeneratedContentRepository
|
||||
from src.generation.ai_client import AIClient, AIClientError
|
||||
from src.generation.validator import StageValidator
|
||||
from src.generation.augmenter import ContentAugmenter
|
||||
from src.generation.rule_engine import ContentRuleEngine
|
||||
from src.core.config import Config, get_config
|
||||
from sqlalchemy.orm import Session
|
||||
from datetime import datetime
|
||||
from typing import Optional, Tuple
|
||||
from src.generation.ai_client import AIClient, PromptManager
|
||||
from src.database.repositories import ProjectRepository, GeneratedContentRepository
|
||||
|
||||
|
||||
class GenerationError(Exception):
|
||||
"""Content generation error"""
|
||||
pass
|
||||
|
||||
|
||||
class ContentGenerationService:
|
||||
"""Service for AI-powered content generation with validation"""
|
||||
|
||||
MAX_H2_TOTAL = 5
|
||||
MAX_H3_TOTAL = 13
|
||||
class ContentGenerator:
|
||||
"""Main service for generating content through AI pipeline"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
session: Session,
|
||||
config: Optional[Config] = None,
|
||||
ai_client: Optional[AIClient] = None
|
||||
ai_client: AIClient,
|
||||
prompt_manager: PromptManager,
|
||||
project_repo: ProjectRepository,
|
||||
content_repo: GeneratedContentRepository
|
||||
):
|
||||
self.ai_client = ai_client
|
||||
self.prompt_manager = prompt_manager
|
||||
self.project_repo = project_repo
|
||||
self.content_repo = content_repo
|
||||
|
||||
def generate_title(self, project_id: int, debug: bool = False) -> str:
|
||||
"""
|
||||
Initialize service
|
||||
Generate SEO-optimized title
|
||||
|
||||
Args:
|
||||
session: Database session
|
||||
config: Application configuration
|
||||
ai_client: AI client (creates new if None)
|
||||
"""
|
||||
self.session = session
|
||||
self.config = config or get_config()
|
||||
self.ai_client = ai_client or AIClient(self.config)
|
||||
self.content_repo = GeneratedContentRepository(session)
|
||||
self.rule_engine = ContentRuleEngine(self.config)
|
||||
self.validator = StageValidator(self.config, self.rule_engine)
|
||||
self.augmenter = ContentAugmenter(ai_client=self.ai_client)
|
||||
|
||||
self.prompts_dir = Path(__file__).parent / "prompts"
|
||||
|
||||
def generate_article(
|
||||
self,
|
||||
project: Project,
|
||||
tier: int,
|
||||
title_model: str,
|
||||
outline_model: str,
|
||||
content_model: str,
|
||||
max_retries: int = 3,
|
||||
progress_callback: Optional[callable] = None,
|
||||
debug: bool = False
|
||||
) -> GeneratedContent:
|
||||
"""
|
||||
Generate complete article through three-stage pipeline
|
||||
|
||||
Args:
|
||||
project: Project with CORA data
|
||||
tier: Tier level
|
||||
title_model: Model for title generation
|
||||
outline_model: Model for outline generation
|
||||
content_model: Model for content generation
|
||||
max_retries: Max retry attempts per stage
|
||||
progress_callback: Optional callback for progress updates
|
||||
debug: Enable debug output
|
||||
project_id: Project ID to generate title for
|
||||
debug: If True, save response to debug_output/
|
||||
|
||||
Returns:
|
||||
GeneratedContent record with completed article
|
||||
|
||||
Raises:
|
||||
GenerationError: If generation fails after all retries
|
||||
Generated title string
|
||||
"""
|
||||
start_time = time.time()
|
||||
project = self.project_repo.get_by_id(project_id)
|
||||
if not project:
|
||||
raise ValueError(f"Project {project_id} not found")
|
||||
|
||||
content_record = self.content_repo.create(project.id, tier)
|
||||
content_record.title_model = title_model
|
||||
content_record.outline_model = outline_model
|
||||
content_record.content_model = content_model
|
||||
self.content_repo.update(content_record)
|
||||
entities_str = ", ".join(project.entities or [])
|
||||
related_str = ", ".join(project.related_searches or [])
|
||||
|
||||
try:
|
||||
title = self._generate_title(project, content_record, title_model, max_retries)
|
||||
|
||||
content_record.generation_stage = "outline"
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
outline = self._generate_outline(project, title, content_record, outline_model, max_retries)
|
||||
|
||||
content_record.generation_stage = "content"
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
html_content = self._generate_content(
|
||||
project, title, outline, content_record, content_model, max_retries
|
||||
)
|
||||
|
||||
content_record.status = "completed"
|
||||
content_record.generation_duration = time.time() - start_time
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
return content_record
|
||||
|
||||
except Exception as e:
|
||||
content_record.status = "failed"
|
||||
content_record.error_message = str(e)
|
||||
content_record.generation_duration = time.time() - start_time
|
||||
self.content_repo.update(content_record)
|
||||
raise GenerationError(f"Article generation failed: {e}")
|
||||
|
||||
def _generate_title(
|
||||
self,
|
||||
project: Project,
|
||||
content_record: GeneratedContent,
|
||||
model: str,
|
||||
max_retries: int
|
||||
) -> str:
|
||||
"""Generate and validate title"""
|
||||
prompt_template = self._load_prompt("title_generation.json")
|
||||
|
||||
entities_str = ", ".join(project.entities[:10]) if project.entities else "N/A"
|
||||
searches_str = ", ".join(project.related_searches[:10]) if project.related_searches else "N/A"
|
||||
|
||||
prompt = prompt_template["user_template"].format(
|
||||
main_keyword=project.main_keyword,
|
||||
word_count=project.word_count,
|
||||
system_msg, user_prompt = self.prompt_manager.format_prompt(
|
||||
"title_generation",
|
||||
keyword=project.main_keyword,
|
||||
entities=entities_str,
|
||||
related_searches=searches_str
|
||||
related_searches=related_str
|
||||
)
|
||||
|
||||
for attempt in range(1, max_retries + 1):
|
||||
content_record.title_attempts = attempt
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
try:
|
||||
title = self.ai_client.generate(
|
||||
prompt=prompt,
|
||||
model=model,
|
||||
title = self.ai_client.generate_completion(
|
||||
prompt=user_prompt,
|
||||
system_message=system_msg,
|
||||
max_tokens=100,
|
||||
temperature=0.7
|
||||
)
|
||||
|
||||
is_valid, errors = self.validator.validate_title(title, project)
|
||||
title = title.strip().strip('"').strip("'")
|
||||
|
||||
if debug:
|
||||
self._save_debug_output(
|
||||
project_id, "title", title, "txt"
|
||||
)
|
||||
|
||||
if is_valid:
|
||||
content_record.title = title
|
||||
self.content_repo.update(content_record)
|
||||
return title
|
||||
|
||||
if attempt < max_retries:
|
||||
prompt += f"\n\nPrevious attempt failed: {', '.join(errors)}. Please fix these issues."
|
||||
|
||||
except AIClientError as e:
|
||||
if attempt == max_retries:
|
||||
raise GenerationError(f"Title generation failed after {max_retries} attempts: {e}")
|
||||
|
||||
raise GenerationError(f"Title validation failed after {max_retries} attempts")
|
||||
|
||||
def _generate_outline(
|
||||
def generate_outline(
|
||||
self,
|
||||
project: Project,
|
||||
project_id: int,
|
||||
title: str,
|
||||
content_record: GeneratedContent,
|
||||
model: str,
|
||||
max_retries: int
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate and validate outline"""
|
||||
prompt_template = self._load_prompt("outline_generation.json")
|
||||
min_h2: int,
|
||||
max_h2: int,
|
||||
min_h3: int,
|
||||
max_h3: int,
|
||||
debug: bool = False
|
||||
) -> dict:
|
||||
"""
|
||||
Generate article outline in JSON format
|
||||
|
||||
entities_str = ", ".join(project.entities[:20]) if project.entities else "N/A"
|
||||
searches_str = ", ".join(project.related_searches[:20]) if project.related_searches else "N/A"
|
||||
Args:
|
||||
project_id: Project ID
|
||||
title: Article title
|
||||
min_h2: Minimum H2 headings
|
||||
max_h2: Maximum H2 headings
|
||||
min_h3: Minimum H3 subheadings total
|
||||
max_h3: Maximum H3 subheadings total
|
||||
debug: If True, save response to debug_output/
|
||||
|
||||
h2_total = int(project.h2_total) if project.h2_total else 5
|
||||
h2_exact = int(project.h2_exact) if project.h2_exact else 1
|
||||
h2_related = int(project.h2_related_search) if project.h2_related_search else 1
|
||||
h2_entities = int(project.h2_entities) if project.h2_entities else 2
|
||||
Returns:
|
||||
Outline dictionary: {"outline": [{"h2": "...", "h3": ["...", "..."]}]}
|
||||
|
||||
h3_total = int(project.h3_total) if project.h3_total else 10
|
||||
h3_exact = int(project.h3_exact) if project.h3_exact else 1
|
||||
h3_related = int(project.h3_related_search) if project.h3_related_search else 2
|
||||
h3_entities = int(project.h3_entities) if project.h3_entities else 3
|
||||
Raises:
|
||||
ValueError: If outline doesn't meet minimum requirements
|
||||
"""
|
||||
project = self.project_repo.get_by_id(project_id)
|
||||
if not project:
|
||||
raise ValueError(f"Project {project_id} not found")
|
||||
|
||||
if self.config.content_rules.cora_validation.round_averages_down:
|
||||
h2_total = int(h2_total)
|
||||
h3_total = int(h3_total)
|
||||
entities_str = ", ".join(project.entities or [])
|
||||
related_str = ", ".join(project.related_searches or [])
|
||||
|
||||
h2_total = min(h2_total, self.MAX_H2_TOTAL)
|
||||
h3_total = min(h3_total, self.MAX_H3_TOTAL)
|
||||
|
||||
prompt = prompt_template["user_template"].format(
|
||||
system_msg, user_prompt = self.prompt_manager.format_prompt(
|
||||
"outline_generation",
|
||||
title=title,
|
||||
main_keyword=project.main_keyword,
|
||||
word_count=project.word_count,
|
||||
h2_total=h2_total,
|
||||
h2_exact=h2_exact,
|
||||
h2_related_search=h2_related,
|
||||
h2_entities=h2_entities,
|
||||
h3_total=h3_total,
|
||||
h3_exact=h3_exact,
|
||||
h3_related_search=h3_related,
|
||||
h3_entities=h3_entities,
|
||||
keyword=project.main_keyword,
|
||||
min_h2=min_h2,
|
||||
max_h2=max_h2,
|
||||
min_h3=min_h3,
|
||||
max_h3=max_h3,
|
||||
entities=entities_str,
|
||||
related_searches=searches_str
|
||||
related_searches=related_str
|
||||
)
|
||||
|
||||
for attempt in range(1, max_retries + 1):
|
||||
content_record.outline_attempts = attempt
|
||||
self.content_repo.update(content_record)
|
||||
outline_json = self.ai_client.generate_completion(
|
||||
prompt=user_prompt,
|
||||
system_message=system_msg,
|
||||
max_tokens=2000,
|
||||
temperature=0.7,
|
||||
json_mode=True
|
||||
)
|
||||
print(f"[DEBUG] Raw outline response: {outline_json}")
|
||||
# Save raw response immediately
|
||||
if debug:
|
||||
self._save_debug_output(project_id, "outline_raw", outline_json, "txt")
|
||||
print(f"[DEBUG] Raw outline response: {outline_json}")
|
||||
|
||||
try:
|
||||
outline_json_str = self.ai_client.generate_json(
|
||||
prompt=prompt,
|
||||
model=model,
|
||||
temperature=0.7,
|
||||
max_tokens=2000
|
||||
outline = json.loads(outline_json)
|
||||
except json.JSONDecodeError as e:
|
||||
if debug:
|
||||
self._save_debug_output(project_id, "outline_error", outline_json, "txt")
|
||||
raise ValueError(f"Failed to parse outline JSON: {e}\nResponse: {outline_json[:500]}")
|
||||
|
||||
if "outline" not in outline:
|
||||
if debug:
|
||||
self._save_debug_output(project_id, "outline_invalid", json.dumps(outline, indent=2), "json")
|
||||
raise ValueError(f"Outline missing 'outline' key. Got keys: {list(outline.keys())}\nContent: {outline}")
|
||||
|
||||
h2_count = len(outline["outline"])
|
||||
h3_count = sum(len(section.get("h3", [])) for section in outline["outline"])
|
||||
|
||||
if h2_count < min_h2:
|
||||
raise ValueError(f"Outline has {h2_count} H2s, minimum is {min_h2}")
|
||||
|
||||
if h3_count < min_h3:
|
||||
raise ValueError(f"Outline has {h3_count} H3s, minimum is {min_h3}")
|
||||
|
||||
if debug:
|
||||
self._save_debug_output(
|
||||
project_id, "outline", json.dumps(outline, indent=2), "json"
|
||||
)
|
||||
|
||||
if isinstance(outline_json_str, str):
|
||||
outline = json.loads(outline_json_str)
|
||||
else:
|
||||
outline = outline_json_str
|
||||
|
||||
is_valid, errors, missing = self.validator.validate_outline(outline, project)
|
||||
|
||||
if is_valid:
|
||||
content_record.outline = json.dumps(outline)
|
||||
self.content_repo.update(content_record)
|
||||
return outline
|
||||
|
||||
if attempt < max_retries:
|
||||
if missing:
|
||||
augmented_outline, aug_log = self.augmenter.augment_outline(
|
||||
outline, missing, project.main_keyword,
|
||||
project.entities or [], project.related_searches or []
|
||||
)
|
||||
|
||||
is_valid_aug, errors_aug, _ = self.validator.validate_outline(
|
||||
augmented_outline, project
|
||||
)
|
||||
|
||||
if is_valid_aug:
|
||||
content_record.outline = json.dumps(augmented_outline)
|
||||
content_record.augmented = True
|
||||
content_record.augmentation_log = aug_log
|
||||
self.content_repo.update(content_record)
|
||||
return augmented_outline
|
||||
|
||||
prompt += f"\n\nPrevious attempt failed: {', '.join(errors)}. Please meet ALL CORA targets exactly."
|
||||
|
||||
except (AIClientError, json.JSONDecodeError) as e:
|
||||
if attempt == max_retries:
|
||||
raise GenerationError(f"Outline generation failed after {max_retries} attempts: {e}")
|
||||
|
||||
raise GenerationError(f"Outline validation failed after {max_retries} attempts")
|
||||
|
||||
def _generate_content(
|
||||
def generate_content(
|
||||
self,
|
||||
project: Project,
|
||||
project_id: int,
|
||||
title: str,
|
||||
outline: Dict[str, Any],
|
||||
content_record: GeneratedContent,
|
||||
model: str,
|
||||
max_retries: int
|
||||
outline: dict,
|
||||
min_word_count: int,
|
||||
max_word_count: int,
|
||||
debug: bool = False
|
||||
) -> str:
|
||||
"""Generate and validate full HTML content"""
|
||||
prompt_template = self._load_prompt("content_generation.json")
|
||||
"""
|
||||
Generate full article HTML fragment
|
||||
|
||||
outline_str = self._format_outline_for_prompt(outline)
|
||||
entities_str = ", ".join(project.entities[:30]) if project.entities else "N/A"
|
||||
searches_str = ", ".join(project.related_searches[:30]) if project.related_searches else "N/A"
|
||||
Args:
|
||||
project_id: Project ID
|
||||
title: Article title
|
||||
outline: Article outline dict
|
||||
min_word_count: Minimum word count for guidance
|
||||
max_word_count: Maximum word count for guidance
|
||||
debug: If True, save response to debug_output/
|
||||
|
||||
prompt = prompt_template["user_template"].format(
|
||||
outline=outline_str,
|
||||
Returns:
|
||||
HTML string with <h2>, <h3>, <p> tags
|
||||
"""
|
||||
project = self.project_repo.get_by_id(project_id)
|
||||
if not project:
|
||||
raise ValueError(f"Project {project_id} not found")
|
||||
|
||||
entities_str = ", ".join(project.entities or [])
|
||||
related_str = ", ".join(project.related_searches or [])
|
||||
outline_str = json.dumps(outline, indent=2)
|
||||
|
||||
system_msg, user_prompt = self.prompt_manager.format_prompt(
|
||||
"content_generation",
|
||||
title=title,
|
||||
main_keyword=project.main_keyword,
|
||||
word_count=project.word_count,
|
||||
term_frequency=project.term_frequency or self.config.content_rules.universal.default_term_frequency,
|
||||
outline=outline_str,
|
||||
keyword=project.main_keyword,
|
||||
entities=entities_str,
|
||||
related_searches=searches_str
|
||||
related_searches=related_str,
|
||||
min_word_count=min_word_count,
|
||||
max_word_count=max_word_count
|
||||
)
|
||||
|
||||
for attempt in range(1, max_retries + 1):
|
||||
content_record.content_attempts = attempt
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
try:
|
||||
html_content = self.ai_client.generate(
|
||||
prompt=prompt,
|
||||
model=model,
|
||||
temperature=0.7,
|
||||
max_tokens=self.config.ai_service.max_tokens
|
||||
content = self.ai_client.generate_completion(
|
||||
prompt=user_prompt,
|
||||
system_message=system_msg,
|
||||
max_tokens=8000,
|
||||
temperature=0.7
|
||||
)
|
||||
|
||||
is_valid, validation_result = self.validator.validate_content(html_content, project)
|
||||
content = content.strip()
|
||||
|
||||
content_record.validation_errors = len(validation_result.errors)
|
||||
content_record.validation_warnings = len(validation_result.warnings)
|
||||
content_record.validation_report = validation_result.to_dict()
|
||||
self.content_repo.update(content_record)
|
||||
|
||||
if is_valid:
|
||||
content_record.content = html_content
|
||||
word_count = len(html_content.split())
|
||||
content_record.word_count = word_count
|
||||
self.content_repo.update(content_record)
|
||||
return html_content
|
||||
|
||||
if attempt < max_retries:
|
||||
missing = self.validator.extract_missing_elements(validation_result, project, html_content)
|
||||
has_word_deficit = missing.get("word_count_deficit", 0) > 0
|
||||
|
||||
if has_word_deficit:
|
||||
try:
|
||||
augmented_html, aug_log = self.augmenter.augment_content_with_ai(
|
||||
html_content, missing, project.main_keyword,
|
||||
project.entities or [], project.related_searches or [],
|
||||
model=model
|
||||
if debug:
|
||||
self._save_debug_output(
|
||||
project_id, "content", content, "html"
|
||||
)
|
||||
|
||||
is_valid_aug, validation_result_aug = self.validator.validate_content(
|
||||
augmented_html, project
|
||||
return content
|
||||
|
||||
def validate_word_count(self, content: str, min_words: int, max_words: int) -> Tuple[bool, int]:
|
||||
"""
|
||||
Validate content word count
|
||||
|
||||
Args:
|
||||
content: HTML content string
|
||||
min_words: Minimum word count
|
||||
max_words: Maximum word count
|
||||
|
||||
Returns:
|
||||
Tuple of (is_valid, actual_count)
|
||||
"""
|
||||
word_count = self.count_words(content)
|
||||
is_valid = min_words <= word_count <= max_words
|
||||
return is_valid, word_count
|
||||
|
||||
def count_words(self, html_content: str) -> int:
|
||||
"""
|
||||
Count words in HTML content
|
||||
|
||||
Args:
|
||||
html_content: HTML string
|
||||
|
||||
Returns:
|
||||
Number of words
|
||||
"""
|
||||
text = re.sub(r'<[^>]+>', '', html_content)
|
||||
text = unescape(text)
|
||||
words = text.split()
|
||||
return len(words)
|
||||
|
||||
def augment_content(
|
||||
self,
|
||||
content: str,
|
||||
target_word_count: int,
|
||||
debug: bool = False,
|
||||
project_id: Optional[int] = None
|
||||
) -> str:
|
||||
"""
|
||||
Expand article content to meet minimum word count
|
||||
|
||||
Args:
|
||||
content: Current HTML content
|
||||
target_word_count: Target word count
|
||||
debug: If True, save response to debug_output/
|
||||
project_id: Optional project ID for debug output
|
||||
|
||||
Returns:
|
||||
Expanded HTML content
|
||||
"""
|
||||
system_msg, user_prompt = self.prompt_manager.format_prompt(
|
||||
"content_augmentation",
|
||||
content=content,
|
||||
target_word_count=target_word_count
|
||||
)
|
||||
|
||||
content_record.content = augmented_html
|
||||
content_record.augmented = True
|
||||
existing_log = content_record.augmentation_log or {}
|
||||
existing_log["content_ai_augmentation"] = aug_log
|
||||
content_record.augmentation_log = existing_log
|
||||
content_record.validation_errors = len(validation_result_aug.errors)
|
||||
content_record.validation_warnings = len(validation_result_aug.warnings)
|
||||
content_record.validation_report = validation_result_aug.to_dict()
|
||||
word_count = len(augmented_html.split())
|
||||
content_record.word_count = word_count
|
||||
self.content_repo.update(content_record)
|
||||
augmented = self.ai_client.generate_completion(
|
||||
prompt=user_prompt,
|
||||
system_message=system_msg,
|
||||
max_tokens=8000,
|
||||
temperature=0.7
|
||||
)
|
||||
|
||||
missing_after = self.validator.extract_missing_elements(validation_result_aug, project, augmented_html)
|
||||
still_short = missing_after.get("word_count_deficit", 0) > 0
|
||||
augmented = augmented.strip()
|
||||
|
||||
if not still_short:
|
||||
return augmented_html
|
||||
if debug and project_id:
|
||||
self._save_debug_output(
|
||||
project_id, "augmented", augmented, "html"
|
||||
)
|
||||
|
||||
html_content = augmented_html
|
||||
validation_result = validation_result_aug
|
||||
return augmented
|
||||
|
||||
except Exception as e:
|
||||
print(f"AI augmentation failed: {e}")
|
||||
error_summary = f"Word count too short. AI augmentation failed: {str(e)}"
|
||||
prompt += f"\n\nPrevious content failed validation: {error_summary}. Generate MORE content to meet the word count target."
|
||||
else:
|
||||
content_record.content = html_content
|
||||
word_count = len(html_content.split())
|
||||
content_record.word_count = word_count
|
||||
self.content_repo.update(content_record)
|
||||
return html_content
|
||||
def _save_debug_output(
|
||||
self,
|
||||
project_id: int,
|
||||
stage: str,
|
||||
content: str,
|
||||
extension: str,
|
||||
tier: Optional[str] = None,
|
||||
article_num: Optional[int] = None
|
||||
):
|
||||
"""Save debug output to file"""
|
||||
debug_dir = Path("debug_output")
|
||||
debug_dir.mkdir(exist_ok=True)
|
||||
|
||||
except AIClientError as e:
|
||||
if attempt == max_retries:
|
||||
raise GenerationError(f"Content generation failed after {max_retries} attempts: {e}")
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
|
||||
raise GenerationError(f"Content validation failed after {max_retries} attempts")
|
||||
tier_part = f"_tier{tier}" if tier else ""
|
||||
article_part = f"_article{article_num}" if article_num else ""
|
||||
|
||||
def _load_prompt(self, filename: str) -> Dict[str, Any]:
|
||||
"""Load prompt template from JSON file"""
|
||||
prompt_path = self.prompts_dir / filename
|
||||
if not prompt_path.exists():
|
||||
raise GenerationError(f"Prompt template not found: {filename}")
|
||||
filename = f"{stage}_project{project_id}{tier_part}{article_part}_{timestamp}.{extension}"
|
||||
filepath = debug_dir / filename
|
||||
|
||||
with open(prompt_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
def _format_outline_for_prompt(self, outline: Dict[str, Any]) -> str:
|
||||
"""Format outline JSON into readable string for content prompt"""
|
||||
lines = [f"H1: {outline.get('h1', '')}"]
|
||||
|
||||
for section in outline.get("sections", []):
|
||||
lines.append(f"\nH2: {section['h2']}")
|
||||
for h3 in section.get("h3s", []):
|
||||
lines.append(f" H3: {h3}")
|
||||
|
||||
return "\n".join(lines)
|
||||
with open(filepath, 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
|
|
@ -0,0 +1,52 @@
|
|||
"""
|
||||
Integration test for batch generation (stub)
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import Mock, MagicMock
|
||||
from src.generation.batch_processor import BatchProcessor
|
||||
from src.generation.service import ContentGenerator
|
||||
|
||||
|
||||
def test_batch_processor_initialization():
|
||||
"""Test BatchProcessor can be initialized"""
|
||||
mock_generator = Mock(spec=ContentGenerator)
|
||||
mock_content_repo = Mock()
|
||||
mock_project_repo = Mock()
|
||||
|
||||
processor = BatchProcessor(
|
||||
content_generator=mock_generator,
|
||||
content_repo=mock_content_repo,
|
||||
project_repo=mock_project_repo
|
||||
)
|
||||
|
||||
assert processor is not None
|
||||
assert processor.stats["total_jobs"] == 0
|
||||
assert processor.stats["processed_jobs"] == 0
|
||||
|
||||
|
||||
def test_batch_processor_stats_initialization():
|
||||
"""Test BatchProcessor initializes stats correctly"""
|
||||
mock_generator = Mock(spec=ContentGenerator)
|
||||
mock_content_repo = Mock()
|
||||
mock_project_repo = Mock()
|
||||
|
||||
processor = BatchProcessor(
|
||||
content_generator=mock_generator,
|
||||
content_repo=mock_content_repo,
|
||||
project_repo=mock_project_repo
|
||||
)
|
||||
|
||||
expected_keys = [
|
||||
"total_jobs",
|
||||
"processed_jobs",
|
||||
"total_articles",
|
||||
"generated_articles",
|
||||
"augmented_articles",
|
||||
"failed_articles"
|
||||
]
|
||||
|
||||
for key in expected_keys:
|
||||
assert key in processor.stats
|
||||
assert processor.stats[key] == 0
|
||||
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
"""
|
||||
Unit tests for ContentGenerator service
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from src.generation.service import ContentGenerator
|
||||
|
||||
|
||||
def test_count_words_simple():
|
||||
"""Test word count on simple text"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
html = "<p>This is a test with five words</p>"
|
||||
count = generator.count_words(html)
|
||||
|
||||
assert count == 7
|
||||
|
||||
|
||||
def test_count_words_with_headings():
|
||||
"""Test word count with HTML headings"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
html = """
|
||||
<h2>Main Heading</h2>
|
||||
<p>This is a paragraph with some words.</p>
|
||||
<h3>Subheading</h3>
|
||||
<p>Another paragraph here.</p>
|
||||
"""
|
||||
|
||||
count = generator.count_words(html)
|
||||
|
||||
assert count > 10
|
||||
|
||||
|
||||
def test_count_words_strips_html_tags():
|
||||
"""Test that HTML tags are stripped before counting"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
html = "<p>Hello <strong>world</strong> this <em>is</em> a test</p>"
|
||||
count = generator.count_words(html)
|
||||
|
||||
assert count == 6
|
||||
|
||||
|
||||
def test_validate_word_count_within_range():
|
||||
"""Test validation when word count is within range"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
content = "<p>" + " ".join(["word"] * 100) + "</p>"
|
||||
is_valid, count = generator.validate_word_count(content, 50, 150)
|
||||
|
||||
assert is_valid is True
|
||||
assert count == 100
|
||||
|
||||
|
||||
def test_validate_word_count_below_minimum():
|
||||
"""Test validation when word count is below minimum"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
content = "<p>" + " ".join(["word"] * 30) + "</p>"
|
||||
is_valid, count = generator.validate_word_count(content, 50, 150)
|
||||
|
||||
assert is_valid is False
|
||||
assert count == 30
|
||||
|
||||
|
||||
def test_validate_word_count_above_maximum():
|
||||
"""Test validation when word count is above maximum"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
content = "<p>" + " ".join(["word"] * 200) + "</p>"
|
||||
is_valid, count = generator.validate_word_count(content, 50, 150)
|
||||
|
||||
assert is_valid is False
|
||||
assert count == 200
|
||||
|
||||
|
||||
def test_count_words_empty_content():
|
||||
"""Test word count on empty content"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
count = generator.count_words("")
|
||||
|
||||
assert count == 0
|
||||
|
||||
|
||||
def test_count_words_only_tags():
|
||||
"""Test word count on content with only HTML tags"""
|
||||
generator = ContentGenerator(None, None, None, None)
|
||||
|
||||
html = "<div><p></p><span></span></div>"
|
||||
count = generator.count_words(html)
|
||||
|
||||
assert count == 0
|
||||
|
||||
|
|
@ -1,208 +1,176 @@
|
|||
"""
|
||||
Unit tests for job configuration
|
||||
Unit tests for JobConfig parser
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from src.generation.job_config import (
|
||||
JobConfig, TierConfig, ModelConfig, AnchorTextConfig,
|
||||
FailureConfig, InterlinkingConfig
|
||||
)
|
||||
from src.generation.job_config import JobConfig, TIER_DEFAULTS
|
||||
|
||||
|
||||
def test_model_config_creation():
|
||||
"""Test ModelConfig creation"""
|
||||
config = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
|
||||
assert config.title == "model1"
|
||||
assert config.outline == "model2"
|
||||
assert config.content == "model3"
|
||||
@pytest.fixture
|
||||
def temp_job_file(tmp_path):
|
||||
"""Create a temporary job file for testing"""
|
||||
def _create_file(data):
|
||||
job_file = tmp_path / "test_job.json"
|
||||
with open(job_file, 'w') as f:
|
||||
json.dump(data, f)
|
||||
return str(job_file)
|
||||
return _create_file
|
||||
|
||||
|
||||
def test_anchor_text_config_modes():
|
||||
"""Test different anchor text modes"""
|
||||
default_config = AnchorTextConfig(mode="default")
|
||||
assert default_config.mode == "default"
|
||||
|
||||
override_config = AnchorTextConfig(
|
||||
mode="override",
|
||||
custom_text=["anchor1", "anchor2"]
|
||||
)
|
||||
assert override_config.mode == "override"
|
||||
assert len(override_config.custom_text) == 2
|
||||
|
||||
append_config = AnchorTextConfig(
|
||||
mode="append",
|
||||
additional_text=["extra"]
|
||||
)
|
||||
assert append_config.mode == "append"
|
||||
|
||||
|
||||
def test_tier_config_creation():
|
||||
"""Test TierConfig creation"""
|
||||
models = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
|
||||
tier_config = TierConfig(
|
||||
tier=1,
|
||||
article_count=15,
|
||||
models=models
|
||||
)
|
||||
|
||||
assert tier_config.tier == 1
|
||||
assert tier_config.article_count == 15
|
||||
assert tier_config.validation_attempts == 3
|
||||
|
||||
|
||||
def test_job_config_creation():
|
||||
"""Test JobConfig creation"""
|
||||
models = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
|
||||
tier = TierConfig(
|
||||
tier=1,
|
||||
article_count=10,
|
||||
models=models
|
||||
)
|
||||
|
||||
job = JobConfig(
|
||||
job_name="Test Job",
|
||||
project_id=1,
|
||||
tiers=[tier]
|
||||
)
|
||||
|
||||
assert job.job_name == "Test Job"
|
||||
assert job.project_id == 1
|
||||
assert len(job.tiers) == 1
|
||||
assert job.get_total_articles() == 10
|
||||
|
||||
|
||||
def test_job_config_multiple_tiers():
|
||||
"""Test JobConfig with multiple tiers"""
|
||||
models = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
|
||||
tier1 = TierConfig(tier=1, article_count=10, models=models)
|
||||
tier2 = TierConfig(tier=2, article_count=20, models=models)
|
||||
|
||||
job = JobConfig(
|
||||
job_name="Multi-Tier Job",
|
||||
project_id=1,
|
||||
tiers=[tier1, tier2]
|
||||
)
|
||||
|
||||
assert job.get_total_articles() == 30
|
||||
|
||||
|
||||
def test_job_config_unique_tiers_validation():
|
||||
"""Test that tier numbers must be unique"""
|
||||
models = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
|
||||
tier1 = TierConfig(tier=1, article_count=10, models=models)
|
||||
tier2 = TierConfig(tier=1, article_count=20, models=models)
|
||||
|
||||
with pytest.raises(ValueError, match="unique"):
|
||||
JobConfig(
|
||||
job_name="Duplicate Tiers",
|
||||
project_id=1,
|
||||
tiers=[tier1, tier2]
|
||||
)
|
||||
|
||||
|
||||
def test_job_config_from_file():
|
||||
"""Test loading JobConfig from JSON file"""
|
||||
config_data = {
|
||||
"job_name": "Test Job",
|
||||
"project_id": 1,
|
||||
"tiers": [
|
||||
def test_load_job_config_valid(temp_job_file):
|
||||
"""Test loading valid job file"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"tier": 1,
|
||||
"article_count": 5,
|
||||
"models": {
|
||||
"title": "model1",
|
||||
"outline": "model2",
|
||||
"content": "model3"
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
|
||||
json.dump(config_data, f)
|
||||
temp_path = f.name
|
||||
job_file = temp_job_file(data)
|
||||
config = JobConfig(job_file)
|
||||
|
||||
try:
|
||||
job = JobConfig.from_file(temp_path)
|
||||
assert job.job_name == "Test Job"
|
||||
assert job.project_id == 1
|
||||
assert len(job.tiers) == 1
|
||||
finally:
|
||||
Path(temp_path).unlink()
|
||||
assert len(config.get_jobs()) == 1
|
||||
assert config.get_jobs()[0].project_id == 1
|
||||
assert "tier1" in config.get_jobs()[0].tiers
|
||||
|
||||
|
||||
def test_job_config_to_file():
|
||||
"""Test saving JobConfig to JSON file"""
|
||||
models = ModelConfig(
|
||||
title="model1",
|
||||
outline="model2",
|
||||
content="model3"
|
||||
)
|
||||
def test_tier_defaults_applied(temp_job_file):
|
||||
"""Test defaults applied when not in job file"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
tier = TierConfig(tier=1, article_count=5, models=models)
|
||||
job = JobConfig(
|
||||
job_name="Test Job",
|
||||
project_id=1,
|
||||
tiers=[tier]
|
||||
)
|
||||
job_file = temp_job_file(data)
|
||||
config = JobConfig(job_file)
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
|
||||
temp_path = f.name
|
||||
job = config.get_jobs()[0]
|
||||
tier1_config = job.tiers["tier1"]
|
||||
|
||||
try:
|
||||
job.to_file(temp_path)
|
||||
assert Path(temp_path).exists()
|
||||
|
||||
loaded_job = JobConfig.from_file(temp_path)
|
||||
assert loaded_job.job_name == job.job_name
|
||||
assert loaded_job.project_id == job.project_id
|
||||
finally:
|
||||
Path(temp_path).unlink()
|
||||
assert tier1_config.count == 3
|
||||
assert tier1_config.min_word_count == TIER_DEFAULTS["tier1"]["min_word_count"]
|
||||
assert tier1_config.max_word_count == TIER_DEFAULTS["tier1"]["max_word_count"]
|
||||
|
||||
|
||||
def test_interlinking_config_validation():
|
||||
"""Test InterlinkingConfig validation"""
|
||||
config = InterlinkingConfig(
|
||||
links_per_article_min=2,
|
||||
links_per_article_max=4
|
||||
)
|
||||
def test_custom_values_override_defaults(temp_job_file):
|
||||
"""Test custom values override defaults"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {
|
||||
"count": 5,
|
||||
"min_word_count": 3000,
|
||||
"max_word_count": 3500
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
assert config.links_per_article_min == 2
|
||||
assert config.links_per_article_max == 4
|
||||
job_file = temp_job_file(data)
|
||||
config = JobConfig(job_file)
|
||||
|
||||
job = config.get_jobs()[0]
|
||||
tier1_config = job.tiers["tier1"]
|
||||
|
||||
assert tier1_config.min_word_count == 3000
|
||||
assert tier1_config.max_word_count == 3500
|
||||
|
||||
|
||||
def test_failure_config_defaults():
|
||||
"""Test FailureConfig default values"""
|
||||
config = FailureConfig()
|
||||
def test_multiple_jobs_in_file(temp_job_file):
|
||||
"""Test parsing file with multiple jobs"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {"tier1": {"count": 5}}
|
||||
},
|
||||
{
|
||||
"project_id": 2,
|
||||
"tiers": {"tier2": {"count": 10}}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
assert config.max_consecutive_failures == 5
|
||||
assert config.skip_on_failure is True
|
||||
job_file = temp_job_file(data)
|
||||
config = JobConfig(job_file)
|
||||
|
||||
jobs = config.get_jobs()
|
||||
assert len(jobs) == 2
|
||||
assert jobs[0].project_id == 1
|
||||
assert jobs[1].project_id == 2
|
||||
|
||||
|
||||
def test_multiple_tiers_in_job(temp_job_file):
|
||||
"""Test job with multiple tiers"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"project_id": 1,
|
||||
"tiers": {
|
||||
"tier1": {"count": 5},
|
||||
"tier2": {"count": 10},
|
||||
"tier3": {"count": 15}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
job_file = temp_job_file(data)
|
||||
config = JobConfig(job_file)
|
||||
|
||||
job = config.get_jobs()[0]
|
||||
assert len(job.tiers) == 3
|
||||
assert "tier1" in job.tiers
|
||||
assert "tier2" in job.tiers
|
||||
assert "tier3" in job.tiers
|
||||
|
||||
|
||||
def test_invalid_job_file_no_jobs_key(temp_job_file):
|
||||
"""Test error when jobs key is missing"""
|
||||
data = {"invalid": []}
|
||||
|
||||
job_file = temp_job_file(data)
|
||||
|
||||
with pytest.raises(ValueError, match="must contain 'jobs'"):
|
||||
JobConfig(job_file)
|
||||
|
||||
|
||||
def test_invalid_job_missing_project_id(temp_job_file):
|
||||
"""Test error when project_id is missing"""
|
||||
data = {
|
||||
"jobs": [
|
||||
{
|
||||
"tiers": {"tier1": {"count": 5}}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
job_file = temp_job_file(data)
|
||||
|
||||
with pytest.raises(ValueError, match="missing 'project_id'"):
|
||||
JobConfig(job_file)
|
||||
|
||||
|
||||
def test_file_not_found():
|
||||
"""Test error when file doesn't exist"""
|
||||
with pytest.raises(FileNotFoundError):
|
||||
JobConfig("nonexistent_file.json")
|
||||
Loading…
Reference in New Issue