Big-Link-Man/docs/job-schema.md

387 lines
12 KiB
Markdown

# Job Configuration Schema
This document defines the complete schema for job configuration files used in the Big-Link-Man content automation platform. All job files are JSON format and define batch content generation parameters.
## Root Structure
```json
{
"jobs": [
{
// Job object (see Job Object section below)
}
]
}
```
### Root Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `jobs` | `Array<Job>` | Yes | Array of job definitions to process |
## Job Object
Each job object defines a complete content generation batch for a specific project.
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `project_id` | `integer` | The project ID to generate content for |
| `tiers` | `Object` | Dictionary of tier configurations (see Tier Configuration section) |
### Optional Fields
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `models` | `Object` | Uses CLI default | AI models to use for each generation stage (title, outline, content) |
| `deployment_targets` | `Array<string>` | `null` | Array of site custom_hostnames for tier1 deployment assignment (Story 2.5) |
| `tier1_preferred_sites` | `Array<string>` | `null` | Array of hostnames for tier1 site assignment priority (Story 3.1) |
| `auto_create_sites` | `boolean` | `false` | Whether to auto-create sites when pool is insufficient (Story 3.1) |
| `create_sites_for_keywords` | `Array<Object>` | `null` | Array of keyword site creation configs (Story 3.1) |
| `tiered_link_count_range` | `Object` | `null` | Configuration for tiered link counts (Story 3.2) |
## Tier Configuration
Each tier in the `tiers` object defines content generation parameters for that specific tier level.
### Tier Keys
- `tier1` - Premium content (highest quality)
- `tier2` - Standard content (medium quality)
- `tier3` - Supporting content (basic quality)
### Tier Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `count` | `integer` | Yes | - | Number of articles to generate for this tier |
| `min_word_count` | `integer` | No | See defaults | Minimum word count for articles |
| `max_word_count` | `integer` | No | See defaults | Maximum word count for articles |
| `min_h2_tags` | `integer` | No | See defaults | Minimum number of H2 headings |
| `max_h2_tags` | `integer` | No | See defaults | Maximum number of H2 headings |
| `min_h3_tags` | `integer` | No | See defaults | Minimum number of H3 subheadings |
| `max_h3_tags` | `integer` | No | See defaults | Maximum number of H3 subheadings |
### Tier Defaults
#### Tier 1 Defaults
```json
{
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
}
```
#### Tier 2 Defaults
```json
{
"min_word_count": 1500,
"max_word_count": 2000,
"min_h2_tags": 2,
"max_h2_tags": 4,
"min_h3_tags": 3,
"max_h3_tags": 8
}
```
#### Tier 3 Defaults
```json
{
"min_word_count": 1000,
"max_word_count": 1500,
"min_h2_tags": 2,
"max_h2_tags": 3,
"min_h3_tags": 2,
"max_h3_tags": 6
}
```
## Deployment Target Assignment (Story 2.5)
### `deployment_targets`
- **Type**: `Array<string>` (optional)
- **Purpose**: Assigns tier1 articles to specific sites in round-robin fashion
- **Behavior**:
- Only affects tier1 articles
- Articles 0 through N-1 get assigned to N deployment targets
- Articles N and beyond get `site_deployment_id = null`
- If not specified, all articles get `site_deployment_id = null`
### Example
```json
{
"deployment_targets": [
"www.domain1.com",
"www.domain2.com",
"www.domain3.com"
]
}
```
**Assignment Result:**
- Article 0 → www.domain1.com
- Article 1 → www.domain2.com
- Article 2 → www.domain3.com
- Articles 3+ → null (no assignment)
## Site Assignment (Story 3.1)
### `tier1_preferred_sites`
- **Type**: `Array<string>` (optional)
- **Purpose**: Preferred sites for tier1 article assignment
- **Behavior**: Used in priority order before random selection
- **Validation**: All hostnames must exist in database
### `auto_create_sites`
- **Type**: `boolean` (optional, default: `false`)
- **Purpose**: Auto-create sites when available pool is insufficient
- **Behavior**: Creates generic sites using project keyword as prefix
### `create_sites_for_keywords`
- **Type**: `Array<Object>` (optional)
- **Purpose**: Pre-create sites for specific keywords before assignment
- **Structure**: Each object must have `keyword` (string) and `count` (integer)
#### Keyword Site Creation Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `keyword` | `string` | Yes | Keyword to create sites for |
| `count` | `integer` | Yes | Number of sites to create for this keyword |
### Example
```json
{
"tier1_preferred_sites": [
"www.premium-site1.com",
"site123.b-cdn.net"
],
"auto_create_sites": true,
"create_sites_for_keywords": [
{
"keyword": "engine repair",
"count": 3
},
{
"keyword": "car maintenance",
"count": 2
}
]
}
```
## AI Model Configuration
### `models`
- **Type**: `Object` (optional)
- **Purpose**: Specifies AI models to use for each generation stage
- **Behavior**: Allows different models for title, outline, and content generation
- **Note**: If not specified, all stages use the model from CLI `--model` flag (default: `gpt-4o-mini`)
#### Models Object Fields
| Field | Type | Description |
|-------|------|-------------|
| `title` | `string` | Model to use for title generation |
| `outline` | `string` | Model to use for outline generation |
| `content` | `string` | Model to use for content generation |
### Available Models (from master.config.json)
- `anthropic/claude-sonnet-4.5` (Claude Sonnet 4.5)
- `anthropic/claude-3.5-sonnet` (Claude 3.5 Sonnet)
- `openai/gpt-4o` (GPT-4 Optimized)
- `openai/gpt-4o-mini` (GPT-4 Mini)
- `meta-llama/llama-3.1-70b-instruct` (Llama 3.1 70B)
- `meta-llama/llama-3.1-8b-instruct` (Llama 3.1 8B)
- `google/gemini-2.5-flash` (Gemini 2.5 Flash)
### Example
```json
{
"models": {
"title": "openai/gpt-4o-mini",
"outline": "openai/gpt-4o",
"content": "anthropic/claude-3.5-sonnet"
}
}
```
### Implementation Status
**Implemented** - The `models` field is fully functional. Different models can be specified for title, outline, and content generation stages. If a job file contains a `models` configuration and you also use the `--model` CLI flag, the system will warn you that the CLI flag is being ignored in favor of the job config.
## Tiered Link Configuration (Story 3.2)
### `tiered_link_count_range`
- **Type**: `Object` (optional)
- **Purpose**: Configures how many tiered links to generate per article
- **Default**: `{"min": 2, "max": 4}` if not specified
#### Tiered Link Range Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `min` | `integer` | Yes | Minimum number of tiered links (must be >= 1) |
| `max` | `integer` | Yes | Maximum number of tiered links (must be >= min) |
### Example
```json
{
"tiered_link_count_range": {
"min": 3,
"max": 5
}
}
```
## Complete Example
```json
{
"jobs": [
{
"project_id": 1,
"models": {
"title": "anthropic/claude-3.5-sonnet",
"outline": "anthropic/claude-3.5-sonnet",
"content": "openai/gpt-4o"
},
"deployment_targets": [
"www.primary-domain.com",
"www.secondary-domain.com"
],
"tier1_preferred_sites": [
"www.premium-site1.com",
"site123.b-cdn.net"
],
"auto_create_sites": true,
"create_sites_for_keywords": [
{
"keyword": "engine repair",
"count": 3
},
{
"keyword": "car maintenance",
"count": 2
}
],
"tiered_link_count_range": {
"min": 3,
"max": 5
},
"tiers": {
"tier1": {
"count": 10,
"min_word_count": 2000,
"max_word_count": 2500,
"min_h2_tags": 3,
"max_h2_tags": 5,
"min_h3_tags": 5,
"max_h3_tags": 10
},
"tier2": {
"count": 50,
"min_word_count": 1500,
"max_word_count": 2000
},
"tier3": {
"count": 100
}
}
}
]
}
```
## Validation Rules
### Job Level Validation
- `project_id` must be a positive integer
- `tiers` must be an object with at least one tier
- `models` must be an object with `title`, `outline`, and `content` fields (if specified)
- `deployment_targets` must be an array of strings (if specified)
- `tier1_preferred_sites` must be an array of strings (if specified)
- `auto_create_sites` must be a boolean (if specified)
- `create_sites_for_keywords` must be an array of objects with `keyword` and `count` fields (if specified)
- `tiered_link_count_range` must have `min` >= 1 and `max` >= `min` (if specified)
### Tier Level Validation
- `count` must be a positive integer
- `min_word_count` must be <= `max_word_count`
- `min_h2_tags` must be <= `max_h2_tags`
- `min_h3_tags` must be <= `max_h3_tags`
### Site Assignment Validation
- All hostnames in `deployment_targets` must exist in database
- All hostnames in `tier1_preferred_sites` must exist in database
- Keywords in `create_sites_for_keywords` must be non-empty strings
- Count values in `create_sites_for_keywords` must be positive integers
## Usage
### CLI Command
```bash
uv run python main.py generate-batch --job-file jobs/example.json --username admin --password secret
```
### Command Options
- `--job-file, -j`: Path to job JSON file (required)
- `--username, -u`: Username for authentication
- `--password, -p`: Password for authentication
- `--debug`: Save AI responses to debug_output/
- `--continue-on-error`: Continue processing if article generation fails
- `--model, -m`: AI model to use (default: gpt-4o-mini). Overridden by job file `models` config if present.
## Implementation History
### Story 2.2: Basic Content Generation
- Added `project_id` and `tiers` fields
- Added tier configuration with word count and heading constraints
- Added tier defaults for common configurations
### Story 2.3: AI Content Generation
- **Implemented**: Per-stage model selection via job config `models` field
- **Implemented**: Dynamic model switching in AIClient with `override_model` parameter
- **Implemented**: CLI warning when job contains models but `--model` flag is used
- **Behavior**: Job file `models` config takes precedence over CLI `--model` flag
### Story 2.5: Deployment Target Assignment
- Added `deployment_targets` field for tier1 site assignment
- Implemented round-robin assignment logic
- Added validation for deployment target hostnames
### Story 3.1: URL Generation and Site Assignment
- Added `tier1_preferred_sites` for priority-based assignment
- Added `auto_create_sites` for on-demand site creation
- Added `create_sites_for_keywords` for pre-creation of keyword sites
- Extended site assignment beyond deployment targets
### Story 3.2: Tiered Link Finding
- Added `tiered_link_count_range` for configurable link counts
- Integrated with tiered link generation system
- Added validation for link count ranges
## Future Extensions
The schema is designed to be extensible for future features:
- **Story 3.3**: Content interlinking injection
- **Story 4.x**: Cloud deployment and handoff
- **Future**: Advanced site matching, cost tracking, analytics
## Error Handling
### Common Validation Errors
- `"Job missing 'project_id'"` - Required field missing
- `"Job missing 'tiers'"` - Required field missing
- `"'deployment_targets' must be an array"` - Wrong data type
- `"Deployment targets not found in database: invalid.com"` - Invalid hostname
- `"'tiered_link_count_range' min must be >= 1"` - Invalid range value
### Graceful Degradation
- Missing optional fields use sensible defaults
- Invalid hostnames cause clear error messages
- Insufficient sites trigger auto-creation (if enabled) or clear errors
- Failed articles are logged but don't stop batch processing (with `--continue-on-error`)