CheddahBot/docs/ARCHITECTURE_REVIEW.md

13 KiB

CheddahBot Architecture Review: Data Consistency and Robustness

Date: 2026-02-19 Scope: Full codebase review focusing on ClickUp data consistency, failure modes, and thread safety


1. Data Flow Map: Complete Task Lifecycle

1.1 ClickUp Auto-Execution Path (Scheduler)

ClickUp API (source of truth for task creation)
    |
    v
Scheduler._poll_clickup()              [scheduler.py:235]
    |-- get_tasks_from_space()          [traverses all folders/lists]
    |-- For each task:
    |     |-- Check KV store for existing state (executing/completed/failed -> skip)
    |     |-- Verify task_type in skill_map
    |     |-- Verify auto_execute == true
    |     |-- Verify due_date within 3-week window
    |     v
    |   Scheduler._execute_task()        [scheduler.py:313]
    |     |-- Set KV state to "executing"
    |     |-- Update ClickUp status to "in progress"
    |     |-- Call ToolRegistry.execute() with mapped tool
    |     |-- On success:
    |     |     |-- If tool handled sync ("## ClickUp Sync" in output):
    |     |     |     state = "completed" in KV
    |     |     |-- Else (fallback path):
    |     |     |     extract docx paths -> upload attachments
    |     |     |     state = "completed" in KV
    |     |     |     ClickUp status -> "internal review"
    |     |     |     Post completion comment
    |     |-- On failure:
    |     |     state = "failed" in KV
    |     |     ClickUp status -> "to do"
    |     |     Post error comment

1.2 Press Release Pipeline (Tool-Internal)

write_press_releases()                   [tools/press_release.py:489]
    |-- If no clickup_task_id passed, auto-lookup via _find_clickup_task()
    |-- Set ClickUp to "in progress", post starting comment
    |-- Step 1: Generate 7 headlines (chat brain)
    |-- Step 2: AI judge picks best 2 (chat brain)
    |-- Step 3: Write 2 PRs (execution brain x2)
    |-- Upload docx attachments to ClickUp mid-pipeline
    |-- Step 4: Generate 2 JSON-LD schemas (execution brain x2)
    |-- Set ClickUp to "internal review"
    |-- Update KV state to "completed"
    |-- Output includes "## ClickUp Sync" marker
run_cora_backlinks()                     [tools/linkbuilding.py:455]
    |-- If no clickup_task_id, auto-lookup via _find_clickup_task()
    |-- Step 1: subprocess call to Big-Link-Man ingest-cora
    |-- Step 2: subprocess call to Big-Link-Man generate-batch
    |-- On success: _complete_clickup_task() -> status "complete", KV "completed"
    |-- On failure: _fail_clickup_task() -> status "internal review", KV "failed"

1.4 Folder Watcher Path

Scheduler._folder_watch_loop()           [scheduler.py:461]
    |-- Scan watch folder for *.xlsx
    |-- For each file:
    |     |-- Check KV "linkbuilding:watched:{filename}" (skip completed/processing/failed)
    |     |-- Match filename to ClickUp task via fuzzy keyword match
    |     |-- Set KV to "processing"
    |     |-- Call run_cora_backlinks via tool registry
    |     |-- On success: move file to processed/, KV "completed"
    |     |-- On failure: KV "failed"

2. Single Points of Failure

2.1 CRITICAL: Crash During Execution Leaves Task in "executing" Limbo

File: scheduler.py:328-344

When _execute_task() sets the KV state to "executing" and updates ClickUp to "in progress", then the process crashes before the tool completes, the task ends up in a permanent dead state:

  • KV store says "executing" -- the scheduler will never pick it up again (skips executing, completed, failed)
  • ClickUp says "in progress" -- the poll only queries ["to do"] status
  • Result: The task is permanently orphaned. No automatic recovery. No human alert.

Severity: CRITICAL. A crashed process could silently drop a paid client deliverable with no notification.

Mitigation today: The clickup_reset_task and clickup_reset_all tools exist (clickup_tool.py), but they require someone to manually notice and intervene.

2.2 CRITICAL: ClickUp API Down During Status Update

File: scheduler.py:343

If update_task_status() fails (returns False), the code continues execution anyway. The ClickUp status stays at "to do" while the KV says "executing". On next poll cycle, the task is skipped in KV but still visible as "to do" in ClickUp.

The update_task_status method in clickup.py catches all exceptions and returns False on failure, but the caller at scheduler.py:343 discards this return value.

Severity: CRITICAL. Silent data inconsistency between ClickUp and KV store.

2.3 HIGH: Execution Brain 5-Minute Hard Timeout

File: llm.py:193

The press release pipeline involves 4+ sequential execution brain calls. Each has a 300-second timeout, but none of the callers in press_release.py handle TimeoutExpired. If the execution brain times out mid-pipeline, intermediate files may have been written and partial docx uploaded to ClickUp.

The link building tool does handle subprocess.TimeoutExpired properly with cleanup.

Severity: HIGH. Partial deliverables uploaded to ClickUp with a failed state could confuse client review.

2.4 HIGH: SQLite DB Corruption Recovery

The SQLite database uses WAL mode, which is robust. However:

  • There is no backup mechanism
  • There is no integrity check on startup
  • If the DB file is corrupted, the entire KV store and all task state history is lost

Severity: HIGH. For a production system managing paid deliverables, a single SQLite file with no backups is fragile.

2.5 MEDIUM: Tool Execution Has No Timeout

File: tools/__init__.py:175

ToolRegistry.execute() calls tool functions synchronously with no timeout wrapper. If a tool hangs, the scheduler thread blocks forever. Since there is only one ClickUp polling thread, this halts all future ClickUp processing.

2.6 MEDIUM: Memory Files Can Be Silently Truncated

File: memory.py:78-85

The memory system does unsynchronized read-modify-write on markdown files. If two threads both try to log at the same time, one write can overwrite the other's data.


3. ClickUp Sync Robustness

3.1 Source of Truth: It is Ambiguous

The system has two sources of truth that can disagree:

  1. KV store (clickup:task:{id}:state) -- the scheduler's decision engine
  2. ClickUp API (task.status) -- the human-facing project management system

Neither is authoritative:

  • A task can be "completed" in KV but still "in progress" on ClickUp if the status update API call failed
  • A task can be "to do" on ClickUp but "executing" in KV if the process crashed
  • A human can manually move a task back to "to do" on ClickUp, but if KV says "completed", the scheduler will never pick it up

3.2 Tasks Can Get Lost

Scenario 1: Someone moves a task to "in progress" manually on ClickUp -- it will never be polled (only "to do" is polled).

Scenario 2: Tasks with no due date are skipped by the scheduler. Tasks without a due date that are never assigned one are permanently invisible.

Scenario 3: The DUE_DATE_WINDOW_WEEKS = 3 filter means tasks with due dates more than 3 weeks out are invisible.

3.3 Duplicate Executions CAN Happen

If the same tool is invoked both from the scheduler AND from the chat UI (via _find_clickup_task auto-lookup), both paths can set up KV state independently. Two press releases for the same client could be generated simultaneously.

Severity: HIGH.

3.4 Failed Task Retry Mechanism

The retry path works but is manual:

  1. On failure, scheduler sets KV to "failed" and ClickUp back to "to do"
  2. But since KV says "failed", the scheduler skips it on next poll
  3. To retry, someone must call clickup_reset_task(task_id) via chat

Discrepancy: The scheduler sets failed tasks to "to do", but the link building tool sets them to "internal review". This inconsistency means retry behavior depends on which code path handles the failure.

3.5 Phantom Completion

The "## ClickUp Sync" string check (scheduler.py:385) is fragile -- it could false-positive on debug output. Also, in the fallback path, a task is marked "completed" even if zero docx files were uploaded.


4. Thread Safety

4.1 Threads in Play

Thread Name Purpose
Main uvicorn Gradio UI + FastAPI
Daemon scheduler Cron-based scheduled tasks
Daemon heartbeat Periodic checklist check
Daemon clickup ClickUp polling + execution
Daemon folder-watch CORA file watcher

4.2 Shared ClickUp Client Instance

The scheduler lazy-initializes self._clickup_client once and reuses it across the ClickUp thread. The folder watcher also calls self._get_clickup_client(), which shares the same instance. If both threads are active simultaneously, this could cause request interleaving bugs in httpx.

Tool functions create their own ClickUpClient instances, so tool-level calls are isolated.

4.3 KV Store Race Conditions

Individual kv_set and kv_get calls are safe (atomic SQL). But the read-check-modify-write pattern in the scheduler is NOT atomic. The real risk is the chat-vs-scheduler race described in section 3.3.

4.4 Memory System File Writes

remember() and log_daily() both do unsynchronized read-modify-write on markdown files with no file locking. Concurrent calls can lose data.


5. Error Recovery After Crash and Restart

5.1 What Survives a Restart

  • KV store, ClickUp task statuses, scheduled tasks, conversation history, memory files, notification history -- all persist.

5.2 What Does NOT Recover

  • In-flight task execution: KV says "executing", ClickUp says "in progress". Task is stuck forever.
  • Folder watcher "processing" state: KV says "processing". File is stuck forever.
  • There is no startup reconciliation, no stale-task detector, no watchdog for stuck states.

6. Recommendations

CRITICAL Priority

# Issue Fix Effort
C1 Stuck "executing" after crash On startup, scan KV for stale "executing" states (>1h old), move to "failed", push notification 30-50 lines
C2 Ignored ClickUp API failures Check update_task_status() return value before proceeding with execution 10-20 lines
C3 Chat-vs-scheduler duplicate execution Check KV store in _find_clickup_task() before creating new entries 15-25 lines/file

HIGH Priority

# Issue Fix Effort
H1 No DB backup Copy DB to .bak on startup; periodic backups from heartbeat 5-10 lines
H2 Inconsistent failure status Unify: scheduler and tools both use same failure status 10-15 lines
H3 Fragile "## ClickUp Sync" detection Use structured return type or unique token instead of string parsing 20-40 lines
H4 No KV-ClickUp reconciliation Add heartbeat check comparing ClickUp "in progress" vs KV states 40-60 lines

MEDIUM Priority

# Issue Fix Effort
M1 Memory file race condition Add threading.Lock() to MemorySystem for file writes 10-15 lines
M2 Shared httpx client across threads Make _get_clickup_client() return thread-local instances 10-15 lines
M3 No tool execution timeout Wrap tool execution in a thread with timeout 15-25 lines
M4 Poor auto-flush summary quality Use chat brain for summarization instead of truncated concat 20-30 lines
M5 Unbounded KV growth Add periodic cleanup of completed/failed states older than N days 20-30 lines

LOW Priority

# Issue Fix Effort
L1 No programmatic health check Add /health endpoint checking thread liveness and DB access 20-30 lines
L2 Unstructured state transition logging Log every state transition as structured JSON event 20-40 lines
L3 Test coverage gaps No tests for crash recovery, concurrent execution, folder watcher 200+ lines

Risk Matrix

# Issue Severity Likelihood Impact
C1 Stuck "executing" after crash CRITICAL Medium Task permanently lost
C2 Ignored ClickUp API failures CRITICAL Medium Silent inconsistency
C3 Chat-vs-scheduler duplicate CRITICAL Low Duplicate deliverables
H1 No DB backup HIGH Low Total state loss
H2 Inconsistent failure status HIGH Medium Confusing retry behavior
H3 Fragile sync detection HIGH Low Phantom completion
H4 No KV-ClickUp reconciliation HIGH Medium Undetected inconsistency
M1 Memory file race condition MEDIUM Low Lost memory entries
M2 Shared httpx client MEDIUM Low Corrupted HTTP requests
M3 No tool execution timeout MEDIUM Low Thread deadlock
M4 Poor auto-flush summaries MEDIUM High Lost context
M5 Unbounded KV growth MEDIUM Low Performance degradation