13 KiB

Raw Blame History

CheddahBot Architecture Review: Data Consistency and Robustness

Date: 2026-02-19 Scope: Full codebase review focusing on ClickUp data consistency, failure modes, and thread safety

1. Data Flow Map: Complete Task Lifecycle

1.1 ClickUp Auto-Execution Path (Scheduler)

ClickUp API (source of truth for task creation)
    |
    v
Scheduler._poll_clickup()              [scheduler.py:235]
    |-- get_tasks_from_space()          [traverses all folders/lists]
    |-- For each task:
    |     |-- Check KV store for existing state (executing/completed/failed -> skip)
    |     |-- Verify task_type in skill_map
    |     |-- Verify auto_execute == true
    |     |-- Verify due_date within 3-week window
    |     v
    |   Scheduler._execute_task()        [scheduler.py:313]
    |     |-- Set KV state to "executing"
    |     |-- Update ClickUp status to "in progress"
    |     |-- Call ToolRegistry.execute() with mapped tool
    |     |-- On success:
    |     |     |-- If tool handled sync ("## ClickUp Sync" in output):
    |     |     |     state = "completed" in KV
    |     |     |-- Else (fallback path):
    |     |     |     extract docx paths -> upload attachments
    |     |     |     state = "completed" in KV
    |     |     |     ClickUp status -> "internal review"
    |     |     |     Post completion comment
    |     |-- On failure:
    |     |     state = "failed" in KV
    |     |     ClickUp status -> "to do"
    |     |     Post error comment

1.2 Press Release Pipeline (Tool-Internal)

write_press_releases()                   [tools/press_release.py:489]
    |-- If no clickup_task_id passed, auto-lookup via _find_clickup_task()
    |-- Set ClickUp to "in progress", post starting comment
    |-- Step 1: Generate 7 headlines (chat brain)
    |-- Step 2: AI judge picks best 2 (chat brain)
    |-- Step 3: Write 2 PRs (execution brain x2)
    |-- Upload docx attachments to ClickUp mid-pipeline
    |-- Step 4: Generate 2 JSON-LD schemas (execution brain x2)
    |-- Set ClickUp to "internal review"
    |-- Update KV state to "completed"
    |-- Output includes "## ClickUp Sync" marker

1.3 Link Building Pipeline (Tool-Internal)

run_cora_backlinks()                     [tools/linkbuilding.py:455]
    |-- If no clickup_task_id, auto-lookup via _find_clickup_task()
    |-- Step 1: subprocess call to Big-Link-Man ingest-cora
    |-- Step 2: subprocess call to Big-Link-Man generate-batch
    |-- On success: _complete_clickup_task() -> status "complete", KV "completed"
    |-- On failure: _fail_clickup_task() -> status "internal review", KV "failed"

1.4 Folder Watcher Path

Scheduler._folder_watch_loop()           [scheduler.py:461]
    |-- Scan watch folder for *.xlsx
    |-- For each file:
    |     |-- Check KV "linkbuilding:watched:{filename}" (skip completed/processing/failed)
    |     |-- Match filename to ClickUp task via fuzzy keyword match
    |     |-- Set KV to "processing"
    |     |-- Call run_cora_backlinks via tool registry
    |     |-- On success: move file to processed/, KV "completed"
    |     |-- On failure: KV "failed"

2. Single Points of Failure

2.1 CRITICAL: Crash During Execution Leaves Task in "executing" Limbo

File: scheduler.py:328-344

When _execute_task() sets the KV state to "executing" and updates ClickUp to "in progress", then the process crashes before the tool completes, the task ends up in a permanent dead state:

KV store says "executing" -- the scheduler will never pick it up again (skips executing, completed, failed)
ClickUp says "in progress" -- the poll only queries ["to do"] status
Result: The task is permanently orphaned. No automatic recovery. No human alert.

Severity: CRITICAL. A crashed process could silently drop a paid client deliverable with no notification.

Mitigation today: The clickup_reset_task and clickup_reset_all tools exist (clickup_tool.py), but they require someone to manually notice and intervene.

2.2 CRITICAL: ClickUp API Down During Status Update

File: scheduler.py:343

If update_task_status() fails (returns False), the code continues execution anyway. The ClickUp status stays at "to do" while the KV says "executing". On next poll cycle, the task is skipped in KV but still visible as "to do" in ClickUp.

The update_task_status method in clickup.py catches all exceptions and returns False on failure, but the caller at scheduler.py:343 discards this return value.

Severity: CRITICAL. Silent data inconsistency between ClickUp and KV store.

2.3 HIGH: Execution Brain 5-Minute Hard Timeout

File: llm.py:193

The press release pipeline involves 4+ sequential execution brain calls. Each has a 300-second timeout, but none of the callers in press_release.py handle TimeoutExpired. If the execution brain times out mid-pipeline, intermediate files may have been written and partial docx uploaded to ClickUp.

The link building tool does handle subprocess.TimeoutExpired properly with cleanup.

Severity: HIGH. Partial deliverables uploaded to ClickUp with a failed state could confuse client review.

2.4 HIGH: SQLite DB Corruption Recovery

The SQLite database uses WAL mode, which is robust. However:

There is no backup mechanism
There is no integrity check on startup
If the DB file is corrupted, the entire KV store and all task state history is lost

Severity: HIGH. For a production system managing paid deliverables, a single SQLite file with no backups is fragile.

2.5 MEDIUM: Tool Execution Has No Timeout

File: tools/__init__.py:175

ToolRegistry.execute() calls tool functions synchronously with no timeout wrapper. If a tool hangs, the scheduler thread blocks forever. Since there is only one ClickUp polling thread, this halts all future ClickUp processing.

2.6 MEDIUM: Memory Files Can Be Silently Truncated

File: memory.py:78-85

The memory system does unsynchronized read-modify-write on markdown files. If two threads both try to log at the same time, one write can overwrite the other's data.

3. ClickUp Sync Robustness

3.1 Source of Truth: It is Ambiguous

The system has two sources of truth that can disagree:

KV store (clickup:task:{id}:state) -- the scheduler's decision engine
ClickUp API (task.status) -- the human-facing project management system

Neither is authoritative:

A task can be "completed" in KV but still "in progress" on ClickUp if the status update API call failed
A task can be "to do" on ClickUp but "executing" in KV if the process crashed
A human can manually move a task back to "to do" on ClickUp, but if KV says "completed", the scheduler will never pick it up

3.2 Tasks Can Get Lost

Scenario 1: Someone moves a task to "in progress" manually on ClickUp -- it will never be polled (only "to do" is polled).

Scenario 2: Tasks with no due date are skipped by the scheduler. Tasks without a due date that are never assigned one are permanently invisible.

Scenario 3: The DUE_DATE_WINDOW_WEEKS = 3 filter means tasks with due dates more than 3 weeks out are invisible.

3.3 Duplicate Executions CAN Happen

If the same tool is invoked both from the scheduler AND from the chat UI (via _find_clickup_task auto-lookup), both paths can set up KV state independently. Two press releases for the same client could be generated simultaneously.

Severity: HIGH.

3.4 Failed Task Retry Mechanism

The retry path works but is manual:

On failure, scheduler sets KV to "failed" and ClickUp back to "to do"
But since KV says "failed", the scheduler skips it on next poll
To retry, someone must call clickup_reset_task(task_id) via chat

Discrepancy: The scheduler sets failed tasks to "to do", but the link building tool sets them to "internal review". This inconsistency means retry behavior depends on which code path handles the failure.

3.5 Phantom Completion

The "## ClickUp Sync" string check (scheduler.py:385) is fragile -- it could false-positive on debug output. Also, in the fallback path, a task is marked "completed" even if zero docx files were uploaded.

4. Thread Safety

4.1 Threads in Play

Thread	Name	Purpose
Main	uvicorn	Gradio UI + FastAPI
Daemon	`scheduler`	Cron-based scheduled tasks
Daemon	`heartbeat`	Periodic checklist check
Daemon	`clickup`	ClickUp polling + execution
Daemon	`folder-watch`	CORA file watcher

4.2 Shared ClickUp Client Instance

The scheduler lazy-initializes self._clickup_client once and reuses it across the ClickUp thread. The folder watcher also calls self._get_clickup_client(), which shares the same instance. If both threads are active simultaneously, this could cause request interleaving bugs in httpx.

Tool functions create their own ClickUpClient instances, so tool-level calls are isolated.

4.3 KV Store Race Conditions

Individual kv_set and kv_get calls are safe (atomic SQL). But the read-check-modify-write pattern in the scheduler is NOT atomic. The real risk is the chat-vs-scheduler race described in section 3.3.

4.4 Memory System File Writes

remember() and log_daily() both do unsynchronized read-modify-write on markdown files with no file locking. Concurrent calls can lose data.

5. Error Recovery After Crash and Restart

5.1 What Survives a Restart

KV store, ClickUp task statuses, scheduled tasks, conversation history, memory files, notification history -- all persist.

5.2 What Does NOT Recover

In-flight task execution: KV says "executing", ClickUp says "in progress". Task is stuck forever.
Folder watcher "processing" state: KV says "processing". File is stuck forever.
There is no startup reconciliation, no stale-task detector, no watchdog for stuck states.

6. Recommendations

CRITICAL Priority

#	Issue	Fix	Effort
C1	Stuck "executing" after crash	On startup, scan KV for stale `"executing"` states (>1h old), move to `"failed"`, push notification	30-50 lines
C2	Ignored ClickUp API failures	Check `update_task_status()` return value before proceeding with execution	10-20 lines
C3	Chat-vs-scheduler duplicate execution	Check KV store in `_find_clickup_task()` before creating new entries	15-25 lines/file

HIGH Priority

#	Issue	Fix	Effort
H1	No DB backup	Copy DB to `.bak` on startup; periodic backups from heartbeat	5-10 lines
H2	Inconsistent failure status	Unify: scheduler and tools both use same failure status	10-15 lines
H3	Fragile "## ClickUp Sync" detection	Use structured return type or unique token instead of string parsing	20-40 lines
H4	No KV-ClickUp reconciliation	Add heartbeat check comparing ClickUp `"in progress"` vs KV states	40-60 lines

MEDIUM Priority

#	Issue	Fix	Effort
M1	Memory file race condition	Add `threading.Lock()` to MemorySystem for file writes	10-15 lines
M2	Shared httpx client across threads	Make `_get_clickup_client()` return thread-local instances	10-15 lines
M3	No tool execution timeout	Wrap tool execution in a thread with timeout	15-25 lines
M4	Poor auto-flush summary quality	Use chat brain for summarization instead of truncated concat	20-30 lines
M5	Unbounded KV growth	Add periodic cleanup of completed/failed states older than N days	20-30 lines

LOW Priority

#	Issue	Fix	Effort
L1	No programmatic health check	Add `/health` endpoint checking thread liveness and DB access	20-30 lines
L2	Unstructured state transition logging	Log every state transition as structured JSON event	20-40 lines
L3	Test coverage gaps	No tests for crash recovery, concurrent execution, folder watcher	200+ lines

Risk Matrix

#	Issue	Severity	Likelihood	Impact
C1	Stuck "executing" after crash	CRITICAL	Medium	Task permanently lost
C2	Ignored ClickUp API failures	CRITICAL	Medium	Silent inconsistency
C3	Chat-vs-scheduler duplicate	CRITICAL	Low	Duplicate deliverables
H1	No DB backup	HIGH	Low	Total state loss
H2	Inconsistent failure status	HIGH	Medium	Confusing retry behavior
H3	Fragile sync detection	HIGH	Low	Phantom completion
H4	No KV-ClickUp reconciliation	HIGH	Medium	Undetected inconsistency
M1	Memory file race condition	MEDIUM	Low	Lost memory entries
M2	Shared httpx client	MEDIUM	Low	Corrupted HTTP requests
M3	No tool execution timeout	MEDIUM	Low	Thread deadlock
M4	Poor auto-flush summaries	MEDIUM	High	Lost context
M5	Unbounded KV growth	MEDIUM	Low	Performance degradation

13 KiB Raw Blame History