13 KiB
CheddahBot Architecture Review: Data Consistency and Robustness
Date: 2026-02-19 Scope: Full codebase review focusing on ClickUp data consistency, failure modes, and thread safety
1. Data Flow Map: Complete Task Lifecycle
1.1 ClickUp Auto-Execution Path (Scheduler)
ClickUp API (source of truth for task creation)
|
v
Scheduler._poll_clickup() [scheduler.py:235]
|-- get_tasks_from_space() [traverses all folders/lists]
|-- For each task:
| |-- Check KV store for existing state (executing/completed/failed -> skip)
| |-- Verify task_type in skill_map
| |-- Verify auto_execute == true
| |-- Verify due_date within 3-week window
| v
| Scheduler._execute_task() [scheduler.py:313]
| |-- Set KV state to "executing"
| |-- Update ClickUp status to "in progress"
| |-- Call ToolRegistry.execute() with mapped tool
| |-- On success:
| | |-- If tool handled sync ("## ClickUp Sync" in output):
| | | state = "completed" in KV
| | |-- Else (fallback path):
| | | extract docx paths -> upload attachments
| | | state = "completed" in KV
| | | ClickUp status -> "internal review"
| | | Post completion comment
| |-- On failure:
| | state = "failed" in KV
| | ClickUp status -> "to do"
| | Post error comment
1.2 Press Release Pipeline (Tool-Internal)
write_press_releases() [tools/press_release.py:489]
|-- If no clickup_task_id passed, auto-lookup via _find_clickup_task()
|-- Set ClickUp to "in progress", post starting comment
|-- Step 1: Generate 7 headlines (chat brain)
|-- Step 2: AI judge picks best 2 (chat brain)
|-- Step 3: Write 2 PRs (execution brain x2)
|-- Upload docx attachments to ClickUp mid-pipeline
|-- Step 4: Generate 2 JSON-LD schemas (execution brain x2)
|-- Set ClickUp to "internal review"
|-- Update KV state to "completed"
|-- Output includes "## ClickUp Sync" marker
1.3 Link Building Pipeline (Tool-Internal)
run_cora_backlinks() [tools/linkbuilding.py:455]
|-- If no clickup_task_id, auto-lookup via _find_clickup_task()
|-- Step 1: subprocess call to Big-Link-Man ingest-cora
|-- Step 2: subprocess call to Big-Link-Man generate-batch
|-- On success: _complete_clickup_task() -> status "complete", KV "completed"
|-- On failure: _fail_clickup_task() -> status "internal review", KV "failed"
1.4 Folder Watcher Path
Scheduler._folder_watch_loop() [scheduler.py:461]
|-- Scan watch folder for *.xlsx
|-- For each file:
| |-- Check KV "linkbuilding:watched:{filename}" (skip completed/processing/failed)
| |-- Match filename to ClickUp task via fuzzy keyword match
| |-- Set KV to "processing"
| |-- Call run_cora_backlinks via tool registry
| |-- On success: move file to processed/, KV "completed"
| |-- On failure: KV "failed"
2. Single Points of Failure
2.1 CRITICAL: Crash During Execution Leaves Task in "executing" Limbo
File: scheduler.py:328-344
When _execute_task() sets the KV state to "executing" and updates ClickUp to "in progress", then the process crashes before the tool completes, the task ends up in a permanent dead state:
- KV store says
"executing"-- the scheduler will never pick it up again (skipsexecuting,completed,failed) - ClickUp says
"in progress"-- the poll only queries["to do"]status - Result: The task is permanently orphaned. No automatic recovery. No human alert.
Severity: CRITICAL. A crashed process could silently drop a paid client deliverable with no notification.
Mitigation today: The clickup_reset_task and clickup_reset_all tools exist (clickup_tool.py), but they require someone to manually notice and intervene.
2.2 CRITICAL: ClickUp API Down During Status Update
File: scheduler.py:343
If update_task_status() fails (returns False), the code continues execution anyway. The ClickUp status stays at "to do" while the KV says "executing". On next poll cycle, the task is skipped in KV but still visible as "to do" in ClickUp.
The update_task_status method in clickup.py catches all exceptions and returns False on failure, but the caller at scheduler.py:343 discards this return value.
Severity: CRITICAL. Silent data inconsistency between ClickUp and KV store.
2.3 HIGH: Execution Brain 5-Minute Hard Timeout
File: llm.py:193
The press release pipeline involves 4+ sequential execution brain calls. Each has a 300-second timeout, but none of the callers in press_release.py handle TimeoutExpired. If the execution brain times out mid-pipeline, intermediate files may have been written and partial docx uploaded to ClickUp.
The link building tool does handle subprocess.TimeoutExpired properly with cleanup.
Severity: HIGH. Partial deliverables uploaded to ClickUp with a failed state could confuse client review.
2.4 HIGH: SQLite DB Corruption Recovery
The SQLite database uses WAL mode, which is robust. However:
- There is no backup mechanism
- There is no integrity check on startup
- If the DB file is corrupted, the entire KV store and all task state history is lost
Severity: HIGH. For a production system managing paid deliverables, a single SQLite file with no backups is fragile.
2.5 MEDIUM: Tool Execution Has No Timeout
File: tools/__init__.py:175
ToolRegistry.execute() calls tool functions synchronously with no timeout wrapper. If a tool hangs, the scheduler thread blocks forever. Since there is only one ClickUp polling thread, this halts all future ClickUp processing.
2.6 MEDIUM: Memory Files Can Be Silently Truncated
File: memory.py:78-85
The memory system does unsynchronized read-modify-write on markdown files. If two threads both try to log at the same time, one write can overwrite the other's data.
3. ClickUp Sync Robustness
3.1 Source of Truth: It is Ambiguous
The system has two sources of truth that can disagree:
- KV store (
clickup:task:{id}:state) -- the scheduler's decision engine - ClickUp API (
task.status) -- the human-facing project management system
Neither is authoritative:
- A task can be
"completed"in KV but still"in progress"on ClickUp if the status update API call failed - A task can be
"to do"on ClickUp but"executing"in KV if the process crashed - A human can manually move a task back to
"to do"on ClickUp, but if KV says"completed", the scheduler will never pick it up
3.2 Tasks Can Get Lost
Scenario 1: Someone moves a task to "in progress" manually on ClickUp -- it will never be polled (only "to do" is polled).
Scenario 2: Tasks with no due date are skipped by the scheduler. Tasks without a due date that are never assigned one are permanently invisible.
Scenario 3: The DUE_DATE_WINDOW_WEEKS = 3 filter means tasks with due dates more than 3 weeks out are invisible.
3.3 Duplicate Executions CAN Happen
If the same tool is invoked both from the scheduler AND from the chat UI (via _find_clickup_task auto-lookup), both paths can set up KV state independently. Two press releases for the same client could be generated simultaneously.
Severity: HIGH.
3.4 Failed Task Retry Mechanism
The retry path works but is manual:
- On failure, scheduler sets KV to
"failed"and ClickUp back to"to do" - But since KV says
"failed", the scheduler skips it on next poll - To retry, someone must call
clickup_reset_task(task_id)via chat
Discrepancy: The scheduler sets failed tasks to "to do", but the link building tool sets them to "internal review". This inconsistency means retry behavior depends on which code path handles the failure.
3.5 Phantom Completion
The "## ClickUp Sync" string check (scheduler.py:385) is fragile -- it could false-positive on debug output. Also, in the fallback path, a task is marked "completed" even if zero docx files were uploaded.
4. Thread Safety
4.1 Threads in Play
| Thread | Name | Purpose |
|---|---|---|
| Main | uvicorn | Gradio UI + FastAPI |
| Daemon | scheduler |
Cron-based scheduled tasks |
| Daemon | heartbeat |
Periodic checklist check |
| Daemon | clickup |
ClickUp polling + execution |
| Daemon | folder-watch |
CORA file watcher |
4.2 Shared ClickUp Client Instance
The scheduler lazy-initializes self._clickup_client once and reuses it across the ClickUp thread. The folder watcher also calls self._get_clickup_client(), which shares the same instance. If both threads are active simultaneously, this could cause request interleaving bugs in httpx.
Tool functions create their own ClickUpClient instances, so tool-level calls are isolated.
4.3 KV Store Race Conditions
Individual kv_set and kv_get calls are safe (atomic SQL). But the read-check-modify-write pattern in the scheduler is NOT atomic. The real risk is the chat-vs-scheduler race described in section 3.3.
4.4 Memory System File Writes
remember() and log_daily() both do unsynchronized read-modify-write on markdown files with no file locking. Concurrent calls can lose data.
5. Error Recovery After Crash and Restart
5.1 What Survives a Restart
- KV store, ClickUp task statuses, scheduled tasks, conversation history, memory files, notification history -- all persist.
5.2 What Does NOT Recover
- In-flight task execution: KV says
"executing", ClickUp says"in progress". Task is stuck forever. - Folder watcher "processing" state: KV says
"processing". File is stuck forever. - There is no startup reconciliation, no stale-task detector, no watchdog for stuck states.
6. Recommendations
CRITICAL Priority
| # | Issue | Fix | Effort |
|---|---|---|---|
| C1 | Stuck "executing" after crash | On startup, scan KV for stale "executing" states (>1h old), move to "failed", push notification |
30-50 lines |
| C2 | Ignored ClickUp API failures | Check update_task_status() return value before proceeding with execution |
10-20 lines |
| C3 | Chat-vs-scheduler duplicate execution | Check KV store in _find_clickup_task() before creating new entries |
15-25 lines/file |
HIGH Priority
| # | Issue | Fix | Effort |
|---|---|---|---|
| H1 | No DB backup | Copy DB to .bak on startup; periodic backups from heartbeat |
5-10 lines |
| H2 | Inconsistent failure status | Unify: scheduler and tools both use same failure status | 10-15 lines |
| H3 | Fragile "## ClickUp Sync" detection | Use structured return type or unique token instead of string parsing | 20-40 lines |
| H4 | No KV-ClickUp reconciliation | Add heartbeat check comparing ClickUp "in progress" vs KV states |
40-60 lines |
MEDIUM Priority
| # | Issue | Fix | Effort |
|---|---|---|---|
| M1 | Memory file race condition | Add threading.Lock() to MemorySystem for file writes |
10-15 lines |
| M2 | Shared httpx client across threads | Make _get_clickup_client() return thread-local instances |
10-15 lines |
| M3 | No tool execution timeout | Wrap tool execution in a thread with timeout | 15-25 lines |
| M4 | Poor auto-flush summary quality | Use chat brain for summarization instead of truncated concat | 20-30 lines |
| M5 | Unbounded KV growth | Add periodic cleanup of completed/failed states older than N days | 20-30 lines |
LOW Priority
| # | Issue | Fix | Effort |
|---|---|---|---|
| L1 | No programmatic health check | Add /health endpoint checking thread liveness and DB access |
20-30 lines |
| L2 | Unstructured state transition logging | Log every state transition as structured JSON event | 20-40 lines |
| L3 | Test coverage gaps | No tests for crash recovery, concurrent execution, folder watcher | 200+ lines |
Risk Matrix
| # | Issue | Severity | Likelihood | Impact |
|---|---|---|---|---|
| C1 | Stuck "executing" after crash | CRITICAL | Medium | Task permanently lost |
| C2 | Ignored ClickUp API failures | CRITICAL | Medium | Silent inconsistency |
| C3 | Chat-vs-scheduler duplicate | CRITICAL | Low | Duplicate deliverables |
| H1 | No DB backup | HIGH | Low | Total state loss |
| H2 | Inconsistent failure status | HIGH | Medium | Confusing retry behavior |
| H3 | Fragile sync detection | HIGH | Low | Phantom completion |
| H4 | No KV-ClickUp reconciliation | HIGH | Medium | Undetected inconsistency |
| M1 | Memory file race condition | MEDIUM | Low | Lost memory entries |
| M2 | Shared httpx client | MEDIUM | Low | Corrupted HTTP requests |
| M3 | No tool execution timeout | MEDIUM | Low | Thread deadlock |
| M4 | Poor auto-flush summaries | MEDIUM | High | Lost context |
| M5 | Unbounded KV growth | MEDIUM | Low | Performance degradation |