Add architecture review: data consistency and robustness
Comprehensive review of ClickUp data flow, failure modes, thread safety, and error recovery. Identifies 3 critical issues (stuck tasks after crash, ignored API failures, duplicate execution risk), 4 high priority items, and 5 medium priority improvements. Doc only, no code changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>cora-start
parent
e1992fa049
commit
76a0200192
|
|
@ -0,0 +1,284 @@
|
|||
# CheddahBot Architecture Review: Data Consistency and Robustness
|
||||
|
||||
**Date:** 2026-02-19
|
||||
**Scope:** Full codebase review focusing on ClickUp data consistency, failure modes, and thread safety
|
||||
|
||||
---
|
||||
|
||||
## 1. Data Flow Map: Complete Task Lifecycle
|
||||
|
||||
### 1.1 ClickUp Auto-Execution Path (Scheduler)
|
||||
|
||||
```
|
||||
ClickUp API (source of truth for task creation)
|
||||
|
|
||||
v
|
||||
Scheduler._poll_clickup() [scheduler.py:235]
|
||||
|-- get_tasks_from_space() [traverses all folders/lists]
|
||||
|-- For each task:
|
||||
| |-- Check KV store for existing state (executing/completed/failed -> skip)
|
||||
| |-- Verify task_type in skill_map
|
||||
| |-- Verify auto_execute == true
|
||||
| |-- Verify due_date within 3-week window
|
||||
| v
|
||||
| Scheduler._execute_task() [scheduler.py:313]
|
||||
| |-- Set KV state to "executing"
|
||||
| |-- Update ClickUp status to "in progress"
|
||||
| |-- Call ToolRegistry.execute() with mapped tool
|
||||
| |-- On success:
|
||||
| | |-- If tool handled sync ("## ClickUp Sync" in output):
|
||||
| | | state = "completed" in KV
|
||||
| | |-- Else (fallback path):
|
||||
| | | extract docx paths -> upload attachments
|
||||
| | | state = "completed" in KV
|
||||
| | | ClickUp status -> "internal review"
|
||||
| | | Post completion comment
|
||||
| |-- On failure:
|
||||
| | state = "failed" in KV
|
||||
| | ClickUp status -> "to do"
|
||||
| | Post error comment
|
||||
```
|
||||
|
||||
### 1.2 Press Release Pipeline (Tool-Internal)
|
||||
|
||||
```
|
||||
write_press_releases() [tools/press_release.py:489]
|
||||
|-- If no clickup_task_id passed, auto-lookup via _find_clickup_task()
|
||||
|-- Set ClickUp to "in progress", post starting comment
|
||||
|-- Step 1: Generate 7 headlines (chat brain)
|
||||
|-- Step 2: AI judge picks best 2 (chat brain)
|
||||
|-- Step 3: Write 2 PRs (execution brain x2)
|
||||
|-- Upload docx attachments to ClickUp mid-pipeline
|
||||
|-- Step 4: Generate 2 JSON-LD schemas (execution brain x2)
|
||||
|-- Set ClickUp to "internal review"
|
||||
|-- Update KV state to "completed"
|
||||
|-- Output includes "## ClickUp Sync" marker
|
||||
```
|
||||
|
||||
### 1.3 Link Building Pipeline (Tool-Internal)
|
||||
|
||||
```
|
||||
run_cora_backlinks() [tools/linkbuilding.py:455]
|
||||
|-- If no clickup_task_id, auto-lookup via _find_clickup_task()
|
||||
|-- Step 1: subprocess call to Big-Link-Man ingest-cora
|
||||
|-- Step 2: subprocess call to Big-Link-Man generate-batch
|
||||
|-- On success: _complete_clickup_task() -> status "complete", KV "completed"
|
||||
|-- On failure: _fail_clickup_task() -> status "internal review", KV "failed"
|
||||
```
|
||||
|
||||
### 1.4 Folder Watcher Path
|
||||
|
||||
```
|
||||
Scheduler._folder_watch_loop() [scheduler.py:461]
|
||||
|-- Scan watch folder for *.xlsx
|
||||
|-- For each file:
|
||||
| |-- Check KV "linkbuilding:watched:{filename}" (skip completed/processing/failed)
|
||||
| |-- Match filename to ClickUp task via fuzzy keyword match
|
||||
| |-- Set KV to "processing"
|
||||
| |-- Call run_cora_backlinks via tool registry
|
||||
| |-- On success: move file to processed/, KV "completed"
|
||||
| |-- On failure: KV "failed"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Single Points of Failure
|
||||
|
||||
### 2.1 CRITICAL: Crash During Execution Leaves Task in "executing" Limbo
|
||||
|
||||
**File:** `scheduler.py:328-344`
|
||||
|
||||
When `_execute_task()` sets the KV state to `"executing"` and updates ClickUp to `"in progress"`, then the process crashes before the tool completes, the task ends up in a permanent dead state:
|
||||
|
||||
- **KV store** says `"executing"` -- the scheduler will never pick it up again (skips `executing`, `completed`, `failed`)
|
||||
- **ClickUp** says `"in progress"` -- the poll only queries `["to do"]` status
|
||||
- **Result:** The task is permanently orphaned. No automatic recovery. No human alert.
|
||||
|
||||
**Severity:** CRITICAL. A crashed process could silently drop a paid client deliverable with no notification.
|
||||
|
||||
**Mitigation today:** The `clickup_reset_task` and `clickup_reset_all` tools exist (`clickup_tool.py`), but they require someone to manually notice and intervene.
|
||||
|
||||
### 2.2 CRITICAL: ClickUp API Down During Status Update
|
||||
|
||||
**File:** `scheduler.py:343`
|
||||
|
||||
If `update_task_status()` fails (returns `False`), the code continues execution anyway. The ClickUp status stays at `"to do"` while the KV says `"executing"`. On next poll cycle, the task is skipped in KV but still visible as `"to do"` in ClickUp.
|
||||
|
||||
The `update_task_status` method in `clickup.py` catches all exceptions and returns `False` on failure, but the caller at `scheduler.py:343` discards this return value.
|
||||
|
||||
**Severity:** CRITICAL. Silent data inconsistency between ClickUp and KV store.
|
||||
|
||||
### 2.3 HIGH: Execution Brain 5-Minute Hard Timeout
|
||||
|
||||
**File:** `llm.py:193`
|
||||
|
||||
The press release pipeline involves 4+ sequential execution brain calls. Each has a 300-second timeout, but none of the callers in `press_release.py` handle `TimeoutExpired`. If the execution brain times out mid-pipeline, intermediate files may have been written and partial docx uploaded to ClickUp.
|
||||
|
||||
The link building tool **does** handle `subprocess.TimeoutExpired` properly with cleanup.
|
||||
|
||||
**Severity:** HIGH. Partial deliverables uploaded to ClickUp with a failed state could confuse client review.
|
||||
|
||||
### 2.4 HIGH: SQLite DB Corruption Recovery
|
||||
|
||||
The SQLite database uses WAL mode, which is robust. However:
|
||||
- There is **no backup mechanism**
|
||||
- There is **no integrity check on startup**
|
||||
- If the DB file is corrupted, the entire KV store and all task state history is lost
|
||||
|
||||
**Severity:** HIGH. For a production system managing paid deliverables, a single SQLite file with no backups is fragile.
|
||||
|
||||
### 2.5 MEDIUM: Tool Execution Has No Timeout
|
||||
|
||||
**File:** `tools/__init__.py:175`
|
||||
|
||||
`ToolRegistry.execute()` calls tool functions synchronously with no timeout wrapper. If a tool hangs, the scheduler thread blocks forever. Since there is only one ClickUp polling thread, this halts all future ClickUp processing.
|
||||
|
||||
### 2.6 MEDIUM: Memory Files Can Be Silently Truncated
|
||||
|
||||
**File:** `memory.py:78-85`
|
||||
|
||||
The memory system does unsynchronized read-modify-write on markdown files. If two threads both try to log at the same time, one write can overwrite the other's data.
|
||||
|
||||
---
|
||||
|
||||
## 3. ClickUp Sync Robustness
|
||||
|
||||
### 3.1 Source of Truth: It is Ambiguous
|
||||
|
||||
The system has **two** sources of truth that can disagree:
|
||||
|
||||
1. **KV store** (`clickup:task:{id}:state`) -- the scheduler's decision engine
|
||||
2. **ClickUp API** (`task.status`) -- the human-facing project management system
|
||||
|
||||
Neither is authoritative:
|
||||
|
||||
- A task can be `"completed"` in KV but still `"in progress"` on ClickUp if the status update API call failed
|
||||
- A task can be `"to do"` on ClickUp but `"executing"` in KV if the process crashed
|
||||
- A human can manually move a task back to `"to do"` on ClickUp, but if KV says `"completed"`, the scheduler will never pick it up
|
||||
|
||||
### 3.2 Tasks Can Get Lost
|
||||
|
||||
**Scenario 1:** Someone moves a task to `"in progress"` manually on ClickUp -- it will never be polled (only `"to do"` is polled).
|
||||
|
||||
**Scenario 2:** Tasks with no due date are skipped by the scheduler. Tasks without a due date that are never assigned one are permanently invisible.
|
||||
|
||||
**Scenario 3:** The `DUE_DATE_WINDOW_WEEKS = 3` filter means tasks with due dates more than 3 weeks out are invisible.
|
||||
|
||||
### 3.3 Duplicate Executions CAN Happen
|
||||
|
||||
If the same tool is invoked both from the scheduler AND from the chat UI (via `_find_clickup_task` auto-lookup), both paths can set up KV state independently. Two press releases for the same client could be generated simultaneously.
|
||||
|
||||
**Severity:** HIGH.
|
||||
|
||||
### 3.4 Failed Task Retry Mechanism
|
||||
|
||||
The retry path works but is manual:
|
||||
1. On failure, scheduler sets KV to `"failed"` and ClickUp back to `"to do"`
|
||||
2. But since KV says `"failed"`, the scheduler skips it on next poll
|
||||
3. To retry, someone must call `clickup_reset_task(task_id)` via chat
|
||||
|
||||
**Discrepancy:** The scheduler sets failed tasks to `"to do"`, but the link building tool sets them to `"internal review"`. This inconsistency means retry behavior depends on which code path handles the failure.
|
||||
|
||||
### 3.5 Phantom Completion
|
||||
|
||||
The `"## ClickUp Sync"` string check (`scheduler.py:385`) is fragile -- it could false-positive on debug output. Also, in the fallback path, a task is marked `"completed"` even if zero docx files were uploaded.
|
||||
|
||||
---
|
||||
|
||||
## 4. Thread Safety
|
||||
|
||||
### 4.1 Threads in Play
|
||||
|
||||
| Thread | Name | Purpose |
|
||||
|--------|------|---------|
|
||||
| Main | uvicorn | Gradio UI + FastAPI |
|
||||
| Daemon | `scheduler` | Cron-based scheduled tasks |
|
||||
| Daemon | `heartbeat` | Periodic checklist check |
|
||||
| Daemon | `clickup` | ClickUp polling + execution |
|
||||
| Daemon | `folder-watch` | CORA file watcher |
|
||||
|
||||
### 4.2 Shared ClickUp Client Instance
|
||||
|
||||
The scheduler lazy-initializes `self._clickup_client` once and reuses it across the ClickUp thread. The folder watcher also calls `self._get_clickup_client()`, which shares the same instance. If both threads are active simultaneously, this could cause request interleaving bugs in httpx.
|
||||
|
||||
Tool functions create their **own** ClickUpClient instances, so tool-level calls are isolated.
|
||||
|
||||
### 4.3 KV Store Race Conditions
|
||||
|
||||
Individual `kv_set` and `kv_get` calls are safe (atomic SQL). But the **read-check-modify-write** pattern in the scheduler is NOT atomic. The real risk is the chat-vs-scheduler race described in section 3.3.
|
||||
|
||||
### 4.4 Memory System File Writes
|
||||
|
||||
`remember()` and `log_daily()` both do unsynchronized read-modify-write on markdown files with no file locking. Concurrent calls can lose data.
|
||||
|
||||
---
|
||||
|
||||
## 5. Error Recovery After Crash and Restart
|
||||
|
||||
### 5.1 What Survives a Restart
|
||||
|
||||
- KV store, ClickUp task statuses, scheduled tasks, conversation history, memory files, notification history -- all persist.
|
||||
|
||||
### 5.2 What Does NOT Recover
|
||||
|
||||
- **In-flight task execution:** KV says `"executing"`, ClickUp says `"in progress"`. Task is stuck forever.
|
||||
- **Folder watcher "processing" state:** KV says `"processing"`. File is stuck forever.
|
||||
- There is no startup reconciliation, no stale-task detector, no watchdog for stuck states.
|
||||
|
||||
---
|
||||
|
||||
## 6. Recommendations
|
||||
|
||||
### CRITICAL Priority
|
||||
|
||||
| # | Issue | Fix | Effort |
|
||||
|---|-------|-----|--------|
|
||||
| C1 | Stuck "executing" after crash | On startup, scan KV for stale `"executing"` states (>1h old), move to `"failed"`, push notification | 30-50 lines |
|
||||
| C2 | Ignored ClickUp API failures | Check `update_task_status()` return value before proceeding with execution | 10-20 lines |
|
||||
| C3 | Chat-vs-scheduler duplicate execution | Check KV store in `_find_clickup_task()` before creating new entries | 15-25 lines/file |
|
||||
|
||||
### HIGH Priority
|
||||
|
||||
| # | Issue | Fix | Effort |
|
||||
|---|-------|-----|--------|
|
||||
| H1 | No DB backup | Copy DB to `.bak` on startup; periodic backups from heartbeat | 5-10 lines |
|
||||
| H2 | Inconsistent failure status | Unify: scheduler and tools both use same failure status | 10-15 lines |
|
||||
| H3 | Fragile "## ClickUp Sync" detection | Use structured return type or unique token instead of string parsing | 20-40 lines |
|
||||
| H4 | No KV-ClickUp reconciliation | Add heartbeat check comparing ClickUp `"in progress"` vs KV states | 40-60 lines |
|
||||
|
||||
### MEDIUM Priority
|
||||
|
||||
| # | Issue | Fix | Effort |
|
||||
|---|-------|-----|--------|
|
||||
| M1 | Memory file race condition | Add `threading.Lock()` to MemorySystem for file writes | 10-15 lines |
|
||||
| M2 | Shared httpx client across threads | Make `_get_clickup_client()` return thread-local instances | 10-15 lines |
|
||||
| M3 | No tool execution timeout | Wrap tool execution in a thread with timeout | 15-25 lines |
|
||||
| M4 | Poor auto-flush summary quality | Use chat brain for summarization instead of truncated concat | 20-30 lines |
|
||||
| M5 | Unbounded KV growth | Add periodic cleanup of completed/failed states older than N days | 20-30 lines |
|
||||
|
||||
### LOW Priority
|
||||
|
||||
| # | Issue | Fix | Effort |
|
||||
|---|-------|-----|--------|
|
||||
| L1 | No programmatic health check | Add `/health` endpoint checking thread liveness and DB access | 20-30 lines |
|
||||
| L2 | Unstructured state transition logging | Log every state transition as structured JSON event | 20-40 lines |
|
||||
| L3 | Test coverage gaps | No tests for crash recovery, concurrent execution, folder watcher | 200+ lines |
|
||||
|
||||
---
|
||||
|
||||
## Risk Matrix
|
||||
|
||||
| # | Issue | Severity | Likelihood | Impact |
|
||||
|---|-------|----------|------------|--------|
|
||||
| C1 | Stuck "executing" after crash | CRITICAL | Medium | Task permanently lost |
|
||||
| C2 | Ignored ClickUp API failures | CRITICAL | Medium | Silent inconsistency |
|
||||
| C3 | Chat-vs-scheduler duplicate | CRITICAL | Low | Duplicate deliverables |
|
||||
| H1 | No DB backup | HIGH | Low | Total state loss |
|
||||
| H2 | Inconsistent failure status | HIGH | Medium | Confusing retry behavior |
|
||||
| H3 | Fragile sync detection | HIGH | Low | Phantom completion |
|
||||
| H4 | No KV-ClickUp reconciliation | HIGH | Medium | Undetected inconsistency |
|
||||
| M1 | Memory file race condition | MEDIUM | Low | Lost memory entries |
|
||||
| M2 | Shared httpx client | MEDIUM | Low | Corrupted HTTP requests |
|
||||
| M3 | No tool execution timeout | MEDIUM | Low | Thread deadlock |
|
||||
| M4 | Poor auto-flush summaries | MEDIUM | High | Lost context |
|
||||
| M5 | Unbounded KV growth | MEDIUM | Low | Performance degradation |
|
||||
Loading…
Reference in New Issue