Linkman-Paperclip-Wrap/README.md

212 lines
7.6 KiB
Markdown

# Linkman-Paperclip-Wrap
A standalone Python package wrapping the Big-Link-Man (BLM) CLI for use by
Paperclip agents. Extracted from CheddahBot (`cheddahbot/tools/linkbuilding.py`)
and simplified for consumption by external callers.
## What it does
Given a task keyword, the package can:
1. **Find a matching CORA `.xlsx`** in an inbox folder (e.g. `Cora-For-Humans/`)
using fuzzy keyword matching with singular/plural awareness.
2. **Invoke Big-Link-Man** to run `ingest-cora` and `generate-batch` on that
xlsx, producing the backlink content.
3. **Return a structured result** the caller can use to update task state.
No folder watching, no task-system coupling, no notifications. The caller owns
task state and polling cadence; this package is pure work.
## Package layout
```
src/link_building_workflow/
deps.py -- Deps, BLMConfig, LLMCheck types
matching.py -- Keyword normalization and fuzzy matching
inbox.py -- Inbox folder scanning (list / find-by-keyword)
blm.py -- BLM subprocess wrapper and stdout parsers
pipeline.py -- run_cora_backlinks, blm_ingest_cora, blm_generate_batch
__init__.py -- Public API re-exports
```
## Installation
```
uv add git+https://git.peninsulaindustries.com/bryanb/Linkman-Paperclip-Wrap.git
```
Big-Link-Man itself is a separate dependency the caller provides. Install it
on the same host and point `BLMConfig.blm_dir` at the checkout.
## Public API
All imports available from the top level:
```python
from link_building_workflow import (
# Dependency types
Deps, BLMConfig, LLMCheck,
# Matching primitives
normalize_for_match, fuzzy_keyword_match, filename_stem_to_keyword,
# Inbox scanning
InboxMatch, list_inbox_xlsx, find_xlsx_for_keyword, find_all_xlsx_for_keyword,
# Pipeline entry points
PipelineResult, run_cora_backlinks, blm_ingest_cora, blm_generate_batch,
# Low-level BLM (if you need to run a custom BLM command)
IngestResult, GenerateResult, build_ingest_args,
parse_ingest_output, parse_generate_output, run_blm_command,
)
```
## Typical usage (Paperclip)
The caller decides when a task is eligible to run (all required task fields
filled in, xlsx present in the inbox). This package provides the primitives
to check the xlsx gate and to execute the work.
```python
from link_building_workflow import (
Deps, BLMConfig, find_xlsx_for_keyword, run_cora_backlinks,
)
deps = Deps(
blm=BLMConfig(
blm_dir="/opt/big-link-man",
username="your-blm-user",
password="your-blm-pass",
timeout_seconds=1800,
),
llm_check=your_plural_checker, # callable[[str, str], bool]
)
def try_run_link_building(task):
# Caller gates 1-4: task-field checks (LB Method, Keyword, IMSURL, ...)
if not (task.keyword and task.imsurl):
return "blocked: missing task fields"
# Gate 5: does a matching xlsx exist yet?
match = find_xlsx_for_keyword(
"/data/Cora-For-Humans",
task.keyword,
deps.llm_check,
)
if match is None:
return "blocked: no xlsx in Cora-For-Humans"
# Execute
result = run_cora_backlinks(
xlsx_path=str(match.path),
project_name=task.keyword,
money_site_url=task.imsurl,
custom_anchors=task.custom_anchors or "",
cli_flags=task.cli_flags or "",
branded_plus_ratio=task.branded_plus_ratio, # None -> BLMConfig default
deps=deps,
)
if result.ok:
# result.summary is a multi-line human-readable string
# result.ingest.project_id, result.generate.job_moved_to, etc.
return f"done: {result.summary}"
else:
# result.step tells you where it stopped: "ingest" or "generate"
# result.error has the details
return f"failed at {result.step}: {result.error}"
```
## The `LLMCheck` callable
Used when the fast-path string equality fails during fuzzy matching. Should
return `True` iff two keywords are the same modulo plural form ("shaft" vs
"shafts", "company" vs "companies"). Return `False` for any other kind of
difference. Implementations should cache -- the workflow may call this
repeatedly with the same pair while scanning an inbox.
Example implementation (the one CheddahBot uses):
```python
import httpx
_cache = {}
def openrouter_plural_check(a: str, b: str) -> bool:
key = (a, b) if a <= b else (b, a)
if key in _cache:
return _cache[key]
resp = httpx.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
json={
"model": "anthropic/claude-haiku-4.5",
"max_tokens": 5,
"messages": [
{"role": "system", "content":
"Reply with only 'YES' or 'NO'. YES iff the two keywords "
"are identical except for singular/plural form."},
{"role": "user", "content": f'A: "{a}"\nB: "{b}"'},
],
},
timeout=15,
)
result = "YES" in resp.json()["choices"][0]["message"]["content"].upper()
_cache[key] = result
return result
```
Tests may pass `lambda a, b: False` for the fast-path-only case, or any
deterministic fake.
## The `PipelineResult` dataclass
Every pipeline entry point returns the same shape:
| field | meaning |
|-----------------|----------------------------------------------------------------|
| `ok` | True if the pipeline completed the phase it was asked to do |
| `step` | "ingest" / "generate" / "complete" (on success) or where it failed |
| `ingest` | `IngestResult` if ingest ran, else None |
| `generate` | `GenerateResult` if generate ran, else None |
| `error` | Human-readable error message (empty on success) |
| `summary` | Multi-line human-readable summary, safe to post as a comment |
| `project_name` | The BLM project name |
| `job_file` | Path to the final job file (post-move on success) |
| `log_lines` | Progress messages captured during the run |
## What this package does NOT do
- Does not watch folders. No threads, no polling loops.
- Does not know about ClickUp, Linear, or any task system. The caller owns
task state and decides what status transitions mean.
- Does not sync with shared-folder job queues (the old AutoCora queue).
- Does not manage the Cora tool itself. It only consumes xlsx files that
Cora has already produced.
- Does not pick up where BLM leaves off. When BLM finishes `generate-batch`,
the job is done from this package's perspective.
These were deliberate drops during extraction. CheddahBot had folder-watch
threads, ClickUp auto-matching, AutoCora queue submission, and a multi-inbox
distribution loop. Paperclip owns that scheduling logic in its own code.
## Development
Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).
```
uv sync # install dev + test deps
uv run pytest # run the test suite (89 tests, ~96% coverage)
uv run ruff check . # lint
```
## Provenance
Extracted from the CheddahBot repo, specifically:
- `cheddahbot/tools/linkbuilding.py` -- pipeline logic and fuzzy matching
- `cheddahbot/tools/autocora.py` -- only the fuzzy-match helpers were kept;
the shared-folder job queue and result polling were dropped
- `cheddahbot/scheduler.py` -- folder-watch loops were dropped; their
matching logic was converted to a synchronous `find_xlsx_for_keyword` call
The BLM invocation parameters, stdout parsing regexes, and default ratios
match CheddahBot's production behavior exactly.