212 lines
7.6 KiB
Markdown
212 lines
7.6 KiB
Markdown
# Linkman-Paperclip-Wrap
|
|
|
|
A standalone Python package wrapping the Big-Link-Man (BLM) CLI for use by
|
|
Paperclip agents. Extracted from CheddahBot (`cheddahbot/tools/linkbuilding.py`)
|
|
and simplified for consumption by external callers.
|
|
|
|
## What it does
|
|
|
|
Given a task keyword, the package can:
|
|
|
|
1. **Find a matching CORA `.xlsx`** in an inbox folder (e.g. `Cora-For-Humans/`)
|
|
using fuzzy keyword matching with singular/plural awareness.
|
|
2. **Invoke Big-Link-Man** to run `ingest-cora` and `generate-batch` on that
|
|
xlsx, producing the backlink content.
|
|
3. **Return a structured result** the caller can use to update task state.
|
|
|
|
No folder watching, no task-system coupling, no notifications. The caller owns
|
|
task state and polling cadence; this package is pure work.
|
|
|
|
## Package layout
|
|
|
|
```
|
|
src/link_building_workflow/
|
|
deps.py -- Deps, BLMConfig, LLMCheck types
|
|
matching.py -- Keyword normalization and fuzzy matching
|
|
inbox.py -- Inbox folder scanning (list / find-by-keyword)
|
|
blm.py -- BLM subprocess wrapper and stdout parsers
|
|
pipeline.py -- run_cora_backlinks, blm_ingest_cora, blm_generate_batch
|
|
__init__.py -- Public API re-exports
|
|
```
|
|
|
|
## Installation
|
|
|
|
```
|
|
uv add git+https://git.peninsulaindustries.com/bryanb/Linkman-Paperclip-Wrap.git
|
|
```
|
|
|
|
Big-Link-Man itself is a separate dependency the caller provides. Install it
|
|
on the same host and point `BLMConfig.blm_dir` at the checkout.
|
|
|
|
## Public API
|
|
|
|
All imports available from the top level:
|
|
|
|
```python
|
|
from link_building_workflow import (
|
|
# Dependency types
|
|
Deps, BLMConfig, LLMCheck,
|
|
# Matching primitives
|
|
normalize_for_match, fuzzy_keyword_match, filename_stem_to_keyword,
|
|
# Inbox scanning
|
|
InboxMatch, list_inbox_xlsx, find_xlsx_for_keyword, find_all_xlsx_for_keyword,
|
|
# Pipeline entry points
|
|
PipelineResult, run_cora_backlinks, blm_ingest_cora, blm_generate_batch,
|
|
# Low-level BLM (if you need to run a custom BLM command)
|
|
IngestResult, GenerateResult, build_ingest_args,
|
|
parse_ingest_output, parse_generate_output, run_blm_command,
|
|
)
|
|
```
|
|
|
|
## Typical usage (Paperclip)
|
|
|
|
The caller decides when a task is eligible to run (all required task fields
|
|
filled in, xlsx present in the inbox). This package provides the primitives
|
|
to check the xlsx gate and to execute the work.
|
|
|
|
```python
|
|
from link_building_workflow import (
|
|
Deps, BLMConfig, find_xlsx_for_keyword, run_cora_backlinks,
|
|
)
|
|
|
|
deps = Deps(
|
|
blm=BLMConfig(
|
|
blm_dir="/opt/big-link-man",
|
|
username="your-blm-user",
|
|
password="your-blm-pass",
|
|
timeout_seconds=1800,
|
|
),
|
|
llm_check=your_plural_checker, # callable[[str, str], bool]
|
|
)
|
|
|
|
def try_run_link_building(task):
|
|
# Caller gates 1-4: task-field checks (LB Method, Keyword, IMSURL, ...)
|
|
if not (task.keyword and task.imsurl):
|
|
return "blocked: missing task fields"
|
|
|
|
# Gate 5: does a matching xlsx exist yet?
|
|
match = find_xlsx_for_keyword(
|
|
"/data/Cora-For-Humans",
|
|
task.keyword,
|
|
deps.llm_check,
|
|
)
|
|
if match is None:
|
|
return "blocked: no xlsx in Cora-For-Humans"
|
|
|
|
# Execute
|
|
result = run_cora_backlinks(
|
|
xlsx_path=str(match.path),
|
|
project_name=task.keyword,
|
|
money_site_url=task.imsurl,
|
|
custom_anchors=task.custom_anchors or "",
|
|
cli_flags=task.cli_flags or "",
|
|
branded_plus_ratio=task.branded_plus_ratio, # None -> BLMConfig default
|
|
deps=deps,
|
|
)
|
|
|
|
if result.ok:
|
|
# result.summary is a multi-line human-readable string
|
|
# result.ingest.project_id, result.generate.job_moved_to, etc.
|
|
return f"done: {result.summary}"
|
|
else:
|
|
# result.step tells you where it stopped: "ingest" or "generate"
|
|
# result.error has the details
|
|
return f"failed at {result.step}: {result.error}"
|
|
```
|
|
|
|
## The `LLMCheck` callable
|
|
|
|
Used when the fast-path string equality fails during fuzzy matching. Should
|
|
return `True` iff two keywords are the same modulo plural form ("shaft" vs
|
|
"shafts", "company" vs "companies"). Return `False` for any other kind of
|
|
difference. Implementations should cache -- the workflow may call this
|
|
repeatedly with the same pair while scanning an inbox.
|
|
|
|
Example implementation (the one CheddahBot uses):
|
|
|
|
```python
|
|
import httpx
|
|
|
|
_cache = {}
|
|
|
|
def openrouter_plural_check(a: str, b: str) -> bool:
|
|
key = (a, b) if a <= b else (b, a)
|
|
if key in _cache:
|
|
return _cache[key]
|
|
resp = httpx.post(
|
|
"https://openrouter.ai/api/v1/chat/completions",
|
|
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
|
|
json={
|
|
"model": "anthropic/claude-haiku-4.5",
|
|
"max_tokens": 5,
|
|
"messages": [
|
|
{"role": "system", "content":
|
|
"Reply with only 'YES' or 'NO'. YES iff the two keywords "
|
|
"are identical except for singular/plural form."},
|
|
{"role": "user", "content": f'A: "{a}"\nB: "{b}"'},
|
|
],
|
|
},
|
|
timeout=15,
|
|
)
|
|
result = "YES" in resp.json()["choices"][0]["message"]["content"].upper()
|
|
_cache[key] = result
|
|
return result
|
|
```
|
|
|
|
Tests may pass `lambda a, b: False` for the fast-path-only case, or any
|
|
deterministic fake.
|
|
|
|
## The `PipelineResult` dataclass
|
|
|
|
Every pipeline entry point returns the same shape:
|
|
|
|
| field | meaning |
|
|
|-----------------|----------------------------------------------------------------|
|
|
| `ok` | True if the pipeline completed the phase it was asked to do |
|
|
| `step` | "ingest" / "generate" / "complete" (on success) or where it failed |
|
|
| `ingest` | `IngestResult` if ingest ran, else None |
|
|
| `generate` | `GenerateResult` if generate ran, else None |
|
|
| `error` | Human-readable error message (empty on success) |
|
|
| `summary` | Multi-line human-readable summary, safe to post as a comment |
|
|
| `project_name` | The BLM project name |
|
|
| `job_file` | Path to the final job file (post-move on success) |
|
|
| `log_lines` | Progress messages captured during the run |
|
|
|
|
## What this package does NOT do
|
|
|
|
- Does not watch folders. No threads, no polling loops.
|
|
- Does not know about ClickUp, Linear, or any task system. The caller owns
|
|
task state and decides what status transitions mean.
|
|
- Does not sync with shared-folder job queues (the old AutoCora queue).
|
|
- Does not manage the Cora tool itself. It only consumes xlsx files that
|
|
Cora has already produced.
|
|
- Does not pick up where BLM leaves off. When BLM finishes `generate-batch`,
|
|
the job is done from this package's perspective.
|
|
|
|
These were deliberate drops during extraction. CheddahBot had folder-watch
|
|
threads, ClickUp auto-matching, AutoCora queue submission, and a multi-inbox
|
|
distribution loop. Paperclip owns that scheduling logic in its own code.
|
|
|
|
## Development
|
|
|
|
Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).
|
|
|
|
```
|
|
uv sync # install dev + test deps
|
|
uv run pytest # run the test suite (89 tests, ~96% coverage)
|
|
uv run ruff check . # lint
|
|
```
|
|
|
|
## Provenance
|
|
|
|
Extracted from the CheddahBot repo, specifically:
|
|
|
|
- `cheddahbot/tools/linkbuilding.py` -- pipeline logic and fuzzy matching
|
|
- `cheddahbot/tools/autocora.py` -- only the fuzzy-match helpers were kept;
|
|
the shared-folder job queue and result polling were dropped
|
|
- `cheddahbot/scheduler.py` -- folder-watch loops were dropped; their
|
|
matching logic was converted to a synchronous `find_xlsx_for_keyword` call
|
|
|
|
The BLM invocation parameters, stdout parsing regexes, and default ratios
|
|
match CheddahBot's production behavior exactly.
|