Linkman-Paperclip-Wrap/README.md

# Linkman-Paperclip-Wrap

A standalone Python package wrapping the Big-Link-Man (BLM) CLI for use by
Paperclip agents. Extracted from CheddahBot (`cheddahbot/tools/linkbuilding.py`)
and simplified for consumption by external callers.

## What it does

Given a task keyword, the package can:

1. **Find a matching CORA `.xlsx`** in an inbox folder (e.g. `Cora-For-Humans/`)
   using fuzzy keyword matching with singular/plural awareness.
2. **Invoke Big-Link-Man** to run `ingest-cora` and `generate-batch` on that
   xlsx, producing the backlink content.
3. **Return a structured result** the caller can use to update task state.

No folder watching, no task-system coupling, no notifications. The caller owns
task state and polling cadence; this package is pure work.

## Package layout

```
src/link_building_workflow/
  deps.py       -- Deps, BLMConfig, LLMCheck types
  matching.py   -- Keyword normalization and fuzzy matching
  inbox.py      -- Inbox folder scanning (list / find-by-keyword)
  blm.py        -- BLM subprocess wrapper and stdout parsers
  pipeline.py   -- run_cora_backlinks, blm_ingest_cora, blm_generate_batch
  __init__.py   -- Public API re-exports
```

## Installation

```
uv add git+https://git.peninsulaindustries.com/bryanb/Linkman-Paperclip-Wrap.git
```

Big-Link-Man itself is a separate dependency the caller provides. Install it
on the same host and point `BLMConfig.blm_dir` at the checkout.

## Public API

All imports available from the top level:

```python
from link_building_workflow import (
    # Dependency types
    Deps, BLMConfig, LLMCheck,
    # Matching primitives
    normalize_for_match, fuzzy_keyword_match, filename_stem_to_keyword,
    # Inbox scanning
    InboxMatch, list_inbox_xlsx, find_xlsx_for_keyword, find_all_xlsx_for_keyword,
    # Pipeline entry points
    PipelineResult, run_cora_backlinks, blm_ingest_cora, blm_generate_batch,
    # Low-level BLM (if you need to run a custom BLM command)
    IngestResult, GenerateResult, build_ingest_args,
    parse_ingest_output, parse_generate_output, run_blm_command,
)
```

## Typical usage (Paperclip)

The caller decides when a task is eligible to run (all required task fields
filled in, xlsx present in the inbox). This package provides the primitives
to check the xlsx gate and to execute the work.

```python
from link_building_workflow import (
    Deps, BLMConfig, find_xlsx_for_keyword, run_cora_backlinks,
)

deps = Deps(
    blm=BLMConfig(
        blm_dir="/opt/big-link-man",
        username="your-blm-user",
        password="your-blm-pass",
        timeout_seconds=1800,
    ),
    llm_check=your_plural_checker,  # callable[[str, str], bool]
)

def try_run_link_building(task):
    # Caller gates 1-4: task-field checks (LB Method, Keyword, IMSURL, ...)
    if not (task.keyword and task.imsurl):
        return "blocked: missing task fields"

    # Gate 5: does a matching xlsx exist yet?
    match = find_xlsx_for_keyword(
        "/data/Cora-For-Humans",
        task.keyword,
        deps.llm_check,
    )
    if match is None:
        return "blocked: no xlsx in Cora-For-Humans"

    # Execute
    result = run_cora_backlinks(
        xlsx_path=str(match.path),
        project_name=task.keyword,
        money_site_url=task.imsurl,
        custom_anchors=task.custom_anchors or "",
        cli_flags=task.cli_flags or "",
        branded_plus_ratio=task.branded_plus_ratio,  # None -> BLMConfig default
        deps=deps,
    )

    if result.ok:
        # result.summary is a multi-line human-readable string
        # result.ingest.project_id, result.generate.job_moved_to, etc.
        return f"done: {result.summary}"
    else:
        # result.step tells you where it stopped: "ingest" or "generate"
        # result.error has the details
        return f"failed at {result.step}: {result.error}"
```

## The `LLMCheck` callable

Used when the fast-path string equality fails during fuzzy matching. Should
return `True` iff two keywords are the same modulo plural form ("shaft" vs
"shafts", "company" vs "companies"). Return `False` for any other kind of
difference. Implementations should cache -- the workflow may call this
repeatedly with the same pair while scanning an inbox.

Example implementation (the one CheddahBot uses):

```python
import httpx

_cache = {}

def openrouter_plural_check(a: str, b: str) -> bool:
    key = (a, b) if a <= b else (b, a)
    if key in _cache:
        return _cache[key]
    resp = httpx.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
        json={
            "model": "anthropic/claude-haiku-4.5",
            "max_tokens": 5,
            "messages": [
                {"role": "system", "content":
                 "Reply with only 'YES' or 'NO'. YES iff the two keywords "
                 "are identical except for singular/plural form."},
                {"role": "user", "content": f'A: "{a}"\nB: "{b}"'},
            ],
        },
        timeout=15,
    )
    result = "YES" in resp.json()["choices"][0]["message"]["content"].upper()
    _cache[key] = result
    return result
```

Tests may pass `lambda a, b: False` for the fast-path-only case, or any
deterministic fake.

## The `PipelineResult` dataclass

Every pipeline entry point returns the same shape:

| field           | meaning                                                        |
|-----------------|----------------------------------------------------------------|
| `ok`            | True if the pipeline completed the phase it was asked to do    |
| `step`          | "ingest" / "generate" / "complete" (on success) or where it failed |
| `ingest`        | `IngestResult` if ingest ran, else None                        |
| `generate`      | `GenerateResult` if generate ran, else None                    |
| `error`         | Human-readable error message (empty on success)                |
| `summary`       | Multi-line human-readable summary, safe to post as a comment   |
| `project_name`  | The BLM project name                                           |
| `job_file`      | Path to the final job file (post-move on success)              |
| `log_lines`     | Progress messages captured during the run                      |

## What this package does NOT do

- Does not watch folders. No threads, no polling loops.
- Does not know about ClickUp, Linear, or any task system. The caller owns
  task state and decides what status transitions mean.
- Does not sync with shared-folder job queues (the old AutoCora queue).
- Does not manage the Cora tool itself. It only consumes xlsx files that
  Cora has already produced.
- Does not pick up where BLM leaves off. When BLM finishes `generate-batch`,
  the job is done from this package's perspective.

These were deliberate drops during extraction. CheddahBot had folder-watch
threads, ClickUp auto-matching, AutoCora queue submission, and a multi-inbox
distribution loop. Paperclip owns that scheduling logic in its own code.

## Development

Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).

```
uv sync                    # install dev + test deps
uv run pytest              # run the test suite (89 tests, ~96% coverage)
uv run ruff check .        # lint
```

## Provenance

Extracted from the CheddahBot repo, specifically:

- `cheddahbot/tools/linkbuilding.py` -- pipeline logic and fuzzy matching
- `cheddahbot/tools/autocora.py` -- only the fuzzy-match helpers were kept;
  the shared-folder job queue and result polling were dropped
- `cheddahbot/scheduler.py` -- folder-watch loops were dropped; their
  matching logic was converted to a synchronous `find_xlsx_for_keyword` call

The BLM invocation parameters, stdout parsing regexes, and default ratios
match CheddahBot's production behavior exactly.