Go to file
bryanb 381d51e001 Initial commit: link building workflow extracted from CheddahBot
Standalone package wrapping Big-Link-Man (BLM) for Paperclip. Extracted
from cheddahbot/tools/linkbuilding.py and related modules, with
task-system coupling, folder watching, and AutoCora queue logic
stripped out.

Public API:
- Deps, BLMConfig, LLMCheck (injection types)
- normalize_for_match, fuzzy_keyword_match, filename_stem_to_keyword
- list_inbox_xlsx, find_xlsx_for_keyword, find_all_xlsx_for_keyword
- blm_ingest_cora, blm_generate_batch, run_cora_backlinks (pipelines)
- PipelineResult, IngestResult, GenerateResult (return types)

89 tests, 96% coverage.
2026-04-22 12:11:16 +00:00
src/link_building_workflow Initial commit: link building workflow extracted from CheddahBot 2026-04-22 12:11:16 +00:00
tests Initial commit: link building workflow extracted from CheddahBot 2026-04-22 12:11:16 +00:00
.gitignore Initial commit: link building workflow extracted from CheddahBot 2026-04-22 12:11:16 +00:00
README.md Initial commit: link building workflow extracted from CheddahBot 2026-04-22 12:11:16 +00:00
pyproject.toml Initial commit: link building workflow extracted from CheddahBot 2026-04-22 12:11:16 +00:00

README.md

Linkman-Paperclip-Wrap

A standalone Python package wrapping the Big-Link-Man (BLM) CLI for use by Paperclip agents. Extracted from CheddahBot (cheddahbot/tools/linkbuilding.py) and simplified for consumption by external callers.

What it does

Given a task keyword, the package can:

  1. Find a matching CORA .xlsx in an inbox folder (e.g. Cora-For-Humans/) using fuzzy keyword matching with singular/plural awareness.
  2. Invoke Big-Link-Man to run ingest-cora and generate-batch on that xlsx, producing the backlink content.
  3. Return a structured result the caller can use to update task state.

No folder watching, no task-system coupling, no notifications. The caller owns task state and polling cadence; this package is pure work.

Package layout

src/link_building_workflow/
  deps.py       -- Deps, BLMConfig, LLMCheck types
  matching.py   -- Keyword normalization and fuzzy matching
  inbox.py      -- Inbox folder scanning (list / find-by-keyword)
  blm.py        -- BLM subprocess wrapper and stdout parsers
  pipeline.py   -- run_cora_backlinks, blm_ingest_cora, blm_generate_batch
  __init__.py   -- Public API re-exports

Installation

uv add git+https://git.peninsulaindustries.com/bryanb/Linkman-Paperclip-Wrap.git

Big-Link-Man itself is a separate dependency the caller provides. Install it on the same host and point BLMConfig.blm_dir at the checkout.

Public API

All imports available from the top level:

from link_building_workflow import (
    # Dependency types
    Deps, BLMConfig, LLMCheck,
    # Matching primitives
    normalize_for_match, fuzzy_keyword_match, filename_stem_to_keyword,
    # Inbox scanning
    InboxMatch, list_inbox_xlsx, find_xlsx_for_keyword, find_all_xlsx_for_keyword,
    # Pipeline entry points
    PipelineResult, run_cora_backlinks, blm_ingest_cora, blm_generate_batch,
    # Low-level BLM (if you need to run a custom BLM command)
    IngestResult, GenerateResult, build_ingest_args,
    parse_ingest_output, parse_generate_output, run_blm_command,
)

Typical usage (Paperclip)

The caller decides when a task is eligible to run (all required task fields filled in, xlsx present in the inbox). This package provides the primitives to check the xlsx gate and to execute the work.

from link_building_workflow import (
    Deps, BLMConfig, find_xlsx_for_keyword, run_cora_backlinks,
)

deps = Deps(
    blm=BLMConfig(
        blm_dir="/opt/big-link-man",
        username="your-blm-user",
        password="your-blm-pass",
        timeout_seconds=1800,
    ),
    llm_check=your_plural_checker,  # callable[[str, str], bool]
)

def try_run_link_building(task):
    # Caller gates 1-4: task-field checks (LB Method, Keyword, IMSURL, ...)
    if not (task.keyword and task.imsurl):
        return "blocked: missing task fields"

    # Gate 5: does a matching xlsx exist yet?
    match = find_xlsx_for_keyword(
        "/data/Cora-For-Humans",
        task.keyword,
        deps.llm_check,
    )
    if match is None:
        return "blocked: no xlsx in Cora-For-Humans"

    # Execute
    result = run_cora_backlinks(
        xlsx_path=str(match.path),
        project_name=task.keyword,
        money_site_url=task.imsurl,
        custom_anchors=task.custom_anchors or "",
        cli_flags=task.cli_flags or "",
        branded_plus_ratio=task.branded_plus_ratio,  # None -> BLMConfig default
        deps=deps,
    )

    if result.ok:
        # result.summary is a multi-line human-readable string
        # result.ingest.project_id, result.generate.job_moved_to, etc.
        return f"done: {result.summary}"
    else:
        # result.step tells you where it stopped: "ingest" or "generate"
        # result.error has the details
        return f"failed at {result.step}: {result.error}"

The LLMCheck callable

Used when the fast-path string equality fails during fuzzy matching. Should return True iff two keywords are the same modulo plural form ("shaft" vs "shafts", "company" vs "companies"). Return False for any other kind of difference. Implementations should cache -- the workflow may call this repeatedly with the same pair while scanning an inbox.

Example implementation (the one CheddahBot uses):

import httpx

_cache = {}

def openrouter_plural_check(a: str, b: str) -> bool:
    key = (a, b) if a <= b else (b, a)
    if key in _cache:
        return _cache[key]
    resp = httpx.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
        json={
            "model": "anthropic/claude-haiku-4.5",
            "max_tokens": 5,
            "messages": [
                {"role": "system", "content":
                 "Reply with only 'YES' or 'NO'. YES iff the two keywords "
                 "are identical except for singular/plural form."},
                {"role": "user", "content": f'A: "{a}"\nB: "{b}"'},
            ],
        },
        timeout=15,
    )
    result = "YES" in resp.json()["choices"][0]["message"]["content"].upper()
    _cache[key] = result
    return result

Tests may pass lambda a, b: False for the fast-path-only case, or any deterministic fake.

The PipelineResult dataclass

Every pipeline entry point returns the same shape:

field meaning
ok True if the pipeline completed the phase it was asked to do
step "ingest" / "generate" / "complete" (on success) or where it failed
ingest IngestResult if ingest ran, else None
generate GenerateResult if generate ran, else None
error Human-readable error message (empty on success)
summary Multi-line human-readable summary, safe to post as a comment
project_name The BLM project name
job_file Path to the final job file (post-move on success)
log_lines Progress messages captured during the run

What this package does NOT do

  • Does not watch folders. No threads, no polling loops.
  • Does not know about ClickUp, Linear, or any task system. The caller owns task state and decides what status transitions mean.
  • Does not sync with shared-folder job queues (the old AutoCora queue).
  • Does not manage the Cora tool itself. It only consumes xlsx files that Cora has already produced.
  • Does not pick up where BLM leaves off. When BLM finishes generate-batch, the job is done from this package's perspective.

These were deliberate drops during extraction. CheddahBot had folder-watch threads, ClickUp auto-matching, AutoCora queue submission, and a multi-inbox distribution loop. Paperclip owns that scheduling logic in its own code.

Development

Requires Python 3.11+ and uv.

uv sync                    # install dev + test deps
uv run pytest              # run the test suite (89 tests, ~96% coverage)
uv run ruff check .        # lint

Provenance

Extracted from the CheddahBot repo, specifically:

  • cheddahbot/tools/linkbuilding.py -- pipeline logic and fuzzy matching
  • cheddahbot/tools/autocora.py -- only the fuzzy-match helpers were kept; the shared-folder job queue and result polling were dropped
  • cheddahbot/scheduler.py -- folder-watch loops were dropped; their matching logic was converted to a synchronous find_xlsx_for_keyword call

The BLM invocation parameters, stdout parsing regexes, and default ratios match CheddahBot's production behavior exactly.