.. _utils_find_dupe: .. index:: single: Utils; find_dupe single: Utils; AI duplicate judge single: Utils; Cluster triage ========================================================== find_dupe --- AI judge for detect-dupe clusters ========================================================== ``find_dupe`` (directory: ``utils/find-dupe/``) consumes a :ref:`detect-dupe ` JSON report and asks Claude (via the ``das-claude`` daspkg, module ``anthropic/anthropic``) to **partition** each cluster into real-duplicate groups vs false positives, with a one-line reason. Output is JSON (machine, for CI gates / future tooling) + Markdown (human, for review). The tool is an install-then-run script, *not* a built-into-binary utility --- the ``anthropic/anthropic`` dependency is fetched at runtime via :ref:`daspkg `, so it can't be linked into ``daslang.exe``'s ``all_utils_exe`` bundle at build time. .. warning:: ``find_dupe`` **sends source code to Anthropic's API.** Each cluster's full function bodies (verbatim, with file paths and line numbers) are uploaded as part of every judge call. Do not run this tool on proprietary, confidential, or otherwise restricted code unless your project's data-handling policy permits sending that source to Anthropic. See `Anthropic's terms `_. Use ``--dry-run`` to preview cluster size and token estimates without making any API calls. .. contents:: :local: :depth: 2 Install ======= From the project root:: bin/daslang utils/daspkg/main.das -- install --root utils/find-dupe This fetches the ``das-claude`` package per ``utils/find-dupe/.das_package`` and unpacks it under ``utils/find-dupe/modules/``. If you skip this step, ``bin/daslang utils/find-dupe/main.das`` fails at compile time with ``module 'anthropic/anthropic' not found`` --- re-run the install command above. See :ref:`utils_daspkg` for daspkg semantics. API key ======= Set ``ANTHROPIC_API_KEY`` so daslang's non-interactive subshells can see it. On macOS the easiest path is ``~/.zshenv`` (sourced by every zsh invocation, interactive or not):: echo 'export ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshenv ``~/.zshrc`` works for terminals you opened yourself but **not** for GUI-launched processes (VS Code, Claude Code's MCP server) --- those inherit env from launchctl. ``~/.zshenv`` covers both. Smoke test ========== Verify the wiring before running over real clusters:: bin/daslang utils/find-dupe/main.das -- --smoke-test Expected output:: Running das-claude smoke test (model=claude-haiku-4-5-20251001)... Reply: OK Tokens: in=15 out=4 Smoke test PASSED. Workflow ======== 1. Run :ref:`detect-dupe ` to produce a JSON report:: bin/daslang utils/detect-dupe/main.das -- -p --json ./dupes.json 2. Dry-run ``find_dupe`` for a cost preview (no API calls):: bin/daslang utils/find-dupe/main.das -- --input ./dupes.json --dry-run -v Output: cluster counts, estimated input/output tokens, dollar estimate. 3. Live run to produce verdicts:: bin/daslang utils/find-dupe/main.das -- --input ./dupes.json -v Writes ``find_dupe_verdicts.json`` (machine) and ``find_dupe_verdicts.md`` (human) to ``./find-dupe-out/`` (override the directory with ``--out``, the basename with ``--out-basename``). Run from the project root. ``detect-dupe`` records source paths relative to its cwd; ``find_dupe`` extracts each function's body from disk using those paths, so its cwd must match. Flags ===== .. list-table:: :header-rows: 1 :widths: 25 18 57 * - Flag - Default - Meaning * - ``-i``, ``--input`` - required - detect-dupe JSON report file * - ``-o``, ``--out`` - ``./find-dupe-out`` - output directory * - ``--model`` - ``haiku`` - ``haiku`` (Haiku 4.5) or ``sonnet`` (Sonnet 4.6) * - ``--min-lines`` - ``6`` - skip clusters where every member is shorter than this * - ``--max-clusters`` - ``0`` - hard cap on clusters analyzed (0 = no cap) * - ``--sort-by`` - ``default`` - ``default`` (composite), ``size``, or ``fuzzy_first`` * - ``--parallel`` - ``8`` - concurrent judge calls (1 = sequential) * - ``--positives-only`` - off - filter reports to actionable rows only: real + partial * - ``--dry-run`` - off - estimate tokens & cost without making API calls * - ``--smoke-test`` - off - one-shot API call to verify the env wiring * - ``-v``, ``--verbose`` - off - per-cluster progress to stdout * - ``-?``, ``--show-help`` - - print help and exit The default sort puts fuzzy pairs first (highest AI value), then larger clusters, then larger total span --- useful with ``--max-clusters`` and when interrupting a long run. Output schema ============= ``find_dupe_verdicts.json`` --- top-level fields: .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Field - Type - Description * - ``schema_version`` - int - bumps on schema changes * - ``model`` - string - resolved model id * - ``summary`` - object - totals + token usage * - ``verdicts`` - array - one ``VerdictRow`` per cluster Each ``VerdictRow`` carries ``cluster_id``, ``kind`` (``exact``/``fuzzy``), ``similarity`` (for fuzzy), ``members`` (with file/line/end_line), ``verdict`` (``real``/``partial``/``false_positive``/``skipped``/``error``), ``groups`` (the partition), ``false_positives`` (indices), ``reason``, and per-call token usage. ``find_dupe_verdicts.md`` --- human-readable, per-cluster sections with verdict tag, reason, member list, partition, and the original source (collapsed under ``
``). MCP integration =============== The :ref:`utils_mcp` server exposes two tools that wrap this CLI. Both shell out to ``daslang utils/find-dupe/main.das`` --- the ``anthropic/anthropic`` dependency is fetched by daspkg at runtime, so in-process ``require`` would force every MCP user to install daspkg packages they may never use. When the subprocess fails because the package isn't installed, the tool surfaces a structured "anthropic daspkg not installed" error with the exact install command. .. list-table:: :header-rows: 1 :widths: 25 75 * - Tool - Purpose * - ``judge_duplicates`` - Take a detect-dupe JSON report and return the verdict envelope. Parameters: ``input`` (required), ``out``, ``model``, ``max_clusters``, ``positives_only``, ``dry_run``. * - ``find_dupe`` - Convenience: run detect-dupe against ``paths`` and judge the resulting clusters in a single call. Parameters: ``paths`` (required), ``out``, ``model``, ``threshold``, ``max_clusters``, ``dry_run``. The MCP layer hardcodes ``parallel=8`` and ``min_lines=6`` (sensible defaults for an interactive tool); use the CLI directly if you need to tune those. Both tools accept ``dry_run=true`` to estimate cluster count and token cost without making any API calls --- the safe default for "is this worth running?" probes. Pricing ======= Approximate, as of 2026-04: .. list-table:: :header-rows: 1 :widths: 25 35 40 * - Model - Input ($/MTok) - Output ($/MTok) * - Haiku 4.5 - 1.00 - 5.00 * - Sonnet 4.6 - 3.00 - 15.00 A typical cluster (3-4 functions, 50 lines total) consumes ~1 KTok in + 0.3 KTok out → ~$0.0025 on Haiku. Use ``--dry-run`` first if a run looks large. Implementation ============== .. list-table:: :header-rows: 1 :widths: 30 70 * - File - Role * - ``main.das`` - CLI (``daslib/clargs``), pipeline orchestration, parallel fan-out via ``daslib/jobque_boost`` * - ``cluster_input.das`` - detect-dupe JSON shadow structs + ``read_input_report`` * - ``source_extract.das`` - ``FileCache`` + ``extract_source(path, start_line, end_line)`` for pulling function bodies from disk * - ``judge.das`` - ``Verdict`` shape, ``parse_verdict``, ``classify_verdict``, prompt formatter, JSON-Schema for the ``report_verdict`` tool * - ``judge_live.das`` - ``live_judge`` --- the actual Anthropic API call. Imports ``anthropic/anthropic`` (daspkg). * - ``report.das`` - ``OutputReport`` shape, JSON + Markdown writers * - ``tests/test_find_dupe.das`` - dastest suite (hermetic; no API calls) See also ======== * :ref:`utils_detect_dupe` --- the cluster-producing pipeline whose JSON ``find_dupe`` consumes. * :ref:`utils_mcp` --- the MCP server that wraps both tools. * :ref:`utils_daspkg` --- the package manager that fetches ``das-claude``.