6.7. find_dupe — AI judge for detect-dupe clusters

find_dupe (directory: utils/find-dupe/) consumes a detect-dupe JSON report and asks Claude (via the das-claude daspkg, module anthropic/anthropic) to partition each cluster into real-duplicate groups vs false positives, with a one-line reason. Output is JSON (machine, for CI gates / future tooling) + Markdown (human, for review).

The tool is an install-then-run script, not a built-into-binary utility — the anthropic/anthropic dependency is fetched at runtime via daspkg, so it can’t be linked into daslang.exe’s all_utils_exe bundle at build time.

Warning

find_dupe sends source code to Anthropic’s API. Each cluster’s full function bodies (verbatim, with file paths and line numbers) are uploaded as part of every judge call. Do not run this tool on proprietary, confidential, or otherwise restricted code unless your project’s data-handling policy permits sending that source to Anthropic. See Anthropic’s terms. Use --dry-run to preview cluster size and token estimates without making any API calls.

6.7.1. Install

From the project root:

bin/daslang utils/daspkg/main.das -- install --root utils/find-dupe

This fetches the das-claude package per utils/find-dupe/.das_package and unpacks it under utils/find-dupe/modules/. If you skip this step, bin/daslang utils/find-dupe/main.das fails at compile time with module 'anthropic/anthropic' not found — re-run the install command above. See daspkg — Package Manager for daspkg semantics.

6.7.2. API key

Set ANTHROPIC_API_KEY so daslang’s non-interactive subshells can see it. On macOS the easiest path is ~/.zshenv (sourced by every zsh invocation, interactive or not):

echo 'export ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshenv

~/.zshrc works for terminals you opened yourself but not for GUI-launched processes (VS Code, Claude Code’s MCP server) — those inherit env from launchctl. ~/.zshenv covers both.

6.7.3. Smoke test

Verify the wiring before running over real clusters:

bin/daslang utils/find-dupe/main.das -- --smoke-test

Expected output:

Running das-claude smoke test (model=claude-haiku-4-5-20251001)...
Reply: OK
Tokens: in=15 out=4
Smoke test PASSED.

6.7.4. Workflow

  1. Run detect-dupe to produce a JSON report:

    bin/daslang utils/detect-dupe/main.das -- -p <paths> --json ./dupes.json
    
  2. Dry-run find_dupe for a cost preview (no API calls):

    bin/daslang utils/find-dupe/main.das -- --input ./dupes.json --dry-run -v
    

    Output: cluster counts, estimated input/output tokens, dollar estimate.

  3. Live run to produce verdicts:

    bin/daslang utils/find-dupe/main.das -- --input ./dupes.json -v
    

    Writes find_dupe_verdicts.json (machine) and find_dupe_verdicts.md (human) to ./find-dupe-out/ (override the directory with --out, the basename with --out-basename).

Run from the project root. detect-dupe records source paths relative to its cwd; find_dupe extracts each function’s body from disk using those paths, so its cwd must match.

6.7.5. Flags

Flag

Default

Meaning

-i, --input

required

detect-dupe JSON report file

-o, --out

./find-dupe-out

output directory

--model

haiku

haiku (Haiku 4.5) or sonnet (Sonnet 4.6)

--min-lines

6

skip clusters where every member is shorter than this

--max-clusters

0

hard cap on clusters analyzed (0 = no cap)

--sort-by

default

default (composite), size, or fuzzy_first

--parallel

8

concurrent judge calls (1 = sequential)

--positives-only

off

filter reports to actionable rows only: real + partial

--dry-run

off

estimate tokens & cost without making API calls

--smoke-test

off

one-shot API call to verify the env wiring

-v, --verbose

off

per-cluster progress to stdout

-?, --show-help

print help and exit

The default sort puts fuzzy pairs first (highest AI value), then larger clusters, then larger total span — useful with --max-clusters and when interrupting a long run.

6.7.6. Output schema

find_dupe_verdicts.json — top-level fields:

Field

Type

Description

schema_version

int

bumps on schema changes

model

string

resolved model id

summary

object

totals + token usage

verdicts

array

one VerdictRow per cluster

Each VerdictRow carries cluster_id, kind (exact/fuzzy), similarity (for fuzzy), members (with file/line/end_line), verdict (real/partial/false_positive/skipped/error), groups (the partition), false_positives (indices), reason, and per-call token usage.

find_dupe_verdicts.md — human-readable, per-cluster sections with verdict tag, reason, member list, partition, and the original source (collapsed under <details>).

6.7.7. MCP integration

The MCP Server — AI Tool Integration server exposes two tools that wrap this CLI. Both shell out to daslang utils/find-dupe/main.das — the anthropic/anthropic dependency is fetched by daspkg at runtime, so in-process require would force every MCP user to install daspkg packages they may never use. When the subprocess fails because the package isn’t installed, the tool surfaces a structured “anthropic daspkg not installed” error with the exact install command.

Tool

Purpose

judge_duplicates

Take a detect-dupe JSON report and return the verdict envelope. Parameters: input (required), out, model, max_clusters, positives_only, dry_run.

find_dupe

Convenience: run detect-dupe against paths and judge the resulting clusters in a single call. Parameters: paths (required), out, model, threshold, max_clusters, dry_run.

The MCP layer hardcodes parallel=8 and min_lines=6 (sensible defaults for an interactive tool); use the CLI directly if you need to tune those.

Both tools accept dry_run=true to estimate cluster count and token cost without making any API calls — the safe default for “is this worth running?” probes.

6.7.8. Pricing

Approximate, as of 2026-04:

Model

Input ($/MTok)

Output ($/MTok)

Haiku 4.5

1.00

5.00

Sonnet 4.6

3.00

15.00

A typical cluster (3-4 functions, 50 lines total) consumes ~1 KTok in + 0.3 KTok out → ~$0.0025 on Haiku. Use --dry-run first if a run looks large.

6.7.9. Implementation

File

Role

main.das

CLI (daslib/clargs), pipeline orchestration, parallel fan-out via daslib/jobque_boost

cluster_input.das

detect-dupe JSON shadow structs + read_input_report

source_extract.das

FileCache + extract_source(path, start_line, end_line) for pulling function bodies from disk

judge.das

Verdict shape, parse_verdict, classify_verdict, prompt formatter, JSON-Schema for the report_verdict tool

judge_live.das

live_judge — the actual Anthropic API call. Imports anthropic/anthropic (daspkg).

report.das

OutputReport shape, JSON + Markdown writers

tests/test_find_dupe.das

dastest suite (hermetic; no API calls)

6.7.10. See also