6.8. find_dupe — AI judge for detect-dupe clusters

find_dupe (directory: utils/find-dupe/) consumes a detect-dupe JSON report and asks Claude (via the das-claude daspkg, module anthropic/anthropic) to partition each cluster into real-duplicate groups vs false positives, with a one-line reason. Output is JSON (machine, for CI gates / future tooling) + Markdown (human, for review).

The tool is an install-then-run script, not a built-into-binary utility — the anthropic/anthropic dependency is fetched at runtime via daspkg, so it can’t be linked into daslang.exe’s all_utils_exe bundle at build time.

Warning

find_dupe sends source code to Anthropic’s API. Each cluster’s full function bodies (verbatim, with file paths and line numbers) are uploaded as part of every judge call. Do not run this tool on proprietary, confidential, or otherwise restricted code unless your project’s data-handling policy permits sending that source to Anthropic. See Anthropic’s terms. Use --dry-run to preview cluster size and token estimates without making any API calls.

6.8.1. Install 

From the project root:

bin/daslang utils/daspkg/main.das -- install --root utils/find-dupe

This fetches the das-claude package per utils/find-dupe/.das_package and unpacks it under utils/find-dupe/modules/. If you skip this step, bin/daslang utils/find-dupe/main.das fails at compile time with module 'anthropic/anthropic' not found — re-run the install command above. See daspkg — Package Manager for daspkg semantics.

6.8.2. API key 

Set ANTHROPIC_API_KEY so daslang’s non-interactive subshells can see it. On macOS the easiest path is ~/.zshenv (sourced by every zsh invocation, interactive or not):

echo 'export ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshenv

~/.zshrc works for terminals you opened yourself but not for GUI-launched processes (VS Code, Claude Code’s MCP server) — those inherit env from launchctl. ~/.zshenv covers both.

6.8.3. Smoke test 

Verify the wiring before running over real clusters:

bin/daslang utils/find-dupe/main.das -- --smoke-test

Expected output:

Running das-claude smoke test (model=claude-haiku-4-5-20251001)...
Reply: OK
Tokens: in=15 out=4
Smoke test PASSED.

6.8.4. Workflow 

Run detect-dupe to produce a JSON report:

bin/daslang utils/detect-dupe/main.das -- -p <paths> --json ./dupes.json

Dry-run find_dupe for a cost preview (no API calls):
```
bin/daslang utils/find-dupe/main.das -- --input ./dupes.json --dry-run -v
```
Output: cluster counts, estimated input/output tokens, dollar estimate.
Live run to produce verdicts:
```
bin/daslang utils/find-dupe/main.das -- --input ./dupes.json -v
```
Writes find_dupe_verdicts.json (machine) and find_dupe_verdicts.md (human) to ./find-dupe-out/ (override the directory with --out, the basename with --out-basename).

Run from the project root. detect-dupe records source paths relative to its cwd; find_dupe extracts each function’s body from disk using those paths, so its cwd must match.

6.8.5. Flags 

Flag	Default	Meaning
`-i`, `--input`	required	detect-dupe JSON report file
`-o`, `--out`	`./find-dupe-out`	output directory
`--model`	`haiku`	`haiku` (Haiku 4.5) or `sonnet` (Sonnet 4.6)
`--min-lines`	`6`	skip clusters where every member is shorter than this
`--max-clusters`	`0`	hard cap on clusters analyzed (0 = no cap)
`--sort-by`	`default`	`default` (composite), `size`, or `fuzzy_first`
`--parallel`	`8`	concurrent judge calls (1 = sequential)
`--positives-only`	off	filter reports to actionable rows only: real + partial
`--dry-run`	off	estimate tokens & cost without making API calls
`--smoke-test`	off	one-shot API call to verify the env wiring
`-v`, `--verbose`	off	per-cluster progress to stdout
`-?`, `--show-help`		print help and exit

The default sort puts fuzzy pairs first (highest AI value), then larger clusters, then larger total span — useful with --max-clusters and when interrupting a long run.

6.8.6. Output schema 

find_dupe_verdicts.json — top-level fields:

Field	Type	Description
`schema_version`	int	bumps on schema changes
`model`	string	resolved model id
`summary`	object	totals + token usage
`verdicts`	array	one `VerdictRow` per cluster

Each VerdictRow carries cluster_id, kind (exact/fuzzy), similarity (for fuzzy), members (with file/line/end_line), verdict (real/partial/false_positive/skipped/error), groups (the partition), false_positives (indices), reason, and per-call token usage.

find_dupe_verdicts.md — human-readable, per-cluster sections with verdict tag, reason, member list, partition, and the original source (collapsed under <details>).

6.8.7. MCP integration 

The MCP Server — AI Tool Integration server exposes two tools that wrap this CLI. Both shell out to daslang utils/find-dupe/main.das — the anthropic/anthropic dependency is fetched by daspkg at runtime, so in-process require would force every MCP user to install daspkg packages they may never use. When the subprocess fails because the package isn’t installed, the tool surfaces a structured “anthropic daspkg not installed” error with the exact install command.

Tool	Purpose
`judge_duplicates`	Take a detect-dupe JSON report and return the verdict envelope. Parameters: `input` (required), `out`, `model`, `max_clusters`, `positives_only`, `dry_run`.
`find_dupe`	Convenience: run detect-dupe against `paths` and judge the resulting clusters in a single call. Parameters: `paths` (required), `out`, `model`, `threshold`, `max_clusters`, `dry_run`.

The MCP layer hardcodes parallel=8 and min_lines=6 (sensible defaults for an interactive tool); use the CLI directly if you need to tune those.

Both tools accept dry_run=true to estimate cluster count and token cost without making any API calls — the safe default for “is this worth running?” probes.

6.8.8. Pricing 

Approximate, as of 2026-04:

Model	Input ($/MTok)	Output ($/MTok)
Haiku 4.5	1.00	5.00
Sonnet 4.6	3.00	15.00

A typical cluster (3-4 functions, 50 lines total) consumes ~1 KTok in + 0.3 KTok out → ~$0.0025 on Haiku. Use --dry-run first if a run looks large.

6.8.9. Implementation 

File	Role
`main.das`	CLI (`daslib/clargs`), pipeline orchestration, parallel fan-out via `daslib/jobque_boost`
`cluster_input.das`	detect-dupe JSON shadow structs + `read_input_report`
`source_extract.das`	`FileCache` + `extract_source(path, start_line, end_line)` for pulling function bodies from disk
`judge.das`	`Verdict` shape, `parse_verdict`, `classify_verdict`, prompt formatter, JSON-Schema for the `report_verdict` tool
`judge_live.das`	`live_judge` — the actual Anthropic API call. Imports `anthropic/anthropic` (daspkg).
`report.das`	`OutputReport` shape, JSON + Markdown writers
`tests/test_find_dupe.das`	dastest suite (hermetic; no API calls)

6.8.10. See also 

detect-dupe — Cross-file similar-function detector — the cluster-producing pipeline whose JSON find_dupe consumes.
MCP Server — AI Tool Integration — the MCP server that wraps both tools.
daspkg — Package Manager — the package manager that fetches das-claude.