6.6. detect-dupe — Cross-file similar-function detector

detect-dupe walks one or more directories of .das files, normalises every user function into an alpha-renamed token stream (identifiers, types, and literals all collapsed), and reports near-identical functions across the corpus. It is useful for surfacing test-suite boilerplate that could be factored, near-clones that drifted apart, or copy-pasted helpers that escaped review.

Two interfaces ship together: a CLI (utils/detect-dupe/main.das) and two MCP tools (export_corpus and detect_duplicates) that expose the same engine to AI coding assistants.

6.6.1. What it reports

Every output is one of two kinds of match:

  • Exact-clone clusters — functions whose canonical token streams are byte-identical. Pure structural duplicates, modulo identifier names, types, and literal values.

  • Fuzzy near-duplicates — pairs scored as sqrt(jaccard × len_ratio) over a 64-slot MinHash signature, with a hard len_ratio >= threshold gate. The length gate suppresses MinHash false-positives on highly periodic boilerplate (otherwise a 4-statement and a 7-statement copy of the same t |> run(...) block both look 100% identical to MinHash).

The geometric-mean score admits a Jaccard somewhat below threshold when lengths match closely. This is intentional and biases toward recall.

6.6.2. Quick start

CLI:

bin/daslang utils/detect-dupe/main.das -- -p tests --json /tmp/dupes.json
bin/daslang utils/detect-dupe/main.das -- -p tests/strings -t 0.85 -n 20
bin/daslang utils/detect-dupe/main.das -- -p daslib --no-fuzzy --min-tokens 32
bin/daslang utils/detect-dupe/main.das -- -p tests --keep all
bin/daslang utils/detect-dupe/main.das -- -?

JIT works too — bin/daslang -jit utils/detect-dupe/main.das -- ... (net runtime improvement is modest because per-file cost is dominated by interpreter compilation of the scanned files).

6.6.3. Flags

Flag

Default

Meaning

-p / --path

required*

File or directory to scan; repeatable. * one of -p, --paths-from, --paths-stdin is required (or --import-functions / --against)

--paths-from

off

Read newline-delimited paths from a file (#-comments and blank lines skipped). Composes with -p. Each entry can be a file or directory; directories recurse via the same scanner as -p

--paths-stdin

off

Read newline-delimited paths from stdin (#-comments and blank lines skipped). Composes with -p. Mutually exclusive with --against-from-stdin (one stdin reader per run)

-j / --workers

0 (auto)

Worker count for parallel --export-functions runs. 0 = hardware threads. 1 = sequential. Files are sorted, split into N contiguous chunks, each compiled by a child detect-dupe process; shards are merged in chunk-index order so output is byte-identical across worker counts. Below 16 input files the export stays sequential regardless. Ignored without --export-functions

-t / --threshold

0.7

Fuzzy similarity floor (0..1). Score is sqrt(jaccard × len_ratio) plus a hard len_ratio >= threshold gate

-n / --top

20

Top-N entries shown in stdout summary

--json

off

Path for full JSON report

-x / --no-fuzzy

off

Skip MinHash pass — exact clusters only

--min-tokens

8

Drop functions with fewer than N tokens (filters trivial wrappers)

-L / --lambdas-only

off

Skip top-level functions, keep only lambdas — useful for clustering dastest t |> run("…") @(t) { } bodies

--export-functions

off

Write all extracted functions to a JSON file and exit before clustering

--import-functions

off

Load functions from a JSON file (produced by --export-functions) instead of compiling. Mutually exclusive with --path and --export-functions

--baseline

off

B1: load corpus JSON; tag records whose member identity (file:line:name) isn’t in the baseline as candidates and filter to those

--baseline-strict

off

B1 modifier: also drop clusters whose canonical exists in the baseline (only fully-new clusters survive)

--against

off

B2 candidate path (file or directory). Repeatable. Compiled in-process; their functions are tagged candidates and the report is filtered

--against-from-stdin

off

B2: read newline-delimited candidate paths from stdin

--check

off

Exit non-zero when the post-filter report contains any clusters/pairs (CI gate)

--flat

off

In --against mode, force the flat clusters/pairs writer (default in --against mode is the per-candidate rollup)

-k / --keep

off

Pattern name to KEEP despite default skip (repeatable). Special value all disables pattern filtering entirely. See Pattern filter

-v / --verbose

off

Per-file progress

-?

Show help

builtin.das, daslib/debugger.das, and daslib/profiler.das are skipped automatically — the latter two install thread-local debug agents at compile time, which would abort the scanner on the second use.

6.6.4. Pattern filter

A “pattern” is a structural shape that signals “this function is boilerplate, not real code” — typically a wrapper or dispatcher whose canonical token stream contains zero unique signal beyond its repetition count. Pattern-matched functions are dropped from clustering by default, on the theory that they explode cluster sizes and fuzzy-pair counts without surfacing real duplicates.

Override per-pattern via --keep <name> (repeatable), or disable filtering wholesale via --keep all.

Currently shipped patterns:

Name

Detects

Why it’s boilerplate

visitor

Class-method whose hook name starts with visit, preVisit, postVisit, before, or after (matched by name, regardless of body)

AstVisitor overrides — the dispatch contract requires one method per AST node type, so cross-class duplication at the canonical level is structural, not actionable. Common in daslib/aot_cpp.das, daslib/ast_print.das, daslib/templates_boost.das, daslib/rst_comment.das, daslib/perf_lint.das.

dispatch

Function whose body is N >= 2 byte-identical top-level statement chunks

dastest’s t |> run("X") @(t) { } outer functions. Lambda bodies are collapsed to ADDR upstream, so two run calls look identical regardless of what the lambdas do. Same shape catches t |> bench(…), repeated-init blocks, and any uniform call list.

emit

1..6 top-level statements, each a single trivial CALL:foo(...) (literal/var/field args only — no nested calls, no control flow) or a RET ...

Emitter shells like def visitX(...) { write(*ss, ")") ; return that }. Catches free-function emitter wrappers that the visitor matcher (which is name-based) doesn’t cover.

Match order in classify() is name-first (visitor), then body-shape (dispatch, emit) — the first match wins. A visitor method whose body happens to fit the emit shape is still classified as visitor.

The summary line surfaces what was filtered:

collected 3097 record(s); 0 compile failure(s), 0 skipped (expect-directive)
patterns skipped: 6 dispatch, 90 emit, 601 visitor (--keep <name>|all to include)

--verbose prints one pattern-skip [name] file:line func (note) line per filtered record — useful for confirming the filter isn’t dropping real signal on a new corpus.

6.6.5. Canonical form

Each function emits a flat tag stream. Examples:

Source

Canonical (excerpt)

def add(a,b:int):int { return a+b }

FN ARG <var_0> TYP ARG <var_1> TYP TYP BODY BLK STMT RET OP2:+ <var_0> <var_1> ENDBLK ENDFN

def add(a,b:float):float { return a+b }

(same — types collapse)

def double(a:int) { return a*2 }

FN ARG <var_0> TYP BODY BLK STMT RET OP2:* <var_0> LIT ENDBLK ENDFN

User identifiers become <var_0>, <var_1>, … All types collapse to TYP. All literals become LIT. Field/swizzle names use .FLD / .SWZ. Called function names are kept (CALL:push vs CALL:emplace is real signal).

6.6.6. Modes

The flat report from -p is the firehose — useful once on a new corpus to calibrate, less useful day-to-day. Two filtered modes are layered on top via a single is_candidate flag inside FuncRecord. A cluster or fuzzy pair is kept iff at least one of its members is a candidate.

An AI judge (find_dupe — AI judge for detect-dupe clusters) can consume the resulting JSON report and triage clusters into real duplicates, partial matches, and false positives — useful when a flat report is too noisy to walk manually.

6.6.6.1. B1 — baseline diff (CI gate)

Snapshot the corpus once, commit the JSON, and on every PR re-scan and diff:

# one-off: build the baseline (commit this)
bin/daslang utils/detect-dupe/main.das -- -p tests --export-functions tests_baseline.json

# CI: scan again, surface only what isn't in the baseline
bin/daslang utils/detect-dupe/main.das -- -p tests --baseline tests_baseline.json --check

Records are tagged candidate when their member identity (file:line:name) is absent from the baseline. A cluster appears in the report if any of its members is a candidate, which catches both (a) brand-new canonicals and (b) growth — a new copy of an already-tracked canonical added in a new location.

Use --baseline-strict to drop case (b) — strict additionally filters out clusters whose canonical was already in the baseline, so only fully-new canonicals survive. Pairs aren’t strict-filtered (the baseline doesn’t carry MinHash signatures), so strict is cluster-only.

file:line:name keying means an unrelated edit that shifts line numbers will look like a “new member” and surface its cluster — acceptable for CI, since touched code is the right default to re-check.

--check turns the filtered report into a CI gate — non-zero exit when any cluster or pair survives the filter.

6.6.6.2. B2 — PR-files / interactive

“Did I just write something that already exists?” Compare a file list against a pre-built corpus:

bin/daslang utils/detect-dupe/main.das -- \
    --import-functions tests_baseline.json --against tests/strings/new_helper.das

# git pipeline:
git diff --name-only master | grep '\.das$' | \
    bin/daslang utils/detect-dupe/main.das -- \
        --import-functions tests_baseline.json --against-from-stdin

When --against and --import-functions are both set, corpus records whose file matches any candidate path are dropped first, then the candidate is freshly compiled — so the file is compared against the rest of the world, never against its own stale copy in the baseline. Look for the dropped N corpus records overridden line.

The default writer in --against mode is a per-candidate rollup (“for each function in the focus set, here are its top siblings”). Use --flat to revert to the legacy clusters+pairs view.

6.6.7. Export / import

Compilation is the expensive step; the canonical-form computation is deterministic. To hand the function list off to an external tool (visualizer, custom clusterer), or to shard compilation across machines and merge later, dump the post-canonicalization records and reload them:

bin/daslang utils/detect-dupe/main.das -- -p tests --export-functions /tmp/funcs.json
bin/daslang utils/detect-dupe/main.das -- --import-functions /tmp/funcs.json --json /tmp/dupes.json

--import-functions is mutually exclusive with both --path and --export-functions. --export-functions always exits before the clustering pass.

6.6.7.1. Parallel export (-j / --workers)

For large corpora the dominant cost is per-file compilation. --workers N fans the export across N child detect-dupe processes:

  • The full file list is sorted, split into N contiguous chunks, and written to per-worker temp files.

  • Each child runs --paths-from <chunk> --export-functions <shard> --workers 1 and produces its own envelope.

  • The parent reads shards back in chunk-index order and concatenates their functions arrays. Output is therefore byte-identical to a sequential --workers 1 run on the same inputs.

  • Below 16 input files the export stays sequential regardless — child-process startup dominates the savings on small lists.

Default (-j 0) is auto, equal to the host’s hardware thread count. Compile is CPU-bound, so oversubscription does not help. -j 1 forces sequential. Compile failures in any child fail the whole export (no partial corpus shipped) — same gate as the sequential path.

6.6.7.2. Explicit file-list inputs

--paths-from <file> and --paths-stdin accept newline- delimited file lists (#-comments and blank lines skipped). They compose with -p (union, deduplicated) and are typically used to scope an export to the files in a PR diff:

# via file (avoids ARG_MAX on big PRs)
git diff --name-only master | grep '\.das$' > /tmp/pr.txt
bin/daslang utils/detect-dupe/main.das -- \
    --paths-from /tmp/pr.txt --export-functions pr.json

# via stdin (same idea, no temp file)
git diff --name-only master | grep '\.das$' | \
    bin/daslang utils/detect-dupe/main.das -- \
        --paths-stdin --export-functions pr.json

Each entry can be a file or a directory; directories recurse via the same scanner -p uses. --paths-stdin is mutually exclusive with --against-from-stdin (one stdin reader per run).

The on-disk schema is a small envelope:

{
  "schema_version": 1,
  "functions": [
    {
      "name": "add_int",
      "file": "tests/foo.das",
      "line": 4,
      "is_lambda": false,
      "canonical": "FN ARG <var_0> TYP ..."
    }
  ]
}

MinHash signatures are not included — they’re recomputed on import (deterministic and cheap). On import, --no-fuzzy and --min-tokens apply just like in the compile path.

6.6.8. MCP integration

The MCP Server — AI Tool Integration server exposes two tools that wrap the engine end-to-end — no shelling out:

Tool

Purpose

export_corpus

Scan paths (files / directories / globs), compile each .das file in-process, and write a corpus JSON to out in the same shape detect_duplicates expects. Replacement for the CLI --export-functions.

detect_duplicates

Wraps B2 mode. Pass paths (newline- or comma-delimited, or a glob) and corpus (a JSON from export_corpus / --export-functions); receive a per-candidate JSON envelope with corpus stats, pattern-skip counts, and a per-candidate report containing the top-N exact and fuzzy matches.

Both tools surface the pattern filter via a keep parameter that mirrors the CLI’s --keep flag. The envelope also reports candidate_functions_pre_filter, so the caller can distinguish “no candidates compiled” from “all candidates were pattern-filtered out”.

6.6.9. Implementation

File

Role

canonical.das

CanonicalVisitor (extends daslib/ast AstVisitor) and tokenize_canonical

minhash.das

64-slot MinHash signatures over 5-grams, Jaccard estimate

cluster.das

Exact-bucket clustering + fuzzy all-pairs with length gate

report.das

JSON + stdout summary writer

main.das

CLI (daslib/clargs), file scan, compile-and-collect orchestration

pipeline.das

compile_and_collect / collect_from_program — compile-and-extract orchestration, shared by main.das and the test suite. Also hosts apply_pattern_filter and the filesystem scan helpers

patterns.das

Pattern matchers — classify(name, canonical) PatternHit. Default-skipped shapes (visitor methods, dispatchers, emitters); override via --keep <name>

exchange.das

On-disk JSON schema + writer/reader for --export-functions / --import-functions

fixture/synth.das

Hand-crafted fixture for smoke-testing the visitor end-to-end

fixture/canonical_cases.das

Narrowly-targeted fixture (one function per canonicalization concern) for unit-testing CanonicalVisitor

test_detect_dupe.das

dastest suite — run with bin/daslang dastest/dastest.das -- --test utils/detect-dupe/test_detect_dupe.das

6.6.10. Notes

  • Compile policy mirrors utils/lint: ignore_shared_modules, export_all. Optimisations and infer-time folding stay ON so dastest macros (e.g. unroll) compile.

  • Default mode drops everything generated (which includes lambdas). The dispatcher already references each lambda via an ADDR token, so the lambda’s structural fingerprint is partially preserved in the parent function. Use -L to flip this and cluster the lambda bodies themselves.

  • Lambda-only mode (``-L``) is dominated at the top by linq’s each macro emissions (a 100+-token GOTO/LABEL/_builtin_iterator_first/next/close shell that recurs hundreds of times). Real test-body signal starts a few clusters down; sort/grep accordingly.

  • Functions whose at.fileInfo points outside the compiled file are filtered out — without this, reified generics from required modules (e.g. dastest/testing.das) flood the report.

  • Per-source-line dedup: a generic reified for N types becomes N FunctionPtr instances all pointing at the same (file, line). detect-dupe keeps the first to avoid the same source location being counted N times.

6.6.11. Out of scope

  • LSH / banding (4.4K2 is fine; revisit at >50K functions).

  • Embedding-based similarity (would need an external service).

  • Auto-fix or refactor suggestions — discovery only.