8.7.5. dasLLAMA-05 — Performance
The three dials that matter for CPU inference speed: the JIT, the thread pool,
and the weight quantization. The tutorial measures all of them with
stats() on your machine:
daslang.exe -jit tutorials/dasLLAMA/05_performance.das -- path/to/model.gguf
8.7.5.1. Always -jit
The kernels are daslang code; the LLVM JIT compiles them to native SIMD.
Interpreted inference is orders of magnitude slower — never benchmark it.
jit_enabled() tells you which world you’re in. On top of that,
options _jit_fast_math = true lets the JIT relax FP ordering in the
kernels (non-bit-exact, roughly +10%) — matching how llama.cpp compiles its
own; the parity test suite stays bit-exact.
8.7.5.2. Threads: the job queue
The matmul kernels split their rows across the job queue — they require one
(model code outside with_job_que() panics). set_jobque_fork_pool(true,
true) pools the per-job fork contexts: the kernels are pure data-parallel
jobs, so their contexts can be reused and skip re-init.
The worker count is fixed when the queue is created; the
DAS_JOBQUE_THREADS environment variable overrides the (deliberately
conservative) default. Re-run the tutorial with DAS_JOBQUE_THREADS=1 to
see what threading buys — the win grows with model size, since tiny models
have tiny matmuls.
8.7.5.3. Prefill vs generation
stats() separates the two phases because their physics differ: the prompt
runs as one batched forward (compute-bound, fast), while generation runs
one token per forward (memory-bandwidth-bound, slower). ttft_s — time
to first token — spans the prefill plus the first generation step.
8.7.5.4. Quantization: memory vs fidelity
QuantMode picks the in-memory weight representation, whatever the GGUF
stores. Measured on SmolLM2-135M (Apple M1 Max, 8 jobs):
Mode |
Resident |
Throughput |
|---|---|---|
|
546 MB |
prefill 471 t/s, gen 86 t/s — the token-exact reference |
|
241 MB |
prefill 1391 t/s, gen 219 t/s — the everyday choice |
|
189 MB |
prefill 70 t/s, gen 69 t/s — smallest, scalar kernel only |
Generation is bandwidth-bound, so between fp32 and q8 smaller weights are also
faster. q4 currently has no batched prefill kernel — it’s the smallest
footprint, not the fastest path.
Where to go deeper: modules/dasLLAMA/tune_for_this_box.md covers kernel
tuning (token-block size, unrolls, the [dasllama_grid] tuner) when you
want to squeeze a specific machine.
See also
Full source: tutorials/dasLLAMA/05_performance.das
Next tutorial: dasLLAMA-06 — The Architecture Registry