8.7.5. dasLLAMA-05 — Performance

The three dials that matter for CPU inference speed: the JIT, the thread pool, and the weight quantization. The tutorial measures all of them with stats() on your machine:

daslang.exe -jit tutorials/dasLLAMA/05_performance.das -- path/to/model.gguf

8.7.5.1. Always -jit

The kernels are daslang code; the LLVM JIT compiles them to native SIMD. Interpreted inference is orders of magnitude slower — never benchmark it. jit_enabled() tells you which world you’re in. On top of that, options _jit_fast_math = true lets the JIT relax FP ordering in the kernels (non-bit-exact, roughly +10%) — matching how llama.cpp compiles its own; the parity test suite stays bit-exact.

8.7.5.2. Threads: the job queue

The matmul kernels split their rows across the job queue — they require one (model code outside with_job_que() panics). set_jobque_fork_pool(true, true) pools the per-job fork contexts: the kernels are pure data-parallel jobs, so their contexts can be reused and skip re-init.

The worker count is fixed when the queue is created; the DAS_JOBQUE_THREADS environment variable overrides the (deliberately conservative) default. Re-run the tutorial with DAS_JOBQUE_THREADS=1 to see what threading buys — the win grows with model size, since tiny models have tiny matmuls.

8.7.5.3. Prefill vs generation

stats() separates the two phases because their physics differ: the prompt runs as one batched forward (compute-bound, fast), while generation runs one token per forward (memory-bandwidth-bound, slower). ttft_s — time to first token — spans the prefill plus the first generation step.

8.7.5.4. Quantization: memory vs fidelity

QuantMode picks the in-memory weight representation, whatever the GGUF stores. Measured on SmolLM2-135M (Apple M1 Max, 8 jobs):

Mode

Resident

Throughput

fp32

546 MB

prefill 471 t/s, gen 86 t/s — the token-exact reference

q8

241 MB

prefill 1391 t/s, gen 219 t/s — the everyday choice

q4

189 MB

prefill 70 t/s, gen 69 t/s — smallest, scalar kernel only

Generation is bandwidth-bound, so between fp32 and q8 smaller weights are also faster. q4 currently has no batched prefill kernel — it’s the smallest footprint, not the fastest path.

Where to go deeper: modules/dasLLAMA/tune_for_this_box.md covers kernel tuning (token-block size, unrolls, the [dasllama_grid] tuner) when you want to squeeze a specific machine.