.. _tutorial_dasLLAMA_performance:

=========================
dasLLAMA-05 — Performance
=========================

.. index::
    single: Tutorial; dasLLAMA
    single: Tutorial; Performance
    single: Tutorial; Quantization

The three dials that matter for CPU inference speed: the JIT, the thread pool,
and the weight quantization. The tutorial measures all of them with
``stats()`` on your machine::

   daslang.exe -jit tutorials/dasLLAMA/05_performance.das -- path/to/model.gguf

Always -jit
===========

The kernels are daslang code; the LLVM JIT compiles them to native SIMD.
Interpreted inference is orders of magnitude slower — never benchmark it.
``jit_enabled()`` tells you which world you're in. On top of that,
``options _jit_fast_math = true`` lets the JIT relax FP ordering in the
kernels (non-bit-exact, roughly +10%) — matching how llama.cpp compiles its
own; the parity test suite stays bit-exact.

Threads: the job queue
======================

The matmul kernels split their rows across the job queue — they *require* one
(model code outside ``with_job_que()`` panics). ``set_jobque_fork_pool(true,
true)`` pools the per-job fork contexts: the kernels are pure data-parallel
jobs, so their contexts can be reused and skip re-init.

The worker count is fixed when the queue is created; the
``DAS_JOBQUE_THREADS`` environment variable overrides the (deliberately
conservative) default. Re-run the tutorial with ``DAS_JOBQUE_THREADS=1`` to
see what threading buys — the win grows with model size, since tiny models
have tiny matmuls.

Prefill vs generation
=====================

``stats()`` separates the two phases because their physics differ: the prompt
runs as **one batched forward** (compute-bound, fast), while generation runs
**one token per forward** (memory-bandwidth-bound, slower). ``ttft_s`` — time
to first token — spans the prefill plus the first generation step.

Quantization: memory vs fidelity
================================

``QuantMode`` picks the in-memory weight representation, whatever the GGUF
stores. Measured on SmolLM2-135M (Apple M1 Max, 8 jobs):

=========  ==================  ====================================================
Mode       Resident            Throughput
=========  ==================  ====================================================
``fp32``   546 MB              prefill 471 t/s, gen 86 t/s — the token-exact reference
``q8``     241 MB              prefill 1391 t/s, gen 219 t/s — the everyday choice
``q4``     189 MB              prefill 70 t/s, gen 69 t/s — smallest, scalar kernel only
=========  ==================  ====================================================

Generation is bandwidth-bound, so between fp32 and q8 smaller weights are also
faster. ``q4`` currently has no batched prefill kernel — it's the smallest
footprint, not the fastest path.

Where to go deeper: ``modules/dasLLAMA/tune_for_this_box.md`` covers kernel
tuning (token-block size, unrolls, the ``[dasllama_grid]`` tuner) when you
want to squeeze a specific machine.

.. seealso::

   Full source: :download:`tutorials/dasLLAMA/05_performance.das <../../../../tutorials/dasLLAMA/05_performance.das>`

   Next tutorial: :ref:`tutorial_dasLLAMA_add_an_arch`