8.7.4. dasLLAMA-04 — Sessions, the KV Cache, and Memory

The model is the expensive, read-only part (the weights). A Session is one generation stream’s state: the attention KV cache plus scratch buffers. This tutorial sizes that cache, runs independent sessions over one model, hand-rolls the loop inside generate, and frees everything properly.

Run it like tutorial 01:

daslang.exe -jit tutorials/dasLLAMA/04_sessions_and_memory.das -- path/to/model.gguf

8.7.4.1. The KV cache, and why you cap seq_len

The KV cache is sized to config.seq_len at create_session time — roughly 2 * n_layers * seq_len * kv_dim floats. Models ship with big native contexts (Llama-3’s native seq_len is 131072, which means tens of GB of fp32 KV), so cap seq_len to the context you actually need before creating sessions:

m.config.seq_len = min(m.config.seq_len, 1024l)
var s = create_session(m)

On SmolLM2-135M that’s the difference between ~377 MB per session at the native 8192 and ~47 MB at 1024.

8.7.4.2. One model, many sessions

Each session has its own position (n_past), KV cache, and sampling RNG — two sessions over one model don’t see each other. This is how one loaded model serves many conversations.

8.7.4.3. eval and sample by hand

generate is a convenience over two primitives. eval runs tokens at the session’s current position and advances it — prefill is just the whole prompt in one call. sample picks the next token from session.logits:

var s = create_session(m)
eval(m, s, prompt)              // prefill: n_past goes 0 -> length(prompt)
var one : array<int64>
one |> resize(1)
for (_step in range64(16l)) {
    let tok = sample(s, SamplingParams())
    print("{piece(m, tok)}")
    one[0] = tok
    eval(m, s, one)             // feed the sampled token back in
}

generate adds the timing stats and the repetition-penalty window — this skeleton is all the model ever does.

8.7.4.4. Freeing: persistent_heap + delete

A one-shot script can just exit — the context frees everything. A long-lived process that loads model after model must free explicitly, and that takes both ingredients:

options persistent_heap at the top of the file — a malloc-backed heap where delete returns memory immediately (on the default linear heap a mid-context delete is a no-op);
an explicit delete of every Session and Model you created.

options persistent_heap
// ...
delete s
delete m   // heap drops by the model's full footprint

One practical note baked into the tutorial file itself: each section lives in its own function. A daslang function’s stack frame is statically sized for all its locals, and the default 16 KB context stack must also fit the model’s forward-pass call chain — piling every section’s locals into one fat main() overflows it. Keep model-driving main functions lean (or raise options stack).