8.7.4. dasLLAMA-04 — Sessions, the KV Cache, and Memory
The model is the expensive, read-only part (the weights). A Session is one
generation stream’s state: the attention KV cache plus scratch buffers. This
tutorial sizes that cache, runs independent sessions over one model, hand-rolls
the loop inside generate, and frees everything properly.
Run it like tutorial 01:
daslang.exe -jit tutorials/dasLLAMA/04_sessions_and_memory.das -- path/to/model.gguf
8.7.4.1. The KV cache, and why you cap seq_len
The KV cache is sized to config.seq_len at create_session time —
roughly 2 * n_layers * seq_len * kv_dim floats. Models ship with big
native contexts (Llama-3’s native seq_len is 131072, which means tens of
GB of fp32 KV), so cap seq_len to the context you actually need before
creating sessions:
m.config.seq_len = min(m.config.seq_len, 1024l)
var s = create_session(m)
On SmolLM2-135M that’s the difference between ~377 MB per session at the native 8192 and ~47 MB at 1024.
8.7.4.2. One model, many sessions
Each session has its own position (n_past), KV cache, and sampling RNG —
two sessions over one model don’t see each other. This is how one loaded model
serves many conversations.
8.7.4.3. eval and sample by hand
generate is a convenience over two primitives. eval runs tokens at the
session’s current position and advances it — prefill is just the whole prompt
in one call. sample picks the next token from session.logits:
var s = create_session(m)
eval(m, s, prompt) // prefill: n_past goes 0 -> length(prompt)
var one : array<int64>
one |> resize(1)
for (_step in range64(16l)) {
let tok = sample(s, SamplingParams())
print("{piece(m, tok)}")
one[0] = tok
eval(m, s, one) // feed the sampled token back in
}
generate adds the timing stats and the repetition-penalty window — this
skeleton is all the model ever does.
8.7.4.4. Freeing: persistent_heap + delete
A one-shot script can just exit — the context frees everything. A long-lived process that loads model after model must free explicitly, and that takes both ingredients:
options persistent_heapat the top of the file — a malloc-backed heap wheredeletereturns memory immediately (on the default linear heap a mid-context delete is a no-op);an explicit
deleteof everySessionandModelyou created.
options persistent_heap
// ...
delete s
delete m // heap drops by the model's full footprint
One practical note baked into the tutorial file itself: each section lives in
its own function. A daslang function’s stack frame is statically sized for
all its locals, and the default 16 KB context stack must also fit the
model’s forward-pass call chain — piling every section’s locals into one fat
main() overflows it. Keep model-driving main functions lean (or raise
options stack).
See also
Full source: tutorials/dasLLAMA/04_sessions_and_memory.das
Next tutorial: dasLLAMA-05 — Performance