.. _tutorial_dasLLAMA_sessions_and_memory: ================================================ dasLLAMA-04 — Sessions, the KV Cache, and Memory ================================================ .. index:: single: Tutorial; dasLLAMA single: Tutorial; KV Cache single: Tutorial; Sessions The model is the expensive, read-only part (the weights). A ``Session`` is one generation stream's state: the attention KV cache plus scratch buffers. This tutorial sizes that cache, runs independent sessions over one model, hand-rolls the loop inside ``generate``, and frees everything properly. Run it like tutorial 01:: daslang.exe -jit tutorials/dasLLAMA/04_sessions_and_memory.das -- path/to/model.gguf The KV cache, and why you cap seq_len ===================================== The KV cache is sized to ``config.seq_len`` *at* ``create_session`` time — roughly ``2 * n_layers * seq_len * kv_dim`` floats. Models ship with big native contexts (Llama-3's native ``seq_len`` is 131072, which means tens of GB of fp32 KV), so cap ``seq_len`` to the context you actually need **before** creating sessions: .. code-block:: das m.config.seq_len = min(m.config.seq_len, 1024l) var s = create_session(m) On SmolLM2-135M that's the difference between ~377 MB per session at the native 8192 and ~47 MB at 1024. One model, many sessions ======================== Each session has its own position (``n_past``), KV cache, and sampling RNG — two sessions over one model don't see each other. This is how one loaded model serves many conversations. eval and sample by hand ======================= ``generate`` is a convenience over two primitives. ``eval`` runs tokens at the session's current position and advances it — prefill is just the whole prompt in one call. ``sample`` picks the next token from ``session.logits``: .. code-block:: das var s = create_session(m) eval(m, s, prompt) // prefill: n_past goes 0 -> length(prompt) var one : array one |> resize(1) for (_step in range64(16l)) { let tok = sample(s, SamplingParams()) print("{piece(m, tok)}") one[0] = tok eval(m, s, one) // feed the sampled token back in } ``generate`` adds the timing stats and the repetition-penalty window — this skeleton is all the model ever does. Freeing: persistent_heap + delete ================================= A one-shot script can just exit — the context frees everything. A long-lived process that loads model after model must free explicitly, and that takes both ingredients: * ``options persistent_heap`` at the top of the file — a malloc-backed heap where ``delete`` returns memory immediately (on the default linear heap a mid-context delete is a no-op); * an explicit ``delete`` of every ``Session`` and ``Model`` you created. .. code-block:: das options persistent_heap // ... delete s delete m // heap drops by the model's full footprint One practical note baked into the tutorial file itself: each section lives in its own function. A daslang function's stack frame is statically sized for *all* its locals, and the default 16 KB context stack must also fit the model's forward-pass call chain — piling every section's locals into one fat ``main()`` overflows it. Keep model-driving ``main`` functions lean (or raise ``options stack``). .. seealso:: Full source: :download:`tutorials/dasLLAMA/04_sessions_and_memory.das <../../../../tutorials/dasLLAMA/04_sessions_and_memory.das>` Next tutorial: :ref:`tutorial_dasLLAMA_performance`