17.1. dasLLAMA LLM inference: models, sessions, generation, chat

CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, run the transformer, sample — or hold a full chat — validated token-for-token against llama.cpp on every supported family. Run with -jit; examples/dasLLAMA/run.das and chat.das show the canonical program shape.

Supported model families (GGUF — fp32 / f16 / q8_0 / q4_0 / mxfp4 weights read directly; K-quant files such as Q4_K_M / Q5_K_M / Q6_K run on native K-quant kernels):

Llama — Llama-2 / TinyLlama, Llama-3.1 / 3.2, Mistral-7B-Instruct, SmolLM2, plus llama2.c .bin checkpoints
Qwen — Qwen2.5, Qwen3 (QK-norm), Qwen3.5 / Qwen3.6 (hybrid Gated-DeltaNet attention, incl. the 35B-A3B MoE); MoE: Qwen1.5-MoE (routed + sigmoid-gated shared expert), Qwen3-30B-A3B (routed-only, renormalized top-k)
Phi — Phi-3.5-mini
Gemma — Gemma-2, Gemma-3 (per-layer sliding-window patterns), Gemma-4 (12B / 31B dense, the 26B-A4B MoE, and the E2B / E4B edge series with per-layer embeddings + cross-layer KV sharing)
gpt-oss — gpt-oss-20b (attention sinks, native MXFP4 experts, YaRN long context, Harmony chat format)

The architecture is picked from GGUF metadata at load — the same program runs any of these.

Hands-on tutorials (overview): hello, generation, chat and templates, sampling, sessions and memory, performance, the architecture registry.

17.1.1. Types

The engine types the API below works with. They are created and consumed by the functions of this module; their remaining fields are engine implementation detail.

Model

A loaded model: weights, config, tokenizer, and the architecture’s blocks and chat template, as produced by load_model. User code touches config (e.g. cap config.seq_len before create_session on large-context models) and arch (the GGUF architecture name).

Session

One generation stream over a model: the KV cache, scratch buffers, sampling RNG, and the current position n_past. logits holds the distribution produced by the last eval. A model serves many independent sessions.

BatchWorkspace

Caller-owned scratch for eval_batch: the batched activation buffers a step of B sessions shares. One per concurrent batch, reused across calls; holds no session state — positions, caches and logits stay in the sessions.

KVPool

A caller-owned paged KV-cache pool (create_kv_pool): sessions created over it allocate fixed-size page groups on demand, so cache memory tracks the actual context instead of the full seq_len slab. One pool serves many sessions; an eval_batch batch must share one pool. Keep it alive and in place while its sessions live.

PrefixCache

A page-granular prefix cache over one pool’s sessions (create_prefix_cache): finished streams donate their KV pages keyed by a chained page hash of the token history, and later requests attach the longest cached prefix instead of re-prefilling it. Pages are refcounted with the pool; an LRU budget bounds retention.

KVDtype

Per-session KV-cache codec picked at create_session/create_chat: f16 — the default, half the KV bytes and faster deep-context decode (stores clamp to ±65504); f32 — the bit-exact reference; q8_0 — block-quantized (llama.cpp -ctk/-ctv q8_0), half the f16 bytes again and near-lossless in practice; needs head_size and kv_dim to be multiples of 32 (checked at create).

QuantMode

Weight representation picked at load: fp32 — the token-exact reference; q8 — int8 quantization, the fast CPU path; q4 — 4-bit, smallest footprint.

SamplingParams

Sampling knobs: temp (<= 0 selects greedy argmax), top_k (0 = no cutoff), and repetition penalty (1.0 = none) applied over the last penalty_last_n generated tokens. The defaults are greedy — SamplingParams() reproduces argmax exactly.

Stats

Timing of the last generate/respond call: n_prompt/n_gen token counts, ttft_s (seconds to first token), and prefill_tps/gen_tps throughput in tokens per second.

LlmCaps

What the model honestly supports at the chat layer, as returned by caps: system_prompt is false for architectures with no system role (gemma), where the chat layer silently folds the system prompt into the first user turn. Grows as gaps surface.

ChatSession

A conversation over a model: its session, the resolved chat template, and the running transcript in history. Create with create_chat, then drive with add_user + respond.

AudioTower

A loaded audio encoder: Whisper-family encoder weights plus the model-specific projector tail, as produced by load_audio_tower from an mmproj GGUF. Pass it to create_chat to enable add_user_audio turns.

17.1.2. Model loading and sessions

caps (model: Model) : LlmCaps

create_batch_workspace (model: Model) : BatchWorkspace

create_kv_pool (model: Model; page_rows: int64 = 64; kv_dtype: KVDtype = dasllama_common::KVDtype.f16) : KVPool

create_session (model: Model; kv_dtype: KVDtype = dasllama_common::KVDtype.f16) : Session

create_session (model: Model; var pool: KVPool) : Session

load_model (path: string; mode: QuantMode = dasllama_common::QuantMode.fp32) : Model

release_kv_pages (var session: Session)

setup_dasllama_jobque ()

caps(model: Model ): LlmCaps 

What model honestly supports at the chat layer (see LlmCaps) — e.g. gemma has no system role, so the chat layer folds the system prompt into the first user turn; system_prompt is false there so callers can surface it instead of being silently absorbed.

Arguments:

model : Model

create_batch_workspace(model: Model ): BatchWorkspace 

Create the caller-owned scratch that eval_batch steps through — one per concurrent batch, reused across calls (buffers grow to the largest batch seen). Holds no session state: the sessions keep their own positions, caches and logits.

Arguments:

model : Model

create_kv_pool(model: Model; page_rows: int64 = 64; kv_dtype: KVDtype = dasllama_common::KVDtype.f16 ): KVPool 

Create a caller-owned PAGED KV pool over model’s cache geometry. Sessions created over it (create_session(model, pool)) allocate cache pages of page_rows positions on demand instead of the full seq_len slab up front — KV memory tracks the ACTUAL context, and many sessions share one elastic pool. kv_dtype fixes the codec for every session in the pool. Keep the pool alive (and in place) as long as its sessions live; it is plain data — delete frees everything at the end.

Arguments:

model : Model
page_rows : int64
kv_dtype : KVDtype

17.1.2.1. create_session

create_session(model: Model; kv_dtype: KVDtype = dasllama_common::KVDtype.f16 ): Session 

Create a fresh session (KV cache + scratch) sized to model.config.seq_len. A model serves many sessions, each with its own position and cache — one model, many independent conversations. On large-context models cap model.config.seq_len BEFORE creating sessions to bound KV memory. kv_dtype picks this session’s KV-cache codec — the KVDtype.f16 default halves the KV bytes and speeds up deep-context decode (near-lossless: stores clamp to ±65504; llama.cpp’s default is F16 too); pass KVDtype.f32 for the bit-exact reference, or KVDtype.q8_0 (llama.cpp -ctk/-ctv q8_0) to halve the bytes again — block-quantized, near-lossless in practice, needs head_size/kv_dim multiples of 32 (checked at create). Per-session state, so sessions with different codecs and parallel models never interact.

Arguments:

model : Model
kv_dtype : KVDtype

create_session(model: Model; pool: KVPool ): Session

load_model(path: string; mode: QuantMode = dasllama_common::QuantMode.fp32 ): Model 

Load a model AND its tokenizer from a GGUF file — the one entry point. The architecture and tokenizer backend are auto-selected from GGUF metadata; mode picks the weight quantization (QuantMode.fp32 = the token-exact reference, QuantMode.q8 = the fast int8 path). Q8 loads keep a PREPARED IMAGE next to the gguf (box + knob specific): the first load saves it, later loads map it in milliseconds with zero-copy weight planes. DASLLAMA_IMAGE=0 disables the cache. Passing a .dlim path loads that image DIRECTLY (no gguf needed — wrong identity panics; there is nothing to regenerate from); the image’s baked-in quantization applies and mode is ignored on that path.

Arguments:

path : string
mode : QuantMode

release_kv_pages(session: Session )

Return session’s KV pages to its pool (no-op on flat sessions). The normal shape is release + delete; a released session stays alive but loses its cached context — to reuse it, also reset session.n_past to 0.

Arguments:

session : Session

setup_dasllama_jobque()

Configure the job queue for dasLLAMA’s pure fork/join matmul dispatch: pooled fork contexts, batched dispatch, and the worker spin-before-park window (the tune sidecar’s runtime jobque_spin_us tunes the window per box; 0 disables the spin). Call it INSIDE with_job_que(), before the first generate/eval — see examples/dasLLAMA/run.das.

17.1.3. Prefix cache

create_prefix_cache (max_groups: int64 = 0) : PrefixCache

prefix_attach (var cache: PrefixCache; var pool: KVPool; var session: Session; prompt: array<int64>) : int64

prefix_held_groups (cache: PrefixCache) : int64

prefix_insert (var cache: PrefixCache; var pool: KVPool; session: Session; tokens: array<int64>)

prefix_release (var cache: PrefixCache; var pool: KVPool)

create_prefix_cache(max_groups: int64 = 0 ): PrefixCache 

Create a prefix cache for the paged sessions of one create_kv_pool pool: finished streams donate their KV pages (prefix_insert) and later requests with the same prompt prefix attach them (prefix_attach) instead of re-prefilling — the serving win for shared system prompts and multi-turn chats. max_groups caps how many page groups the cache retains (LRU-dropped past it); 0 = unbounded. Page hashes route, stored token ids verify — a hit attaches only after its tokens compare equal.

Arguments:

max_groups : int64

prefix_attach(cache: PrefixCache; pool: KVPool; session: Session; prompt: array<int64> ): int64 

Attach the longest cached prefix of prompt to a FRESH paged session of pool: matched whole pages join the session’s block table (shared, refcounted) and n_past advances past them, so the caller prefills only the tail. Returns the matched token count — a multiple of the pool’s page_rows, capped one token short of the prompt so the tail eval always produces the sampling logits.

Arguments:

cache : PrefixCache
pool : KVPool
session : Session
prompt : array<int64>

prefix_held_groups(cache: PrefixCache ): int64 

Pages the cache currently holds (== pool groups retained for reuse).

Arguments:

cache : PrefixCache

prefix_insert(cache: PrefixCache; pool: KVPool; session: Session; tokens: array<int64> )

Donate a finished session’s KV pages to the cache. tokens is the session’s full EVALED history (prompt + reply + any closing tokens; only the first n_past count — those rows exist); every full page of it not already cached is registered and survives the session’s release_kv_pages. Call right before releasing.

Arguments:

cache : PrefixCache
pool : KVPool
session : Session
tokens : array<int64>

prefix_release(cache: PrefixCache; pool: KVPool )

Release every cached page back to pool and clear the cache (pages still used by live sessions stay alive until those sessions release them). Call before deleting the pool.

Arguments:

cache : PrefixCache
pool : KVPool

17.1.4. Tokenizer

decode (model: Model; ids: array<int64>) : string

encode (model: Model; text: string; add_special: bool = true; parse_special: bool = false) : array<int64>

piece (model: Model; id: int64) : string

decode(model: Model; ids: array<int64> ): string 

Decode a token-id sequence back to text with the model’s tokenizer.

Arguments:

model : Model
ids : array<int64>

encode(model: Model; text: string; add_special: bool = true; parse_special: bool = false ): array<int64> 

Encode text to token ids with the model’s tokenizer. add_special prepends BOS where the model expects one. parse_special is reserved and not yet honored — special tokens reach the model as atomic ids from the chat layer’s template renderer, never by spelling them in text.

Arguments:

model : Model
text : string
add_special : bool
parse_special : bool

piece(model: Model; id: int64 ): string 

Decode a single token to its text piece — the streaming counterpart of decode.

Arguments:

model : Model
id : int64

17.1.5. Evaluation and sampling

eval (model: Model; var session: Session; tokens: array<int64>)

eval_batch (model: Model; var ws: BatchWorkspace; var sessions: array<Session?>; tokens: array<int64>)

eval_embd (model: Model; var session: Session; embd: array<float>; npos: int64)

sample (var session: Session; params: SamplingParams) : int64

set_seed (var session: Session; seed: int)

stats (session: Session) : Stats

eval(model: Model; session: Session; tokens: array<int64> )

THE eval primitive: run tokens at the session’s current position and advance it. Prefill = eval(prompt); each generation step = eval([token]) — the same call at different batch sizes. Logits land in session.logits.

Arguments:

model : Model
session : Session
tokens : array<int64>

eval_batch(model: Model; ws: BatchWorkspace; sessions: array<Session?>; tokens: array<int64> )

One synchronous batched decode step: row i evals tokens[i] at sessions[i]’s current position, logits land in sessions[i].logits, and each session advances by one — B independent conversations through ONE pass of the weights (the weight GEMVs batch into GEMMs; attention stays per-session). Sessions must be distinct, same-geometry (one model, equal cache sizes, uniform KV dtype; paged sessions must share ONE pool, never mixed with flat rows) and each have room for one more position. Ragged batches: pass only the still-active sessions — B may shrink between calls. B == 1 and q4 weights delegate to the exact per-session forward path.

Arguments:

model : Model
ws : BatchWorkspace
sessions : array< Session?>
tokens : array<int64>

eval_embd(model: Model; session: Session; embd: array<float>; npos: int64 )

eval’s embedding-input twin: prefill npos pre-built embedding rows (npos × dim, token-major) at the session’s current position and advance it. This is the multimodal splice entry — fill text spans with embed_text_rows and media spans with an encoder tower’s soft tokens (see examples/dasLLAMA/audio.das). Logits land in session.logits.

Arguments:

model : Model
session : Session
embd : array<float>
npos : int64

sample(session: Session; params: SamplingParams ): int64 

Sample the next token from session.logits per params: repetition/presence/frequency penalties over the recent window, then temperature + top-k on logits, softmax, top-p / min-p on probabilities, and a CDF draw — or greedy argmax when params.temp <= 0. SamplingParams() defaults are greedy.

Arguments:

session : Session
params : SamplingParams

set_seed(session: Session; seed: int )

Seed the session’s sampling RNG for reproducible generation.

Arguments:

session : Session
seed : int

stats(session: Session ): Stats 

Timing of the most recent generate/respond call on session: prompt/generated token counts, time to first token, prefill and generation tok/s.

Arguments:

session : Session

17.1.6. Generation

generate (model: Model; var session: Session; prompt: array<int64>; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool>) : int64

generate_embd (model: Model; var session: Session; embd: array<float>; npos: int64; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool>) : int64

generate(model: Model; session: Session; prompt: array<int64>; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool> ): int64 

Stream-generate up to max_tokens from prompt, invoking the trailing block per token with (id, piece); return false from the block to stop early (e.g. on a stop token). Prefills the prompt in one eval, then samples one token at a time, advancing the session. Returns the number of tokens emitted; timing is available afterwards via stats.

Arguments:

model : Model
session : Session
prompt : array<int64>
params : SamplingParams
max_tokens : int64
blk : block<(id:int64;piece:string):bool>

generate_embd(model: Model; session: Session; embd: array<float>; npos: int64; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool> ): int64 

generate’s embedding-prefill twin: prefill npos pre-built embedding rows (the multimodal splice — see eval_embd), then stream-sample exactly like generate. The chat layer’s audio turns run on this; use it directly for custom multimodal prompts.

Arguments:

model : Model
session : Session
embd : array<float>
npos : int64
params : SamplingParams
max_tokens : int64
blk : block<(id:int64;piece:string):bool>

17.1.7. Embeddings

embed (model: Model; text: string) : array<float>

embed(model: Model; text: string ): array<float> 

Mean-pooled, L2-normalized sentence embedding of text (model.config.dim floats): the decoder’s last-layer hidden state (post-final RMSNorm) averaged over token positions, then unit-normalized. One forward over a fresh session per call. A decoder-only model used as an embedder yields RAG-grade vectors — good for retrieval / similarity, not a substitute for a dedicated embedding model.

Arguments:

model : Model
text : string

17.1.8. Chat

add_assistant (model: Model; var chat: ChatSession; text: string)

add_user (var chat: ChatSession; text: string)

add_user_audio (var chat: ChatSession; samples: array<float>|array<float>#) : auto

create_chat (model: Model; system: string = “”; max_new: int64 = 256; kv_dtype: KVDtype = dasllama_common::KVDtype.f16) : ChatSession

create_chat (model: Model; var tower: AudioTower; system: string = “”; max_new: int64 = 256; kv_dtype: KVDtype = dasllama_common::KVDtype.f16) : ChatSession

create_chat_renderer (model: Model; system: string = “”; max_new: int64 = 256) : ChatSession

render_assistant (model: Model; var chat: ChatSession; text: string; var out: array<int64>)

render_close (model: Model; chat: ChatSession) : array<int64>

render_turn (model: Model; chat: ChatSession) : array<int64>

respond (model: Model; var chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool>) : string

set_thinking (var chat: ChatSession; on: bool)

add_assistant(model: Model; chat: ChatSession; text: string )

Inject a KNOWN assistant reply (no generation): prefill the pending user turn and text into the KV cache, then close the turn — like respond but with a supplied reply. Replay a prior transcript with alternating add_user / add_assistant calls, then respond the final turn — the shape a stateless OpenAI-style server needs (the client resends the whole history each request). Precondition: a user message is pending (add_user first); a no-op otherwise.

Arguments:

model : Model
chat : ChatSession
text : string

add_user(chat: ChatSession; text: string )

Queue a user message for the next respond.

Arguments:

chat : ChatSession
text : string

add_user_audio(chat: ChatSession; samples: array<float>|array<float># ): auto 

Queue audio (16 kHz mono f32 PCM) for the next respond — encoded to soft tokens immediately, spliced at the head of the turn before any add_user text. Multiple clips concatenate. Needs a chat created with create_chat(model, tower); call inside with_job_que() (the encoder threads its kernels).

Arguments:

chat : ChatSession
samples : option<array<float>| array<float>#>

17.1.8.1. create_chat

create_chat(model: Model; system: string = ""; max_new: int64 = 256; kv_dtype: KVDtype = dasllama_common::KVDtype.f16 ): ChatSession 

Start a conversation over model: resolves the model’s chat template (sniffed from the GGUF’s embedded template, falling back to the arch registry) and creates the session. system is the system prompt (empty = none); max_new caps each reply. One model can drive many chats. kv_dtype is the session’s KV-cache codec (f16 default — see create_session).

Arguments:

model : Model
system : string
max_new : int64
kv_dtype : KVDtype

create_chat(model: Model; tower: AudioTower; system: string = ""; max_new: int64 = 256; kv_dtype: KVDtype = dasllama_common::KVDtype.f16 ): ChatSession

create_chat_renderer(model: Model; system: string = ""; max_new: int64 = 256 ): ChatSession 

create_chat’s RENDER-ONLY twin: resolves the chat template, stop ids, and turn close exactly like create_chat, but creates NO KV session — a queued request can render its whole prompt (add_user / render_assistant / render_turn / render_close) holding tokens only, no cache memory. It cannot respond/eval.

Arguments:

model : Model
system : string
max_new : int64

render_assistant(model: Model; chat: ChatSession; text: string; out: array<int64> )

add_assistant’s render half: append the exact token stream a known assistant reply prefills — the pending user turn, the assistant-open prompt, text, and the turn close — to out WITHOUT running the model, advancing the transcript exactly like add_assistant. Replay a stateless request’s history on a create_chat_renderer chat to render its full prompt with no KV memory. Precondition: a user message is pending; a no-op otherwise.

Arguments:

model : Model
chat : ChatSession
text : string
out : array<int64>

render_close(model: Model; chat: ChatSession ): array<int64> 

The tokens that TERMINATE an assistant turn (what respond evals after the reply) — for schedulers that close a finished stream’s turn themselves.

Arguments:

model : Model
chat : ChatSession

render_turn(model: Model; chat: ChatSession ): array<int64> 

Render the next turn’s prefill token ids — BOS + system on the first turn, then the user turn and the generation prompt — WITHOUT running the model. For inspection, token budgeting, tests.

Arguments:

model : Model
chat : ChatSession

respond(model: Model; chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool> ): string 

Generate the assistant’s reply to the queued user message: render the turn, prefill it, then stream pieces through the trailing block (return false to stop early) until a stop token or the max_new budget. Terminates the turn in the KV cache and appends both turns to chat.history. Returns the full reply text; timing via stats(chat.session).

Arguments:

model : Model
chat : ChatSession
params : SamplingParams
blk : block<(piece:string):bool>

set_thinking(chat: ChatSession; on: bool )

Toggle reasoning for a hybrid thinking model (Qwen3 family): false appends the template’s empty think block to every generation prompt (the Jinja enable_thinking=false form), so the model answers directly. No-op for templates with no suppress form or vocabs without the think specials. Default is on.

Arguments:

chat : ChatSession
on : bool

17.1.9. Tool calling

add_tool_results (var chat: ChatSession; results: array<string>)

render_assistant_calls (model: Model; var chat: ChatSession; text: string; calls: array<string>; var out: array<int64>)

set_tools (var chat: ChatSession; var tools: array<string>)

add_tool_results(chat: ChatSession; results: array<string> )

Queue tool results as the next pending turn — the reply to an assistant turn that called tools. Call in place of add_user, then respond/render_turn as usual.

Arguments:

chat : ChatSession
results : array<string>

render_assistant_calls(model: Model; chat: ChatSession; text: string; calls: array<string>; out: array<int64> )

render_assistant’s tool-calling twin: replay an assistant turn that emitted tool calls (verbatim \{"name":…,"arguments":…} objects) plus any text alongside.

Arguments:

model : Model
chat : ChatSession
text : string
calls : array<string>
out : array<int64>

set_tools(chat: ChatSession; tools: array<string> )

Declare the conversation’s tools (verbatim JSON objects, one per tool — the OpenAI tools[] entries, moved in) BEFORE the first turn renders; the system turn then carries the family’s tool block. Families with no tool format (tmpl.tool_call_open empty) ignore them.

Arguments:

chat : ChatSession
tools : array<string>