17.1. dasLLAMA LLM inference: models, sessions, generation, chat
CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, run the
transformer, sample — or hold a full chat — validated token-for-token against llama.cpp on every
supported family. Run with -jit; examples/dasLLAMA/run.das and chat.das show the
canonical program shape.
Supported model families (GGUF, fp32 / q8 / q4 weights):
Llama — Llama-2 / TinyLlama, Llama-3.1 / 3.2, Mistral-7B-Instruct, SmolLM2, plus llama2.c
.bincheckpointsQwen — Qwen2.5, Qwen3 (QK-norm), Qwen1.5-MoE (shared + routed experts)
Phi — Phi-3.5-mini
Gemma — Gemma-2, Gemma-3 (per-layer sliding-window patterns), Gemma-4-12B (p-RoPE, heterogeneous attention geometry)
gpt-oss — gpt-oss-20b (attention sinks, MXFP4 weights, YaRN long context, Harmony chat format)
The architecture is picked from GGUF metadata at load — the same program runs any of these.
Hands-on tutorials (overview): hello, generation, chat and templates, sampling, sessions and memory, performance, the architecture registry.
17.1.1. Types
The engine types the API below works with. They are created and consumed by the functions of this module; their remaining fields are engine implementation detail.
- Model
A loaded model: weights, config, tokenizer, and the architecture’s blocks and chat template, as produced by load_model. User code touches config (e.g. cap config.seq_len before create_session on large-context models) and arch (the GGUF architecture name).
- Session
One generation stream over a model: the KV cache, scratch buffers, sampling RNG, and the current position n_past. logits holds the distribution produced by the last eval. A model serves many independent sessions.
- QuantMode
Weight representation picked at load: fp32 — the token-exact reference; q8 — int8 quantization, the fast CPU path; q4 — 4-bit, smallest footprint.
- SamplingParams
Sampling knobs: temp (<= 0 selects greedy argmax), top_k (0 = no cutoff), and repetition penalty (1.0 = none) applied over the last penalty_last_n generated tokens. The defaults are greedy — SamplingParams() reproduces argmax exactly.
- Stats
Timing of the last generate/respond call: n_prompt/n_gen token counts, ttft_s (seconds to first token), and prefill_tps/gen_tps throughput in tokens per second.
- ChatSession
A conversation over a model: its session, the resolved chat template, and the running transcript in history. Create with create_chat, then drive with add_user + respond.
17.1.2. Model loading and sessions
- create_session(model: Model ): Session
Create a fresh session (KV cache + scratch) sized to model.config.seq_len. A model serves
many sessions, each with its own position and cache — one model, many independent conversations.
On large-context models cap model.config.seq_len BEFORE creating sessions to bound KV memory.
- Arguments:
model : Model
- load_model(path: string; mode: QuantMode = dasllama_common::QuantMode.fp32 ): Model
Load a model AND its tokenizer from a GGUF file — the one entry point. The architecture and
tokenizer backend are auto-selected from GGUF metadata; mode picks the weight quantization
(QuantMode.fp32 = the token-exact reference, QuantMode.q8 = the fast int8 path).
- Arguments:
path : string
mode : QuantMode
17.1.3. Tokenizer
- decode(model: Model; ids: array<int64> ): string
Decode a token-id sequence back to text with the model’s tokenizer.
- Arguments:
model : Model
ids : array<int64>
- encode(model: Model; text: string; add_special: bool = true; parse_special: bool = false ): array<int64>
Encode text to token ids with the model’s tokenizer. add_special prepends BOS where the
model expects one. parse_special is reserved and not yet honored — special tokens reach the
model as atomic ids from the chat layer’s template renderer, never by spelling them in text.
- Arguments:
model : Model
text : string
add_special : bool
parse_special : bool
- piece(model: Model; id: int64 ): string
Decode a single token to its text piece — the streaming counterpart of decode.
- Arguments:
model : Model
id : int64
17.1.4. Evaluation and sampling
- eval(model: Model; session: Session; tokens: array<int64> )
THE eval primitive: run tokens at the session’s current position and advance it. Prefill =
eval(prompt); each generation step = eval([token]) — the same call at different batch
sizes. Logits land in session.logits.
- sample(session: Session; params: SamplingParams ): int64
Sample the next token from session.logits per params: repetition penalty over the recent
window, then temperature + top-k + softmax + CDF — or greedy argmax when params.temp <= 0.
SamplingParams() defaults are greedy.
- Arguments:
session : Session
params : SamplingParams
- set_seed(session: Session; seed: int )
Seed the session’s sampling RNG for reproducible generation.
- Arguments:
session : Session
seed : int
- stats(session: Session ): Stats
Timing of the most recent generate/respond call on session: prompt/generated token
counts, time to first token, prefill and generation tok/s.
- Arguments:
session : Session
17.1.5. Generation
- generate(model: Model; session: Session; prompt: array<int64>; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool> ): int64
Stream-generate up to max_tokens from prompt, invoking the trailing block per token with
(id, piece); return false from the block to stop early (e.g. on a stop token). Prefills
the prompt in one eval, then samples one token at a time, advancing the session. Returns the
number of tokens emitted; timing is available afterwards via stats.
- Arguments:
model : Model
session : Session
prompt : array<int64>
params : SamplingParams
max_tokens : int64
blk : block<(id:int64;piece:string):bool>
17.1.6. Chat
- add_user(chat: ChatSession; text: string )
Queue a user message for the next respond.
- Arguments:
chat : ChatSession
text : string
- create_chat(model: Model; system: string = ""; max_new: int64 = 256 ): ChatSession
Start a conversation over model: resolves the model’s chat template (sniffed from the GGUF’s
embedded template, falling back to the arch registry) and creates the session. system is the
system prompt (empty = none); max_new caps each reply. One model can drive many chats.
- Arguments:
model : Model
system : string
max_new : int64
- render_turn(model: Model; chat: ChatSession ): array<int64>
Render the next turn’s prefill token ids — BOS + system on the first turn, then the user turn and the generation prompt — WITHOUT running the model. For inspection, token budgeting, tests.
- Arguments:
model : Model
chat : ChatSession
- respond(model: Model; chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool> ): string
Generate the assistant’s reply to the queued user message: render the turn, prefill it, then
stream pieces through the trailing block (return false to stop early) until a stop token or
the max_new budget. Terminates the turn in the KV cache and appends both turns to
chat.history. Returns the full reply text; timing via stats(chat.session).
- Arguments:
model : Model
chat : ChatSession
params : SamplingParams
blk : block<(piece:string):bool>