17.1. dasLLAMA LLM inference: models, sessions, generation, chat

CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, run the transformer, sample — or hold a full chat — validated token-for-token against llama.cpp on every supported family. Run with -jit; examples/dasLLAMA/run.das and chat.das show the canonical program shape.

Supported model families (GGUF, fp32 / q8 / q4 weights):

  • Llama — Llama-2 / TinyLlama, Llama-3.1 / 3.2, Mistral-7B-Instruct, SmolLM2, plus llama2.c .bin checkpoints

  • Qwen — Qwen2.5, Qwen3 (QK-norm), Qwen1.5-MoE (shared + routed experts)

  • Phi — Phi-3.5-mini

  • Gemma — Gemma-2, Gemma-3 (per-layer sliding-window patterns), Gemma-4-12B (p-RoPE, heterogeneous attention geometry)

  • gpt-oss — gpt-oss-20b (attention sinks, MXFP4 weights, YaRN long context, Harmony chat format)

The architecture is picked from GGUF metadata at load — the same program runs any of these.

Hands-on tutorials (overview): hello, generation, chat and templates, sampling, sessions and memory, performance, the architecture registry.

17.1.1. Types

The engine types the API below works with. They are created and consumed by the functions of this module; their remaining fields are engine implementation detail.

Model

A loaded model: weights, config, tokenizer, and the architecture’s blocks and chat template, as produced by load_model. User code touches config (e.g. cap config.seq_len before create_session on large-context models) and arch (the GGUF architecture name).

Session

One generation stream over a model: the KV cache, scratch buffers, sampling RNG, and the current position n_past. logits holds the distribution produced by the last eval. A model serves many independent sessions.

QuantMode

Weight representation picked at load: fp32 — the token-exact reference; q8 — int8 quantization, the fast CPU path; q4 — 4-bit, smallest footprint.

SamplingParams

Sampling knobs: temp (<= 0 selects greedy argmax), top_k (0 = no cutoff), and repetition penalty (1.0 = none) applied over the last penalty_last_n generated tokens. The defaults are greedy — SamplingParams() reproduces argmax exactly.

Stats

Timing of the last generate/respond call: n_prompt/n_gen token counts, ttft_s (seconds to first token), and prefill_tps/gen_tps throughput in tokens per second.

ChatSession

A conversation over a model: its session, the resolved chat template, and the running transcript in history. Create with create_chat, then drive with add_user + respond.

17.1.2. Model loading and sessions

create_session(model: Model ): Session

Create a fresh session (KV cache + scratch) sized to model.config.seq_len. A model serves many sessions, each with its own position and cache — one model, many independent conversations. On large-context models cap model.config.seq_len BEFORE creating sessions to bound KV memory.

Arguments:
load_model(path: string; mode: QuantMode = dasllama_common::QuantMode.fp32 ): Model

Load a model AND its tokenizer from a GGUF file — the one entry point. The architecture and tokenizer backend are auto-selected from GGUF metadata; mode picks the weight quantization (QuantMode.fp32 = the token-exact reference, QuantMode.q8 = the fast int8 path).

Arguments:

17.1.3. Tokenizer

decode(model: Model; ids: array<int64> ): string

Decode a token-id sequence back to text with the model’s tokenizer.

Arguments:
  • model : Model

  • ids : array<int64>

encode(model: Model; text: string; add_special: bool = true; parse_special: bool = false ): array<int64>

Encode text to token ids with the model’s tokenizer. add_special prepends BOS where the model expects one. parse_special is reserved and not yet honored — special tokens reach the model as atomic ids from the chat layer’s template renderer, never by spelling them in text.

Arguments:
  • model : Model

  • text : string

  • add_special : bool

  • parse_special : bool

piece(model: Model; id: int64 ): string

Decode a single token to its text piece — the streaming counterpart of decode.

Arguments:
  • model : Model

  • id : int64

17.1.4. Evaluation and sampling

eval(model: Model; session: Session; tokens: array<int64> )

THE eval primitive: run tokens at the session’s current position and advance it. Prefill = eval(prompt); each generation step = eval([token]) — the same call at different batch sizes. Logits land in session.logits.

Arguments:
sample(session: Session; params: SamplingParams ): int64

Sample the next token from session.logits per params: repetition penalty over the recent window, then temperature + top-k + softmax + CDF — or greedy argmax when params.temp <= 0. SamplingParams() defaults are greedy.

Arguments:
set_seed(session: Session; seed: int )

Seed the session’s sampling RNG for reproducible generation.

Arguments:
stats(session: Session ): Stats

Timing of the most recent generate/respond call on session: prompt/generated token counts, time to first token, prefill and generation tok/s.

Arguments:

17.1.5. Generation

generate(model: Model; session: Session; prompt: array<int64>; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool> ): int64

Stream-generate up to max_tokens from prompt, invoking the trailing block per token with (id, piece); return false from the block to stop early (e.g. on a stop token). Prefills the prompt in one eval, then samples one token at a time, advancing the session. Returns the number of tokens emitted; timing is available afterwards via stats.

Arguments:
  • model : Model

  • session : Session

  • prompt : array<int64>

  • params : SamplingParams

  • max_tokens : int64

  • blk : block<(id:int64;piece:string):bool>

17.1.6. Chat

add_user(chat: ChatSession; text: string )

Queue a user message for the next respond.

Arguments:
create_chat(model: Model; system: string = ""; max_new: int64 = 256 ): ChatSession

Start a conversation over model: resolves the model’s chat template (sniffed from the GGUF’s embedded template, falling back to the arch registry) and creates the session. system is the system prompt (empty = none); max_new caps each reply. One model can drive many chats.

Arguments:
  • model : Model

  • system : string

  • max_new : int64

render_turn(model: Model; chat: ChatSession ): array<int64>

Render the next turn’s prefill token ids — BOS + system on the first turn, then the user turn and the generation prompt — WITHOUT running the model. For inspection, token budgeting, tests.

Arguments:
respond(model: Model; chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool> ): string

Generate the assistant’s reply to the queued user message: render the turn, prefill it, then stream pieces through the trailing block (return false to stop early) until a stop token or the max_new budget. Terminates the turn in the KV cache and appends both turns to chat.history. Returns the full reply text; timing via stats(chat.session).

Arguments: