.. _stdlib_dasllama: ========================================================== dasLLAMA LLM inference: models, sessions, generation, chat ========================================================== .. das:module:: dasllama CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, run the transformer, sample — or hold a full chat — validated token-for-token against llama.cpp on every supported family. Run with ``-jit``; ``examples/dasLLAMA/run.das`` and ``chat.das`` show the canonical program shape. Supported model families (GGUF, fp32 / q8 / q4 weights): * **Llama** — Llama-2 / TinyLlama, Llama-3.1 / 3.2, Mistral-7B-Instruct, SmolLM2, plus llama2.c ``.bin`` checkpoints * **Qwen** — Qwen2.5, Qwen3 (QK-norm), Qwen1.5-MoE (shared + routed experts) * **Phi** — Phi-3.5-mini * **Gemma** — Gemma-2, Gemma-3 (per-layer sliding-window patterns), Gemma-4-12B (p-RoPE, heterogeneous attention geometry) * **gpt-oss** — gpt-oss-20b (attention sinks, MXFP4 weights, YaRN long context, Harmony chat format) The architecture is picked from GGUF metadata at load — the same program runs any of these. Hands-on tutorials (:ref:`overview `): :ref:`hello, generation `, :ref:`chat and templates `, :ref:`sampling `, :ref:`sessions and memory `, :ref:`performance `, :ref:`the architecture registry `. +++++ Types +++++ The engine types the API below works with. They are created and consumed by the functions of this module; their remaining fields are engine implementation detail. .. _struct-dasllama_common-Model: .. das:attribute:: Model A loaded model: weights, ``config``, tokenizer, and the architecture's blocks and chat template, as produced by ``load_model``. User code touches ``config`` (e.g. cap ``config.seq_len`` before ``create_session`` on large-context models) and ``arch`` (the GGUF architecture name). .. _struct-dasllama_common-Session: .. das:attribute:: Session One generation stream over a model: the KV cache, scratch buffers, sampling RNG, and the current position ``n_past``. ``logits`` holds the distribution produced by the last ``eval``. A model serves many independent sessions. .. _enum-dasllama_common-QuantMode: .. das:attribute:: QuantMode Weight representation picked at load: ``fp32`` — the token-exact reference; ``q8`` — int8 quantization, the fast CPU path; ``q4`` — 4-bit, smallest footprint. .. _struct-dasllama_common-SamplingParams: .. das:attribute:: SamplingParams Sampling knobs: ``temp`` (<= 0 selects greedy argmax), ``top_k`` (0 = no cutoff), and repetition ``penalty`` (1.0 = none) applied over the last ``penalty_last_n`` generated tokens. The defaults are greedy — ``SamplingParams()`` reproduces argmax exactly. .. _struct-dasllama_common-Stats: .. das:attribute:: Stats Timing of the last ``generate``/``respond`` call: ``n_prompt``/``n_gen`` token counts, ``ttft_s`` (seconds to first token), and ``prefill_tps``/``gen_tps`` throughput in tokens per second. .. _struct-dasllama_chat-ChatSession: .. das:attribute:: ChatSession A conversation over a model: its ``session``, the resolved chat template, and the running transcript in ``history``. Create with ``create_chat``, then drive with ``add_user`` + ``respond``. ++++++++++++++++++++++++++ Model loading and sessions ++++++++++++++++++++++++++ * :ref:`create_session (model: Model) : Session ` * :ref:`load_model (path: string; mode: QuantMode = dasllama_common::QuantMode.fp32) : Model ` .. _function-dasllama_create_session_Model: .. das:function:: create_session(model: Model) : Session Create a fresh session (KV cache + scratch) sized to ``model.config.seq_len``. A model serves many sessions, each with its own position and cache — one model, many independent conversations. On large-context models cap ``model.config.seq_len`` BEFORE creating sessions to bound KV memory. :Arguments: * **model** : :ref:`Model ` .. _function-dasllama_load_model_string_QuantMode: .. das:function:: load_model(path: string; mode: QuantMode = dasllama_common::QuantMode.fp32) : Model Load a model AND its tokenizer from a GGUF file — the one entry point. The architecture and tokenizer backend are auto-selected from GGUF metadata; ``mode`` picks the weight quantization (``QuantMode.fp32`` = the token-exact reference, ``QuantMode.q8`` = the fast int8 path). :Arguments: * **path** : string * **mode** : :ref:`QuantMode ` +++++++++ Tokenizer +++++++++ * :ref:`decode (model: Model; ids: array\) : string ` * :ref:`encode (model: Model; text: string; add_special: bool = true; parse_special: bool = false) : array\ ` * :ref:`piece (model: Model; id: int64) : string ` .. _function-dasllama_decode_Model_array_ls_int64_gr_: .. das:function:: decode(model: Model; ids: array) : string Decode a token-id sequence back to text with the model's tokenizer. :Arguments: * **model** : :ref:`Model ` * **ids** : array .. _function-dasllama_encode_Model_string_bool_bool: .. das:function:: encode(model: Model; text: string; add_special: bool = true; parse_special: bool = false) : array Encode ``text`` to token ids with the model's tokenizer. ``add_special`` prepends BOS where the model expects one. ``parse_special`` is reserved and not yet honored — special tokens reach the model as atomic ids from the chat layer's template renderer, never by spelling them in text. :Arguments: * **model** : :ref:`Model ` * **text** : string * **add_special** : bool * **parse_special** : bool .. _function-dasllama_piece_Model_int64: .. das:function:: piece(model: Model; id: int64) : string Decode a single token to its text piece — the streaming counterpart of ``decode``. :Arguments: * **model** : :ref:`Model ` * **id** : int64 +++++++++++++++++++++++ Evaluation and sampling +++++++++++++++++++++++ * :ref:`eval (model: Model; var session: Session; tokens: array\) ` * :ref:`sample (var session: Session; params: SamplingParams) : int64 ` * :ref:`set_seed (var session: Session; seed: int) ` * :ref:`stats (session: Session) : Stats ` .. _function-dasllama_eval_Model_Session_array_ls_int64_gr_: .. das:function:: eval(model: Model; session: Session; tokens: array) THE eval primitive: run ``tokens`` at the session's current position and advance it. Prefill = ``eval(prompt)``; each generation step = ``eval([token])`` — the same call at different batch sizes. Logits land in ``session.logits``. :Arguments: * **model** : :ref:`Model ` * **session** : :ref:`Session ` * **tokens** : array .. _function-dasllama_sample_Session_SamplingParams: .. das:function:: sample(session: Session; params: SamplingParams) : int64 Sample the next token from ``session.logits`` per ``params``: repetition penalty over the recent window, then temperature + top-k + softmax + CDF — or greedy argmax when ``params.temp <= 0``. ``SamplingParams()`` defaults are greedy. :Arguments: * **session** : :ref:`Session ` * **params** : :ref:`SamplingParams ` .. _function-dasllama_set_seed_Session_int: .. das:function:: set_seed(session: Session; seed: int) Seed the session's sampling RNG for reproducible generation. :Arguments: * **session** : :ref:`Session ` * **seed** : int .. _function-dasllama_stats_Session: .. das:function:: stats(session: Session) : Stats Timing of the most recent ``generate``/``respond`` call on ``session``: prompt/generated token counts, time to first token, prefill and generation tok/s. :Arguments: * **session** : :ref:`Session ` ++++++++++ Generation ++++++++++ * :ref:`generate (model: Model; var session: Session; prompt: array\; params: SamplingParams; max_tokens: int64; blk: block\<(id:int64;piece:string):bool\>) : int64 ` .. _function-dasllama_generate_Model_Session_array_ls_int64_gr__SamplingParams_int64_block_ls_id_c_int64;piece_c_string_c_bool_gr_: .. das:function:: generate(model: Model; session: Session; prompt: array; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool>) : int64 Stream-generate up to ``max_tokens`` from ``prompt``, invoking the trailing block per token with ``(id, piece)``; return ``false`` from the block to stop early (e.g. on a stop token). Prefills the prompt in one ``eval``, then samples one token at a time, advancing the session. Returns the number of tokens emitted; timing is available afterwards via ``stats``. :Arguments: * **model** : :ref:`Model ` * **session** : :ref:`Session ` * **prompt** : array * **params** : :ref:`SamplingParams ` * **max_tokens** : int64 * **blk** : block<(id:int64;piece:string):bool> ++++ Chat ++++ * :ref:`add_user (var chat: ChatSession; text: string) ` * :ref:`create_chat (model: Model; system: string = ""; max_new: int64 = 256) : ChatSession ` * :ref:`render_turn (model: Model; chat: ChatSession) : array\ ` * :ref:`respond (model: Model; var chat: ChatSession; params: SamplingParams; blk: block\<(piece:string):bool\>) : string ` .. _function-dasllama_add_user_ChatSession_string: .. das:function:: add_user(chat: ChatSession; text: string) Queue a user message for the next ``respond``. :Arguments: * **chat** : :ref:`ChatSession ` * **text** : string .. _function-dasllama_create_chat_Model_string_int64: .. das:function:: create_chat(model: Model; system: string = ""; max_new: int64 = 256) : ChatSession Start a conversation over ``model``: resolves the model's chat template (sniffed from the GGUF's embedded template, falling back to the arch registry) and creates the session. ``system`` is the system prompt (empty = none); ``max_new`` caps each reply. One model can drive many chats. :Arguments: * **model** : :ref:`Model ` * **system** : string * **max_new** : int64 .. _function-dasllama_render_turn_Model_ChatSession: .. das:function:: render_turn(model: Model; chat: ChatSession) : array Render the next turn's prefill token ids — BOS + system on the first turn, then the user turn and the generation prompt — WITHOUT running the model. For inspection, token budgeting, tests. :Arguments: * **model** : :ref:`Model ` * **chat** : :ref:`ChatSession ` .. _function-dasllama_respond_Model_ChatSession_SamplingParams_block_ls_piece_c_string_c_bool_gr_: .. das:function:: respond(model: Model; chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool>) : string Generate the assistant's reply to the queued user message: render the turn, prefill it, then stream pieces through the trailing block (return ``false`` to stop early) until a stop token or the ``max_new`` budget. Terminates the turn in the KV cache and appends both turns to ``chat.history``. Returns the full reply text; timing via ``stats(chat.session)``. :Arguments: * **model** : :ref:`Model ` * **chat** : :ref:`ChatSession ` * **params** : :ref:`SamplingParams ` * **blk** : block<(piece:string):bool>