.. _stdlib_dasllama:

==========================================================
dasLLAMA LLM inference: models, sessions, generation, chat
==========================================================

.. das:module:: dasllama

CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, run the
transformer, sample — or hold a full chat — validated token-for-token against llama.cpp on every
supported family. Run with ``-jit``; ``examples/dasLLAMA/run.das`` and ``chat.das`` show the
canonical program shape.

Supported model families (GGUF, fp32 / q8 / q4 weights):

* **Llama** — Llama-2 / TinyLlama, Llama-3.1 / 3.2, Mistral-7B-Instruct, SmolLM2, plus llama2.c ``.bin`` checkpoints
* **Qwen** — Qwen2.5, Qwen3 (QK-norm), Qwen1.5-MoE (shared + routed experts)
* **Phi** — Phi-3.5-mini
* **Gemma** — Gemma-2, Gemma-3 (per-layer sliding-window patterns), Gemma-4-12B (p-RoPE, heterogeneous attention geometry)
* **gpt-oss** — gpt-oss-20b (attention sinks, MXFP4 weights, YaRN long context, Harmony chat format)

The architecture is picked from GGUF metadata at load — the same program runs any of these.

Hands-on tutorials (:ref:`overview <tutorials_dasllama>`):
:ref:`hello, generation <tutorial_dasLLAMA_hello_generate>`,
:ref:`chat and templates <tutorial_dasLLAMA_chat>`,
:ref:`sampling <tutorial_dasLLAMA_sampling>`,
:ref:`sessions and memory <tutorial_dasLLAMA_sessions_and_memory>`,
:ref:`performance <tutorial_dasLLAMA_performance>`,
:ref:`the architecture registry <tutorial_dasLLAMA_add_an_arch>`.


+++++
Types
+++++

The engine types the API below works with. They are created and consumed by the
functions of this module; their remaining fields are engine implementation detail.

.. _struct-dasllama_common-Model:

.. das:attribute:: Model

A loaded model: weights, ``config``, tokenizer, and the architecture's blocks and chat template, as produced by ``load_model``. User code touches ``config`` (e.g. cap ``config.seq_len`` before ``create_session`` on large-context models) and ``arch`` (the GGUF architecture name).

.. _struct-dasllama_common-Session:

.. das:attribute:: Session

One generation stream over a model: the KV cache, scratch buffers, sampling RNG, and the current position ``n_past``. ``logits`` holds the distribution produced by the last ``eval``. A model serves many independent sessions.

.. _enum-dasllama_common-QuantMode:

.. das:attribute:: QuantMode

Weight representation picked at load: ``fp32`` — the token-exact reference; ``q8`` — int8 quantization, the fast CPU path; ``q4`` — 4-bit, smallest footprint.

.. _struct-dasllama_common-SamplingParams:

.. das:attribute:: SamplingParams

Sampling knobs: ``temp`` (<= 0 selects greedy argmax), ``top_k`` (0 = no cutoff), and repetition ``penalty`` (1.0 = none) applied over the last ``penalty_last_n`` generated tokens. The defaults are greedy — ``SamplingParams()`` reproduces argmax exactly.

.. _struct-dasllama_common-Stats:

.. das:attribute:: Stats

Timing of the last ``generate``/``respond`` call: ``n_prompt``/``n_gen`` token counts, ``ttft_s`` (seconds to first token), and ``prefill_tps``/``gen_tps`` throughput in tokens per second.

.. _struct-dasllama_chat-ChatSession:

.. das:attribute:: ChatSession

A conversation over a model: its ``session``, the resolved chat template, and the running transcript in ``history``. Create with ``create_chat``, then drive with ``add_user`` + ``respond``.


++++++++++++++++++++++++++
Model loading and sessions
++++++++++++++++++++++++++

  *  :ref:`create_session (model: Model) : Session <function-dasllama_create_session_Model>`
  *  :ref:`load_model (path: string; mode: QuantMode = dasllama_common::QuantMode.fp32) : Model <function-dasllama_load_model_string_QuantMode>`

.. _function-dasllama_create_session_Model:

.. das:function:: create_session(model: Model) : Session

Create a fresh session (KV cache + scratch) sized to ``model.config.seq_len``. A model serves
many sessions, each with its own position and cache — one model, many independent conversations.
On large-context models cap ``model.config.seq_len`` BEFORE creating sessions to bound KV memory.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

.. _function-dasllama_load_model_string_QuantMode:

.. das:function:: load_model(path: string; mode: QuantMode = dasllama_common::QuantMode.fp32) : Model

Load a model AND its tokenizer from a GGUF file — the one entry point. The architecture and
tokenizer backend are auto-selected from GGUF metadata; ``mode`` picks the weight quantization
(``QuantMode.fp32`` = the token-exact reference, ``QuantMode.q8`` = the fast int8 path).


:Arguments: * **path** : string

            * **mode** :  :ref:`QuantMode <enum-dasllama_common-QuantMode>`


+++++++++
Tokenizer
+++++++++

  *  :ref:`decode (model: Model; ids: array\<int64\>) : string <function-dasllama_decode_Model_array_ls_int64_gr_>`
  *  :ref:`encode (model: Model; text: string; add_special: bool = true; parse_special: bool = false) : array\<int64\> <function-dasllama_encode_Model_string_bool_bool>`
  *  :ref:`piece (model: Model; id: int64) : string <function-dasllama_piece_Model_int64>`

.. _function-dasllama_decode_Model_array_ls_int64_gr_:

.. das:function:: decode(model: Model; ids: array<int64>) : string

Decode a token-id sequence back to text with the model's tokenizer.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **ids** : array<int64>

.. _function-dasllama_encode_Model_string_bool_bool:

.. das:function:: encode(model: Model; text: string; add_special: bool = true; parse_special: bool = false) : array<int64>

Encode ``text`` to token ids with the model's tokenizer. ``add_special`` prepends BOS where the
model expects one. ``parse_special`` is reserved and not yet honored — special tokens reach the
model as atomic ids from the chat layer's template renderer, never by spelling them in text.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **text** : string

            * **add_special** : bool

            * **parse_special** : bool

.. _function-dasllama_piece_Model_int64:

.. das:function:: piece(model: Model; id: int64) : string

Decode a single token to its text piece — the streaming counterpart of ``decode``.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **id** : int64


+++++++++++++++++++++++
Evaluation and sampling
+++++++++++++++++++++++

  *  :ref:`eval (model: Model; var session: Session; tokens: array\<int64\>) <function-dasllama_eval_Model_Session_array_ls_int64_gr_>`
  *  :ref:`sample (var session: Session; params: SamplingParams) : int64 <function-dasllama_sample_Session_SamplingParams>`
  *  :ref:`set_seed (var session: Session; seed: int) <function-dasllama_set_seed_Session_int>`
  *  :ref:`stats (session: Session) : Stats <function-dasllama_stats_Session>`

.. _function-dasllama_eval_Model_Session_array_ls_int64_gr_:

.. das:function:: eval(model: Model; session: Session; tokens: array<int64>)

THE eval primitive: run ``tokens`` at the session's current position and advance it. Prefill =
``eval(prompt)``; each generation step = ``eval([token])`` — the same call at different batch
sizes. Logits land in ``session.logits``.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **session** :  :ref:`Session <struct-dasllama_common-Session>`

            * **tokens** : array<int64>

.. _function-dasllama_sample_Session_SamplingParams:

.. das:function:: sample(session: Session; params: SamplingParams) : int64

Sample the next token from ``session.logits`` per ``params``: repetition penalty over the recent
window, then temperature + top-k + softmax + CDF — or greedy argmax when ``params.temp <= 0``.
``SamplingParams()`` defaults are greedy.


:Arguments: * **session** :  :ref:`Session <struct-dasllama_common-Session>`

            * **params** :  :ref:`SamplingParams <struct-dasllama_common-SamplingParams>`

.. _function-dasllama_set_seed_Session_int:

.. das:function:: set_seed(session: Session; seed: int)

Seed the session's sampling RNG for reproducible generation.


:Arguments: * **session** :  :ref:`Session <struct-dasllama_common-Session>`

            * **seed** : int

.. _function-dasllama_stats_Session:

.. das:function:: stats(session: Session) : Stats

Timing of the most recent ``generate``/``respond`` call on ``session``: prompt/generated token
counts, time to first token, prefill and generation tok/s.


:Arguments: * **session** :  :ref:`Session <struct-dasllama_common-Session>`


++++++++++
Generation
++++++++++

  *  :ref:`generate (model: Model; var session: Session; prompt: array\<int64\>; params: SamplingParams; max_tokens: int64; blk: block\<(id:int64;piece:string):bool\>) : int64 <function-dasllama_generate_Model_Session_array_ls_int64_gr__SamplingParams_int64_block_ls_id_c_int64;piece_c_string_c_bool_gr_>`

.. _function-dasllama_generate_Model_Session_array_ls_int64_gr__SamplingParams_int64_block_ls_id_c_int64;piece_c_string_c_bool_gr_:

.. das:function:: generate(model: Model; session: Session; prompt: array<int64>; params: SamplingParams; max_tokens: int64; blk: block<(id:int64;piece:string):bool>) : int64

Stream-generate up to ``max_tokens`` from ``prompt``, invoking the trailing block per token with
``(id, piece)``; return ``false`` from the block to stop early (e.g. on a stop token). Prefills
the prompt in one ``eval``, then samples one token at a time, advancing the session. Returns the
number of tokens emitted; timing is available afterwards via ``stats``.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **session** :  :ref:`Session <struct-dasllama_common-Session>`

            * **prompt** : array<int64>

            * **params** :  :ref:`SamplingParams <struct-dasllama_common-SamplingParams>`

            * **max_tokens** : int64

            * **blk** : block<(id:int64;piece:string):bool>


++++
Chat
++++

  *  :ref:`add_user (var chat: ChatSession; text: string) <function-dasllama_add_user_ChatSession_string>`
  *  :ref:`create_chat (model: Model; system: string = ""; max_new: int64 = 256) : ChatSession <function-dasllama_create_chat_Model_string_int64>`
  *  :ref:`render_turn (model: Model; chat: ChatSession) : array\<int64\> <function-dasllama_render_turn_Model_ChatSession>`
  *  :ref:`respond (model: Model; var chat: ChatSession; params: SamplingParams; blk: block\<(piece:string):bool\>) : string <function-dasllama_respond_Model_ChatSession_SamplingParams_block_ls_piece_c_string_c_bool_gr_>`

.. _function-dasllama_add_user_ChatSession_string:

.. das:function:: add_user(chat: ChatSession; text: string)

Queue a user message for the next ``respond``.


:Arguments: * **chat** :  :ref:`ChatSession <struct-dasllama_chat-ChatSession>`

            * **text** : string

.. _function-dasllama_create_chat_Model_string_int64:

.. das:function:: create_chat(model: Model; system: string = ""; max_new: int64 = 256) : ChatSession

Start a conversation over ``model``: resolves the model's chat template (sniffed from the GGUF's
embedded template, falling back to the arch registry) and creates the session. ``system`` is the
system prompt (empty = none); ``max_new`` caps each reply. One model can drive many chats.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **system** : string

            * **max_new** : int64

.. _function-dasllama_render_turn_Model_ChatSession:

.. das:function:: render_turn(model: Model; chat: ChatSession) : array<int64>

Render the next turn's prefill token ids — BOS + system on the first turn, then the user turn
and the generation prompt — WITHOUT running the model. For inspection, token budgeting, tests.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **chat** :  :ref:`ChatSession <struct-dasllama_chat-ChatSession>`

.. _function-dasllama_respond_Model_ChatSession_SamplingParams_block_ls_piece_c_string_c_bool_gr_:

.. das:function:: respond(model: Model; chat: ChatSession; params: SamplingParams; blk: block<(piece:string):bool>) : string

Generate the assistant's reply to the queued user message: render the turn, prefill it, then
stream pieces through the trailing block (return ``false`` to stop early) until a stop token or
the ``max_new`` budget. Terminates the turn in the KV cache and appends both turns to
``chat.history``. Returns the full reply text; timing via ``stats(chat.session)``.


:Arguments: * **model** :  :ref:`Model <struct-dasllama_common-Model>`

            * **chat** :  :ref:`ChatSession <struct-dasllama_chat-ChatSession>`

            * **params** :  :ref:`SamplingParams <struct-dasllama_common-SamplingParams>`

            * **blk** : block<(piece:string):bool>