8.7.1. dasLLAMA-01 — Hello, Generation

This tutorial introduces dasLLAMA — CPU large-language-model inference in pure daslang: load a GGUF model, tokenize, and stream a text continuation. Everything goes through the public facade, dasllama/dasllama; requiring it also registers every supported architecture (Llama-2/3, Mistral, Qwen, Phi-3, Gemma-2/3/4, Qwen-MoE, gpt-oss), so the same program runs any of them.

The companion .das files are in tutorials/dasLLAMA/. They need a GGUF model file on disk (models are not shipped with the repo) — a good tiny one is SmolLM2-135M-Instruct Q8_0 (~145 MB). Pass the model path on the command line (or set DASLLAMA_MODEL), and always run with -jit — interpreted inference is far too slow for model work:

daslang.exe -jit tutorials/dasLLAMA/01_hello_generate.das -- path/to/SmolLM2-135M-Instruct-Q8_0.gguf

8.7.1.1. Loading a model

load_model is the one entry point: it reads the GGUF, picks the architecture and tokenizer from the file’s metadata, and loads the weights at the precision you ask for. QuantMode.q8 (int8) is the fast everyday choice; QuantMode.fp32 is the token-exact reference the test suite validates against llama.cpp.

require dasllama/dasllama

var m <- load_model(path, QuantMode.q8)
let c = m.config
print("arch={m.arch} dim={c.dim} layers={c.n_layers} vocab={c.vocab_size} ctx={c.seq_len}\n")

8.7.1.2. Tokens: encode, decode, piece

Models consume token ids, not text. encode uses the model’s own tokenizer (BPE or SentencePiece, chosen at load); decode is the inverse; piece decodes a single token — that’s what streaming callbacks receive.

let ids <- encode(m, "Once upon a time")
// -> [ 6403, 1980, 253, 655]
print("{decode(m, ids)}\n")       // -> Once upon a time
print("{piece(m, ids[1])}\n")     // -> " upon"

8.7.1.3. Generating

A Session holds the KV cache and scratch buffers for one generation stream. generate prefills the prompt, then samples one token at a time, calling the trailing block with (id, piece) — return false to stop early. SamplingParams() defaults are greedy (always the most likely token); tutorial 03 covers the knobs.

The kernels thread through the job queue — wrap generation in with_job_que() (from daslib/jobque_boost), or model code will panic asking for one.

with_job_que() {
    set_jobque_fork_pool(true, true)   // pool per-job fork contexts
    var s = create_session(m)
    generate(m, s, ids, SamplingParams(), 48l) $(_id, piece) {
        fprint(fstdout(), piece)
        fflush(fstdout())
        return true
    }
}

8.7.1.4. Stats

stats reports the last generate/respond call on the session: token counts, time to first token, and prefill / generation throughput.

let st = stats(s)
print("prompt {st.n_prompt} tok | gen {st.n_gen} tok | ttft {int(st.ttft_s * 1000.0lf)}ms | ")
print("prefill {int(st.prefill_tps)} t/s | gen {int(st.gen_tps)} t/s\n")

8.7.1.5. Quick Reference

Function	Description
`load_model(path, mode)`	GGUF → `Model` (arch + tokenizer auto-detected)
`create_session(model)`	Fresh KV cache + scratch, sized to `config.seq_len`
`encode(model, text)`	Text → token ids (model’s own tokenizer)
`decode(model, ids)` / `piece(model, id)`	Token ids → text / one token’s text
`generate(model, s, prompt, params, n)`	Streaming generation, trailing block per token
`stats(s)`	Counts, ttft, prefill / generation tok/s