8.7.2. dasLLAMA-02 — Chat and Templates
Chat is plain generation over a templated prompt. This tutorial holds a real multi-turn conversation with a chat-tuned model, inspects the transcript, and then lifts the hood on how turns become tokens.
Run it like tutorial 01 (SmolLM2-135M-Instruct works well):
daslang.exe -jit tutorials/dasLLAMA/02_chat.das -- path/to/model.gguf
8.7.2.1. A conversation in three calls
create_chat resolves the model’s chat template — sniffed from the GGUF’s
embedded template, falling back to the arch registry — and creates the
session. add_user queues a message; respond renders the turn, prefills
it, and streams the reply until a stop token or the max_new budget.
var chat = create_chat(m, "You are a helpful, friendly assistant.")
add_user(chat, "What is the capital of France?")
respond(m, chat, SamplingParams()) $(piece) {
fprint(fstdout(), piece)
fflush(fstdout())
return true
}
// -> The capital of France is Paris.
8.7.2.2. Multi-turn memory
A follow-up like “And of Italy?” only makes sense with the first turn in
context. Nothing is re-prefilled: every turn so far is already in the
session’s KV cache, so each respond only evaluates the new tokens.
add_user(chat, "And of Italy?")
// respond(...) -> The capital of Italy is Rome.
8.7.2.3. The transcript
chat.history records both sides of the conversation (for Qwen3-style
thinking models, with the <think> block stripped per protocol):
for (msg in chat.history) {
print(" {msg.role}: {msg.content}\n")
}
8.7.2.4. Under the hood: render_turn
render_turn shows the exact prefill the next respond would evaluate —
without running the model — so you can inspect the wrapping and budget tokens:
add_user(chat, "Thanks!")
let turn <- render_turn(m, chat) // read-only: the message stays queued
print("next turn prefill: {length(turn)} tokens: {turn}\n")
print("decoded text: \"{decode(m, turn)}\"\n")
On a ChatML-family model the ids come out as
[ 1, 4093, 198, 16937, 17, 2, 198, 1, 520, 9531, 198] — and the decoded
text is just user\nThanks!\nassistant\n. The template’s
<|im_start|> / <|im_end|> wrapping is special tokens: atomic ids the
model sees, which decode renders invisibly. Only the text between them
survives a round-trip.
See also
Full source: tutorials/dasLLAMA/02_chat.das
Next tutorial: dasLLAMA-03 — Sampling
The interactive chat REPL: examples/dasLLAMA/chat.das