8.7.3. dasLLAMA-03 — Sampling

How the next token gets picked. SamplingParams has four knobs, and their defaults are greedy:

struct SamplingParams {
    temp : float = 0.0             // <= 0 => greedy argmax; otherwise softmax temperature
    top_k : int64 = 0l             // <= 0 or >= vocab => no top-k cutoff
    penalty : float = 1.0          // repetition penalty over recent tokens (1.0 = none)
    penalty_last_n : int64 = 64l   // repetition-penalty window
}

Run it like tutorial 01:

daslang.exe -jit tutorials/dasLLAMA/03_sampling.das -- path/to/model.gguf

8.7.3.1. Greedy: deterministic, and loop-prone

Greedy argmax takes the single most likely token every step. Two runs are identical — but on small models the most likely continuation of a repetitive context is more repetition, so greedy text tends to loop. SmolLM2-135M demonstrates on cue:

greedy:, I was a young man, a young woman, and a young man again. I was a
young man, a young woman, and a young man again. ...

8.7.3.2. The repetition penalty

penalty > 1 scales down the logits of the last penalty_last_n generated tokens before picking, so the argmax can’t keep choosing the same phrase. Still fully deterministic — no randomness involved:

greedy + penalty 1.3:, I was a young man with dreams of becoming an
engineer. My parents were both engineers and they encouraged my passion...

8.7.3.3. Temperature, top-k, and seeds

temp > 0 samples from the softmax distribution (higher = more adventurous); top_k > 0 first cuts it to the k most likely tokens. Sampling draws from the session’s RNG, so variety comes from the seed — and set_seed makes any run exactly reproducible:

def run_once(m : Model; prompt : array<int64>; params : SamplingParams; seed : int) : string {
    var s = create_session(m)   // fresh session, so runs compare cleanly
    set_seed(s, seed)
    var parts : array<string>
    generate(m, s, prompt, params, 40l) $(_id, piece) {
        parts |> push(piece)
        return true
    }
    let out = join(parts, "")
    delete s                    // six sessions in one run — free each (tutorial 04)
    return out
}

let params = SamplingParams(temp = 0.8, top_k = 40l, penalty = 1.1)
let s7 = run_once(m, prompt, params, 7)        // a story about Alex
let s8 = run_once(m, prompt, params, 8)        // a story about Kanaq
let s7again = run_once(m, prompt, params, 7)   // s7 again, token for token