Sampling from LLMs: art & science
Little lies we tell ourselves when talking about LLMs:
- LLMs are simple: text in/text out.
- Only use higher temperatures for more creative outputs.
- Setting
temperature=0when sampling gives the most accurate label, and makes evals deterministic. - Logits == model confidence, even when using reasoning models.
Let's take a closer look!
LLMs are probability graphs, not text generators
I'm being very pedantic here, but it's worth separating the raw model from how we use it in practice.
The model itself does not really generate a sequence of tokens: it only returns the likelihood of each token in its vocabulary to be the next one in the sequence. It's only by repeating this process over multiple steps, and only selecting a single token each time, that we generate text.
What the model actually produces is a lot richer than a single string: it's a complex probability graph with countless possible continuations.
When using a chatbot or your favorite LLM API endpoint, what you get is only one sampled path from this complex probability graph, partly for convenience (you really want to use a single response), but mostly because we would need to pave all of Greenland with data centers to explore this full graph for a single prompt.
It gets trippy when you imagine that every knob you tweak (your prompt, temperature, reasoning level, etc.) is really reshaping this complex graph in very counter-intuitive ways, not just changing which specific response you get.
Temperature
Temperature controls how peaked or flat the distribution is.
At temperature=0, we greedily pick the most-probable token at each step. This might seem like the "obviously right" thing to do, but it's the equivalent of sweeping the uncertainty under the rug: the model might only be 52% confident in its answer, but we squint our eyes and pretend it's 100%.
At temperature=1, you get the raw distribution: much noisier and unpredictable, but more honest and representative of what the actual distribution looks like.
Below is an example where the likelihood of the ultimate answer being RELEVANT is 70%.
Log probabilities
Of course, you can directly read the logprobs from the model. One API call tells you whether a prediction is at 97% confidence (signal) or 51% (coin flip).
Now, this is only feasible when using a single token at a time... but even that is becoming challenging with the popularity of reasoning/thinking models.
Reasoning models
Thinking models fundamentally change the probability graph you're sampling from.
Taking the example of a classifier (e.g. predict a single YES or NO token):
- Without reasoning, you are sampling from
P(label | prompt). Simple enough: you can directly use logprobs to get confidence scores. - With reasoning, the model first generates a thinking trace, and the final answer is conditioned on that trace:
P(label | prompt + reasoning).
The reasoning trace isn't neutral: it will often anchor the likelihood of converging to different answers, and a wrong turn early in the trace can lock the model into an incorrect conclusion. By the time it outputs a final answer, its logits will show high confidence in that wrong answer, simply because it's now logically consistent with the flawed trace it just generated.
Prompt structure (e.g., "list reasons for and against" vs. "think step by step") steers which branches get explored. On challenging borderline cases in particular, different reasoning styles will often lead to very different outcomes.
What can go wrong in practice
Uncalibrated evals
We all strive for reproducible, noise-free evals.
Using temperature=0 seemingly produces that outcome, but creates a bigger problem: systematic bias.
While random noise is annoying, it can still cancel out over many runs. Systematic bias, on the other hand, is here to stay.
Using temperature=0 in evals will consistently amplify any systemic biases your LLM of choice suffers from. Examples of such systematic biases include:
- preferring results mentioned first or last
- preferring results in English over other languages
- preferring formal language over slang
- ... anything else specific to your domain.
Greedy sampling amplifies a slight bias (e.g. 55% preference) into a 100% likelihood of leaning in that direction.
TL;DR: Consider whether you can live with more variance in exchange for less systematic bias.
Mode collapse when distilling to smaller models
When generating training data for distillation, temperature=0 will confidently assign hard labels to very borderline examples. The student model is then trained on these mistakes (with full conviction!).
This is particularly bad as smaller student models are more prone to overfitting. We also lose the well documented benefits of soft-label distillation, e.g. training student models on the teacher's probability estimates instead of using discrete predictions.
This is a great excuse to re-read the classic distillation paper, written by the absolute ML dream team (Geoff, Jeff, and Oriol!).
Reasoning collapse when distilling to LLMs
There are many valid and equally good ways to get to the truth. Learning this is a prerequisite to true reasoning, and conversely, only knowing a single path to each answer is more akin to rote memorization.
Sampling reasoning traces greedily (e.g. temperature=0) collapses this reasoning diversity, and will most likely lead to systematically biased data. A student model trained on these traces is likely to parrot "reasoning-like" tokens, instead of actually generalizing and learning the actual semantics/process of reasoning.
Distribution mismatch
A teacher model will be strong on easy examples and weak on hard ones. Without temperature-based sampling to surface uncertainty, you get dense high-quality labels where you don't need them and sparse low-quality labels where you do.
The borderline examples, the ones the model is uncertain about, are usually the most valuable training signal, and temperature=0 throws that signal away.
Once more, something as simple as sampling strategies can hide quite a lot of useful insights.
Next time, we might get into Simple Self-Distillation (SSD) which is a great example of how understanding the basics can enable us to get the most out of existing models, and build even better ones in the future.