$ cat llm-sampling-pitfalls.mdx | grep metadata

LLM sampling pitfalls: easy to get it wrong

calendar_monthpersonkachkach

Take the simplest task one can use an LLM for: classify documents into relevant and irrelevant buckets. Easy enough!

But wait... what should I set temperature to? Should I sample multiple answers and take a majority vote? What if I use thinking models? Or should I just read the model's logprobs?

Read on if you would like to build intuition on why sampling from LLMs isn't trivial, and learn how to pick the right sampling strategy for your problem.


LLMs are probability graphs, not sequence generators

You know what they say:

all models are wrong, but some are useful

"Text in, text out" is a useful abstraction and makes it much easier to deal with LLMs in code... but it quickly breaks as soon as you get to more advanced applications: evals, long trajectory generation, distillation, etc.

In reality, an LLM akshually does not generate a sequence. It only computes a probability distribution over all tokens in its vocabulary at each step, rather than choosing any specific sequence.

How we sample from that graph of possibilities is an entirely different question, and should be informed by how LLMs are trained, and what you intend to do with the sequences you will generate.

Think of it as a directed graph: every node is a token, every edge is the model's confidence in that transition. What you see in the output is one path through that graph. The entire tree of alternatives the model considered is invisible.

That distinction — one sampled path vs. the full distribution — is why this article might be worth your time.


Temperature

Temperature controls how peaked or flat that distribution is before sampling. Low temperature sharpens it — the model almost always picks the highest-probability token. High temperature flattens it — more randomness, more diversity.

Default intuition for classification: "I want deterministic output. Set temperature=0."

Sounds reasonable, but this is where things start to get tricky.

temperature=0

The model always returns the same output for the same input. Reproducible, stable, deterministic. Great!

Except... this is most often just an illusion, hiding the model's uncertainty.

Say you have a borderline document. The model internally assigns 52% probability to relevant. At temperature=0, you get relevant every single time, with total apparent confidence. You'd never know it was a coin flip.

temperature=0 doesn't make the model more accurate. It makes uncertainty invisible.

TOKEN SAMPLING
n=0REL 0.0%IRR 0.0%
prompt: "Is this document relevant?" RELEVANT | IRRELEVANT
temp0.70
speed1.0x

temperature=1

Now you get the raw distribution. That 52%-confidence document returns relevant roughly half the time. More honest, but it introduces variance — borderline examples flip between runs, which is noisy if you're computing metrics or building a pipeline.

Logprobs: reading the distribution directly

Rather than sampling many times to estimate confidence, you can just look at the distribution. Most APIs expose logprobs — the log-probabilities the model assigned to each candidate token:

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=1,
    logprobs=True,
    top_logprobs=5,
)

top = response.choices[0].logprobs.content[0].top_logprobs
probs = {t.token: math.exp(t.logprob) for t in top}
# {"relevant": 0.52, "irrelevant": 0.48}

This is reading the graph directly instead of sampling paths through it. A label at 51% is a coin flip. A label at 97% is signal. One API call, no repeated sampling.

A reasonable default: temperature=0 for stable outputs, logprobs to measure confidence where it matters.

Logprob Calibration

Traditional outputs hide the model's internal uncertainty.

Text Output (temp=0)
RELEVANT
The text output remains identical. You have no way of knowing if the model is certain or guessing.
Internal Logprobs
RELEVANT52%
IRRELEVANT48%

Unreliable (Coin flip)

Uncertain (52%)Certain (97%)

Chain-of-thought

Ask the model to reason before labeling:

Is the following document relevant to the query?
Document: ...
Query: ...
Think step by step, then answer RELEVANT or IRRELEVANT.

This often improves accuracy. But it fundamentally changes what you're sampling from.

The label is conditioned on the path

Without CoT, you sample the label directly from P(label | prompt). With CoT, the model first generates a reasoning trace, and then the label is drawn from P(label | prompt + reasoning). The label now sits at the end of a longer path through the graph — and which path you took determines where you land.

If the model writes a confident case for relevant, the probability of outputting RELEVANT afterward is very high. The reasoning anchors the label. Good when the reasoning is sound. Bad when the model takes a wrong turn early and doubles down.

CoT Conditioning

Selection probability depends on the reasoning path taken.

Baseline (Direct)
RELEVANT55%

Directly asking for a label results in a balanced probability distribution.

With Reasoning (CoT)
RELEVANT55%

Run a scenario to see the distribution shift based on reasoning.

Model reasoning path

One path vs. many paths

At temperature=0, you get one deterministic reasoning trace, and one label locked to it. That trace is the greedily most-probable path — but greedy doesn't mean correct. It's the most locally probable sequence of tokens, not necessarily the most accurate chain of reasoning.

At temperature=1, different runs produce different reasoning paths, each conditioning the label differently. This is actually useful: if the label is stable across many different reasoning chains, the model reaches the same conclusion regardless of path. If labels flip, the model is genuinely uncertain — no matter how confident any single trace looks.

Prompt structure shapes the path

How you frame the reasoning changes which paths are likely:

Think step by step, then give your answer.

vs.

First, list reasons this document might be IRRELEVANT.
Then list reasons it might be RELEVANT.
Then give your final answer.

Different structures lead the model down different branches of the graph. On borderline cases, you'll often get different labels. The reasoning isn't a neutral preamble — it's part of the classification.

Practical implication: if you're generating labels with CoT, sampling multiple paths per example and treating label instability as an uncertainty signal is more informative than committing to a single greedy trace.


Eval vs. training data

You've built this classifier. Now you want to use it for one of two things:

  1. Evaluation — scoring system outputs, running evals
  2. Data labeling — generating training data to distill into a smaller model

Same tool, nearly opposite failure modes.

As an evaluator

The main threat is systematic bias. Random noise averages out over enough examples. Bias doesn't.

Position bias: LLMs tend to prefer content that appears earlier in the prompt. Always put output A first and you'll systematically favor A.

Style bias: Fluent, confident-sounding text gets rated higher even when the content is worse.

Self-similarity bias: An evaluator from the same model family as the system being evaluated will prefer its own style. Evaluating Gemini 3 Flash with Gemini 3 Flash is not a neutral eval.

At temperature=0, these biases are perfectly deterministic — which makes them worse. Clean-looking, consistently skewed results. CoT helps by making the evaluator articulate its reasoning, which surfaces style bias. It doesn't eliminate it.

As a training data generator

Here the concern flips to coverage and diversity. Systematic patterns in your labels get baked directly into the student model.

Label collapse: temperature=0 gives hard cases deterministic (often wrong) labels. Training data looks clean, but the student learns the wrong thing with high confidence.

Reasoning collapse: If the teacher always produces the same style of reasoning trace, the student learns to mimic that pattern rather than reason from scratch.

Distribution mismatch: The teacher is strong on some examples and weak on others. You end up with dense, high-quality labels on easy cases and sparse, low-quality labels on hard ones.

For training data, you actually want temperature — not for better accuracy on any single example, but to get diversity in reasoning paths and surface uncertainty on borderline cases. Those uncertain examples, labeled carefully, are usually the most valuable signal.


Summary

  • temperature=0 gives stability, hides uncertainty
  • temperature=1 gives calibration, adds variance
  • logprobs read the distribution directly — stable output with explicit confidence, no repeated sampling
  • chain-of-thought makes the label path-dependent: at temperature=0 you're committed to one trace; at temperature=1, label stability across traces is itself a confidence signal
  • for evals, systematic bias is the threat — noise isn't
  • for training data, false confidence and low diversity are the threats — uncertainty is signal

The output isn't just a label. It's a sample from a distribution, shaped by temperature, prompt structure, and whatever the model generated before the final token. How you sample determines what you see.