$ cat generating-one-token-at-a-time-is-a-blessing-in-disguise.mdx | grep metadata

Generating one token at a time is a blessing in disguise

calendar_month2026-03-29personkachkach

LLMs generate their output one token at a time. The first thought when you learn this is that this is a huge performance bottleneck, as we are used to highly parallelized systems.

However, a large part of what makes LLMs feel so magical comes from this exact bottleneck.

Pointwise scoring only goes so far

Traditional search systems score and rank items independently.

This is what it usually looks like:

some kind of model/heuristic scores every possible result
we "greedily" sample from these results, starting from the highest scoring one
we cross our fingers and hope that this still leads to a nice result

The obvious problem here is that it breaks down when the ideal output isn't a single answer, but a combination of interdependent pieces of information.

Example: Your favorite shopping website recommends the three most popular shoes on sale at the moment. It has no structural awareness that recommending three pairs of black sneakers makes for a terrible, unbalanced outfit.

Selecting items in context vs scoring them separately.

searchSuggest a casual outfit

[LLM] Autoregressive

check_circleResult: Balanced outfit.

[Traditional] Pointwise

warningResult: 3 pairs of sneakers.

End-to-end optimization

An inherent limitation of pointwise systems is that... the world is simply not pointwise. We're full-slate machines that ingest a combination of parameters and spit out actions in return. What this implies is that traditional systems would have to derive some kind of pointwise proxy metric that if optimized, would correlate with the actual (full-slate) metric we really care about.

For example, for an e-commerce website, the real objective is: I want this person to buy a lot of stuff, and ideally do it frequently. We can't really predict that pointwise; we can only optimize for the expected probability of an isolated interaction (like a click or individual purchase). Likewise, greedily grabbing the top-3 highest-CTR items out of context will almost always lead to a very redundant set of products, which actively hurts the actual purchase likelihood (since if you dislike the first result, you'll probably dislike the next two as well).

Because LLMs are optimized to generate optimal sequences (e.g. via response-based RL), they naturally learn to produce outputs that are coherent and diverse.

Optimizing the full trajectory, not just individual components.

[LLM] Sequence graph

Goal: Optimize trajectory

checkroomGraphic Tee

checkroomButton-Down

accessibility_newSweatpants

accessibility_newJeans

accessibility_newSlacks

snowshoeingSneakers

snowshoeingDress Shoes

routeEnd-to-end success.

[Traditional] Proxy metrics

Goal: Maximize item click %

warningMismatched outfit.

Diversity without the hacks and heuristics

Because pointwise systems lack sequence awareness, engineers historically had to build heavy, heuristic-driven "diversity" algorithms (like Maximal Marginal Relevance) on top of the pointwise scoring to prevent highly redundant results.

The harsh truth is that these hacks just never really worked: producing optimally diverse outputs is a very hard task, and attempts to do this rigorously (via full-slate optimization, and some multi-armed bandit approaches) often led to particularly complex systems that only very large companies willing to squeeze that last 0.2% of performance would ever deploy.

LLMs, on the other hand, inherently produce diverse outputs: the data they are pre-trained on, and the objectives they are RL'd on, steer them away from generating redundant sequences.

Models trained to generate diverse list outputs, or diversity as an afterthought?

searchList 3 distinct gift options

[LLM] Diversity by design

balanceOrganic, natural balancing.

[Traditional] Hacky hard-coded heuristics

memoryHeavy, hard-coded overhead.