kachkach.com git:(main)

$ cat generating-one-token-at-a-time-is-a-blessing-in-disguise.mdx | grep metadata

Generating one token at a time is a blessing in disguise

calendar_monthpersonkachkach
ANIMATIONS

LLMs generate their output one token at a time. The first thought when you learn this is that this is a huge performance bottleneck, as we are used to highly parallelized systems.

However, a large part of what makes LLMs feel so magical comes from this exact bottleneck.

Pointwise scoring only goes so far

Traditional search systems score and rank items independently.

This is what it usually looks like:

  • some kind of model/heuristic scores every possible result
  • we "greedily" sample from these results, starting from the highest scoring one
  • we cross our fingers and hope that this still leads to a nice result

The obvious problem here is that it breaks down when the ideal output isn't a single answer, but a combination of interdependent pieces of information.

Example: Your favorite shopping website recommends the three most popular shoes on sale at the moment. It has no structural awareness that recommending three pairs of black sneakers makes for a terrible, unbalanced outfit.

searchSuggest a casual outfit

[LLM] Autoregressive

check_circleResult: Balanced outfit.

[Traditional] Pointwise

warningResult: 3 pairs of sneakers.

End-to-end optimization

An inherent limitation of pointwise systems is that... the world is simply not pointwise. We're full-slate machines that ingest a combination of parameters and spit out actions in return. What this implies is that traditional systems would have to derive some kind of pointwise proxy metric that if optimized, would correlate with the actual (full-slate) metric we really care about.

For example, for an e-commerce website, the real objective is: I want this person to buy a lot of stuff, and ideally do it frequently. We can't really predict that pointwise; we can only optimize for the expected probability of an isolated interaction (like a click or individual purchase). Likewise, greedily grabbing the top-3 highest-CTR items out of context will almost always lead to a very redundant set of products, which actively hurts the actual purchase likelihood (since if you dislike the first result, you'll probably dislike the next two as well).

Because LLMs are optimized to generate optimal sequences (e.g. via response-based RL), they naturally learn to produce outputs that are coherent and diverse.

[LLM] Sequence graph

Goal: Optimize trajectory
checkroomGraphic Tee
checkroomButton-Down
accessibility_newSweatpants
accessibility_newJeans
accessibility_newSlacks
snowshoeingSneakers
snowshoeingDress Shoes
routeEnd-to-end success.

[Traditional] Proxy metrics

Goal: Maximize item click %
warningMismatched outfit.

Diversity without the hacks and heuristics

Because pointwise systems lack sequence awareness, engineers historically had to build heavy, heuristic-driven "diversity" algorithms (like Maximal Marginal Relevance) on top of the pointwise scoring to prevent highly redundant results.

The harsh truth is that these hacks just never really worked: producing optimally diverse outputs is a very hard task, and attempts to do this rigorously (via full-slate optimization, and some multi-armed bandit approaches) often led to particularly complex systems that only very large companies willing to squeeze that last 0.2% of performance would ever deploy.

LLMs, on the other hand, inherently produce diverse outputs: the data they are pre-trained on, and the objectives they are RL'd on, steer them away from generating redundant sequences.

searchList 3 distinct gift options

[LLM] Diversity by design

balanceOrganic, natural balancing.

[Traditional] Hacky hard-coded heuristics

memoryHeavy, hard-coded overhead.