Generating one token at a time is a blessing in disguise
LLMs generate their output one token at a time. The first thought when you learn this is that this is a huge performance bottleneck, as we are used to highly parallelized systems.
However, a large part of what makes LLMs feel so magical comes from this exact bottleneck.
Pointwise scoring only goes so far
Traditional search systems score and rank items independently.
This is what it usually looks like:
- some kind of model/heuristic scores every possible result
- we "greedily" sample from these results, starting from the highest scoring one
- we cross our fingers and hope that this still leads to a nice result
The obvious problem here is that it breaks down when the ideal output isn't a single answer, but a combination of interdependent pieces of information.
Example: Your favorite shopping website recommends the three most popular shoes on sale at the moment. It has no structural awareness that recommending three pairs of black sneakers makes for a terrible, unbalanced outfit.
[LLM] Autoregressive
[Traditional] Pointwise
End-to-end optimization
An inherent limitation of pointwise systems is that... the world is simply not pointwise. We're full-slate machines that ingest a combination of parameters and spit out actions in return. What this implies is that traditional systems would have to derive some kind of pointwise proxy metric that if optimized, would correlate with the actual (full-slate) metric we really care about.
For example, for an e-commerce website, the real objective is: I want this person to buy a lot of stuff, and ideally do it frequently. We can't really predict that pointwise; we can only optimize for the expected probability of an isolated interaction (like a click or individual purchase). Likewise, greedily grabbing the top-3 highest-CTR items out of context will almost always lead to a very redundant set of products, which actively hurts the actual purchase likelihood (since if you dislike the first result, you'll probably dislike the next two as well).
Because LLMs are optimized to generate optimal sequences (e.g. via response-based RL), they naturally learn to produce outputs that are coherent and diverse.
[LLM] Sequence graph
[Traditional] Proxy metrics
Diversity without the hacks and heuristics
Because pointwise systems lack sequence awareness, engineers historically had to build heavy, heuristic-driven "diversity" algorithms (like Maximal Marginal Relevance) on top of the pointwise scoring to prevent highly redundant results.
The harsh truth is that these hacks just never really worked: producing optimally diverse outputs is a very hard task, and attempts to do this rigorously (via full-slate optimization, and some multi-armed bandit approaches) often led to particularly complex systems that only very large companies willing to squeeze that last 0.2% of performance would ever deploy.
LLMs, on the other hand, inherently produce diverse outputs: the data they are pre-trained on, and the objectives they are RL'd on, steer them away from generating redundant sequences.