Skip to main content

Are Autoregressive LLMs Really Doomed? (A Commentary Upon Yann LeCun's Recent Key Note)

· 4 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Yann LeCun, Chief AI Scientist at Meta and respected pioneer in AI research, recently stated that autoregressive LLMs (Large Language Models) are doomed because the probability of generating a sequence of tokens that represents a satisfying answer decreases exponentially by the token. While I hold LeCun in especially high regard, and resonate with many of the insights he shared in the summit, I disagree with him on this particular point.

Yann LeCun giving a key note at AI Action Summit

Although he qualified his statement with "assuming independence of errors" (in each token generation) this, precisely, was the wrong turn in his analysis. Autoregressive LLMs do not actually diverge in the way he implied there, and we can demonstrate it.

What is Autoregression?

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

For a deeper walkthrough of this process, see our recent post on autoregression.

Convergent or Divergent?

What LeCun is saying, then, can be unpacked as follows.

  1. Given the set C of all completions of length N (tokens),
  2. Given the subset A ⊂ C of all "acceptable" completions within C (A = C - U, where U ⊂ C is the subset of unacceptable completions)
  3. Let Ci be the completion we are now generating, token by token. Assume that Ci currently contains K<N completed tokens such that Ci is (still) an acceptable completion (Ci ∈ A)
  4. Suppose some independent constant E (for error) as the probability of generating the next token such that it causes Ci to diverge and become unacceptable (Ci ⊂ U)
  5. Then, generating the next token of Ci at K+1 is (1-E) likely to maintain the acceptability of Ci as a valid and correct completion
  6. Likewise, generating all remaining tokens R = N - K such that Ci stays acceptable has the probability of (1-E)^R

In Simpler Terms

If we always have, say, a 99% chance to generate a single next token such that the completion stays acceptable, then generating 100 next tokens brings our chance down to 0.99^100, or roughly 36%. If we generate 1,000 tokens, then by this logic there is a 0.0004% chance that our final completion is acceptable!

Do you see the problem here? Many of us have generated 1k completions that have been perfectly fine. Could we all have landed on the lucky side of 0.0004%, or is something else going on? Moreover, what about techniques like Chain-of-Thought (CoT) and reasoning models? Notice how they generate hundreds if not thousands of tokens before converging to a response that is often more correct.

The problem here is precisely with assuming that E is constant. It is not.

LLMs, due to their attention mechanism, have a way to bounce back even from initial completions that we would find unacceptable. This is exactly what techniques like CoT or CoV (Chain-of-Verification) do—they lead the model to generate new tokens that will actually increase the completion's likelihood to converge and ultimately be acceptable.

We know it first hand from developing the Attentive Reasoning Queries (ARQs) technique which we use in Parlant. We get the model to generate, on its own, a structured thinking process of our design, which keeps it convergent throughout the generation process.

Depending on your prompting technique and completion schema, not only do you not have to drop to 0.0004% acceptance rate; you can actually stay quite close to 100%.