Skip to main content

2 posts tagged with "machine-learning"

View All Tags

Are Autoregressive LLMs Really Doomed? (A Commentary Upon Yann LeCun's Recent Key Note)

· 4 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Yann LeCun, Chief AI Scientist at Meta and respected pioneer in AI research, recently stated that autoregressive LLMs (Large Language Models) are doomed because the probability of generating a sequence of tokens that represents a satisfying answer decreases exponentially by the token. While I hold LeCun in especially high regard, and resonate with many of the insights he shared in the summit, I disagree with him on this particular point.

Yann LeCun giving a key note at AI Action Summit

Although he qualified his statement with "assuming independence of errors" (in each token generation) this, precisely, was the wrong turn in his analysis. Autoregressive LLMs do not actually diverge in the way he implied there, and we can demonstrate it.

What is Autoregression?

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

For a deeper walkthrough of this process, see our recent post on autoregression.

Convergent or Divergent?

What LeCun is saying, then, can be unpacked as follows.

  1. Given the set C of all completions of length N (tokens),
  2. Given the subset A ⊂ C of all "acceptable" completions within C (A = C - U, where U ⊂ C is the subset of unacceptable completions)
  3. Let Ci be the completion we are now generating, token by token. Assume that Ci currently contains K<N completed tokens such that Ci is (still) an acceptable completion (Ci ∈ A)
  4. Suppose some independent constant E (for error) as the probability of generating the next token such that it causes Ci to diverge and become unacceptable (Ci ⊂ U)
  5. Then, generating the next token of Ci at K+1 is (1-E) likely to maintain the acceptability of Ci as a valid and correct completion
  6. Likewise, generating all remaining tokens R = N - K such that Ci stays acceptable has the probability of (1-E)^R

In Simpler Terms

If we always have, say, a 99% chance to generate a single next token such that the completion stays acceptable, then generating 100 next tokens brings our chance down to 0.99^100, or roughly 36%. If we generate 1,000 tokens, then by this logic there is a 0.0004% chance that our final completion is acceptable!

Do you see the problem here? Many of us have generated 1k completions that have been perfectly fine. Could we all have landed on the lucky side of 0.0004%, or is something else going on? Moreover, what about techniques like Chain-of-Thought (CoT) and reasoning models? Notice how they generate hundreds if not thousands of tokens before converging to a response that is often more correct.

The problem here is precisely with assuming that E is constant. It is not.

LLMs, due to their attention mechanism, have a way to bounce back even from initial completions that we would find unacceptable. This is exactly what techniques like CoT or CoV (Chain-of-Verification) do—they lead the model to generate new tokens that will actually increase the completion's likelihood to converge and ultimately be acceptable.

We know it first hand from developing the Attentive Reasoning Queries (ARQs) technique which we use in Parlant. We get the model to generate, on its own, a structured thinking process of our design, which keeps it convergent throughout the generation process.

Depending on your prompting technique and completion schema, not only do you not have to drop to 0.0004% acceptance rate; you can actually stay quite close to 100%.

What Is Autoregression in LLMs?

· 5 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

Let's break that down carefully, step by step. An LLM:

  1. Accepts an input context
  2. Predicts a single token out of a vocabulary
  3. Feeds that token back into the input context
  4. Repeats the process again until something tells it to stop

On Accepting an Input Context

It must be understood that LLMs, for various reasons, do not look at their input (nor output for that matter) as text per se, but as numbers.

Each of these numbers is called a token, and each token directly corresponds to some language fragment, depending on the architecture of the model. These tokens form the basic unit by which LLMs understand and predict text. The set of all supported values of these tokens is called the model's vocabulary.

In some LLMs, a token corresponds to a proper word. In theory, it could even correspond to full sentences or paragraphs. In practice, however, tokens are most commonly word-parts (i.e., not even full words). As such, while vocabulary would have been a perfect name for models in which tokens correspond to words, it's important to remember that in most LLMs today the vocabulary is a set of word-parts. Among other benefits, this allows us to keep the vocabulary relatively small (around 200k word-part tokens).

When you prompt an LLM, the prompt first undergoes tokenization so that the LLM can understand it using its own language—its vocabulary. Tokenization breaks down a text into a series of tokens, or, again, numbers, each representing a unique word-part in the vocabulary. To illustrate this concept:

tokens = tokenizer.encode("I like bananas")

for t in tokens:
print(f"{t} = {tokenizer.decode(t)}")

# Output (for example):
# 4180 = "I"
# 5 = " "
# 918 = "lik"
# 5399 = "e"
# 5 = " "
# 882 = "ba"
# 76893 = "nana"
# 121 = "s"

On Predicting the Next Token

The next step is to do what the model is supposed to do: predict the next token, also known as generating a completion for the input context (or at least the fundamental building block of this completion process).

We won't go into the internal prediction and attention mechanisms here. Instead, I want to focus on the very last stage of the prediction process.

When an LLM has done its best to "figure out" the meaning in its input context, it provides what's called a probability distribution over its vocabulary. This means that every single token is assigned the likelihood that it, among all others, should be chosen as the next predicted token in the completion—a great honor indeed.

The following snippet illustrates what that distribution might look like:

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")
next_token_probabilities = prompt(context_tokens)

for token, probability in next_token_probabilities:
print(f"{token} ({tokenizer.decode(token)}) = {probability}")

# Output (for example):
# ...
# 5 (" ") = 0.0001 (0.01%)
# ...
# 121 ("s") = 0.002 (0.2%)
# ...
# 4180 ("I") = 0.008 (0.8%)
# ...
# 882 ("ba") = 0.013 (1.3%)
# ...
# 918 ("lik") = 0.0004 (0.04%)
# ...
# 1000 ("dog") = 0.975 (97.5%)
# ...
# 76893 ("nana") = 0.0006 (0.06%)

Once we have this probability distribution, we need to actually decide which one to use. A naive approach would simply choose the one with the highest probability. It turns out, however, that this is a mistake. Not only can it cause models to repeat themselves robotically (a rather uninspiring application of such complex beasts); worse yet, due to biases and issues that often lurk within them, the most probable token—in their eyes—is not necessarily the ones that most of us would deem most reasonable. This is because any such statistical machine has inherent flaws and inaccuracies in its representation of our complex world.

Thus, we don't simply choose the token with the highest probability. Instead, we choose one using various sampling techniques that introduce randomization into the choice process, while respecting the assigned probabilities (kind of like rolling a loaded dice that's heavier on the sides that represent the more likely tokens).

Two things we must take note of here:

  1. The token with the highest probability won't necessarily get selected at every prediction iteration—it will just be more likely to get selected
  2. The token with the highest probability isn't even necessarily the most reasonable one from a Human/AI alignment perpsective

On Feeding the Predicted Token Back into the Context

The final step of the process—before rinsing and repeating—is to the take the newly generated token and feed it back into the input context, appending it to the end. The completion process goes something like this (conceptually):

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")

while True:
next_token_probabilities = prompt(context_tokens)
next_token = weighted_random_choice(next_token_probabilities)
context_tokens.append(next_token)

if next_token == STOP_TOKEN:
break

This iterative feed-back loop is directly related to how autoregressive models are trained. During the training process, they are essentially asked to predict or "fill out" the next token in the context. Once they improve on that, the training process moves on to the next token, and so forth. And this is the principle the inference process follows as well.

This is what autoregression is!