Skip to main content

What Is Autoregression in LLMs?

· 5 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

Let's break that down carefully, step by step. An LLM:

  1. Accepts an input context
  2. Predicts a single token out of a vocabulary
  3. Feeds that token back into the input context
  4. Repeats the process again until something tells it to stop

On Accepting an Input Context

It must be understood that LLMs, for various reasons, do not look at their input (nor output for that matter) as text per se, but as numbers.

Each of these numbers is called a token, and each token directly corresponds to some language fragment, depending on the architecture of the model. These tokens form the basic unit by which LLMs understand and predict text. The set of all supported values of these tokens is called the model's vocabulary.

In some LLMs, a token corresponds to a proper word. In theory, it could even correspond to full sentences or paragraphs. In practice, however, tokens are most commonly word-parts (i.e., not even full words). As such, while vocabulary would have been a perfect name for models in which tokens correspond to words, it's important to remember that in most LLMs today the vocabulary is a set of word-parts. Among other benefits, this allows us to keep the vocabulary relatively small (around 200k word-part tokens).

When you prompt an LLM, the prompt first undergoes tokenization so that the LLM can understand it using its own language—its vocabulary. Tokenization breaks down a text into a series of tokens, or, again, numbers, each representing a unique word-part in the vocabulary. To illustrate this concept:

tokens = tokenizer.encode("I like bananas")

for t in tokens:
print(f"{t} = {tokenizer.decode(t)}")

# Output (for example):
# 4180 = "I"
# 5 = " "
# 918 = "lik"
# 5399 = "e"
# 5 = " "
# 882 = "ba"
# 76893 = "nana"
# 121 = "s"

On Predicting the Next Token

The next step is to do what the model is supposed to do: predict the next token, also known as generating a completion for the input context (or at least the fundamental building block of this completion process).

We won't go into the internal prediction and attention mechanisms here. Instead, I want to focus on the very last stage of the prediction process.

When an LLM has done its best to "figure out" the meaning in its input context, it provides what's called a probability distribution over its vocabulary. This means that every single token is assigned the likelihood that it, among all others, should be chosen as the next predicted token in the completion—a great honor indeed.

The following snippet illustrates what that distribution might look like:

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")
next_token_probabilities = prompt(context_tokens)

for token, probability in next_token_probabilities:
print(f"{token} ({tokenizer.decode(token)}) = {probability}")

# Output (for example):
# ...
# 5 (" ") = 0.0001 (0.01%)
# ...
# 121 ("s") = 0.002 (0.2%)
# ...
# 4180 ("I") = 0.008 (0.8%)
# ...
# 882 ("ba") = 0.013 (1.3%)
# ...
# 918 ("lik") = 0.0004 (0.04%)
# ...
# 1000 ("dog") = 0.975 (97.5%)
# ...
# 76893 ("nana") = 0.0006 (0.06%)

Once we have this probability distribution, we need to actually decide which one to use. A naive approach would simply choose the one with the highest probability. It turns out, however, that this is a mistake. Not only can it cause models to repeat themselves robotically (a rather uninspiring application of such complex beasts); worse yet, due to biases and issues that often lurk within them, the most probable token—in their eyes—is not necessarily the ones that most of us would deem most reasonable. This is because any such statistical machine has inherent flaws and inaccuracies in its representation of our complex world.

Thus, we don't simply choose the token with the highest probability. Instead, we choose one using various sampling techniques that introduce randomization into the choice process, while respecting the assigned probabilities (kind of like rolling a loaded dice that's heavier on the sides that represent the more likely tokens).

Two things we must take note of here:

  1. The token with the highest probability won't necessarily get selected at every prediction iteration—it will just be more likely to get selected
  2. The token with the highest probability isn't even necessarily the most reasonable one from a Human/AI alignment perpsective

On Feeding the Predicted Token Back into the Context

The final step of the process—before rinsing and repeating—is to the take the newly generated token and feed it back into the input context, appending it to the end. The completion process goes something like this (conceptually):

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")

while True:
next_token_probabilities = prompt(context_tokens)
next_token = weighted_random_choice(next_token_probabilities)
context_tokens.append(next_token)

if next_token == STOP_TOKEN:
break

This iterative feed-back loop is directly related to how autoregressive models are trained. During the training process, they are essentially asked to predict or "fill out" the next token in the context. Once they improve on that, the training process moves on to the next token, and so forth. And this is the principle the inference process follows as well.

This is what autoregression is!