Have you ever wondered what actually happens after you type a prompt into ChatGPT and press Enter? How does the model process your words, understand context, and generate a response one token at a time?

Step 1: Tokenization

Tokenization refers to the process of breaking the received sequence into smaller sub-words (or word-like combinations) called tokens.

Tokens are the fundamental units that language models process, as the model cannot process the entire sequence in one go.

It's like breaking a big task into smaller tasks, making it easier to understand and execute.

The most popular algorithm for tokenization is Byte Pair Encoding (BPE).

It works iteratively:

Split the entire text into individual characters. Let's take “hello” as an example. E.g.: h e l l o
Count which adjacent pairs of individual characters occur most frequently in the dataset. E.g.: (h,e), (e,l), (l,l), (l,o)
Find the most common pair and merge them into a single token. E.g.: h e ll o
Each new token is assigned a unique integer. E.g.: (“he:”123), (“hell”: 584), (”hello”: 68930), etc.
Repeat the process on the updated text. E.g.: (h,e), (e,ll), (ll,o)
Over iterations, most commonly occurring characters will merge into tokens. Suppose, “h e” appears the second most frequently. E.g.: he ll o
The process iteratively repeats until a predefined number of tokens are made by the algorithm.

Tokenization also allows the model to learn subtle nuances in the language like grammar and syntax, alongside dealing with unknown words the model wasn't trained on, as the model also learns the relationships between tokens.

A positive integer, called the Token ID, is mapped to each token, once the word has been tokenized.

Step 2: Word Embeddings and Positional Encodings

Language models don’t understand words or tokens. They understand numbers.

The tokens are converted into n-dimensional floats (having values between -1 and 1) called tensors to represent a token and its semantic meaning to the language model.

An embedding model is used to convert the token into an n-dimensional tensor.

More dimensions in the embedding model allow for better semantic representation of the token, which allows the language model to better understand the token and its meaning and relationships with other tokens.

Usually, embedding models have hundreds to thousands of dimensions for representations. E.g., “hello” could look something like:

hello = [[0.0058879545, -0.0029966063, -0.006023083, 0.0030583914, ...]]

Similar words have similar embedding values and vice versa.

More dimensions would result in better semantic meaning of the token up to a certain number of dimensions, but would also lead to more storage and processing requirements.

About Positional Encodings

The Transformers architecture was essential in scaling and parallelizing language models, but it also introduced a new problem: a lack of sequence.

This means that now, tokens can be processed in parallel by language models, but the model will have no idea of which token appears first in the sequence.

E.g, “Hello, Dan, how are you?” could be processed by the model as “Dan, are Hello, how you?” Weird and nonsensical, right?

To solve this, the architecture introduces positional encodings to each token before processing it, so the model will have knowledge of which token comes where in the sequence.

Step 3: Transformer Layer (Attention + Feed-Forward Network)

This is where the magic of LLMs happens, i.e., the Attention mechanism.

Unlike RNNs or LSTMs, Attention calculates an “attention score” for each token, giving the model information about which tokens to prioritize. Since attention is position-agnostic, it can process tokens in parallel.

Attention score is essentially a dot product of each token corresponding to itself and other tokens in the sequence.

Attention doesn’t allow the model to look at the attention scores of future tokens, i.e, the tokens the model has yet to ‘generate’, as it would beat the purpose of predicting the next token. Attention masks the attention scores of future tokens by replacing their actual values with very small values (usually -∞), indicating the model to pay no attention to this token. This process is called ‘Causal Masking’.

Feed-Forward Network (FFN)

While Attention allows the model to learn the importance of each token in the sequence, that simply isn’t enough for the model to learn the complexity of the sequence.

This is where the Feed-Forward Network (FFN) comes in.

FFN adds another layer of non-linearity to the token representation, allowing the model to generalize better. Otherwise, whenever the model sees the token ‘x’, it would predict ‘y '.

Step 4: Logits + Sampling

LLMs don’t ‘generate’ the token you see on your screen. They generate raw scores for each token they know.

For example, if a model has been a Tokenizer of 10,000 tokens, the model would generate raw scores called logits for each of those 10,000 tokens.

To convert these logits into a probability distribution, a SoftMax function is used.

Sampling happens when a token ID is selected from the probability distribution. That token ID is decoded back to its value and is the one you see on your screen being ‘generated’ on your screen.

It’s essentially a classic classification problem on steroids.

There are many sampling techniques, but the most common ones work by selecting a token among the ones having the highest probabilities.

Step 5: Token Output

Before you see the token ‘generated’ on your screen, the selected token’s ID is decoded back to its corresponding value.

For example, if the model selects 254, 254 is decoded back to its token value, i.e., ‘dog’, given the tokenizer has mapped ‘dog’ to 254.

So, you’ll see “dog” on your screen being ‘generated’.

Step 6: Repeat

The loop is repeated from the attention mechanism to token selection for every single token in the sequence until either of the following conditions is met:

The model encounters an End-of-Sequence (EOS) token, which indicates that the model doesn’t have to make any more predictions.
The model has ‘generated’ the number of tokens the user requested.

Conclusion

LLMs predict a probability distribution over all possible next tokens. A sampling strategy then selects one token from that distribution.

So, LLMs don’t really know about things. They are guessing what word will come next based on what you’ve provided it and what it has ‘generated’ so far - like autocomplete on steroids.

The Complete LLM Inference Pipeline

Step 1: Tokenization

Step 2: Word Embeddings and Positional Encodings

About Positional Encodings

Step 3: Transformer Layer (Attention + Feed-Forward Network)

Feed-Forward Network (FFN)

Step 4: Logits + Sampling

Step 5: Token Output

Step 6: Repeat

Conclusion

Comments

Command Palette

Step 1: Tokenization

Step 2: Word Embeddings and Positional Encodings

About Positional Encodings

Step 3: Transformer Layer (Attention + Feed-Forward Network)

Feed-Forward Network (FFN)

Step 4: Logits + Sampling

Step 5: Token Output

Step 6: Repeat

Conclusion

Comments