Introduction to Attention, Transformers and Large Language Models - Part 2

This post is a continuation of the first post in this series. In the previous post we introduced the attention mechanism and the transformer architecture. The focus of this second post will be to break down the different components that make up the transformer, which were introduced in the previous post. It should be noted that some topics will not be discussed in as much detail since they are common deep learning topics. These include skip layers, residual layers, feed forward layers, and linear layers. The ultimate objective of this series of posts is to provide enough detail and background to get machine learning/deep learning developers comfortable implementing and using LLMs.

Input and Output Embeddings

The process of turning text into embedded vectors is the first step in training transformer based models and using them to generate new predictions. For this reason, the first component of the transformer that we will discuss are the input embeddings in the encoder and the decoder. The Multi-Head Attention (MHA) layers, feed forward layers, linear layers, and even the embeddings themselves, have learnable parameters associated with them that are learned during training. Those learnable parameters are what fundamentally make transformers neural networks. The reason I highlight this is because, just like neural networks, transformers only work with numeric data. The process of turning text into embedded vectors is the first step in training transformer based models and using them during to generate new predictions.

Figure 1. Input and Output Embeddings with Positional Encondings.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Tokenization, Embedding, and Vectorization

Language models, transformers included, have a vocabulary size which generally represent the words that the language model knows. Each word in the vocabulary has a specific number, known as a token, that can be used to represent it. The vocabulary size is something that is determined early in the modeling process, and it is usually tied to the text that will be used to train and assess the model. There are also tokens for punctuation marks, special tokens such as the START and END tokens which indicate the beginning and end of a sentence, and tokens for unknown words that are not part of the known vocabulary. There are many techniques used in language models to tokenize text, however many of those techniques go beyond the scope of this post.

Figure 2. Text Tokenization.

When the transformer receives input text it is turned into tokens which are used to represent the words in the input. In figure 2, we can see the tokens (except for the [START] and [END] tokens) that represent the input sequence in a BERT model. These tokens can then be used to find the embeddings corresponding to each of those words.

Figure 3. Token Vectorization and Embedding Matrix (Right).

You can think of tokens as keys that can be used in a look up table to find the vector representation of the words in the input. The vectors that correspond to words are known as embeddings, and the matrix that houses all those vectors is known as the embedding matrix (shown in figure 3). The dimensionality of the embeddings will differ depending on the model being used. and it is usually specified by the modeler early in the modeling process. The original transformer embeddings output vectors with a dimensionality of 512.

What truly makes embeddings powerful is the fact that there are learnable parameters associated with the embedding matrix. These learnable parameters are adjusted during training such that they learn the relationships between words in a sequence. The original transformer used the same embedding matrix in the encoder and decoder.

Positional Encodings

After the words in an input have been vectorized, both the encoder and the decoder in the original transformer apply positional encoding to the embedded vectors. The positional encoding encodes the position of a word in a sequence.

Figure 4. Positional Encoding Layers.

In the sentence “The wolf ate the lamb”, 'The' is the first word, 'wolf' is the second word, etc. The order in which those words appear in the sentence matters. “The wolf ate the lamb” is not the same sentence as “The lamb ate the wolf”, a single positional swap can change the entire meaning of a sentence. For this reason, the positional encoding encodes the location of where words appear in a sequence. Without this, the order of the inputs is lost which would have very detrimental effects on the model.

In a transformer, positional encodings replace convolutions and recurrences. It is hypothesized that this function allows the model to easily learn to attend to tokens in a sequence via their relative positions. Positional encodings are designed such that they have the same dimensionality as the embeddings, in other words, they can simply be added to the embedded vectors. This makes this implementation of positional encodings very computationally efficient.

Figure 5. Sinusoidal Positional Encodings.

Sinusoids provide a way to accomplish positional encoding that is based on the Fourier transform. A sine and cosine function of different frequencies (illustrated in figure 5) are used to encode the positions of words in a sequence. The functions takes the position of the word, the element in the embedded vector to which information is being encoded to, and the model embedding dimensions as arguments. Stated more clearly, the encoding function output (given the previously stated input) is added to the corresponding element in the vector. The sine and cosine function are used for even and odd words respectively.

It should also be noted that the equations in figure 5 are not the only way to encode the position of text in a sequence. Another positional embedding alternative that was explored in the original transformer are learned embeddings. Learned embeddings yielded similar performance to sinusoidal positional encodings. The authors of the original transformer chose sinusoidal positional encoding because it does not require our inputs to have a fixed length. The sinusoidal embeddings allow the model to extrapolate to sequence lengths longer than the ones encountered during training, whereas learned embeddings require the inputs to have a fixed length. Learned embeddings generally require that additional information be added to the input tokens to indicate their position in a sequence. Other language models, such as GPT-1 and GPT-2, use learned positional embeddings.

Global Self-Attention Layer

As we stated in the previous post, each Multi-Head Attention (MHA) layer in the transformer is slightly different. As such it may not be uncommon to see these layers being referred to by a specific name. The first MHA layer in the encoder is sometimes called the global self-attention layer. The global self-attention layer allows information to flow in both directions because it performs self-attention without masking. Meaning that later words in a sequence can attend to words earlier in the sequence and vice versa. This ultimately allows encoders to truly capture the interrelatedness between all words in a sequence. Think about a scenario in which you are translating a sentence. Generally speaking, you listen to the entire sentence or a certain number of words before starting to translate. The order of the words in a sentence matters and they vary from language to language. If you start to translate everything word by word, your translation may be incorrect in the target language. Let’s say you’re trying to translate the Japanese phrase “Oishii desu!” which translates to “It is delicious!”. If you were to translate it word by word in the order in which the Japanese sentence appears, you’d produce the Yoda-like sentence “Delicious it is!” which is not the best way to translate that sentence.

Let's look at an example of what occurs within the global self-attention layer. As we saw earlier, the input to the encoder is a sequence of text. After that sequence of text gets tokenized, vectorized, and the positional encoding is applied, we end up with a matrix that represents the input sequence. One axis corresponds to the sequence of words and the other axis corresponds to the embedding dimension.

Figure 6. Input Dimensions (Left) with Example (Right).

Let’s say that the input sequence is “this is an input!”. That would mean that the sequence length is a total of 5, and the embedding dimension is a hyperparameter determined by the modeler. In the original transformer the embedding dimension is 768, but for the sake of this example, let’s say it’s 3.

Figure 7. Self-Attention Matrices.

Now that we have our matrix of inputs, we can move forward and start to compute self-attention. In self-attention the query, key, and value, are all the same, which means this matrix is equal to Q, K, and V. Although technically we have everything that we need to compute self-attention, remember that within transformers, attention appears in the multi-head attention layers. Multi-head attention layers are important because they introduce learnable parameters (or weights) into the attention computation. Without these weights we would just be re-averaging vectors as this matrix moves through the transformer. The query, key, and value, have their own set of learnable parameters that will be used to modify them. The weight matrices have a dimension of embedding by embedding so that they can generate an output of the same dimension as their input.

Figure 8. Generating Weighted Embeddings.

It is important to note that the operation illustrated in figure 9 will be performed for Q, K, and V, and each of these will have its own separate set of weight matrices. It is important to note that After the query, key, and value are multiplied by their weight matrices, we can then compute self-attention. Remember that in the multi-head attention layer, attention is computed once for each attention head and each attention head will have its own set of weights for Q, K, and V. In the original transformer each MHA layer had a total of eight attention heads, however, for the sake of this example we are going to focus on a single attention head. Once we have multiplied query, key, and value by their weights we can then compute self-attention.

Figure 9. Visualization of the Self-Attention Computation.

The query matrix is multiplied by the transpose of the key matrix and this product is divided by the square root of the dimension of the keys, which is not displayed here. Once we have scaled the product of these two matrices, we can apply a softmax function to it. The resulting matrix is then multiplied by the value matrix. If you have kept track of the dimensions of all of the matrices, you will know that the matrix we output after we perform all these operations will have the same dimension as the original input matrix. All of the heads are then concatenated and passed to one final linear layer where it is multiplied by one final weight matrix. The weight matrix has a dimension of embedding by embedding so that we can output a matrix of the same dimension as the original input to the encoder.

Figure 10. Attention Head Output Visualization.

The operations that occur in the other MHA layers in the transformer are very similar to the steps that we have already outlined. When we discuss the other MHA layers, we will focus on highlighting the differences between those steps and these.

Causal Self-Attention Layer

Next let’s look at the first MHA layer in the decoder. This layer is sometimes called the causal self-attention layer.

Figure 11. Causal Self-Attention Layer.

As we have stated previously, transformers are autoregressive models which means that they generate one token at a time and feed that output back to the input. Unlike the global self-attention layer, every word in a sequence does not attend to every other word in the sequence. In the encoder block the model tries to learn the interrelatedness between all the words. The decoder requires a different approach, if every word could attend to every other word in the sequence the model would not learn to predict tokens in a way that makes syntactical sense, it would just output a grab bag of words. That means that we need to ensure that during training, we don’t look at future words in a sequence. Leftward information flow in the decoder needs to be prevented to preserve the decoders auto regressive property. This is achieved by applying a mask that effectively hides future words in a sequence during training.

Teacher Forcing and Masking

Now you might be wondering, if the decoder is autoregressive how is it that it is more parallelizable than RNNs? Part of why this is possible is a training technique called teacher forcing. Teacher forcing allows the true value of an input to be passed to the next time step regardless of the model's output at the current time step. Let's say that we are training a transformer to translate Spanish text to English text, and in this training iteration the sequence that we're trying to translate is “El lobo se comió el cordero” which translates to “The wolf ate the lamb”. In this scenario we have the input which is the Spanish sequence, and the target which is the English sequence. Let's say we're trying to predict the third token in the sequence, ate. Regardless of whether we correctly predicted the tokens for “The wolf” in the previous two iterations, when trying to predict the third token we pass the correct tokens (The wolf) as inputs to try to produce the third token (ate) in the sequence. Additionally, thanks to the way attention is computed, each word doesn't need to be processed sequentially. So if we are trying to predict the third token, we can process the first two tokens simultaneously. Thanks to teacher forcing we do not need to run the model sequentially, the outputs at different sequence locations can be computed in parallel.

Masking is another major component in training the decoder portion of the transformer. It is implemented in the causal self attention layer as part of the scaled dot product attention calculation.

Figure 12. Attention Computation with Masking Inside Causal Self Attention Layer.

The mask hides values that an input shouldn't see depending on which token in the sequence we're trying to predict. It sets all values in the softmax input that correspond to illegal connections (think future words the model should not be currently seeing) equal to negative infinity. The ability to mask values in combination with teacher forcing allows us to train language models on much larger sequence than we were previously able to and as a result, generate language models larger than ever before.

Cross Attention Layer

The last MHA layer that we find in the transformer performs cross attention. This layer is important in the transformer architecture because it connects the encoder output to the decoder. Before discussing the cross attention layer in more detail, it is important to highlight another key difference between the encoder and the decoder. That difference is that the decoder receives the outputs shifted right, this can be seen at the bottom of figure 13.

Figure 13. Cross Attention Layer.

The first token that is generated in the decoder is always a special [START] token, so before we try to predict the first token in the sequence, we consult the encoder output. Using the same machine translation example we used earlier, "El lobo se comió el cordero" translating to "The wolf ate the lamb", prior to translating the first word in the sequence, we consult the encoder output first. During inference we always start by predicting the [START] token, consulting the encoder output, then feeding the output back to the decoder to predict the first token in the sequence. The attention that we compute here determines how relevant the source sequence is in the computation of the output. It is important to note that there are multiple layers in the decoder and every single cross attention layer that appears in the decoder will receive the encoder output. For example, in the original transformer there are 6 cross attention layers, all six of those layers will receive the encoder output.

In the cross-attention layer the encoder output is set equal to both the key and the value. The decoder predictions, whether it is only the [START] token or a sequence of tokens, will act as the query in the cross-attention computation. Padding may need to be applied so that the dimensionality matches between the encoder output and the decoder output.

Figure 14. Encoder Output Matrix.

The cross-attention computation is very similar to the ones that we saw in the causal self-attention layer. Remember that the query, key, and value matrices are multiplied by their weight matrices prior to getting to this point. The masking component is not applied in the cross-attention layer, because the output of the previous multi-head attention layer did include masking and that is the matrix that is used as the query here.

Residual, Linear, and SoftMax Layers

Throughout the transformer we have residual layers and layer normalization at the output of each multi-head attention layer and each feed forward layer. Residual connections and layer normalization are common techniques in deep learning so we will not do a deep exploration of these topics at this time. However, it is important to point out that these techniques help to prevent the loss of information and improve the overall generalizability of the transformer.

Figure 15. Residual Layers with Layer Normalization.

The feed forward blocks in the encoder and decoder contains two fully connected hidden layers, the first hidden layer in the block contains a user-defined number of hidden units while the second layer contains a number of hidden units that matches the dimensionality of the embeddings.

Figure 16. Feedforward Layers.

This allows the output of the feed-forward block to match the dimensionality of its input. It will output a matrix with dimension of sequence by hidden unit size. Since the hidden unit size is equal to the embedding dimensions, both the input matrix to the feed forward layer and its output matrix are of the same size. Just like with residual layers and layer normalization, fully connected layers are commonly used in deep learning so we will not be exploring this topic in much detail. The output of the decoder layers is passed through one final linear layer and a softmax activation to produce the output probabilities for a given token, in other words, a prediction for the next word in the sequence.

Figure 17. Decoder Output.

During inference, the output of the softmax would be fed as an input to the decoder to try to predict the next token in a sequence. This process continues recursively until the decoder produces the final token, the [END] token.

Conclusion

In this post we’ve elaborated upon some of the topics discussed in the previous post of this series. We’ve now explored the individual components of the transformer model in more depth and discussed how text moves through the transformer architecture. In the next post of this series we’ll discuss two early LLMs, GPT and BERT and how they’re different from the transformer model.

References

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library