Introduction to Attention, Transformers and Large Language Models - Part 1

2 Likes

Transformer based models are the backbones of modern large language models. Unlike language models based off recurrent neural networks which operate via recursions, transformer-based models rely entirely on attention. The objective of this post is to introduce the attention mechanism and introduce readers to the transformer model. This lays the groundwork for future posts focused on large language models. The ultimate objective of this series of posts is to provide enough detail and background to get machine learning/deep learning developers comfortable implementing and using LLMs.

Introduction

Language models are used for a variety of tasks including speech recognition, machine translation, text generation, entity recognition, sentiment analysis, and many more applications. Before the first paper on the transformer model was published in 2018 [1], most language models were using recurrent neural networks [2](RNNs). It should be mentioned that earlier language models were used prior to RNNs, such as n-gram models, but those are beyond the scope of this post.

History of Recurrent Neural Networks (RNN)

Early works in language modeling can be traced all the way back to the early 1900s. Although this work was not useful in modern applications, it was influential in the development of these models. Substantial progress in language modeling came with the advent of the RNN in the 1980s and it's variations in the 1990s given the wide availability of computational resources that sprung up during these periods. A major step forward was taken in 1997 when the Long Short-Term Memory (LSTM) network paper was published [3]. At the time, this model set performance records across many natural language tasks.

For roughly two decades recurrent neural networks and their variations (such as LSTM and GRU [4]) were the basis of the most powerful language models. These models can be used to process sequential data such as time series data and text data. The order in which words appear is very important to the construction of language, therefore any model that is used for natural language tasks must take sequences into account. RNNs were, and in some cases still are, used for text classification, machine translation, and entity recognition.

Limitations of RNNs

In the years since there have been many developments that have improved the power of these models. Although these models achieved amazing results, the creation of these models preclude the wide availability of parallel computing and extensive computational resources. RNN's require computations using previous elements in a sequence for them to generate outputs later in a sequence. The inherent sequential nature of these models limits the degree to which they can be parallelized.

Relating signals from two arbitrary positions becomes prohibitively difficult. As sequence lengths increase, it becomes increasingly harder for a model to learn dependencies between distant positions. To make the problem worse, the computational complexity can in some cases grow linearly or logarithmically.

Lastly, these issues limit the length of sequences that RNNs can handle. Limiting the length of sequences that RNNs can process limits the kinds of data that we can use to train models, the applications for which you can use these models, and ultimately limits how deep of an understanding of language a model could have.

Attention is All You Need

In 2017, a paper was published by researchers at Google and the University of Toronto titled “Attention is All You Need”[1]. In this paper the authors introduced a model called the transformer. This model leverages a technique called attention and it was introduced in the model in the form of multi-head attention layers. Attention is inspired by the idea of how human brains deal with massive amounts of audio and visual input.

The transformer relies entirely on attention without the need for recurrences or convolutions like what are used in RNNs and CNNs, respectively. One of the key motivations for using attention without recurrences is that it can address the shortcomings that are found in sequential models, such as RNNs. Attention does not require text data to be processed sequentially. This means that these models are much more parallelizable. They achieve this while reducing the overall number of sequential operations required to generate an output and reduce computational complexity. One last key advantage of attention-based models is that they can learn dependencies between distant positions. This allows them to process and learn from longer sequences than RNNs are capable of.

Figure 1. Scaled Dot Product Attention.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In figure 1 we can see a visualization of how we can compute the type of attention that is implemented in Transformers, scaled dot product attention. Attention layers are fundamentally weighted mean reductions. They take three vectors as inputs. These are called the query, key, and value vectors. These vector names were inspired by data retrieval systems. One way that we can think about attention is as mapping a query and a key-value pair to an output.

The process begins by taking a query and a key, generally in matrix form, and then performing matrix multiplication between the query and the transposed key matrix. We then scale these products and optionally apply a mask to this output. The mask is useful because it can hide words in a sequence. Masking can be used to prevent language models from looking at future words in a sequence during training. The resulting matrix is then passed to a softmax function. The softmax output is then multiplied with the value matrix.

Equation 1. Scaled Dot Product Attention.

We can see the formula for scaled dot product attention above. We divide the product of the query and key matrices by the square root of the dimension of the keys. This is used as a scaling factor because for keys with a high dimensionality, the dot products grow large in magnitude. This pushes the softmax into regions where it has very small gradients.

Mathematically looking at attention, it may be difficult to get an intuitive grasp with regards to what attention is computing. An example that can be used to grain some intuition about attention, is by thinking about a sentence like, "The quick brown fox jumps over the lazy dog." In a given sentence we often place our attention on a specific word (or words) while the rest of the sentence generally adds structure and context to what you’re paying attention to.

Figure 2. Attention Sentence Example.

What’s being paid attention to in the sentence is the query, in this case we'll choose "the lazy dog". This is what we'll calculate attention for. The key and value are the words we evaluating in relation to the query. In other words, how relevant are those words to the query. In a conversation there is always a subject or topic of interest, which is what we focus our attention on. In this example you think of the topic of interest as the lazy dog. Maybe we're having a conversation about how lazy our friend's dog is. The next few sentences or paragraphs in the conversation may revolve around this lazy dog. And to reinforce the idea that the dog is lazy, we give an example of something that happened. A quick fox jumped over this dog and , presumably, this dog had no reaction. In this case, quick brown fox and jumps over are the key and value. They're the words that help to communicate this idea that the dog is lazy.

Researchers often have an idea on how to improve language models based off of real world observations, in this case it was this idea of attention. It is important to remember however that many of these ideas may not be how things work in the brain. These are just ideas that researchers try to model mathematically to see if they yield better results in the tasks at hand, in this case, language modeling.

Self-Attention

Although we can calculate attention using different queries, keys, and values, another variation of attention which we find in transformer models is called self-attention. Self-attention is a special case of attention in which the query, key and value are all the same. This means that everything in a sequence will ultimately attend to itself. This technique could be useful in a situation where you're trying to understand the meaning of a given input sequence and how each of the values in that sequence are related to one another.

As an example, think of a translator translating a sentence from one language to another. They generally read or listen to the entire sentence that needs to be translated, process all the words to determine their meanings, and only then do they start to produce a grammatically correct translation that gets the actual meaning across. Take the Japanese sentence "oiishi desu", which translates to "it is tasty". "Oiishi" means "tasty" and "desu" means "it is". If the translator were to translate each word as they hear them, the sentence would be translated to "Tasty it is". This may get the meaning across, but it is grammatically incorrect and may be confusing. In some cases, this approach of translating each word one at a time may yield something with an entirely different meaning. An example of this is the sentence "The cat sat on the mat". By just swapping two words you can form the sentence "The mat sat on the cat". This has an entirely different meaning. This is why processing all words in a sentence allows a translator to understand the true meaning of what's being said and also allows them to produce the grammatically correct sentence. Everything prior to producing the translation is effectively what self-attention helps transformer based models do. It can be useful in situations where we need to find how every word in a sequence relates to others.

Multi-Head Attention

Attention isn't only calculated once, in fact, it is calculated many times within a transformer. In transformers, attention is used in what are called multi-head attention (MHA) blocks. Within these blocks attention is computed multiple times in what are called heads; more specifically, attention is computed once per head. It’s important to note that we don’t just compute attention between Q, K, and V, multiple times. That would not accomplish anything since we would just be repeatedly re-averaging vectors. MHA blocks pass Q, K, and V through their own set of linear layers, which is where we start to introduce learnable parameters and effectively turn transformers into neural networks.

Figure 3. Multi-Head Attention.

Each attention head (denoted as h in figure 3) contains a different set of learnable parameters for all three of Q, K and, V. This makes it so that each attention head effectively computes attention using different values for all three of these vectors. The output of each attention head is then concatenated and passed through one final linear layer to generate the MHA layer output. Since each attention head has different weights for all three of our vectors, you can think of multi-head attention as a method that allows the model to jointly attend to information from different representation subspaces at different positions. That means that as we train our model each attention head will capture different relationships between these three inputs.

Equation 2. Multi-Head Attention.

The equation for multi-head attention is straightforward given that we know how to calculate attention. All we need to do is concatenate the outputs of each attention head into a single matrix and then multiply that by a matrix of learnable parameters which represents the final linear layer in the MHA block. The way we compute an individual attention head is by computing attention on the products of weighted query, key, and value matrices. The number of attention heads is a hyperparameter that can be chosen by the modeler which represents how many times Q, K, and V are projected.

Ultimately, you can think of multi-head attention as a method that allows the model to jointly attend to information from different representation subspaces at different positions. That means that as we train our model using some data set, each attention head will capture and focus on different relationships between these three inputs.

Transformer

Now that we understand what attention and multi-head attention is, we can start to discuss the overall architecture of the transformer. The transformer consists of two sections, an encoder and a decoder.

Figure 3. Transformer Architecture.

The encoder takes an input sequence, learns the interrelatedness of all the words in the sequence, and outputs a representation of that. The decoder consults the encoder output to generate its own output, auto-regressively, one token at a time. The decoder output is the transformers final output. We can see that within the transformer the multi-head attention layers appears in both the encoder and the decoder. In very simple terms, the learnable parameters along with how queries, keys, and values are combined within the transformer is how these models fundamentally learn the meaning and interrelatedness between words. There’s a handful of other things that make these models work, those will be discussed in parts 2 and 3 of this series of blogs.

Let's look at the encoder, the decoder, and all the other components that make transformers work in more detail.

Transformer Encoder

Figure 4. Transformer Encoder.

The encoder shown in Figure 4 receives the user input, passes it through an input embedding, applies positional encoding and then passes the input through a stack of N identical layers. Each layer has two sub layers: a multi-head attention layer and a feed forward layer. You can think of the feed forward layer as what you would find in a classic neural network. Both the multi-head attention layer and the feed forward layer are followed by a residual layer with layer normalization. In the "Attention is All You Need" paper [1], this stack is repeated a total of 6 times.

The objective of the encoder is to learn a representation and the interrelatedness between all the words in the sequence. Every word in the sequence pays attention to every other word in the sequence. If this sounds familiar it's because the MHA layer in the encoder performs self-attention. We're going to cover the details of what occurs in each of those layers in the next post.

Transformer Decoder

Figure 5. Transformer Decoder.

The architecture for the decoder does not look much different than the encoder architecture. The key difference is that the decoder applies masking, it contains a second multi-head attention layer, and the user inputs are not passed into the decoder. The decoder actually consults the encoder output in order to generate its own output. Just like in the encoder, the stack is repeated a total of 6 times in the "Attention is All You Need" paper [1].

The objective of the decoder is to predict words one at a time. Unlike the encoder where the model tries to learn the interrelatedness between all words in a sequence, the decoder must only consult outputs for previous time steps (i.e. previous words) when computing a future output. For this reason, the decoder performs masked multi-head attention in the first MHA layer. The masking in this first MHA layer is used to prevent future words in a sequence from being used in the attention computation during training. When the decoder is being used to generate predictions, it does not know what the future word that it needs to predict is. A decoder trained on data where it is allowed to see the future would not perform well on new data where this information is not available to it.

Conclusion

In this post we’ve introduced the attention mechanism, self-attention, multi-head attention and done a quick overview of the transformer architecture. Multi-head attention layers are one of the main mechanisms through which transformer models learn the meaning and interrelatedness of words. You may be struggling to wrap your head around some of these concepts and how they play a part in the transformer model, and more broadly, LLMs. In the next post we will spend more time diving a bit deeper into the different components of the transformer and breaking down how information flows through it.

References

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library