Welcome to part 2 of this 2-part series which has the purpose of presenting a high-level introduction to Generative AI. In Part 1, I introduced synthetic data generation and the primary model behind data generation, the Generative Adversarial Network or GAN. Now, in Part 2, I’ll introduce the modern Large Language Model. If you didn’t check out Part 1, take a moment and do so. I’ll use some of the terminology I defined in it, in this post. To recap, I am a human and as the title indicates, this series is being generated by a human, me. The reason I, a human, am generating this series rather than using generative AI to do so is that the topic is relatively new to me. So, to help get a better understanding of it myself, I decided to post on the topic. In sharing my thoughts and understandings on Generative AI with you, it will deepen my knowledge of it and hopefully you, also a human, can gain further insight into the topic as well.
Before I jump into Large Language Models let me say that this topic has already been published in SAS communities by a few other authors. My esteemed, former colleague, Jason Colon wrote a multi-part series on Large Language Models (part 1, part 2), but he provides more technical details compared to my goal with this high-level introduction. Beth Ebersole also demystified Generative AI and Large Language Models in a post which covered some crucial terminology in the area and also included a nice history of ChatGPT, arguably the best known Large Language Model in use today. However, I’d like to cover a bit more on the underlying model and its structure than she did. So, I’m hoping this post is a nice addition and fills a gap for what is already available in SAS communities.
Large Language Models, or LLMs, are one of the most widely used algorithms in the field of GenAI. First, why do we need large language models? The best way to answer this question is by considering the limitation of historical text analytics models such as recurrent neural networks. Traditional text analytics models do not scale well to massive data and can suffer from performance issues. Large Language Models (LLMs) on the other hand scale well to huge amounts of data and have ever-increasing performance. The increase in performance is partially due to the fact that LLMs are more parallelizable compared to traditional language models. This increase in performance is quite impressive given the size of modern LLMs. A single LLM can have a size measured in terabytes with trillions of parameters (akin to weights in a neural network). Historically, text analytics requires separate models for different natural language processing (NLP) tasks, such as document classification, entity recognition, and sentiment analysis. LLMs can perform multiple NLP tasks without the need for separate models. LLMs allow for multi-step reasoning, such as would be required in math-type word problems, whereas historical text analytics models do not. What are large language models used for? They are used to process and generate natural language text and when used as text generators, they simply provide the next most likely word in a sequence. A widely used application where these models really shine is language translation. LLMs can also identify complex relationships in language. Unlike the early language models, they can learn patterns and meanings between words that are not necessarily located near each other. In other words, LLMs can learn distant relationships. Large Language Models are trained on massive amounts of text data typically taken from the internet or other sources of large corpuses. (In text analytics parlance, a corpus is a large collection of usually related documents. All of Shakespeare’s works together create a corpus.) So, what is it that led to this massive jump in improvement for the modern LLM? We need to look back about 8 or so years to answer that question.
In 2017, a paper was published by researchers at Google and the University of Toronto that took things in a new direction. The paper was titled “Attention is All You Need” and it introduced a groundbreaking model called a transformer. I’ll take a different approach from what was presented in the original paper, and build up to the full transformer model, starting from its smallest component.
This “smallest component” (as I think of it) of the transformer model is based on a calculation known as attention. Attention is inspired by the way humans “pay attention” to information they take in through certain senses. When a human is listening to audio or seeing video, they often pay attention to a specific part of the incoming information and the rest provides context. For example, while watching a scene from a movie, our attention is typically focused on the main character and what specifically he or she is doing. Other aspects of the scene provide context. Where is the scene taking place? Is it day or night? What is the weather like? Are there other people present and is the main character interacting with them or not? Having now thought of attention in an illustrative example, let’s move on to how attention is calculated within a transformer.
Below is a visualization of the type of attention that is the “smallest component” in a transformer.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
This is called the scaled dot product attention. It has that name because that is a concise way to describe how attention is calculated- it’s based on matrix multiplication, that’s scaled. It starts with three inputs: a query (Q), a key (K), and a value (V). Each is typically a matrix, but they can also be vectors. It depends if they are a phrase or a single word, and they can be either. The query is what we are paying attention to, or in terms of the scaled dot product, what we are calculating attention for. If the query is a phrase or combination of words, Q is represented by a matrix, whereas if the query is an individual word, Q is a vector. (The matrix or vector comes from an embedding matrix which I’ll briefly discuss later.) Attention maps a query and a key-value pair to an output. The pictorial representation is interpreted from the bottom up. It begins with Q and K, where matrix multiplication is performed between Q and the transpose of K. Then the product is scaled. Optionally, a mask can be applied which is used to hide certain words in a sequence. Masking will prevent the model from looking at future words when learning the meaning of or relationship between words. Remember that LLMs are text generators, predicting the next likely word in a sequence. Masking mimics this because when humans speak, for example, they don’t use future words to decide what to say next. The resulting matrix is passed to a softmax function which is widely used with neural networks as an activation function. It is simply a generalization of the logistic function. The softmax output is then multiplied with the V matrix to provide the final calculation for attention.
Breaking down these elements for a specific sentence may help get a better grasp with how attention is calculated. Let’s use the sentence: “This blog post is a high-level introduction to large language models.” The query, Q, is what you are paying attention to. There is no right-or-wrong answer to what the query is in a sequence or sentence. In fact, each sentence could have multiple words or phrases that are a query. For our example sentence, I’ll choose “This blog post” as the query. So, the scaled dot product is calculated for that phrase. The key, K, and value, V, provide context to what we are paying attention to. Again, there’s no right-or-wrong answer. In fact, mixing up the values of Q, K, and V is how the model learns the meanings of words. In our example sentence, I’ll choose “high-level introduction” and “large language models”, as K and V. In this case, since Q, K, and V are all phrases, each would be a matrix in the scaled dot product calculation. We are paying attention to “this blog post”. But what is the context? What is special about the blog post or what do we need to know about it? “This blog post” is a “high-level introduction”. Is there more context? What is it a high level introduction to? “This blog post” is a “high-level introduction” to “large language models”. We have an element of the sentence we are paying attention to and the rest of the sentence contains phrases to help us understand the context and perhaps meaning of what we are paying attention to.
Keep in mind that what we are building up to is the full transformer model. We’ve covered the starting point but now let’s discuss the next step towards the full model. I said above, there is no right-or-wrong answer for what Q, K, and V are. For a given sentence or sequence, one calculation of attention may use different choices for Q, K, and V than another calculation. Further, the Q from a certain sentence might be combined with K and V from an entirely different sentence for yet another calculation for attention in a longer sequence. Recall that modern LLMs can learn distant relationships between words so Q, K, and V do not necessarily even come from the same sentence. Combining all this must imply that attention is not calculated just a single time within a transformer, but many, many times.
Within a transformer, attention is used in multi-head attention (MHA) blocks. Within the blocks, attention is computed multiple times in “heads”. The head is just one component of the block. As a foreshadow, within the MHA blocks is where transformers begin to behave like massive neural networks. Each head performs its own calculation for attention. However, the calculations must be different, otherwise we’re repeating the same calculation for Q, K and V multiple times. Each head passes Q, K, and V through a set of linear layers, where these linear layers are made up of learnable parameters. These linear layers are just like those in the input layers of neural networks. They are what turn transformers into massive neural networks.
The diagram below illustrates a Multi-head Attention (MHA) block:
Each attention head has a different resulting output, even for the same values of Q, K, V. That’s because the linear layers leading into each head have different parameters. The amount of output from the heads grows quickly, as we iterate and have different values for Q, K, V as inputs. The output from the multiple heads is then concatenated and then, just like the output layer of a classical neural network, is passed through one final layer to generate the MHA layer output. Even this final linear layer contains more learnable parameters.
Ultimately, MHA is a method that allows the transformer to learn relationships between words, sometimes even those that are far apart, coming into the model from the training data. Each attention head captures and learns different relationships and meanings of words as it considers different values of the three inputs Q, K, and V.
Now to the ultimate model; the transformer. First, it’s helpful to know there are two main parts to a transformer. And each uses their own MHA block(s). Each of the two parts has a different purpose, and each performs different NLP tasks. (In fact, each part can be used individually for specific NLP needs. A bit more on that later.) One part of the transformer that takes the input in is called the encoder and the other part that provides the output of the model is called the decoder.
The main task of the encoder is to learn the meaning of words and how different words are related. You can think of the encoder as the part of the model that learns a language. It creates an output that is passed to the decoder. This output is different from the original sequences consumed by the encoder. The decoder uses this output from the encoder to generate its own output. The decoder is the output text generator. In other words, it predicts the next most likely word, one at a time, autoregressively. Here is an image of a full transformer.
The encoder is the portion on the left-hand side and the decoder is the portion on the right. Aside from a few small details, the structure of each is similar.
The encoder receives user input, text. It then turns the text into numbers (using embeddings, which I’ll discuss later) and performs positional encoding which preserves the order of words in a sequence. It then passes the encodings through a stack of N layers. The layers have a similar structure, but each layer has its own set of learnable parameters. The parameters of one layer are different from those in other layers. Each layer consists of two sublayers: an MHA block, as described above, and a feed forward layer, as can be found in a typical neural network. Each sublayer employs a residual connection which is followed by normalization. The output of the encoder is then sent to the decoder. Recall that the purpose of the encoder is to learn the meaning of words and the interrelatedness between all the words in a sequence. The encoder generates encodings that can be used for NLP tasks such as classification, question answering, sentiment analysis, and entity recognition. In fact, a well-known LLM called a BERT model uses only the encoder portion of a transformer to perform NLP tasks such as document classification. The term “encoder” is even included in the acronym BERT. BERT stands for Bidirectional Encoder Representations from Transformers. Check out this post if you want to see how SAS technical support used a BERT model to accurately classify incoming emails. (A quick side note on the BERT model. Although I describe BERT as an LLM here, compared to other LLMs it is quite small. Some may even consider and call BERT a “Small” Language Model.)
The structure of the decoder is similar to the encoder but with a key difference being that the decoder includes an additional masked MHA block. The decoder takes the output of the encoder, performs embedding and positional encoding and then has its own stack of N layers. Each of the N layers contains sublayers of an MHA block and a feedforward layer, but they also contain a masked MHA sublayer. This masked MHA sublayer prevents future words in a sequence from being used when attention is calculated during training. As with the encoder, each of the N layers are similar in structure but has their own learnable parameters. Each sublayer has a residual connection and performs normalization. The output of the N layers is passed through a final linear layer (with more learnable parameters) and in the final step a softmax transformation is applied. Recall that the purpose of the decoder is to generate text, autoregressively, one word at a time. Specifically, it provides what the next most likely word is in a sequence, that is, the word with the highest probability of appearing next. The popular LLM family of GPT models are based on only the decoder portion of the full transformer model. GPT stands for Generative Pretrained Transformer, and they are used as text generation models.
If text is the input into a transformer and mathematical calculations such as matrix multiplication are performed within the transformers, one remaining question is, how do we go from words to numbers? This is handled through the process of embedding. Every word the LLM knows is represented by a numerical value known as a token. The number of words the model knows is based on training. The model even has a token for words it does not know (words it did not see during training), tokens for punctuation, and start and end tokens to indicate the beginning and end of a sentence. Each token corresponds to a row (i.e., a vector) in a huge matrix known as the embedding matrix. Each vector in the embedding matrix is essentially made up of learnable parameters, that define how the model understands language. As the model learns more about the interrelatedness of words, the rows in the embedding matrix are updated. These vectors that come from the embedding matrix are the numerical elements used in the transformer. Recall that earlier I said when calculating attention, the query, Q, can be a vector or a matrix depending on whether Q is a single word or a phrase, respectively. These vectors are the rows of the embedding matrix.
For completeness of this series, I need to mention one additional component to GenAI. At SAS, we think of 3 pillars that together encompass all of GenAI. Synthetic data generation and large language models have been introduced in the two posts in this series. The third pillar of GenAI are digital twins. A digital twin is a digital, animated, dynamic ecosystem which is made up of an interconnected network of software, generative and non-generative models, and data that may be a combination of historical, real-time and synthesized. This ecosystem both mirrors and synchronizes with a physical system. So, for example, a digit twin may exist for a large factory. The digital twin of this factory is a computer-simulated representation of the actual, physical factory. The idea is that experiments and “what-if” scenarios can be run on the digit twin as a cost-effective way to try out things that would be too costly or time consuming to do in the actual factory or determine what the consequences might be if something were to go wrong within the factory.
In this 2-part series I, a human, have presented a human generated, high-level introduction to two of SAS’s three pillars that encompass GenAI. Synthetic Data Generation and Large Language Models are proving to be key tools in the modern business landscape. I hope this introduction to these two areas of GenAI will help start you on a path to a deeper understanding of the underlying models for each. If you want to learn more, check out the resources I included in my first post in the series, which I linked to above.
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.