Introduction to Attention, Transformers and Large Language Models - Part 3 (GPT and BERT)

This post is the third, and last, post in this series focused on attention and large language models. Although this post can be read as a standalone post, it may be helpful to review the first and second parts of this series. This post will focus on explaining how the transformer architecture is the basis for modern LLMs. Decoder based LLMs, such as the GPT models, and encoder-based models, such as the BERT family of models are both based on the transformer architecture.

From Transformers to LLMs

The transformer model is one of the most influential architectures proposed. Thanks to advancements in computing, basing the model off attention, and the ability to massively parallelize computations, the door to building language models way more powerful than anything seen before is now open. The original transformer architecture largely focused on machine translation; however, by applying modifications to the layers appearing after the decoder output, the architecture can be applied to a variety of different tasks. As a result, the transformer architecture is the basis for modern large language models including the family of GPT models, BERT, LLaMA, Gemini, and many others.

A lot of the differences between LLMs are due to the volumes of data used, the number of layers, the number of parameters, training times, and the training objectives. LLMs architectures are similar with regards to the type of layers and connections that we see. Understanding the transformer architecture can help you understand how modern LLMs work.

Key LLM Ideas

There are two key ideas that gave rise to modern LLMs, pre-training and fine tuning. Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that do not have much annotated data. Prior research showed that models that can leverage linguistic information from unlabeled data could be used as an alternative to gathering more annotated data. The goal of one of the first transformer based large language models, the Generative Pre-trained Transformer, also known as GPT, was to learn a universal representation that transfers to a wide range of tasks with little adaptation.

These models are effectively semi-supervised because they are pre-trained with a large language modeling objective on unlabeled data, and then they could be fine-tuned on a supervised objective with minor, or no, architectural modifications. Fine tuning does assume that a labeled data set consisting of input tokens and a label are used. Most of the work on language models prior to the development of the transformer architecture generally required major architectural changes to apply a language model to a different supervised task. The core idea in the development of modern LLMs is that with enough pre-training these models can be easily applied to variety of different tasks.

GPT-1 and BERT

Two models that are directly based on the original transformer architecture are the first GPT model and the BERT model. Both models were published a few months apart in 2018 and were two of the first transformer-based large language models. It is worth noting that at the time, both models achieved state of the art performance on a variety of benchmark data sets.

Figure 1. GPT-1 (Left) and BERT (Right) Architectures.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In terms of their architectural design, both models are noticeably simpler than transformers. This is because GPT models are decoder-only models, while BERT, and its variations, are encoder-only models. The GPT decoder is even simpler than the transformer decoder because it does not include the second multi-head attention layer.

Comparing the model architectures of these two models side by side, it might be a bit difficult to immediately spot any differences aside from the positional embedding. Let's discuss these two models in a bit more detail.

GPT-1

GPT stands for generative pre trained transformer [2]. These models are generative, meaning that they produce sequences of tokens. The decoder in the transformer architecture is what made the transformer autoregressive. Since GPT is a decoder-only model, this model is also an autoregressive model. In this case autoregressive means that the prediction for the current token is a function of the previous tokens.

Pre-trained indicates that the model was trained using a large corpus of text on a standard language modeling objective, similar to what was used in the original transformer. The objective being, given the previous tokens in a sequence, predict the next token.

Lastly the T stands for transformer. In summary, the GPT model is a decoder-only model based off the transformer architecture which replaces recurrences with multi-head attention.

GPT-1 Architecture

Figure 2. Detailed GPT-1 Architecture.

The architecture visualization may make it seem as if these are small models, but it is worth noting that there are 12 layers in this decoder only transformer. Expanding that out a little bit, this implies 12 multi-head attention layers. The rest of the architecture hyperparameters are not much different from what we saw in the original transformer. The embedding dimensions is 768, but instead of using 8 attention heads per multi-head attention layer, GPT uses 12. All in all, this model has a grand total of 117 million learnable parameters.

The last few layers after the decoder output can be changed so that it can adapt to a variety of different tasks. By default, the output of this model is the probability of a particular token given the inputs. A softmax layer could also be applied to the output to use the model for classification.

GPT-1 Pre-Training

The BookCorpus dataset was used for pre-training. It contains 7,000 unpublished fiction books from various genres. This dataset was chosen because it contains long passages that could help the model learn dependencies between distant information.

The pre-training was performed with no specific task. The goal was to create a high-capacity language model that could be tuned without having to make major architectural changes. If the model was pre-trained enough with a large volume of data, that may help the model understand some of the semantics of language. Afterwards, the model could be fine-tuned with just a few epochs for a specific task.

The rationale for this approach is that all language tasks require some knowledge of language in the first place. If you think of an RNN, the initial weight updates would effectively allow the model to learn some of the semantics of language necessary before the model is able to perform the specific supervised task. This is something that we see often in our daily lives. As an example, before a child is able to play a sport, the child must learn to walk, develop finer motor skills, understand language, and understand the concept of rules before they're able to play a sport. Similarly, the goal of pre-training is for the model to learn the semantics of language before they're then trained to perform a supervised task.

Figure 3. Pre-Training Objective.

The pre training objective is simply a standard language modeling objective of maximizing likelihood. The objective is to maximize the probability of correctly predicting the next token in the sequence using a neural network with parameters theta. The model was trained using ADAM optimization.

Fine-Tuning

Given GPT’s pre-training, the model can be tuned to generate different types of outputs by passing the decoder output to a softmax layer, then training the model for a few epochs. In various experiments, researchers found that 3 epochs worth of training is sufficient in most cases. Generally speaking, the hyper-parameter settings from the unsupervised pre-training phase are used in fine-tuning tasks. Another clever approach to avoid having to customize the model architecture too much was to restructure the inputs for fine tuning tasks.

Input Transformations

The inputs to the model can be restructured in specific ways to construct sequences that can be used for fine tuning. Text inputs can be transformed to handle a variety of different tasks such as classification, textual entailment, similarity, question and answering, etc.

04_JC_fig4_input_restructuring_classification-1024x135.png

Figure 4. Input Restructuring for Classification.

For simple tasks like classification, not much needs to be done outside of adding a linear layer followed by a softmax. For other tasks such as similarity, and question and answering, sequences of text can be concatenated by using a delimiter token in between two sequences of text. The delimiter acts as a breaking point for the sequences.

Figure 5. Input Restructuring for Similarity Tasks.

In similarity tasks where sentences are compared, the sentences have no inherent ordering associated with them. Since there is no ordering, the sequences can be concatenated twice but the ordering of the texts can be flipped in the concatenations. These sequences can then be passed through the decoder independently and then added element wise prior to being fed to one final linear layer and softmax.

Figure 6. Input Restructuring for Q&A.

A question and answering task requires a document, question, and a set of possible answers, to be passed through the decoder. The document context and question is concatenated, with each possible answer adding a delimiter token in between. Each of the sequences are processed independently through the decoder. The softmax layer produces an output distribution of possible answers.

Figure 7. Fine-Tuning Objective.

The fine-tuning modeling objective is also changed since the parameters need to be adapted to the new supervised task. The likelihood of the conditional probability of a label, y, given the sequence of input tokens is maximized. Research also found that adding an auxiliary objective to fine tune the model was useful. This helps the model achieve better generalization and accelerates convergence.

These are just a few examples of ways in which inputs can be re-structured to fine tune these models. In smaller/older LLMs, fine-tuning can still be a viable approach to improve model performance for certain downstream tasks. That being said, it is worth noting that modern LLMs, such as GPT-3, GPT-4, Gemini, and more are not open source, which makes it harder to have this level of flexibility with how they can be fine tuned. Even if they were open source, they are so large that fine tuning is infeasible without access to massive amounts of resources. Fine-tuning these modern LLMs is often reserved for publishers of these models and institutions with the time, resources, and data to be able to do so. As a result, modern LLMs are often improved for specific tasks via different techniques such as prompting and retrieval augmented generation systems. Both of these topics are outside the scope of this post but will be explored in a future ones.

BERT

The other LLM that we will briefly discuss is BERT, which stands for “Bidirectional Encoder Representations from Transformers” [3]. The B in BERT stands for bidirectional, this comes from the fact that the language modeling objective isn't just predicting the next word in the sequence. In fact, the pre-training objective is to predict a random word in the sequence using all the other words in the sequence. So maybe if we have the sentence "The gray dog wagged her tail", we could try to predict the word "dog" using all other words in the sequence.

The ER in BERT stand for "Encoder Representations". BERT is an encoder-only model, and its objective is to extract information from text inputs. This extracted information can then be used to accomplish different tasks, like text classification or sentiment analysis, by performing minor adaptations to the architecture. Unlike GPT, BERT is not an autoregressive model that generates text. This is part of the reason why the language modeling objective can attend to words in both directions. Because it is not trying to predict future words in a sequence, the entire sequence of words can be used to extract useful information.

The T in BERT also stands for "Transformer“ because it uses the transformer architecture proposed in the "Attention is All you Need" [1] paper. Although there are differences in the training approach, the model architecture for BERT’s encoder is not too different from the architecture of the original transformer.

BERT Encoding

Figure 8. BERT Architecture.

BERTs architecture also consists of multiple stacked encoders that process the embedded text to create transformed representations of the inputs. Like GPT, BERT is also pre-trained on a language modeling objective and can be subsequently fine-tuned to perform different tasks. BERT is pre-trained using the BookCorpus data set with two training objectives, language modeling and next sentence prediction. BERT's parameters can be fine-tuned end to end by plugging task-specific inputs and targets into the model. Compared to pre-training, fine tuning is relatively inexpensive. Although encoder-only models cannot be used to generate text by default, the encodings created by the model can be used for tasks like text classification, question answering (like multiple-choice questions), and name entity recognition. It should be noted that the encodings can be passed to decoder layers, which can be used to generate sequential outputs.

The encoding blocks consist of a combination of multi-head attention layers and fully connected layers. These are connected with residual and batch normalization layers. Attention layers enable the model to learn long-range relationships between sequential elements, and the rest of the encoder block is designed to efficiently extract this information and then convert it into a useful form (a form that is useful in predicting the target).

BERT Pre-Training

The pre-training objective for many decoder-based models, such as GPT, is to predict the probability of the next token in a sequence. This is useful in decoder-based models because we want to generate syntactically correct text. The BERT language modeling task is an important part of its pre-training because it enables the model to learn what words mean and how they are used in text. When a person reads or listens, we have access to the entire sequence to extract information to generate some sort of response. This is similar to what BERT models attempt to do during pre-training by accessing an entire sequence and exploring the interrelatedness of all words in said sequence.

Figure 9. BERT Language Modeling Task.

To enable this pre-training, we select a random token from each sentence in the training data and replace it with a placeholder token (called the [MASK] token). The models is trained to predict the replaced token (via supervised training). The problem with this approach is that the model will never actually see real text input data with the special [MASK] token, so when preparing the training data for pre-training, we include the [MASK] token in 80% of the training data. For the rest of the training data, we leave the [MASK] token out, replacing the token of interest with a random token 10% of the time, and leaving the true token in the training data 10% of time. This prepares the model to see data that does not include the [MASK] token.

Many language tasks such as question answering and natural language inference are based on understanding the relationships between two sentences. These relationships are not things directly captured by language modeling nor by the decoder pre-training employed in the GPT-1 model. This is why the next sentence prediction task is an important part of pre-training the BERT model. Without it, the model would not have a good understanding of how sentences are composed together.

Figure 10. Next Sentence Prediction Pre-Training.

The model is trained to predict whether sentence A is followed in the corpus by sentence B. This is a simple binary classification task, where the model predicts whether a pair of sentences A and B should have the label [IsNext], meaning that sentence B directly follows sentence A, or the label [NotNext], meaning sentence B is either a random sentence unrelated to sentence A or sentences are missing between them. The data is prepared such that 50% of the training pair have the label [IsNext] and 50% have the label [NotNext].

In this example, the first pair of sentences appear in order in Shakespeare's Romeo and Juliet, but the second pair of sentences are not contiguous in the text. Instead, they are pulled randomly from the sentences in the play.

BERT and GPT-1 Differences

Now that we have discussed both GPT-1 and BERT, let's come back and discuss some of the differences.

Although both models are based off the transformer architecture, they are fundamentally different. GPT models are decoder only models whereas BERT and its variations are encoder only. GPT-1 was trained with a language modeling objective that is unidirectional whereas BERT is bidirectional. This is a major difference between the design of both models, because in GPT the prediction for a given position can only depend on the outputs that came prior to it. Whereas in BERT everything in the sequence attends to itself without any restriction, the modeling objective randomly masks certain words, and every other token in a sequence is used to predict the masked word. These differences make GPT better for tasks that require syntactically accurate text generation including dialogue, creative writing, and more. Whereas BERT is better suited for tasks like sentiment analysis, language understanding, and named entity recognition.

There are also other architectural differences, such as the positional embeddings, the number of layers, hyperparameters, and optimization settings. In the original BERT paper, one BERT model was built using an architecture that closely matched GPT‘s, and one that was a larger version of BERT. The GPT-1 model had 117 million parameters, while the BERT model had a total of 345 million parameters.

BERT Variations

The first BERT model was published by Google in 2018 and set state-of-the-art performance on a variety of language tasks. The model was quickly integrated to Google to improve their search functionality. Unlike the family of GPT models, there isn’t a family of ever-growing BERT models published by Google. Many of the variations of BERT have been published by different institutions considering that BERT is open source. A lot of models based off BERT have focused on ways to improve its efficiency and reduce its size so that it can be more easily fine-tuned and integrated into existing applications.

One such variation of BERT is RoBERTa which was published in 2019 [4]. This model was constructed during a replication study of BERTs pre-training to carefully measure the impact of hyperparameters and training data size. The study found that BERT was very undertrained, and its performance could be improved via better hyperparameter tuning. The RoBERTa model didn't do much in terms of architectural changes or introduce many new techniques relative to BERT. When evaluated on the same data sets that BERT was, it did achieve state-of-the-art results at the time.

Another variation is ALBERT, which was published in late 2019, early 2020 [5]. The goal of ALBERT was to function as a lite BERT model. As model sizes increases, more GPU (or TPU) memory is required and training times become longer. It used techniques like factorized embedding parametrization, cross-layer parameter sharing, and an additional loss to reduce the total number of learnable parameters. These changes allow the model to remain smaller than BERT even while using larger feed forward components in the architecture of the network.

The last variation that we will discuss is DistilBERT, which was published in 2020 [6]. This model reduced the size of BERT by 40% while retaining 97% of it’s language understanding and was 60% faster. The model achieved this by using a technique called knowledge distillation in which a compact model (also known as student) is trained to reproduce the behavior of a larger model (the teacher). This model added a distillation loss to the training objective such that the final loss was a linear combination of the masked language modeling loss and the distillation loss. Additional architectural changes that included removing the token type embeddings, removing the pooler, and reducing the number of layers by 2 helped the model achieve this level of performance.

Other notable works on encoder-based transformers exists, however these are beyond the scope of this post.

GPT Timeline

The GPT family of models is published by open AI in partnership with Microsoft. The architectures for GPT-1 and GPT-2 are overall very similar. The major difference in the training of these two models is that GPT-2 was trained on a much larger volume of data and the number of learnable parameters in GPT-2 is also substantially larger. GPT-1 was trained on a dataset containing about 3.3 billion tokens, whereas GPT 2 was trained on a much larger and diverse dataset containing text from over 45 million links and approximately 10 billion tokens after cleaning the text. GPT-2 contains a larger vocabulary, a larger context size so that it can process even more information, and even more layers. Work on GPT- 2 showed a lot of promise in zero shot learning, which is effectively applying the model on a task to which it wasn't trained on. The increase in parameter size showed that the model was good on several tasks on which it wasn't explicitly trained on, such as question answering. The massive increase in performance and state-of-the-art results on multiple tasks, as well as other external publications, led OpenAI researchers to continue building larger models with even more diverse datasets.

GPT-3 was truly the game changer for Large Language Models. It contains approximately 175 billion parameters and was trained on roughly 300 billion tokens. The GPT-3 and GPT-4 family of models are closed source models (now common practice by most of the major developers of LLMs) so most of the known information pertaining to these models comes directly from OpenAI but in some cases, some of the model information can be derived. It is estimated that GPT-4 contains a total of 1.7 trillion weights and, as of early 2024, it is currently unknown what the size of the training dataset was. The sheer size of these models give it extremely impressive performance even on tasks for which it was not trained on. These models with just clever prompting can outperform the best encoder-based models on tasks in which encoders are meant to shine due to the complexity and size of the language model.

Conclusion

The introduction of the original transformer in 2018 feels like ages ago given the rapid pace of progress in the field. Nowadays, there are new transformer-based models, applications, products, etc. being released every day. Although there have been many new techniques and ways to improve LLMs, understanding the original transformer and the concepts behind it is still valuable since it is still the backbone to modern LLMs. Hopefully this series as posts has broadened your knowledge on how transformer based models work, are trained, and how they’re developed.

References

Find more articles from SAS Global Enablement and Learning here.

Getting Started