A Human Generated Introduction to Generative AI, Part 1: Synthetic Data Generation

2 Likes

The Purpose of this 2-part series is to present a high-level introduction of two specific applications of Generative AI. Yes, I am a human. And yes, as the title of the post indicates, I’m writing this myself. So, it begs the question, if this is a post on generative AI, why not use generative AI to write it? The simple answer is that Generative AI is a relatively new topic for me (a human!) and so I am in the early stages of learning it. So, as a human, the more times I work through it, the easier my learning process for it will be. If I can summarize Generative AI for you in my own words, it will help me learn the topic at a deeper level. So, this introduction comes with a human touch.

Two commonly used areas of Generative Artificial Intelligence, simply referred to as GenAI, are for generating new data as output and for generating text as output. I’ll soon be delivering a short workshop on both of these areas of GenAI, live at SAS Innovate in Orlando, FL. But more on that later. As I started to write a single post about both areas of application, it really started to get lengthy, and I don’t want to lose my readers with one single, long-and-drawn out post. So, I’m going to give you content in easier to consume bites. The first application of GenAI, which I’ll introduce in this post, is called Synthetic Data Generation. The second application, which I’ll introduce in the second post in the series, is based on Large Language Models. Before I get into synthetic data generation, allow me to cover some background and definitions.

Let’s build up into the full definition of Generative Artificial Intelligence. First, let’s be sure we understand the “AI” part of GenAI. Artificial intelligence is the science of designing ethical and transparent systems to support and accelerate human decisions and actions. So, in a very simple sense, GenAI is using AI technology to generate things. “Things” could be tabular data, as is the case for synthetic data generation. “Things” could also be text or language, as is the case for large language models. But “things” could also be music, pictures, videos, or audio. All of which, of course, are different types of data. Let’s think about how generating things (i.e., various data) and AI go together. GenAI technology works by learning from mega-huge, real-world data taken from the internet, books, and other sources. The data can be in the form of text, books, images, videos, audio, etc. A human then types or asks a question. The question may be referred to as a prompt. GenAI then creates new output based on the human request. And just like the data it learns from, GenAI can create output in multiple forms, as stated above: text, tabular data, image, video, audio, etc. This makes GenAI valuable for many industries including writing, art, software development, product design, health care, marketing, and finance, just to name a few. Thus, GenAI is transforming the interaction between human intelligence and artificial intelligence.

Synthetic Data Generation:

When we talk about generating synthetic data, let’s not beat around the bush in terms of what is happening. GenAI is creating fake data; that is, artificial data that is meant to be similar to real-world data. Why do we need to generate fake data? A few of the reasons are that real data is expensive, and it can take time to collect all the data our business needs require. There could be data quality concerns or the lack of relevant data, such as with a rare target. Privacy could also be a concern. What are the benefits of generated data? It can preserve the characteristics, in other words, the statistical properties, of the real data. So, in theory, there are no discernible differences between the generated data and the real data. It could be cost effective compared to collecting real data and it can meet conditions lacking in real data. Synthetic data can also guarantee privacy. The data is fake, after all.

Synthetic data generation is an area of GenAI that is based on a model called a Generative Adversarial Network, or GAN. To understand GAN models, let’s break down each term. “Generative” means we’re using a model that can generate new (fake) data. “Adversarial” means that there are two models that compete against each other. “Networks” means that deep neural networks are used as the underlying models to process data.

A few applications of GAN models are Data Augmentation, Style Transfer, and Photograph Inpainting. In data augmentation, real data is used to generate synthetic data samples. The synthetic samples are typically then combined with the original, real data to create an augmented data set. In style transfer, the content of one image and the style of another can be combined to create a new image. So, suppose I have a picture of my family taken in the mountains at summertime. But I want to use that picture on a Christmas card. Style transfer can take that family photo and make it appear as if it were taken in winter in the mountains. For photograph inpainting, GenAI is used to remove unwanted objects or fix imperfections in photographs or pictures. If I have a photo of my son playing tennis, but he is blurry in the photo, photograph inpainting can remove the blurriness. For the current discussion, I’ll focus on data augmentation as that is the type of synthetic data generation most likely used in a business context.

The key to understanding GANs at an introductory level is that there are two models involved, and that they are competing. Essentially, one model is trying to trick the other. One model is known as the Generator and the other is the Discriminator. The generator creates the fake samples. Its goal is to maximize the likelihood that the discriminator misclassifies the generator’s fake output as real.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The discriminator is trying to catch the generator making fake data. Its goal is to accurately distinguish between real and fake data. The discriminator receives both generated and real samples coming in and predicts the probability for each that they are real.

So, in the full network, both models work in tandem. In a cops-and-robbers sort of analogy, the generator is the bad guy- trying to get away with something- and the discriminator is the good guy- trying to catch the generator in the act of getting away with something.

The generator and the discriminator are adversarial, but they also communicate with each other during the training process. When fake samples are detected by the discriminator, the generator learns from this and tries to make samples that look more like the real data. In deep neural network terminology, the generator updates its weights through back propagation. As the generator is making better samples, the discriminator works harder to try and catch the generated fake data. The discriminator learns from the mistakes it makes. The weights of the discriminator are also updated using back propagation as the system works.

The advantages of GANs are that they can be used to generate supervised (data with a target) as well as unsupervised (data without a target) data, making them useful in a variety of business scenarios. They generate data that are similar to the real data by learning the intricacies of the real data. GANs can learn complicated distributions and even create realistic, messy data if needed. Given enough computer resources, they generate new synthetic samples quickly.

There are some challenges to GANs also. They can suffer from a scenario known as Mode Collapse- the good guy is too weak to do his job and is easily tricked by the bad guy. This is the result when the discriminator is not powerful enough to distinguish real from fake data. This can happen when the generator finds a way to easily trick the discriminator with a small sample of near homogeneous instances. GANs may also suffer from vanishing gradients during the learning process. During back propagation, the gradient flows in a backwards direction, from the final layer of the deep neural network to the first layer. As it flows backwards the gradient gets increasingly smaller. Sometimes, the gradient may decrease to a point that the weight values of the initial layers don’t change during learning. This essentially stops the training in these first layers. In the advantages listed earlier, I said that GANs can generate new synthetic samples quickly “given enough computer resources”. So, to be efficient, they do require a lot of computational power. Typically, a computer system with a GPU, graphical processing unit, is needed. To be fair, more computational resources are needed during training the GAN compared to generating the new samples once the model exists (scoring). We’ve also known now for decades that garbage in is garbage out. (I learned the GI-GO principle in my first computer science class in the early ‘90s…but I’m dating myself.) Bias in training data can lead to biased synthetic data.

Generating synthetic data in SAS:

How can one generate synthetic data using SAS? Well, I don’t mean to disappoint you, but I will not cover details on how to generate synthetic data using SAS in this post. Keep in mind that the purpose of this series is to introduce GenAI applications, at a high level. That said, SAS Viya software does currently include two types of GAN models. One is called the Correlation Preserving Conditional Tabular GAN, or CPCTGAN, model and the other is a styleGAN model. Details of these models are beyond the scopes of the current post, but they are very likely something I may post about in the future. So, stay tuned! Both models are implemented in SAS Viya using the “generativeAdversarialNet” CAS action set. Specifically, the CPCTGAN model uses the “tabularGanTrain” action and the styleGAN model uses the “styleGANTrain” action. Feel free to take a look at product documentation for more information.

Where to learn more:

As I mentioned at the start of the post, I’ll be covering both synthetic data generation and large language models in a pre-conference tutorial at SAS Innovate in Orlando, FL on May 6. The name of the tutorial is Getting Started with Generative AI. This 2 hour and 15-minute presentation will introduce both of these areas of GenAI at a deeper level than I can go into in posts. In addition to more detail on the topics, demonstrations of using both GAN and Large Language Models using SAS will be given. If you still are not registered for SAS Innovate, you can do so here.

If you want background on deep neural networks and AI technology, consider taking the self-paced or instructor lead course Deep Learning Using SAS Software.

A great resource on GenAI and also the location of much of the content for this post is the FREE self-paced e-learning course Generative AI Using SAS.

Finally, if you want a slightly different perspective on this same topic, check out this post by my esteemed colleague Manoj Singh.

Keep an eye out for post #2 in this series that will be introducing Large Language Models and how they are used to generate text output. If all goes well, that post will be published sometime in May.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library