There are plenty of reasons to want to synthesize data, from preserving individual privacy, to handing wildly unbalanced classes, to running complex simulations. SAS Data Maker offers a no-code way to do this quickly and easily for tabular data. It really is as simple as feeding in a data set and clicking a few buttons to generate realistic data.
There are advantages to using Data Maker over, for example, simulating multivariate normal data from a mean and covariance matrix, because Data Maker mimics the distributions and types of relationships among variables from the original data using a variety of methods. In addition, Data Maker gives you a panel of assessment plots to determine whether your model has been faithful to the original data patterns while still maintaining individual privacy.
There have been other posts written about SAS Data Maker, and I won’t try to reproduce those here when others have done excellent work already. I want to focus on one little situation that surprised me, but makes perfect sense once I thought carefully about it. Maybe it will save you a minute or two of spinning your wheels.
One of the greatest joys of being a SAS employee is kicking the tires on new software features before anyone else gets to see them. When I started playing with Data Maker last Fall, I giddily synthesized all kinds of data, sometimes for no reason at all, and other times for really good reasons. Couldn’t get enough.
One afternoon, I needed to boost a rare event rate for a binary target variable. I had 40,000 training records, but only 500 positive cases (where target event = 1). SMOTE (Synthetic Minority Oversampling Technique) is one of the methods in SAS Data Maker to model and synthesize data. SMOTE is designed for oversampling from a minority class (like a rare target) using a k-nearest neighbor approach. I thought, “Excellent! I will just load up this 40,000 observation data set and get me some of that sweet, sweet balanced data.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
I looked at the assessment measures.
The Histogram Similarity plots show source data in blue and synthetic data in pink. So, Ideally, you want to see a lot of purple. Looks good, for capturing the original patterns in the data. What about balancing my target? How well did it oversample the target level? Ideally, I want the synthetic data to have a lot more event cases than the source data.
That’s not what I expected. I thought that it would give me something closer to balanced synthetic data. This is just as imbalanced as the original sample. And then I realized: I never told the software what my target variable was. How can it do synthetic minority oversampling from a rare target, if it doesn’t even know what the target is? What am I missing?
And then it hit me. Or, rather, my friend Dan Obermiller hit me (not literally) with what I overlooked. Don’t put 40,000 training cases in if you want to oversample from only the 500 event cases. Just put in the event cases.
This rubs my statistical brain the wrong way. You want me to train a model on ONLY the target event level? Everyone knows that you need events and non-events to train a predictive model, and compare what makes event cases different from non-event cases. What kind of madness is this?
The madness, my friends, is that we AREN’T training a predictive model. We are mimicking a population. The population we are mimicking is the event level. The fraudsters, the responders, the ultra-talented athletes, the patients with extremely rare side-effects. We don’t really much care what the large and plentiful population looks like if we are trying to mimic the rare one, even if there is overlap. In fact, synthesizing data in a way that enhances or exaggerates differences between events and non-events would result in a biased predictive model. The model should mimic the rare events, regardless of what other levels look like.
So, back into the 40k data set. I ran a quick data step to subset it down to 500. There’s one more notable benefit to using only the rare cases: data upload, analysis, and model fitting are laughably fast on 500 records. Data Maker now ran SMOTE on just the event cases and I can generate 5,000 synthetic cases. Download the generated data and I’m ready to go!
Now that I’m ready to develop a predictive model, I can use all 40,000 training cases plus the newly generated 5,000 rare cases. It’s still an unbalanced data set, but I’m much happier to train ML models on 5,500 events than on 500, making the effective sample size closer to 10,000 instead of 1,000.
Want to see Data Maker in action? Learn more in the Data Maker lesson of the free course, Generative AI Using SAS. We’ve also got a plenty of information for you here. But my limited-time-only advice? Meet me in Texas next month! I’ll be teaching a hands-on workshop at SAS Innovate in Grapevine, TX on April 29.
Ready to get Data Maker for yourself and start synthesizing? It’s available on the Azure Marketplace. See this video to learn how to get started.
Find more articles from SAS Global Enablement and Learning here.
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.