Companies and organizations face a growing demand for large, high-quality training data sets for prediction and decision-making . At the same time, data protection and confidentiality requirements limit the use and disclosure of real customer data. Synthetic data offers a solution by preserving the essential statistical properties of the original data without directly disclosing personal information. This enables both more comprehensive data bases for model training and data protection-compliant disclosure for testing and cooperation purposes.
This article illustrates the creation of synthetic data with three SAS Procedures:
In the conclusion section the advantages of these methods are summarized and an outlook to other methods in SAS, e.g. by using SAS Data Maker is given.
For our analysis we are going to use the sashelp.cars data.
title Original Cars Data;
proc print data=sashelp.cars(obs=10);
var invoice horsepower mpg_city mpg_highway weight;
run;
MPG_
Obs Invoice Horsepower MPG_City Highway Weight
1 $33,337 265 17 23 4451
2 $21,761 200 24 31 2778
3 $24,647 200 22 29 3230
4 $30,299 270 20 28 3575
5 $39,014 225 18 24 3880
6 $41,100 225 18 24 3893
7 $79,978 290 17 24 3153
8 $23,508 170 22 31 3252
9 $32,506 170 23 30 3638
10 $28,846 220 20 28 3462
The SIMNORMAL procedure is part of SAS/STAT and SAS Visual Statistics and allows to create synthetic data based on a variance/covariance matrix. Note that for this method you need to calculate variances and covariances, therefore its application is limited to interval variables.
In the first step you use PROC CORR to create the variance/covariance matrix and store it in dataset WORK.CARS_COV.
/*********************************************************************************************************************
*** Analyze Relationships and Create Data Generator
*********************************************************************************************************************/
proc corr data=sashelp.cars out=work.cars_cov cov noprint nocorr;
var invoice horsepower mpg_city mpg_highway weight;
run;
In the next step you use the SIMNORMAL procedure to generate synthetic data.
/*********************************************************************************************************************
*** Use Generator to Create Synthetic Data
*********************************************************************************************************************/
proc simnormal data=work.cars_cov(type=cov)
out = work.CarsSynth_Simnormal
numreal= 500
seed = 1;
var invoice horsepower mpg_city mpg_highway weight;
run;
In the next steps we print and plot the original and synthetic data to be able to review their similarity visually.
title Original Cars Data;
proc print data=sashelp.cars(obs=10);
var invoice horsepower mpg_city mpg_highway weight;
run;
title Synthetic Cars Data with PROC SIMNORMAL;
proc print data=work.CarsSynth_Simnormal(obs=10);
format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
var invoice horsepower mpg_city mpg_highway weight;
run;
At the first sight we see that the new observations match to some extent with the ranges of the original data.
However when we review the scatterplot of two selected variables HORSEPOWER and MPG_HIGHWAY (using PROC SPLOT and the SCATTER statement)
title Original Cars Data;
proc sgplot data=sashelp.cars;
scatter x=horsepower y=mpg_highway;
run;
title;
title Synthetic Cars Data created with PROC SIMNORMAL;
proc sgplot data=work.CarsSynth_Simnormal;
scatter x=horsepower y=mpg_highway;
run;
You can see the following:
You can summarize the features of this method:
However,
Note that you can also apply this method using the RANDNORMAL function in SAS/IML, which is illustrated by Rick Wicklin @Rick_SAS in tip 5 of his paper SAS Global Forum 2015 paper "Ten Tips for Simulating Data with SAS"
In his DO Loop blog article, The SMOTE method for generating synthetic data, Rick Wicklin @Rick_SAS explains very nicely the idea and method of the SMOTE algorithm.
In the machine learning offering of SAS you find the SMOTE procedure as well as the SMOTE Action Set to generate data using the SMOTE algorithm. You need a Viya: Machine Learning license to use this procedure or action set. In this example the the SMOTE procedures is used. Refer to the documentation for an example to implement the SMOTE Action Set .
If you have not started a CAS session yet, start it using the CAS statement.
CAS cas1;
data casuser.cars;
set sashelp.cars;
run;
proc smote data = casuser.cars seed=1;
input type drivetrain/level = nominal;
input invoice horsepower mpg_city mpg_highway weight;
output out=casuser.CarsSynth_Smote;
sample numsamples=500
EXTRAPOLATIONFACTOR = 0.1
K = 7;
run;
In the next steps we again print and plot the original and synthetic data to be able to review their similarity visually.
title sashelp.cars Data;
proc print data=sashelp.cars(obs=10);
title Original Cars Data;
var invoice horsepower mpg_city mpg_highway weight;
run;
title Synthetic Cars Data with PROC SMOTE;
proc print data=casuser.CarsSynth_Smote(obs=10);
format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
var invoice horsepower mpg_city mpg_highway weight;
run;
proc sgplot data=casuser.cars;
title Original Cars Data;
scatter x=horsepower y=mpg_highway;
run;
proc sgplot data=casuser.CarsSynth_Smote;
title Synthetic Cars Data created with PROC SMOTE;
scatter x=horsepower y=mpg_highway;
run;
The SMOTE method also allows to consider categorical variables, therefore it makes sense to compare the distribution of categorical variables in the original and in the synthetic data as well.
You can use the FREQ procedures to show the bivariate distribution between TYPE and DRIVETRAIN. In this example only the mosaic plot for the two variables is shown. You can of course also display the frequency table itself or show other charts.
ods noproctitle;
proc freq data= casuser.cars ;
ods select mosaicplot;
title Original Cars Data;
table type * drivetrain / plots=(mosaicplot);
run;
ods noproctitle;
proc freq data= casuser.CarsSynth_Smote;
ods select mosaicplot;
title Synthetic Cars Data created with PROC SMOTE;
table type * drivetrain / plots=(mosaicplot);
run;
In the code
In the results you can see that the bivariate distribution between DRIVETRAIN and TYPE highly overlap. The marginal distribution of DRIVETRAIN is the same in the synthetic data as well as the cell frequencies of the two categories. When you exchange the order of the variables in the TABLE statement you will see the vertical columns for TYPE and DRIVETRAIN in the cells.
You can summarize the features of this method:
However
In his article "Implement a SMOTE simulation algorithm in SAS" Rick Wicklin @Rick_SAS shows how you can implement the SMOTE simulation in SAS/IML.
In the final example you are looking at generative adversarial networks for the generation of synthetic data. In SAS Machine learning you find the TABULARGAN procedure. This procedure trains a correlation-preserving conditional tabular generative adversarial network (CPCTGAN) model on tabular data. It implements the CPCTGAN model by using the PyTorch library.
The procedure trains a model that you can use to generate synthetic data in another environment where you do not have your original data present. David Weik illustrates and describes very nicely the separation of the training of the model and the application of the model in different environments in his article "Using synthetic data to bridge production and development".
In the diagram below the arrow between "Production" and "Development" environment illustrates the move of the model logic into another environment.
Note that the method of generative adversarial networks is very compute intense and requires a significant amount of training iterations and computation time. In most cases GPUs are needed to perform these computations in reasonable time. Therefore it might be hard or not possible to create a good model in your local installation.
In this section you will not deep dive into the ultimate tuning of the model to receive a good generator. We focus on an example where you train a basic model, output this model and then apply it in the same or in a separate environment.
As in the SMOTE example above, you need to have a CAS session running and you need to have your original data in a CAS library. You do this in the same way as shown above for PROC SMOTE
Next you call the TABULARGAN procedure. This example shows how you can
proc tabularGAN data = casuser.cars seed = 42 numSamples = 500;
input type drivetrain/level = nominal;
input invoice horsepower mpg_city mpg_highway weight;
gmm alpha = 1 maxClusters = 10 seed = 42 VB(maxVbIter = 3);
aeOptimization ADAM LearningRate = 0.0001 numEpochs = 3;
ganOptimization ADAM(beta1 = 0.55 beta2 = 0.95) numEpochs = 5;
train embeddingDim = 32 miniBatchSize = 300 useOrigLevelFreq;
saveState rStore = casuser.ASTORE_CarsSynth_TabGAN;
output out = casuser.CarsSynth_TabGAN;
run; quit;
You could run the same code as for the SMOTE procedure above to validate the synthetic data and compare it with the original data. The above code, however, does not train a model at the necessary degree to create a representative model for the data. The code is just shown here for completeness.
title sashelp.cars Data;
proc print data=sashelp.cars(obs=10);
title Original Cars Data;
var invoice horsepower mpg_city mpg_highway weight;
run;
title Synthetic Cars Data with PROC TABULARGAN;
proc print data=casuser.CarsSynth_TabGAN(obs=10);
format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
var invoice horsepower mpg_city mpg_highway weight;
run;
proc sgplot data=casuser.cars;
title Original Cars Data;
scatter x=horsepower y=mpg_highway;
run;
proc sgplot data=casuser.CarsSynth_TabGAN;
title Synthetic Cars Data created with PROC TABULARGAN;
scatter x=horsepower y=mpg_highway;
run;
ods noproctitle;
proc freq data= casuser.cars ;
ods select mosaicplot;
title Original Cars Data;
table type * drivetrain / plots=(mosaicplot);
run;
ods noproctitle;
proc freq data= casuser.CarsSynth_TabGAN;
ods select mosaicplot;
title Synthetic Cars Data created with PROC TABULARGAN;
table type * drivetrain / plots=(mosaicplot);
run;
In order to create synthetic data from the model stores in the ASTORE casuser.ASTORE_CarsSynth_TabGAN you can use the following code.
In the first step you create an dataset with the required number of observations (800 in our example).
data work.Cars800_Base;
do id = 1 to 800;
output;
end;
run;
proc astore;
score data = work.Cars800_Base
rstore = casuser.ASTORE_CarsSynth_TabGAN
out = casuser.CarsSynthData_TabGAN800
copyVars = (_all_)
;
run;
With SAS Data Maker SAS also provides a comprehensive synthetic data solution. It is a low-code/no-code tool for generating high-quality synthetic data that mirrors real-world data sets. It lets you augment existing data or create entirely new data sets, reducing the cost of data acquisition, protecting sensitive information and accelerating AI and analytics development. SAS Data Maker can also synthetize time series data and data from a relational scheme of tables,
Find more examples in my webinars at Youtube:
my SAS Press books
and on Medium | LinkedIn | Github | SAS-Books
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.