BookmarkSubscribeRSS Feed

Creating Synthetic Data using PROC SMOTE, PROC SIMNORMAL and PROC TABULARGAN

Started Friday by
Modified Friday by
Views 153

Introduction

 

Companies and organizations face a growing demand for large, high-quality training data sets for prediction and decision-making . At the same time, data protection and confidentiality requirements limit the use and disclosure of real customer data. Synthetic data offers a solution by preserving the essential statistical properties of the original data without directly disclosing personal information. This enables both more comprehensive data bases for model training and data protection-compliant disclosure for testing and cooperation purposes.

 

This article illustrates the creation of synthetic data with three SAS Procedures:

  1. In the first section you learn basic features for synthetic data generation based on the methods of method of multivariate correlations using the SIMNORMAL Procedure. You also see the limitations of this approach 
  2. The SMOTE procedures is featured in the second section based on the method of synthetic minority oversampling technique (SMOTE). 
  3. Finally section tree illustrates how you can package a synthetic data generator into an ASTORE file using Proc TABULARGAN 

In the conclusion section the advantages of these methods are summarized and an outlook to other methods in SAS, e.g. by using SAS Data Maker is given.

 

 

For our analysis we are going to use the sashelp.cars data.

 

title Original Cars Data;
proc print data=sashelp.cars(obs=10);
 var invoice horsepower mpg_city mpg_highway weight;
run;

 

                                                                                  MPG_
                                   Obs     Invoice    Horsepower    MPG_City    Highway    Weight

                                     1     $33,337        265          17          23       4451 
                                     2     $21,761        200          24          31       2778 
                                     3     $24,647        200          22          29       3230 
                                     4     $30,299        270          20          28       3575 
                                     5     $39,014        225          18          24       3880 
                                     6     $41,100        225          18          24       3893 
                                     7     $79,978        290          17          24       3153 
                                     8     $23,508        170          22          31       3252 
                                     9     $32,506        170          23          30       3638 
                                    10     $28,846        220          20          28       3462 

 

 

 

Method 1 - PROC SIMNORMAL to create data from multivariate normal distributions

 

The SIMNORMAL procedure is part of SAS/STAT and SAS Visual Statistics and allows to create synthetic data based on a variance/covariance matrix. Note that for this method you need to calculate variances and covariances, therefore its application is limited to interval variables.

 

Step a)  Create a "Data Generator"

 

In the first step you use PROC CORR to create the variance/covariance matrix and store it in dataset WORK.CARS_COV. 

 

/*********************************************************************************************************************
 ***  Analyze Relationships and Create Data Generator
 *********************************************************************************************************************/

proc corr data=sashelp.cars out=work.cars_cov cov noprint nocorr;
 var invoice horsepower mpg_city mpg_highway weight;
run;
  • Option COV requests covariances instead of correlations
  • OUT= stores the results in a dataset
  • NOPRINT suppresses printed output.

 

Step b)  Generate new (synthetic data) using the SIMNORMAL procedure

 

In the next step you use the SIMNORMAL procedure to generate synthetic data.

 

/*********************************************************************************************************************
 ***  Use Generator to Create Synthetic Data
 *********************************************************************************************************************/

proc simnormal data=work.cars_cov(type=cov)
               out = work.CarsSynth_Simnormal
               numreal= 500
               seed = 1;
  var invoice horsepower mpg_city mpg_highway weight;
run;
  • You use the covariance matrix (WORK.CARS_COV) that you generated with the CORR procedure
  • You specify and output dataset that contains the new created data.
  • the NUMREAL option specifies how many observations shall be generated
  • SEED can be used to us a specific seed for the random number generator 

 

Step c)  Compare the created data with the original data

 

In the next steps we print and plot the original and synthetic data to be able to review their similarity visually.

 

title Original Cars Data;
proc print data=sashelp.cars(obs=10);
 var invoice horsepower mpg_city mpg_highway weight;
run;

title Synthetic Cars Data with PROC SIMNORMAL;
proc print data=work.CarsSynth_Simnormal(obs=10);
 format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
 var invoice horsepower mpg_city mpg_highway weight;
run;

 

At the first sight we see that the new observations match to some extent with the ranges of the original data.

 

Print Orig.png          Print Simnormal.png

  

However when we review the scatterplot of two selected variables HORSEPOWER and MPG_HIGHWAY (using PROC SPLOT and the SCATTER statement)

 


title Original Cars Data;
proc sgplot data=sashelp.cars;
 scatter x=horsepower y=mpg_highway;
run;

title;
title Synthetic Cars Data created with PROC SIMNORMAL;
proc sgplot data=work.CarsSynth_Simnormal;
 scatter x=horsepower y=mpg_highway;
run;


 

You can see the following:

  • The basic relationship "negative relationship between HORSEPOWER and MPG_HIGHWAYS" has been captures.
  • This negative relationship however has only been captured as linear, while it is a non-linear in the original data. 
  • The ranges of the generated data do not stay in the ranges of the original data. For HORSEPOWER the range was roughly in the interval 80 to 500 while the generated data lie between roughly 0 and 400. There is even a negative horsepower value in the generated data.

 

 

Scatter orig.pngscatter simnormal.png

d)  Summary

 

You can summarize the features of this method:

  • Easy to use with the CORR and the SIMNORMAL procedures
  • Only requires a SAS/STAT or SAS Visual Statistics license
  • Does not require a lot of computational power
  • Allows to separate the logic (in our case the WORK.CARS_COV dataset) from the original data for the creation of new observations

However,

  • It can only captured linear relationships
  • It is limited to interval variables
  • It only works for cross sectional and not for time series data.

 

Note that you can also apply this method using the RANDNORMAL function in SAS/IML, which is illustrated by Rick Wicklin @Rick_SAS  in tip 5 of his paper SAS Global Forum 2015 paper "Ten Tips for Simulating Data with SAS

 

 

Method 2 - PROC SMOTE to create data using a non-parametric method

 

 

In his DO Loop blog article, The SMOTE method for generating synthetic data, Rick Wicklin @Rick_SAS explains very nicely the idea and method of the SMOTE algorithm.

 

In the machine learning offering of SAS you find the SMOTE procedure as well as the SMOTE Action Set to generate data using the SMOTE algorithm. You need a Viya: Machine Learning license to use this procedure or action set. In this example the the SMOTE procedures is used. Refer to the documentation for an example to implement the SMOTE Action Set .

 

Step a)  Make the original data available in CAS

 

If you have not started a CAS session yet, start it using the CAS statement.

 

CAS cas1;
 
if your data is not yet available in CAS, create a CAS table of your original data
 
data casuser.cars;
 set sashelp.cars;
run;

 

Step b)  Generate synthetic data using the SMOTE procedure

 
SMOTE is a non-parametric method. This means that there is no "formula", "rule set" or "logic" being calculated or learned from the original data to create synthetic data. (this was the case in the previous example with PROC SIMNORMAL). Consequently PROC SMOTE uses the original dataset to generate synthetic data.
 
proc smote data = casuser.cars seed=1;
 input type drivetrain/level = nominal;
 input invoice horsepower mpg_city mpg_highway weight;
 output out=casuser.CarsSynth_Smote;
 sample numsamples=500 
        EXTRAPOLATIONFACTOR = 0.1 
        K = 7;
run;
 
In the SMOTE procedure, you
  • use the INPUT statement to define the variables that are used for the SMOTE analysis and data generation process. You differentiate between NOMINAL and INTERVAL (= default) variables.
  • specify the name of the output dataset with the OUTPUT statement
  • use the SAMPLE statement to specify details for the SMOTE algorithm
    • NUMSAMPLES: number of observations that shall be generated
    • K: value for the k-nearest neighbors, that shall be configured 
    • EXTRAPOLATIONFACTOR: specifies the standard deviation of the Gaussian noise that is used for extrapolation

 

Step c)  Compare the created data with the original data

 

In the next steps we again print and plot the original and synthetic data to be able to review their similarity visually.

 


title sashelp.cars Data;
proc print data=sashelp.cars(obs=10);
 title Original Cars Data;
 var invoice horsepower mpg_city mpg_highway weight;
run;

title Synthetic Cars Data with PROC SMOTE;
proc print data=casuser.CarsSynth_Smote(obs=10);
 format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
 var invoice horsepower mpg_city mpg_highway weight;
run;
 
You see that the values of the synthetic data are in the similar magnitude and range of the original data.

Print Orig.png      Print Smote.png

  

When looking at the the scatterplot of HORSEPOWER and MPG_HIGHWAY you see the similarity in the relationship, which both show a negative trend and a non-linear relationship. You also see that the ranges of the synthetic data overlap with the original data.
 
proc sgplot data=casuser.cars;
title Original Cars Data;
 scatter x=horsepower y=mpg_highway;
run;


proc sgplot data=casuser.CarsSynth_Smote;
 title Synthetic Cars Data created with PROC SMOTE;
 scatter x=horsepower y=mpg_highway;
run;
Scatter orig.pngScatter smote.png

  

The SMOTE method also allows to consider categorical variables, therefore it makes sense to compare the distribution of categorical variables in the original and in the synthetic data as well.

 

You can use the FREQ procedures to show the bivariate distribution between TYPE and DRIVETRAIN. In this example only the mosaic plot for the two variables is shown. You can of course also display the frequency table itself or show other charts.

 

ods noproctitle;
proc freq data= casuser.cars ;
 ods select mosaicplot;
title Original Cars Data;
 table type *  drivetrain / plots=(mosaicplot);
run;

ods noproctitle;
proc freq data= casuser.CarsSynth_Smote;
 ods select mosaicplot;
 title Synthetic Cars Data created with PROC SMOTE;
 table type *  drivetrain / plots=(mosaicplot);
run;

 

In the code

  • the TABLE statement is used to request an analysis for TYPE and drivetrain.
  • the PLOTS option specifies that you would like to see the MOSAICPLOT
  • with the ODS SELECT statement is requested that for better visibility only the MOSAICPLOT is shown (and not the frequency table)
  • ODSNOPROCTITLE turns off the display of the procedures title in the output 

 

crosstab orig.pngcrosstab smote.png

 

In the results you can see that the bivariate distribution between DRIVETRAIN and TYPE highly overlap. The marginal distribution of DRIVETRAIN is the same in the synthetic data as well as the cell frequencies of the two categories. When you exchange the order of the variables in the TABLE statement you will see the vertical columns for TYPE and DRIVETRAIN in the cells.

 

d)  Summary

 

You can summarize the features of this method:

  • Non-parametric method, which can deal very nicely with non-linear relationships and also categorical data
  • Because of the nearest neighbor method, the synthetic data that are generated lie within the multivariate distribution and range of the original data.
  • The methods it easy to use, by calling the SMOTE procedure
  • Execution time increases (almost linearly) with the number of observations that you want to generate

However

  • there is not "logic" or "formula" generated by the method, which can be separated from the original data to generate synthetic data. You always need a copy, or a sample of the original data to generate data. 
    • In case you do not want to move your original data to another environment for data generation, you might want to consider creating a synthetic copy of the original data on your trusted environment which is transferred to to be used for applying the SMOTE method 
  • Allows to separate the logic (in our case the WORK.CARS_COV dataset) from the original data for the creation of new observations

 

 In his article "Implement a SMOTE simulation algorithm in SAS" Rick Wicklin @Rick_SAS shows how you can implement the SMOTE simulation in SAS/IML.

 

 

Method 3 - PROC TABULARGAN for generative adversarial networks

 

Overview

 

In the final example you are looking at generative adversarial networks for the generation of synthetic data. In SAS Machine learning you find the TABULARGAN procedure. This procedure trains a correlation-preserving conditional tabular generative adversarial network (CPCTGAN) model on tabular data. It implements the CPCTGAN model by using the PyTorch library.

 

The procedure trains a model that you can use to generate synthetic data in another environment where you do not have your original data present. David Weik illustrates and describes very nicely the separation of the training of the model and the application of the model in different environments in his article "Using synthetic data to bridge production and development".

 

In the diagram below the arrow between "Production" and "Development" environment illustrates the move of the model logic into another environment.

 TabularGAN-2048x289.png

 

  

 

 

 

Note that the method of generative adversarial networks is very compute intense and requires a significant amount of training iterations and computation time. In most cases GPUs are needed to perform these computations in reasonable time. Therefore it might be hard or not possible to create a good model in your local installation.

 

In this section you will not deep dive into the ultimate tuning of the model to receive a good generator. We focus on an example where you train a basic model, output this model and then apply it in the same or in a separate environment.

 

Train the model

 

As in the SMOTE example above, you need to have a CAS session running and you need to have your original data in a CAS library. You do this in the same way as shown above for PROC SMOTE

 

Next you call the TABULARGAN procedure. This example shows how you can

  1. train a model
  2. output the model logic as an ASTORE file
  3. generate a dataset with synthetic data, which you might want to use for validation purposes

 

proc tabularGAN data = casuser.cars  seed = 42   numSamples = 500;

 input type drivetrain/level = nominal;
 input invoice horsepower mpg_city mpg_highway weight;

    gmm alpha = 1 maxClusters = 10 seed = 42 VB(maxVbIter = 3);
    aeOptimization ADAM LearningRate = 0.0001 numEpochs = 3;
    ganOptimization ADAM(beta1 = 0.55 beta2 = 0.95) numEpochs = 5;
    train embeddingDim = 32 miniBatchSize = 300 useOrigLevelFreq;

    saveState rStore = casuser.ASTORE_CarsSynth_TabGAN;
    output out = casuser.CarsSynth_TabGAN;
run; quit;

 

  • You use the INPUT statement to specify INTERVAL (default) and NOMINAL variables.
  • You define options to train the model using the GMM, AEOPTIMIZATION, GANOPTIMIZATION and TRAIN statement. Check the documentation for more details (Bullet "1" above).
  • Note that values specified for the the MAXVBITER and the NUMEPOCHS options are too small for a representative model. They have been set at this value to make sure that the example code runs through quickly to be able to illustrate the results.
  • The SAVESTATE statement allows you to export the model in an ASTORE and save it in a CAS library (this is mentioned in bullet "2" above).
  • The OUTPUT statement allows to create synthetic data and store them in a dataset (as mentioned in "3" above). Here there NUMSAMPLES option in the procedure statement is used to specify the number of observations.

 

You could run the same code as for the SMOTE procedure above to validate the synthetic data and compare it with the original data. The above code, however, does not train a model at the necessary degree to create a representative model for the data. The code is just shown here for completeness.

 

 

title sashelp.cars Data;
proc print data=sashelp.cars(obs=10);
 title Original Cars Data;
 var invoice horsepower mpg_city mpg_highway weight;
run;
title Synthetic Cars Data with PROC TABULARGAN;
proc print data=casuser.CarsSynth_TabGAN(obs=10);
 format invoice DOLLAR8. horsepower mpg_city mpg_highway weight 8.;
 var invoice horsepower mpg_city mpg_highway weight;
run;

proc sgplot data=casuser.cars;
title Original Cars Data;
 scatter x=horsepower y=mpg_highway;
run;
proc sgplot data=casuser.CarsSynth_TabGAN;
 title Synthetic Cars Data created with PROC TABULARGAN;
 scatter x=horsepower y=mpg_highway;
run;

ods noproctitle;
proc freq data= casuser.cars ;
 ods select mosaicplot;
title Original Cars Data;
 table type *  drivetrain / plots=(mosaicplot);
run;
ods noproctitle;
proc freq data= casuser.CarsSynth_TabGAN;
 ods select mosaicplot;
 title Synthetic Cars Data created with PROC TABULARGAN;
 table type *  drivetrain / plots=(mosaicplot);
run;

 

 

Applying the model and creating synthetic data

 

In order to create synthetic data from the model stores in the ASTORE casuser.ASTORE_CarsSynth_TabGAN you can use the following code.  

 

In the first step you create an dataset with the required number of observations (800 in our example).

 

 

data work.Cars800_Base;
 do id = 1 to 800;
   output;
 end;
run;
 
In the next step you use the ASTORE procedures to apply the model in ASTORE_CARSSYNTH_TABGAN on the 800 records.
 

proc astore;
 score data     = work.Cars800_Base
       rstore   = casuser.ASTORE_CarsSynth_TabGAN
       out      = casuser.CarsSynthData_TabGAN800
       copyVars = (_all_)
;
run;
 
Table CASUSER.CARSSYNTHDATA_TABGAN800 contains the 800 synthetic observations. 
 
As mentioned above, this methods allows to separate model building (using PROC TABULARGAN) and model application (using PROC ASTORE) between environments and only move the model logic to the environment, where synthetic data shall be generated.

 

 

 

 

 

Links and other solutions

 

With SAS Data Maker SAS also provides a comprehensive synthetic data solution. It is a low-code/no-code tool for generating high-quality synthetic data that mirrors real-world data sets. It lets you augment existing data or create entirely new data sets, reducing the cost of data acquisition, protecting sensitive information and accelerating AI and analytics development. SAS Data Maker can also synthetize time series data and data from a relational scheme of tables,

 

 

Related articles

 

 

 

Find more examples in my webinars at Youtube:

my SAS Press books

and on Medium | LinkedIn | Github | SAS-Books

 

 

 

 

 

Contributors
Version history
Last update:
Friday
Updated by:

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!

Register now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags