Juletip #12 - Synthetic Data Generation - kind of a Kinderegg of possi...

OMH · Posted 12-12-2024 02:52 AM

SAS Data Maker:
A Deep Dive into Synthetic Data Generation. In today's data-driven world, access to large amounts of data is crucial for developing accurate and effective AI models. However, real-world data often presents challenges related to privacy, bias, cost, and availability. Synthetic data generated by SAS Data Maker offers a solution to these problems, opening the door to new possibilities in artificial intelligence.

Figure 1: SAS Data Maker Canvas

What is Synthetic Data, and Why is it Important?

Synthetic data is artificially generated information that mirrors the statistical properties of real data without containing any sensitive or identifiable details. This makes synthetic data an invaluable tool for:

Protecting privacy: Avoid risks associated with sharing sensitive data.

Reducing bias: Mitigate discrimination and biases in datasets.

Improving AI models: Increase the accuracy and robustness of AI applications.

Democratizing data: Make data accessible to more people without compromising privacy.

SAS Data Maker: A Comprehensive Platform for Synthetic Data Generation

SAS Data Maker is an innovative solution that simplifies and automates the process of generating high-quality synthetic data. The platform offers a variety of features that give users complete control and flexibility over the data generation process.

Key Features and Benefits:

User-friendly interface: Intuitive interface for easy data preparation and model configuration.

Advanced algorithms: Access to a wide range of algorithms, including deep learning models like GANs, for generating realistic synthetic data.

Data quality evaluation: Robust tools for assessing the quality and privacy of synthetic data.

Scalability: Generate large volumes of synthetic data to meet the demands of complex AI models.

Integration with SAS Viya: Seamlessly integrate with SAS Viya for enhanced analytics and model development.

How SAS Data Maker Works

The process of generating synthetic data with SAS Data Maker involves a series of simple steps:

1. Data preparation: Select and preprocess the real-world data you want to use as a basis for your synthetic data.

Figure 2: SAS Data Maker User Interface

2. Model configuration: Choose the appropriate algorithm and configure the parameters based on your specific needs.

3. Model training: Train the chosen model on your real-world data.

4. Synthetic data generation: Generate synthetic data that mirrors the statistical properties of your original data.

Figure 3: SAS Data Maker Data Sampling

5. Data quality evaluation: Evaluate the quality and privacy of your synthetic data using a variety of metrics and visualizations

Figure 4: Statistical Correlation between input data and synthetic data.

Figure 5: Statistical Distribution between the input and output data is available to show.

6. You can then download the result and use the data for further analytical work.

Figure 6: Results Data download

Use Cases for Synthetic Data

Synthetic data has a wide range of applications across various industries, including:

Healthcare: Simulate patient cohorts for clinical trials, research, and drug discovery.

Finance: Detect fraud by simulating fraudulent transactions and improve risk assessment models.

Climate: Simulate climate-related events to assess risks and develop mitigation strategies.

Manufacturing: Simulate sensor data from manufacturing plants to develop predictive maintenance algorithms.

Public Sector: Simulate demographic information to support public policy development.

Generative AI and Synthetic Data

Synthetic data is an essential component of generative AI. It can be used to train AI models, improve privacy, and address ethical concerns related to using real-world data. SAS Data Maker empowers organizations to leverage the full potential of generative AI while ensuring data privacy and security.

Synthetic Data creation using SAS Studio:

SAS Studio Generate Synthetic Data using Custom Step:

Synthetic Minority Oversampling TEchnique (SMOTE)

This custom step helps you generate synthetic data based on an input table, using the Synthetic Minority Oversampling TEchnique (SMOTE). SMOTE is an oversampling technique which identifies new data observations in the neighborhood of closely associated original observations.

SMOTE is an alternative approach to Generative Adversarial Networks (GANs) for generating synthetic tabular data. Access to synthetic data helps you make better, data-informed decisions in situations where you have imbalanced, scant, poor quality, unobservable, or restricted data.

Read more about the SAS Studio Custom Step in this Github project

Figure 7: SAS Studio Custom Step generating Synthetic Data

Running SAS Code in SAS Studio is also possible:

This SAS program utilizes the smote Sample action within the smote action set in SAS Viya to generate synthetic data.

Figure 8: SAS Studio SAS Code running SMOTE

Here's a breakdown of the code and the process involved:

Establishing a CAS Session:

cas mySession sessopts=(caslib=casuser timeout=1800 locale="en_US");

This line initiates a CAS (Cloud Analytic Services) session named "mySession," setting parameters for the session like default caslib (a library in CAS), timeout duration, and locale.

Loading Data:

proc casutil;

load file="/home/users/XXXX*/Trout/HunderTroutData_Growth_DQ.csv"

outcaslib="casuser" casout="HunderTroutData_Growth_DQ";

run;

*Adjust to your username
This section uses proc casutil to load a CSV file named "HunderTroutData_Growth_DQ.csv" into a CAS table named "HunderTroutData_Growth_DQ" within the "casuser" caslib. This makes the data accessible for analysis within the CAS environment.

Data sample

MarkNo,CaptureNo,ScaleNo,ScaleNoMax,Period,AgeTotal,AgeRiver,AgeLake,Length,SmoltingStatus,MaturationStatus,SpawnStatus,Year,Sex,Origin,HatchYear,AgeAtSmolting,LengthAtSmolting,AgeAtMaturation,LengthAtMaturation,SpawnCount,CaptureYear

H 1151,1,1,1,River,1,1,0,78,0,0,0,1956,female,wild,1955,5,333,7,629,3,1966

H 1151,1,1,1,River,2,2,0,138,0,0,0,1957,female,wild,1955,5,333,7,629,3,1966

H 1151,1,1,1,River,3,3,0,201,0,0,0,1958,female,wild,1955,5,333,7,629,3,1966

H 1151,1,1,1,River,4,4,0,258,0,0,0,1959,female,wild,1955,5,333,7,629,3,1966

Applying SMOTE in SAS Studio:

proc cas;

  loadactionset "smote";

  action smoteSample result=r /

    table="HunderTroutData_Growth_DQ",

    nominals={"LengthAtSmolting", "Sex", "Origin", "MarkNo", "Period"},

    seed=10,

    numSamples=150000,

    extrapolationFactor=0.8,

    casout={name="SyntheticHunderTrout",replace="TRUE"};

  print r;

run;

quit;

This is the core part of the program where synthetic data is generated using the SMOTE (Synthetic Minority Over-sampling Technique) algorithm.

loadactionset "smote"; loads the action set containing the smoteSample action.
action smoteSample result=r / ... invokes the smoteSample action to generate the synthetic data.
	table="HunderTroutData_Growth_DQ" specifies the input CAS table containing the real data.
	nominals={"LengthAtSmolting", "Sex", "Origin", "MarkNo", "Period"} identifies the categorical variables in the dataset.
	seed=10 sets a seed value for reproducibility of the synthetic data generation.
	numSamples=150000 determines the number of synthetic samples to generate.
	extrapolationFactor=0.8 controls the degree of extrapolation when creating new synthetic instances.
	casout={name="SyntheticHunderTrout",replace="TRUE"} specifies the name ("SyntheticHunderTrout") and location (replace="TRUE" indicates overwriting if the table exists) for storing the generated synthetic data.
print r; displays the results of the smoteSample action.

Results:

Figure 9: Results of SMOTE

Data:

Figure 10: Data Sample from SAS Code running

In the context of the provided SAS program, SMOTE is used to generate synthetic data that balances the representation of different categories or groups within the "HunderTroutData_Growth_DQ" dataset. This can be particularly useful if the original data has under-represented groups, leading to more robust and fair AI models trained on the synthetic data.

Conclusion

SAS Data Maker is a powerful and versatile tool that enables organizations to generate high-quality synthetic data for a variety of purposes. With its advanced features, user-friendly interface, and seamless integration with SAS Viya, SAS Data Maker is the ideal solution for organizations looking to leverage the power of synthetic data in a responsible and ethical manner.

It is possible to generate your own Synthetic Data using other tools as shown in the examples above using the SAS Viya platform with SAS Studio Custom Step - SDG - Generate Synthetic Data through SMOTE or my own example from SAS Studio regular SAS Code to generate synthetic data using Trout data origin from a 51-year mark-recapture study of a land-locked population of large-sized migratory brown trout (Salmo trutta) in Norway.

SAS emphasizes that synthetic data will be crucial for addressing challenges related to data privacy, scarcity, and bias. They see it as a key enabler for innovation, allowing organizations to develop AI models and make better data-driven decisions.  

It is also worth noting that Gartner predicts synthetic data will overshadow real data in AI models by 2030. This highlights the transformative potential of synthetic data in creating more accurate, ethical, and robust AI applications.

In conclusion, both SAS and Gartner foresee a future where synthetic data plays a pivotal role in analytics, driving innovation and improving decision-making across various industries.

As a last bit of information SAS has recently bought the company Hazy to evolve Syntethic data generation even more - so have a lookout for more features to arrive in 2025.

From all of us at SAS we wish You a Merry Christmas!

Juletip #12 - Synthetic Data Generation - kind of a Kinderegg of possibilities

The 2025 SAS Hackathon has begun!