How Can Synthetic Data Improve ML Training and Fairness? Q&A, Slides, and On-Demand Recording

Watch this Ask the Expert session to discover how synthetic data accelerates model development while preserving data integrity.

Watch the webinar

You will learn how to:

Eliminate bottlenecks in data access and sharing that slow model development.
Address issues of sufficiency and balance for your training data.
Evaluate synthetic data for realism, fairness and analytical value.
Accelerate experimentation, model building and deployment while minimizing privacy and compliance risks.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

Answered

Do we need to start with "real" data? Can't we start with summary statistics? For example, if an article in New England Journal of Medicine describes the characteristics of the population being studied, can these data be used to generate a synthetic data set with the same overall characteristics?

Essentially, Data Maker currently relies on having an input sample to derive all of the distributions for the various tables, the links between them, and all of the statistics that we pull into the generator models. However, we are working on an extension of Data Maker that maintains the same overall structure. In this approach, you describe what's in your data—the data types—and rather than selecting a model to do the job, you specify the distributions you'd like to reproduce. The process then works much the same way, except you skip the training part since there is no training, as we’re not deriving any distributions. This is something we're actively developing, though we don't have a specific schedule in mind. We welcome your feedback and thank Sufjan for the question. We invite further input on generating synthetic data and your needs. When generating synthetic data through configuration, we do not rely on actual production data but instead start with a configuration sheet—this could be an Excel file or another specification—that incorporates summary statistics, as the questioner referred to, and uses them to generate distributions. This process can also involve the imposition of constraints. This is an interesting parallel area that we are exploring.

Currently Data Maker is a sample-based solution, we are however working on a schema-based approach that will allow the user to do exactly as you have described meaning describe a multi-table data schema and provide a set of distributions they wish the data to follow.

How does SAS Data Maker ensure privacy while maintaining analytical utility?

So, we have those privacy prediction metrics and controls, which mainly rely on a technique called differential privacy. Differential privacy is regarded as a stronger form of privacy protection, primarily because it injects noise during the training process, as opposed to other techniques such as anonymization, which mainly focus on individual variables and attempt to convert them to synthetic proxies. The danger with anonymization is that you might believe you are protected if you simply remove a column from your data set or cover it with a series of X marks or redactions. However, other attributes may still strongly indicate the presence of, or similarity to, real-world observations within your synthetic data. To guard against this, it is necessary to "jumble up the works" in some way, and that is exactly what differential privacy does. However, when you do this, there is always a trade-off between the extent of privacy protection and the analytical utility of the data set, because, in essence, you are introducing noise. Therefore, it is important to monitor these metrics and identify the optimal balance that best serves your purpose.

Can you describe a real or potential use case in the government space about using protected tax data for revenue forecasting or public policy formulation?

That is a rather specific application, but we do not have details we can disclose at this time. However, we have been working with government agencies, and most of the time, the main pain point we encounter is related to data sharing. When you consider problems like the one described—where data comes into one agency but that agency relies on data from various sources—analysis is better served if the data is shared. Data sharing across multiple agencies is important because they sometimes have different perspectives on the same issue. Beyond taxation and finance, other industries with strong public oversight, such as healthcare, face similar challenges. There is a need to share data with research institutes and other quasi-government bodies, but direct data access cannot always be granted. This is why synthetic data helps us implement a barrier between real data and synthetic data, which is very similar but can be shared across agencies. This facilitates better consortium models and improves overall prediction. I agree that tax agencies and the taxation sector can definitely benefit from synthetic data.

Can synthetic data be used in PhD research and academic publications?

The answer is yes, and I was waiting for others to chime in as well. We have had cases where this has been discussed, primarily in the healthcare and life sciences space, especially prior to publishing results in a journal. Most clinical studies tend to be lengthy, and researchers may not want to publish final results immediately since they must go through various stages of review. However, sharing initial insights is valuable since it helps the scientific community analyse and advance research as a whole. In such cases, journal publications typically accept a synthetic proxy, provided you have demonstrated that you have taken adequate steps to capture the essence without revealing too much about a specific study, or increasing liability for both the publisher and the contributing agency. Additionally, in contexts where the PhD or research focuses on sensitive data, accessing the actual data can be an issue in many instances. Using synthetic data provided by the data owner can facilitate the progress of the research, because otherwise, researchers would not be allowed to access the data.

What types of data work best with synthetic generation today?

Structured data is something I will say up front—structured data, usually a mix of categorical and numeric variables, works best. Categories provide the dimensions that allow you to generate synthetic data and be quite specific in mimicking the distribution for a particular level of a categorical variable. Where synthetic data still has progress to make before it can be considered production grade is in the area of free text generation. This is a debatable point, given that we are living in a world of LLMs and large language models—in fact, we are generating synthetic data all the time. But the challenge is that generating free text really requires a lot of context, and that context needs to be relevant and specific to the task at hand. This actually requires considerable access to local data and proper structuring of unstructured data, which is used as evidence for free text generation. There have been cases, for example, in the hospitality industry, where organisations generate synthetic data mimicking customer reviews so they can test their customer service processes. Similarly, the insurance industry generates fake claims to test their claims adjudication processes. However, when these experiments are conducted, the details often turn out to be too specific and too similar to real-world reviews or customer complaints, since that is the data on which the models were trained. Therefore, much more careful creation of the data is needed before synthetic data generation can be reliably used for unstructured free text.

Recommended Resources

SAS Data Maker Page

Microsoft Marketplace Listing

Unlocking Health Innovation with Synthetic Data | The Health Pulse Podcast

Please see additional resources in the attached slide deck.

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

How Can Synthetic Data Improve ML Training and Fairness? Q&A, Slides, and On-Demand Recording

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away