BookmarkSubscribeRSS Feed

Trust but Verify: How SAS Data Maker Evaluates Synthetic Data Quality and Privacy

Started Thursday by
Modified Thursday by
Views 134

Synthetic data has rapidly moved from a niche concept to an essential tool in modern analytics. Whether it’s used for model development, software testing, data sharing, or privacy protection, synthetic data offers tremendous advantages especially when real data is limited or sensitive.

 

In my earlier post, we explored how SAS Data Maker helps users generate synthetic datasets quickly and intelligently. But generating synthetic data is only the first step. To truly harness its value, we must ask a critical question: Is the synthetic data any good?

 

In this article, we’ll walk through how to evaluate the quality and utility of synthetic data so that you can confidently use it for analytics, testing, or machine learning.

 

Why Evaluating Synthetic Data Is Critical

 

While synthetic data is powerful, it’s not automatically useful. A dataset may look valid at a glance, but still:

 

  • misrepresent relationships
  • distort statistical properties
  • leak sensitive information
  • fail to support real analytical use cases

 

If we don’t evaluate synthetic data rigorously, we risk making decisions based on inaccurate, biased, or unrealistic information. A strong evaluation framework ensures:

 

  • Statistical Fidelity: statistical properties resemble the original data
  • Privacy & Security: no sensitive information can be traced back
  • Utility & Usability: the data continues to serve its core purpose—supporting modeling, analytics, and testing.

 

This leads us to the three essential dimensions of synthetic data evaluation, let’s explore each in detail.

 

Statistical Fidelity: is all about how closely synthetic data mirrors the statistical properties of the original dataset. With SAS Data Maker, this evaluation becomes highly visual and intuitive through its Similarity Metrics. Below is the list of metrics generated by SAS Data Maker:

 

    • Histogram Similarity- This metric measures how closely the synthetic data preserves the marginal distribution of each column compared to the original dataset. It works by generating histograms that show the percentage of observations falling within each bin range for both the source data and the synthetic data, making it easy to see how well the two distributions align.

 

01_MS_Histogram-Similarity2-2.jpg

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

In the above histogram, frequency percentages are presented for two values of credit policy. The purple part of a bar represents a shared percentage of observations that occur in a bin range in both the original input data and the generated synthetic data. If a purple bar has a blue bar on top of it, this represents the additional percentage of observations that were present in the bin range in the original input data as compared to the generated synthetic data. If a purple bar has a pink bar on top of it, this represents the additional percentage of observations that were present in the bin range in the generated synthetic data as compared to the original input data. The greater the overlap between the generated synthetic data and the source data, the better the score.

 

    • Mutual Information Similarity- evaluates the degree of dependence between two variables. It is particularly valuable in advanced scenarios where preserving cross‑column relationships is critical, such as analytics, machine learning, or reporting. By measuring how well associations between variables are maintained, this metric helps ensure that synthetic data reflects the same underlying patterns as the original dataset. For instance, a user might apply it to confirm that the relationship between columns like income and age remains intact. That matters because those correlations often drive reporting, analytics, or predictive models down the line.

 

02_MS_Mutual-Information1.jpg

 

The heatmap above provides a visual way to examine the mutual information score between pairs of columns. By hovering over any cell, you can see the score for the corresponding column pair in each table. In the example shown, the screenshot highlights a similarity score of 0.967 for the Debt-to-Income Ratio and FICO Score columns, indicating a strong preserved relationship. Keep in mind that the score ranges from 0 to 1.

 

    • Degree Distribution Similarity- This metric is relevant only in multi‑table scenarios where tables are related to one another. It measures how well the relationships between related tables—such as one‑to‑many or many‑to‑many associations—are preserved in the synthetic data compared to the source data. This is achieved by calculating the distribution of co‑occurring values across the key columns of related tables and then comparing those distributions between the real and synthetic datasets.

 

03_MS_Degree-Distribution-Similarity.png

 

For every pair of related tables, SAS Data Maker generates histograms on both sides of the relationship—covering the source data as well as the synthetic data. The degree distributions are then compared using total variation distance to measure how closely they align. You can choose the table combinations you want to examine, with a percentage score displayed for each. By hovering over a bar, you can view the frequency score for the corresponding column pair in both the original and synthetic datasets.

 

    • Cross-Table Mutual Information Similarity- It helps check whether relationships between different tables are being preserved in synthetic data. Think of it as an extension of the standard mutual information metric, but with a twist: instead of looking at columns within the same table, it compares column pairs across tables. This makes it especially useful in multi‑table settings where maintaining those cross‑table connections is critical for accurate reporting, analytics, or machine learning.
    • Aggregated Sequential Histogram Similarity- measures how accurately the synthetic data maintains the marginal distribution of each column within a sequential table. In other words, it checks whether the overall shape and spread of values in the synthetic dataset align with those in the original source data when viewed across time or sequence.

 

If a histogram shows weak alignment for certain columns, don’t worry—there are ways to fine‑tune the results. By adjusting parameters in SAS Data Maker, you can often boost similarity scores and make the synthetic data more closely reflect the patterns of your original dataset.

 

Privacy and Security: are central to the purpose of synthetic data. While synthetic data aims to mimic the patterns of real datasets, it must do so without exposing any identifiable or sensitive information. A well-generated synthetic dataset should never reproduce exact records or rare attribute combinations that could point back to real individuals. Evaluating privacy means checking for potential re-identification risks, verifying that no synthetic records match the original data too closely, and ensuring that statistical patterns are preserved without memorizing personal details. This protects individuals’ confidentiality while allowing teams to safely use the data for analytics, testing, and collaboration—ensuring both regulatory compliance and responsible data stewardship. The core privacy checks in SAS Data Maker are density disclosure and presence disclosure. These metrics are designed to gauge how much risk there is of identifying an individual within the dataset, ensuring that synthetic data remains safe and non‑traceable back to real users.

 

    • Density Disclosure- This metric is used to estimate the risk that someone could, in theory, link synthetic data points back to real individuals. This idea is sometimes called reversibility. Unlike traditional anonymization techniques, SAS Data Maker does not create synthetic data by applying a direct transformation to each real record—an approach that can be easily reversed by a skilled attacker. The density disclosure metric measures privacy risk by looking at how many real data points fall within the “neighborhood” of each synthetic data point. The intuition is simple: if a synthetic record has no nearby real records, or many of them, then an attacker cannot reliably link it back to any specific individual. With no real neighbor, there’s nothing to map to; with too many neighbors, the mapping becomes too vague to be meaningful. This built-in ambiguity—often referred to as plausible deniability—helps ensure that no synthetic data point can be confidently traced back to a real person, strengthening the overall privacy protection. It should be used especially when synthetic data will be shared with external parties or any group that isn’t fully trusted. A density disclosure value close to 1 indicates a low risk of disclosure.
    • Presence Disclosure- is closely aligned with the principles of differential privacy. In simple terms, differential privacy ensures that the inclusion or exclusion of any one person’s data has only a minimal impact on the results of any analysis. Presence disclosure follows the same spirit. It measures how confidently an attacker could determine whether a specific individual’s data was included in the training set used to generate the synthetic data. If the synthetic data inadvertently carries identifiable patterns or features that reveal whether someone’s record was part of the original dataset, the presence disclosure risk increases. It’s recommended to evaluate presence disclosure any time synthetic data will be shared with external parties or anyone who shouldn’t have full trust or access to the underlying real data. SAS Data Maker estimates presence disclosure risk by simulating what a motivated attacker might try to do. It assumes the adversary has access to the full synthetic dataset and a collection of data points they want to classify as either present or absent in the original training data.To make this determination, the attacker compares each data point to its closest match in the synthetic dataset using Hamming distance. If the closest synthetic point is within a certain distance threshold, the attacker guesses that the original point was included in the training set. To measure how well this attack would work in practice, SAS Data Maker tests the classifier using a balanced mix of real training data points and points from a separate test dataset (which was not used during training). By checking how accurately the classifier can tell these two groups apart—and by repeating the evaluation across multiple threshold values—SAS Data Maker builds a complete picture of the potential risk. The final presence disclosure score is simply the average performance across all these threshold settings, giving a reliable indicator of how likely it is that an adversary could infer whether someone’s data was part of the training set. A value close to 1 for presence disclosure risk metric indicates a lower risk, which means it is very unlikely to deduce whether a given data point was in the training set.

 

Utility & Usability: When it comes to synthetic data, utility is all about how useful the data is for real‑world tasks like analytics, reporting, or machine learning. High utility means the synthetic dataset preserves the patterns and relationships that matter, so teams can confidently run models or reports without losing accuracy. Usability, on the other hand, focuses on how easy the data is to work with. A usable synthetic dataset should feel just like the original—structured, consistent, and ready to plug into existing workflows. Together, utility and usability ensure that synthetic data isn’t just safe and private, but also practical and effective for everyday use. Statistical evaluation is the bridge between synthetic data and trust. With SAS Data Maker, metrics like histogram similarity, mutual information, and degree distribution give teams a clear, visual way to confirm that synthetic datasets are both faithful to the original and safe to use. By balancing fidelity, privacy, and utility, organizations can unlock the full potential of synthetic data—confidently applying it to analytics, machine learning, and collaboration without compromising security.

 

The Bottom Line

At its core, synthetic data goes beyond safeguarding sensitive information—it’s a catalyst for innovation. With robust evaluation tools like those in SAS Data Maker, you can ensure your synthetic datasets are not only statistically reliable but also practical, user‑friendly, and primed to fuel the next generation of insights.

 
References:

     

  •  

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
Thursday
Updated by:

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags