Use Differential Privacy Thoughtfully — Avoid Bias in Synthetic Data

6 Likes

As synthetic data becomes foundational to enterprise AI, ensuring privacy is non-negotiable. Differential privacy (DP) is a powerful tool that introduces noise into data to protect individual identities. But while it excels at safeguarding information, it can unintentionally skew the quality and fairness of your synthetic datasets. At SAS, we advocate a balanced approach: privacy should never come at the expense of equity or utility.

The Hidden Tradeoff: Fairness and Representation

When applied without nuance, differential privacy can disproportionately impact underrepresented groups. These groups already suffer from reduced visibility in real-world data. Injecting noise can further obscure their presence—making it harder to train equitable models.

In practice, this can mean:

Synthetic datasets that underrepresent minority classes

Machine learning models with reduced accuracy on vulnerable populations

Risk of unfair or biased outcomes downstream in deployment

What the Research Shows

Recent studies have benchmarked state-of-the-art DP generators—PrivBayes, DP-WGAN, and PATE-GAN—on tabular and image datasets. The results? Consistent bias against minority groups, particularly as the privacy budget (epsilon) tightens.

Three Key Questions

Do DP generators preserve subgroup proportions?
No. Across models, subgroup balance worsens—some even amplify imbalances. PATE-GAN, for example, increases disparities (a "Matthew Effect"), while PrivBayes redistributes samples more evenly (a "Robin Hood Effect").

Does training on DP synthetic data reduce model accuracy on minority classes?
Yes. Underrepresented groups see sharper drops in performance. Alarmingly, even majority groups resembling minority classes experience unexpected accuracy loss.

Do all DP generators behave the same across scenarios?
No. Each reacts differently to dataset composition and privacy settings. PATE-GAN may fail to learn certain subgroups under high imbalance. PrivBayes, on the other hand, maintains utility across more settings but changes class distribution.

Real-World Example: The Texas Hospital Dataset Let’s consider a real use case: predicting hospital stays longer than one week. This underrepresented class forms only ~20% of the data.

Top: Class proportions change with decreasing epsilon. Bottom: Accuracy drops are steeper for minority class as privacy increases

When synthetic datasets are generated using different DP models:

Class Distribution Effects (Top Graphs):

PrivBayes reduces the imbalance (more long-stay patients appear in synthetic data).
PATE-GAN worsens it (fewer long-stay patients).
DP-WGAN holds the ratio steady.

Model Accuracy on Minority Class (Bottom Graphs):

All models show greater accuracy drops for long-stay patients.
The most vulnerable patients become harder to model fairly.

A Balanced Approach

Privacy should be a shield, not a blindfold. SAS Data Maker enables responsible innovation by pairing strong privacy protections with the tools to preserve fairness and accuracy. In regulated, high-stakes environments, this balance is critical.

SAS Data Maker—coming in Q3 to Microsoft Azure—empowers you to:

Fine-tune epsilon levels with transparency
Monitor fairness metrics across subgroups
Validate utility using multiple performance indicators

BrettWujek · ‎08-04-2025

Enlightening article regarding how hyper focus on privacy can be detrimental to bias against underrepresented groups. Thanks for sharing the insights @harry_keen