BookmarkSubscribeRSS Feed

Use Differential Privacy Thoughtfully — Avoid Bias in Synthetic Data

Started ‎07-09-2025 by
Modified ‎07-15-2025 by
Views 324

As synthetic data becomes foundational to enterprise AI, ensuring privacy is non-negotiable. Differential privacy (DP) is a powerful tool that introduces noise into data to protect individual identities. But while it excels at safeguarding information, it can unintentionally skew the quality and fairness of your synthetic datasets. At SAS, we advocate a balanced approach: privacy should never come at the expense of equity or utility. 

 

The Hidden Tradeoff: Fairness and Representation  

When applied without nuance, differential privacy can disproportionately impact underrepresented groups. These groups already suffer from reduced visibility in real-world data. Injecting noise can further obscure their presence—making it harder to train equitable models. 

In practice, this can mean: 

  • Synthetic datasets that underrepresent minority classes 
  • Machine learning models with reduced accuracy on vulnerable populations 
  • Risk of unfair or biased outcomes downstream in deployment 

 

What the Research Shows  

Recent studies have benchmarked state-of-the-art DP generators—PrivBayes, DP-WGAN, and PATE-GAN—on tabular and image datasets. The results? Consistent bias against minority groups, particularly as the privacy budget (epsilon) tightens. 

 

Three Key Questions 

  1. Do DP generators preserve subgroup proportions? 
    No. Across models, subgroup balance worsens—some even amplify imbalances. PATE-GAN, for example, increases disparities (a "Matthew Effect"), while PrivBayes redistributes samples more evenly (a "Robin Hood Effect"). 
  1. Does training on DP synthetic data reduce model accuracy on minority classes? 
    Yes. Underrepresented groups see sharper drops in performance. Alarmingly, even majority groups resembling minority classes experience unexpected accuracy loss. 
  1. Do all DP generators behave the same across scenarios? 
    No. Each reacts differently to dataset composition and privacy settings. PATE-GAN may fail to learn certain subgroups under high imbalance. PrivBayes, on the other hand, maintains utility across more settings but changes class distribution. 

Real-World Example: The Texas Hospital Dataset Let’s consider a real use case: predicting hospital stays longer than one week. This underrepresented class forms only ~20% of the data.  

Top: Class proportions change with decreasing epsilon. Bottom: Accuracy drops are steeper for minority class as privacy increasesTop: Class proportions change with decreasing epsilon. Bottom: Accuracy drops are steeper for minority class as privacy increases

 

When synthetic datasets are generated using different DP models: 

Class Distribution Effects (Top Graphs): 

  • PrivBayes reduces the imbalance (more long-stay patients appear in synthetic data). 
  • PATE-GAN worsens it (fewer long-stay patients). 
  • DP-WGAN holds the ratio steady. 

 

Model Accuracy on Minority Class (Bottom Graphs): 

  • All models show greater accuracy drops for long-stay patients. 
  • The most vulnerable patients become harder to model fairly. 

 

A Balanced Approach  

Privacy should be a shield, not a blindfold. SAS Data Maker enables responsible innovation by pairing strong privacy protections with the tools to preserve fairness and accuracy. In regulated, high-stakes environments, this balance is critical. 

 

SAS Data Maker—coming in Q3 to Microsoft Azure—empowers you to: 

  • Fine-tune epsilon levels with transparency 
  • Monitor fairness metrics across subgroups 
  • Validate utility using multiple performance indicators 
Comments

Enlightening article regarding how hyper focus on privacy can be detrimental to bias against underrepresented groups. Thanks for sharing the insights @harry_keen 

Contributors
Version history
Last update:
‎07-15-2025 03:30 PM
Updated by:

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags