BookmarkSubscribeRSS Feed

Comparing SMOTE, MST, and Bayesian Networks for Synthetic Data Generation in SAS Data Maker

Started yesterday by
Modified yesterday by
Views 166

 

Part 2 of an ongoing series on synthetic data generation methods in SAS Data Maker

 

CatProfilePicCartoon.jpgIf you read my earlier post on using the SMOTE method in SAS Data Maker, you may recall that I started out with an unbalanced 40,000-observation dataset, and discovered that it generated synthetic data that looked exactly as unbalanced as the data I put in. The fix turned out to be obvious in hindsight: feed SMOTE only the minority class. The lesson was as much about understanding what a method does as it was about using the software correctly.

 

That experience got me thinking about the philosophy behind different synthetic data generation methods, and I fell into a rabbit-hole of learning more. So come into the rabbit-hole with me! SMOTE and Bayesian networks approaches are available in SAS Data Maker, SAS's no-code/low-code platform for generating privacy-preserving synthetic tabular data. But they are built on fundamentally different premises about what it means to "understand" a dataset well enough to generate new data from it. This post compares those two families of methods, examines their respective strengths and challenges, and considers when you might choose one over the other. It also lays the groundwork for a follow-up post that will go deeper into the two marginal model-based methods available in SAS Data Maker specifically: MST and PrivBayes.

 

 

A Bit of Context: How SAS Data Maker was Born

 

SAS Data Maker is SAS's no-code synthetic data platform, available on Microsoft Azure Marketplace, designed to generate high-fidelity, privacy-preserving data at scale. The platform supports a growing portfolio of synthesis methods, and with SAS's November 2024 acquisition of Hazy,15 a recognized leader in enterprise synthetic data, that portfolio expanded considerably. Hazy's technology brings with it a rich lineage of data generation methods, including variants of the approaches we will discuss. SAS has continued to broaden and develop the features and functionality in SAS Data Maker with exciting new updates coming soon. There are currently 3 different choices for synthesis method, including synthetic minority over sampling technique (SMOTE), and two differential privacy methods based on marginal models: the maximum spanning tree method (MST), and the private data release via Bayesian networks (PrivBayes).

 

 

Why Method Choice Matters

 

SAS Data Maker is designed to make synthetic data generation accessible, using trusted SAS algorithms to create synthetic datasets from existing data .1 The platform provides automated evaluation metrics so you can assess whether the synthetic data faithfully captures patterns and relationships from the original, and how well the resulting sample preserves privacy of the source data. What it cannot do is make the choice of method for you. You might choose a different method depending on whether you are trying to oversample a rare event, generate a privacy-safe training dataset for a downstream model, or produce a privacy-safe shareable synthetic dataset for external collaborators in a regulated industry.

 

The synthetic generation methods for tabular data in SAS Data Maker fall broadly into two philosophical families. The first family, which includes SMOTE, works at the local level: it reasons about individual observations and their geometric relationships to one another in feature space. The second family, which includes Bayesian network methods, works at the global level: it tries to learn the entire joint distribution of variables in the dataset and then sample from that distribution. (There is another popular family of methods based on generative adversarial networks, or GANs, that is more flexible with regard to the types of data it is suited for, that you can code up in SAS Viya).

 

Each family of methods has a different conception of what makes synthetic data "good," and so they have different strengths and challenges. A useful overview of how these and related methods differ in scope and mechanism is provided by Tracanella (2024).2

 

 

SMOTE: Interpolation in Feature Space

 

SMOTE, the Synthetic Minority Over-sampling Technique, was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in a 2002 paper published in the Journal of Artificial Intelligence Research.3 The motivating problem for SMOTE is class imbalance in supervised learning: when one class has far fewer observations than another, most classifiers will be biased toward predicting the majority class. SMOTE addresses this by generating new synthetic observations in the neighborhood of existing minority class members.

 

The algorithm is straightforward. For each observation in the minority class, SMOTE identifies its k nearest neighbors (typically k = 5). It then selects one of those neighbors at random and creates a new synthetic observation by linear interpolation between the original point and that neighbor. If xi is the observation of interest and xnn is a selected neighbor, the synthetic point xsyn is:

 

xsyn = xi + λ · (xnn − xi)

 

where λ is drawn uniformly from [0, 1]. The result is a point that lies somewhere along the line segment connecting xi and xnn in the feature space.

 

This approach respects the local geometry of the data and tends to produce synthetic points that fill in the “gaps” between existing minority class observations rather than simply duplicating them. For continuous variables, the interpolation can produce new values. It is also easy to explain.

 

That said, SMOTE has real limitations, several of which I bumped into firsthand. The most important is that SMOTE does not model the joint distribution of the full dataset. It does not know or care about the relationships among variables that are not directly part of the interpolation. If your dataset has complex correlations among many variables, SMOTE will respect only the local geometry of the feature space in the neighborhood of each observation; it will not ensure that the synthetic data preserves, say, the correlation between income and credit score that exists in your real data at a population level.

 

Another limitation is that SMOTE is designed almost exclusively for the oversampling use case. As I discovered, feeding the full dataset into SMOTE produces synthetic data that mirrors the full distribution, imbalance and all. Luckily, the developers of SAS Data Maker have handled this issue for you. You can either load in data for just the minority class, or you can choose to filter the generated data on the rare event indicator variable.

 

SMOTE is not designed to accommodate sequential data. Only the geometric similarity of observations is used, and the method has no mechanism to directly model the correlational structure inherent in sequences.

 

Finally, there is no privacy guarantee associated with SMOTE in a formal sense. Synthetic points lie on line segments between real observations. An adversary might in principle infer something about the real records that were used to generate those segments.16 For organizations in regulated industries where formal privacy guarantees are required, you need to be aware of this property.

 

PrivBayes and MST: Learning the Joint Distribution

 

Marginal model-based methods (PrivBayes and MST, in Data Maker) for synthetic data take a different approach from SMOTE. Where SMOTE interpolates between individual observations, marginal model-based methods attempt to learn the full joint probability distribution of the dataset and then draw new samples from that learned distribution. This results in better capability to mimic the source distributions, but also greater computational demands.

 

One such method is a Bayesian network, so we'll illustrate with that. A Bayesian network is a directed acyclic graph (DAG) in which each node represents a variable and each directed edge represents a conditional dependence relationship between variables. The joint distribution over all variables X1, X2, …, Xd can be factored as:6

 

P(X1, X2, …, Xd) = ∏ P(Xi | Pa(Xi))

 

where Pa(Xi) denotes the parent nodes of Xi in the graph. This factorization allows a high-dimensional joint distribution to be approximated through a set of lower-dimensional conditional distributions, which is computationally manageable in ways that directly estimating a d-dimensional joint distribution is not.

 

For synthetic data generation, the Bayesian network structure and the associated conditional distributions are learned from the training data. Once learned, new synthetic records are generated by ancestral sampling: start with root nodes (those with no parents), sample from their marginal distributions, and then sample each subsequent variable from its conditional distribution given its already-sampled parents. The result is a full synthetic record that respects the learned dependency structure of the original data.7

 

Two marginal model-based methods available in SAS Data Maker: PrivBayes (which uses a Bayesian network) and maximum spanning tree (MST). Both operate within this general framework but differ substantially in how they build the network structure and how they handle privacy. That comparison is rich enough to deserve its own dedicated post, and I will address it in depth next time (Stay tuned for another rabbit hole!). For now, what matters is that both share the fundamental characteristic of marginal model-based methods: they are trying to capture the global structure of your dataset, not just the local geometry of individual observations.

 

 

The Privacy Dimension

 

If you work in a regulated industry or if you deal with sensitive research data, the privacy of synthetic data requires special consideration.

 

As I mentioned above, SMOTE provides no formal privacy guarantee. The synthetic points it generates are, by construction, interpolations among real data points. This means that in principle, real records contribute directly to synthetic ones. There is no mathematical framework for bounding the information leakage that could occur if someone has access to the synthetic data to be able to then access information about the real data. You might still end up with excellent privacy metrics related to privacy, but that’s a side effect and not the main objective for SMOTE.

 

The marginal model methods in SAS Data Maker (MST and PrivBayes) are designed with differential privacy in mind. Differential privacy is a framework for privacy protection, introduced by Dwork, McSherry, Nissim, and Smith in a 2006 paper that has since become one of the more influential contributions in the field.8 Ideally, the output of an algorithm should be essentially the same regardless of whether any single individual’s record is included in or excluded from the dataset. Formally, a randomized mechanism M satisfies ε-differential privacy if for any two datasets D and D’ that differ in at most one record, and for any possible output S:

 

Pr[M(D) ∈ S] ≤ eε · Pr[M(D’) ∈ S]

 

The parameter ε (epsilon) is the privacy parameter. A smaller epsilon means stronger privacy, because it means the output distributions for neighboring datasets are closer together, making it harder to infer anything about individual records. The practical tradeoff is that achieving strong privacy requires adding noise to the learned distributions, which reduces the statistical fidelity of the synthetic data. This epsilon-fidelity tradeoff is one of the central challenges in privacy-preserving synthetic data generation, and choosing the right epsilon for your application depends on the sensitivity of the underlying data and the downstream use case.9 A useful real-world anchor: when the U.S. Census Bureau adopted differential privacy for the 2020 decennial census, it added noise calibrated to this same epsilon framework, protecting respondents while preserving the demographic utility of aggregate statistics.10

 

For finance and insurance practitioners who must document compliance with regulations like GDPR, HIPAA, or GLBA, differential privacy offers a quantifiable, auditable privacy guarantee.10,11 The ability to state that the synthetic data was generated with a specific epsilon value and explain what that means mathematically is a useful compliance artifact. For academic researchers working with human subjects data, the same properties are relevant for IRB compliance and data sharing agreements.

 

Comparative Strengths: What Each Method Does Well

 

SMOTE is well-suited to situations where the primary goal is to address class imbalance for a supervised learning task, where the number of minority observations is not extremely small (SMOTE needs enough real points to find meaningful neighbors), and where formal privacy guarantees are not required. For a data scientist building a fraud detection model who needs more fraud examples to train on, and who has clean continuous transaction features to work with, SMOTE is fast and interpretable.3

 

PrivBayes and MST are better suited to situations where the goal is to generate a complete, statistically faithful synthetic version of a multivariate dataset, where the data contains a mix of continuous and categorical variables (as is common in insurance, healthcare, and consumer finance), where the synthetic data will be shared externally or used in contexts that require formal privacy documentation, and where capturing the associations among variables is important. If a bank wants to share a synthetic version of its loan portfolio with a regulatory sandbox, or a university wants to create a training dataset from electronic health records for a machine learning course, Bayesian network methods are a much more appropriate choice.

 

Published benchmark evidence supports this. Empirical studies have shown that marginal-based mechanisms, which include both MST and PrivBayes, routinely outperform GAN-based methods on tabular data when the evaluation criterion is fidelity to low-dimensional marginals.12 McKenna, Miklau, and Sheldon’s winning entry to the 2018 NIST Differential Privacy Synthetic Data Challenge was based on the MST approach, providing a high-profile demonstration of what these methods can achieve when properly applied.13

 

 

Challenges of Each Method

 

Neither method is without weaknesses, and honest representation of those weaknesses can help you decide which method, if any, you want to employ.

 

SMOTE can produce unrealistic synthetic observations in regions of feature space where the real data is sparse, and it may inadvertently generate points that violate known domain constraints (values that fall in between existing data values but are not reasonable in reality). These issues require post-processing or domain-specific constraints that are not part of the basic algorithm. When the absolute number of minority records is very low, SMOTE’s ability to increase diversity is also substantially curtailed.5

 

Bayesian network methods can struggle with high-dimensional data, because the number of possible graph structures grows exponentially with the number of variables. The differentially private versions add noise to the learned distributions, which can degrade the fidelity of the synthetic data, especially for rare patterns. And as McKenna, Miklau, and Sheldon showed, the performance of these methods depends heavily on the method used to select which marginals to measure and how to allocate the privacy budget across them.13 Getting these choices right is not always straightforward for practitioners.

 

It is also worth noting that your privacy budget epsilon should be set with care; there is a trade-off between accuracy and privacy. This does not undermine the value of differential privacy, but it does reinforce the importance of using reasonable values of epsilon, and of looking at your evaluation metrics after training a model.

 

 

In Summary

 

SAS Data Maker brings both SMOTE and Bayesian network methods together in a single, accessible platform.1 But the platform cannot make the method-selection decision for you, and understanding the conceptual differences between these approaches enables you to get the most out of your synthetic data generation.

 

The key contrast is that SMOTE is a local interpolation method designed for minority class oversampling. It is fast, interpretable, and effective for that narrow use case, but it provides no privacy guarantee and does not attempt to model the full joint distribution of your data. Bayesian network methods learn a global probabilistic model of the data, which allows them to generate complete synthetic datasets that preserve multivariate relationships across all variables, and the privacy-preserving variants offer formal differential privacy assurance.

 

In the next post in this series, I will go deeper into the two marginal model methods available in SAS Data Maker: MST and PrivBayes. These are related methods that share a common foundation but differ in how they construct the network structure and how they allocate the privacy budget. I have had some really good head-scratching moments in researching these two, so you will want to stay tuned.

 

See you in class!

CatProfilePicCartoon.jpg

 

If you want to go deeper into the rabbit hole, here are the references that contributed in some way to this post:

 

  1. SAS Institute. SAS Data Maker. https://www.sas.com/en_us/software/data-maker.html; SAS Data Maker Support. https://support.sas.com/en/software/data-maker-support.html
  2. Tracanella, E. (2024). A Review of Synthetic Data Generation Methods: Their Scope and How They Work. Master’s thesis, Politecnico di Milano. https://www.politesi.polimi.it/retrieve/aae90920-41dc-4211-bba5-105e1cd9df5a/2024_10_Tracanella.pdf
  3. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
  4. Blagus, R., and Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14, 106. https://doi.org/10.1186/1471-2105-14-106
  5. Buuren, S. A., et al. (2024). Improving predictions on highly unbalanced data using open source synthetic data upsampling. arXiv preprint. https://arxiv.org/pdf/2507.16419
  6. Koller, D., and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
  7. Bao, E., Xiao, X., Zhao, J., Zhang, D., and Ding, B. (2021). Synthetic data generation with differential privacy via Bayesian networks. Journal of Privacy and Confidentiality, 11(3). https://doi.org/10.29012/jpc.776
  8. Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference (TCC 2006), Lecture Notes in Computer Science, vol. 3876, 265–284. https://doi.org/10.1007/11681878_14
  9. Hsu, J., Gaboardi, M., Haeberlen, A., Khanna, S., Narayan, A., Pierce, B. C., and Roth, A. (2014). Differential privacy: An economic method for choosing epsilon. arXiv:1402.3329. https://arxiv.org/abs/1402.3329
  10. WhiteFiber (2025). Understanding Differential Privacy: A Must for IT and AI Leaders. https://www.whitefiber.com/differential-privacy; ThinkAI Corp. (2025). Privacy-Preserving Analytics with Differential Privacy. https://thinkaicorp.com/privacy-preserving-analytics-using-differential-privacy-in-data-pipelines/
  11. Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D., and Xiao, X. (2017). PrivBayes: Private data release via Bayesian networks. ACM Transactions on Database Systems, 42(4), 1–41. https://doi.org/10.1145/3134428
  12.  
  13. McKenna, R., Miklau, G., and Sheldon, D. (2021). Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. arXiv:2108.04978. https://arxiv.org/abs/2108.04978
  14. Golob, S., Pentyala, S., Maratkhan, A., and De Cock, M. (2024). High epsilon synthetic data vulnerabilities in MST and PrivBayes. arXiv:2402.06699. https://arxiv.org/abs/2402.06699
  15. SAS Institute Inc. (2024). SAS acquires Hazy synthetic data software to boost generative AI portfolio. Press release, November 12, 2024. Available at: https://www.sas.com/en_us/news/press-releases/2024/november/hazy-syntheticdata.html
  16. Ganev, G., Nazari, R., Davison, R., Dizche, A., Wu, X., Abbey, R., Silva, J., De Cristofaro, E. (2026). SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling. arXiv:2510.15083v3. https://arxiv.org/abs/2510.15083v3

 

SAS Data Maker is available on the Microsoft Azure Marketplace and through SAS. Documentation is available at the SAS Help Center. Learn more in the free e-learning course, AI Literacy Series: Generative AI Using SAS.

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
yesterday
Updated by:

Viya Copilot Motion Graphic.gif

Ready to see what SAS Viya Copilot can do?

Visit the Tips & Tricks page for setup guidance, demos, and practical examples that show how Copilot supports your workflows.

Get Started →

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags