Together with my colleague Scott @McClain_Mac and I have been investigating the potential and advantages of using synthetic data enrichment in the context of animal research studies using SAS® Data Maker.
Background
Developing pre-clinical safety and mode-of-action data with animals studies is extremely costly, time-consuming and there are ethical concerns. Furthermore, due to animal variation and inherent differences animal studies often fail to predict human outcomes - Over 90% of drugs that appear safe and effective in animals do not go on to receive FDA approval in humans predominantly due to safety and/or efficacy issues. It is often unfeasible to run animal studies with treatment group sizes large enough to attain a true predictive or useful association to assess the risk for humans.
Due to the limitations of animal testing both health authorities like FDA and Life Science companies are actively investigating new innovative approaches like synthetic data to complement or partly replace animal trials.
Synthetic data is artificially generated information that mirrors the statistical properties of real data. Synthetic data can help in animal studies when:
- Real data from animal studies is limited.
- Additional data is needed to examine specific topics such as safety
- Balance the ethical concerns and reduce the need for live animal experiments
SYNTHETIC DATA GENERATION USING SAS® DATA MAKER
SAS® Data Maker offers a low-code/no-code interface on the Viya platform, enabling users to quickly generate or augment synthetic data without writing code. Its intuitive UI provides built-in governance, auditability, transparency and trust features, making synthetic data generation accessible to both technical and non-technical users.
SAS Data Maker follows a well-designed process which helps teams generate synthetic data in a governed manner.

Figure 1: Synthetic Data Generation follows these common steps
The process comprises of three phases:
1. Plan phase
The original data is onboarded and undergoes rigorous assessment, profiling and cleansing to identify inconsistencies, outliers and personally identifiable and private variables thereby establishing a reliable foundation for synthetic data generation

Figure 2: The seed data loaded into SAS® Data Maker
2. Prepare phase
The Prepare phase focuses on selecting appropriate synthetic data generation techniques. Two methods are in the forefront of real-world data applications: Generative Adversarial Networks (GANs) and Synthetic Minority Oversampling Technique (SMOTE). Each of these also has several variations. In this step also the generation parameters are specified, and the synthetic data generation model is trained.

Figure 3: Specification of synthetic data generation method, settings and which similarity and privacy metrics to include
Finally, the Produce phase involves generating and evaluating the synthetic data to validate the quality, realism and utility for intended use cases.
SAS Data Maker used to expand animal study data
In our example, prospective existing animal study data was used as the seed or “source” data. This had treatment group sizes of 10 animals per group. The goal of the study was early phase dose response examination of a hypothetical drug to establish low and no effect doses on rats. The data had 8% missingness which is typical of real data from animals.
The goal of testing Data Maker was to examine the ability and usefulness of expanding group size to 30 animals per dose using the seed data to model the synthetic data.

Figure 4: An example of visual statistical evaluation metrics in SAS® Data Maker. Animal study seed data used in training the Bayes model, ahead of data creation.
Examination of the animal seed data (small dataset) and testing for applicability of the expanded data set
The seed data was typical of an early phase toxicology dose range finding study. Four metrics of animal health and morbidity were included. A summary metric labeled “severity” represented endpoint measure that can be assessed to determine no and low effect doses on the animals.
Our goal was to test whether SAS Data Maker, using the Bayes model, was useful in a real world context. To that end, SAS Data Maker was used to triple the animal dose group size.
We compared the original small, seed data to the larger dataset with the added synthetic data using two statistical assessments and a clinical threshold approach.
Statistical:
The No Observed (dose) Effect Level (NOEL) and the Lowest Observed Effect Level (LOEL) were the two statistical measures of dose impact on the animals.
Clinical:
A Maximum Tolerated Dose (MTD) was the measure used to assess dose impact by using a threshold response (the severity measure).
Conclusion
Overlayed Metrics: Dose at Which the Metric Indicated Significance (from Control)
|
Small Dataset
|
Large Dataset from SAS Data Maker
|
|
MTD (mean-based): 500 mg/kg
|
MTD (mean-based): 500 mg/kg
|
|
MTD (CI-based): 100 mg/kg
|
MTD (CI-based): 500 mg/kg
|
|
LOEL: 300 mg/kg
|
LOEL: 100 mg/kg
|
|
NOEL: 100 mg/kg
|
NOEL: None (even 100 mg/kg is significant)
|
|
|
|
Interpretation of the Impact of a larger, synthesized dataset on Interpreting Dose impact in the context of a pre-clinical safety study
- LOEL shifts lower in the large dataset (100 mg/kg vs 300 mg/kg), confirming higher sensitivity.
- NOEL disappears in the large dataset because even the lowest dose shows statistical significance.
- A clinical assessment in the form of a MTD remains 500 mg/kg by mean-based threshold.
- A Confidence Interval-based MTD is much higher in the large dataset (500 mg/kg vs 100 mg/kg), reflecting improved precision.

Figure 5. Dose response similarity between the small “seed” data used for training and the synthetic, enriched data (Large).

Figure 6. Impact of enriched, parameterized synthetic data on determining dose response below and above a “clinical threshold” (2.0 on this graph).
In this small example of using SAS Data Maker to generate synthetic data to complement real animal study data, it shows that adding the synthetic data gives improved precision to support the goal in establishing no observed effect levels and lowest observed effect levels (i.e., dose). These are the key statistical measures of early phase animal study toxicology/safety studies for new drugs.
So even though it is early days, complementing real animal study data with synthetic data could indeed have a potential of reducing the need for live animal experiments and saving animal lives. Many times, these early studies are repeated many times because animal group sizes are small (10 animals) but some data is always lost and therefore, statistical power is lost, making safety decisions difficult with only one study.
The impacts from repeated, iterative safety studies is increased animal use, delayed drug development timelines, and increased cost. The value of properly parameterized synthetic data enrichment has clear value in setting up later safety studies with a stronger data foundation for dose setting.