People have made MILLIONS starting with a back-of-the-napkin sketch! I can't prove this, but it's what I hear.
Here's a back-of-the-napkin sketch showing how data scientists, that group of analytics practitioners who swear by truth and accuracy, make use of synthetic data to deliver added value to their machine learning exercises.
Low baseline event rates (commonly found in fraud models or rare diseases) influence models to tend more towards predictions based on attributes of non-fraudulent transactions rather than fraudulent transactions - and no, they do not lead to the same result
Attributes that happen to be significant for minority groups tend to get crowded out in a training exercise due to low sample size
Model training in complex, iterative methods tends to terminate early due to lack of data, leading to subpar outcomes
Synthetic data, generated through SAS Data Maker, enables organisations to create, at manifold scale, artificial datasets similar to production data. Synthetic data is generated using original (production) data as source data, which ensures useful patterns are retained. Consumers of synthetic data, at the time of generation, can increase the magnitude of synthetic data required. Evaluation metrics also provide data scientists the confidence that the characteristics of synthetic data for minority segments are similar to the same segment in original data.
Downstream, data scientists have options regarding how they use synthetic data. For example, they can:
Generate a large sample of synthetic data, filter out only records for the minority segment, and append to their training dataset. In this case they happen to be rebalancing their original dataset by adding synthetic records. If this option is chosen, it's recommended to maintain a flag relating to the provenance of the record (e.g. flag = synthetic / original) so that later examination can measure the effect of synthetic data to a model's outcome.
Generate a large sample, let's say 10x the size of original data, and then reduce the size of the majority segment to make the minority segment of larger proportion relating to the previously majority segments. Here, in addition to rebalancing data, we also "upsample-then-downsample" data to achieve a more workable proportion. The downsampling occurs only for the majority segment, so that the minority segment's proportion increases. The synthetic dataset is then used in place of the original data for modelling for purposes.
Keep the proportions intact, but rely on the increased number of records to provide more robust and statistically significant analysis. This method is preferable in cases where data scientists do not want to tinker with the proportions of data as they stand in the original data, but would like the opportunity to run their training on a larger volume of data.
Try other variants in addition to the above, such as generating a large volume of data and then creating subsets for each class level, each of which go through their own model training process etc.
To summarise, synthetic data provides data scientists options to enhance their datasets for training models and thus address problems of sufficiency, balance and representation, all of which may contribute to model bias.
Now, don't lose your millions by throwing away this napkin!
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!