BookmarkSubscribeRSS Feed

Back of the napkin: How can synthetic data enhance machine learning outcomes?

Started 4 weeks ago by
Modified 4 weeks ago by
Views 216

People have made MILLIONS starting with a back-of-the-napkin sketch!  I can't prove this, but it's what I hear.

 

Here's a back-of-the-napkin sketch showing how data scientists, that group of analytics practitioners who swear by truth and accuracy, make use of synthetic data to deliver added value to their machine learning exercises.  

  

back_of_napkin_synth_data_machine_learning.png

 
Coming back to data scientists, they are a bit demanding, aren't they?  I should know, I'm one of them. I guess it's due to the impact of machine learning models which have consequences when used in decision systems, such as denying or approving someone for a loan, rejecting a transaction because it seems fraudulent, or approving a drug based on its perceived low adverse effects.
 
The issue is,  machine learning models manifest patterns found in real data and real data happens to be messy.  Real data happens to be messy because it captures phenomenon which might be rare or infrequent in nature.  Imperfect measurement systems also amplify this problem further by underreporting data for certain groups or segments due to systemic reasons or bias. Whatever the reasons, the consequences are significant.
 
  • Low baseline event rates (commonly found in fraud models or rare diseases) influence models to tend more towards predictions based on attributes of non-fraudulent transactions rather than fraudulent transactions - and no, they do not lead to the same result

  • Attributes that happen to be significant for minority groups tend to get crowded out in a training exercise due to low sample size

  • Model training in complex, iterative methods tends to terminate early due to lack of data, leading to subpar outcomes

 
How have data scientists conventionally handled such problems?  One way has been to downsample, i.e. reduce the majority class to a sufficiently low number so as to "balance" out proportions with the rare event or minority class (in the context of original data distribution).  But, this essentially means a loss of data, and a loss of data implies a loss of rich information that loses out on a chance to contribute to the machine learning process.
 

How can Synthetic Data Help?

 

Synthetic data, generated through SAS Data Maker, enables organisations to create, at manifold scale, artificial datasets similar to production data.  Synthetic data is generated using original (production) data as source data, which ensures useful patterns are retained.  Consumers of synthetic data, at the time of generation, can increase the magnitude of synthetic data required.  Evaluation metrics also provide data scientists the confidence that the characteristics of synthetic data for minority segments are similar to the same segment in original data.

 

Downstream, data scientists have options regarding how they use synthetic data.  For example, they can:

  • Generate a large sample of synthetic data, filter out only records for the minority segment, and append to their training dataset.  In this case they happen to be rebalancing their original dataset by adding synthetic records.  If this option is chosen, it's recommended to maintain a flag relating to the provenance of the record (e.g. flag = synthetic / original) so that later examination can measure the effect of synthetic data to a model's outcome.

  • Generate a large sample, let's say 10x the size of original data,  and then reduce the size of the majority segment to make the minority segment of larger proportion relating to the previously majority segments.  Here, in addition to rebalancing data, we also "upsample-then-downsample" data to achieve a more workable proportion.  The downsampling occurs only for the majority segment, so that the minority segment's proportion increases.  The synthetic dataset is then used in place of the original data for modelling for purposes.

  • Keep the proportions intact, but rely on the increased number of records to provide more robust and statistically significant analysis.  This method is preferable in cases where data scientists do not want to tinker with the proportions of data as they stand in the original data, but would like the opportunity to run their training on a larger volume of data.

  • Try other variants in addition to the above, such as generating a large volume of data and then creating subsets for each class level, each of which go through their own model training process etc.

 

To summarise, synthetic data provides data scientists options to enhance their datasets for training models and thus address problems of sufficiency, balance and representation, all of which may contribute to model bias.
 

Now, don't lose your millions by throwing away this napkin!

Contributors
Version history
Last update:
4 weeks ago
Updated by:
Latest on SAS Data Maker
Want more? Visit our blog for more articles like these.
Article Tags