BookmarkSubscribeRSS Feed

SAS for Synthetic Data Generation in Martech

Started ‎10-19-2022 by
Modified ‎02-20-2023 by
Views 2,675

As 2022 inches closer to the finish line, use cases for customer analytics and digital intelligence continue to evolve with a bottomless hunger for variety and volume of data to supercharge data science. Check in with any of the analytical magicians within your brand, and observe their struggle to acquire the relevant input data (and enough of it) to train their modeling recipes.

 

Sounds strange, right? The frequently used term of yester-year known as "BIG data" used to roll off the tongue as often as we hear the term "AI" today. The challenge isn't that there is enough information in general. But the intention of any analyst is to identify meaningful data signals to address business objectives. What's the point of having access to oceans of big data if it's just noise? Garbage in, garbage out. For example, I would like to build a classification model using supervised learning to improve our understanding of behaviors and drivers of conversion behavior for my B2C (business-to-consumer) brand.

 

- 99.5% of the customer journeys resulted in non-conversion for the past 90 days.

- 0.5% resulted in a conversion.

 

If my analysis does a great job of accurately predicting non-conversions (high true negative rate), but does a terrible job of classifying conversions (low true positive rate), what marketing leader is going to get excited about that? Excluding special exceptions, not many. The problem here amplifies a well known problem in modeling related to a rare-event of interest. 

 

Taking it one step further, as deep learning increases in practical application of business use cases, many observe the space as a foundational component of modern AI algorithms. But we also know to effectively train models using these algorithms, one needs a tremendous amount of data. And while it seems like we’re practically swimming in data day in and day out, we don’t necessarily always have enough of the RIGHT data for every process or behavior we’re trying to model. In other words, it's NOT the lack of choice of machine learning algorithms, but the scarcity of high-quality data.  

 

Enter synthetic data for analytics, machine learning and AI which replicates, mirrors, or extracts look-alike information that allow analysts to model use cases that would otherwise be impractical. A few examples of challenges include:

 

- Data quality concerns

- Privacy

- Lack of relevant data

 

Image 1: Synthetic data generation use casesImage 1: Synthetic data generation use cases

 

But I work in the martech industry, how does this apply to me? 

 

Real data is expensive to collect and properly annotate, especially when it is in large scale. This is both a monetary and time drain for high-utility team members who support customer journey management processes. Real data can also be messy, requiring time to clean and/or extract useful features. It can be imbalanced, which makes it harder to train good models in support of journey-based analytics. It can be sensitive to share or store due to privacy concerns.  

 

Now, marketers frequently run campaigns, tasks and activities. They target audience segments. The desire is to deliver personalization that is helpful, relevant and value enhancing within tailored customer experiences across channels. Two examples of analytically-driven marketing to consider:

 

  • Propensity scores. They are intended to identify high-likelihood audiences who will convert on your macro- and micro-conversion goals. However, imbalanced data that is used to train classification models will produce higher margins of error, or less valuable propensity scores that lead to irrelevant personalization and lower conversion rates.
  • Look-alike audience insights. You know you love it when an analyst describes to you the behaviors, demographics, and transactional patterns of high-value customers. The actionable outcome is to hunt for look-alikes within acquisition & upsell/cross-sell marketing. Imbalanced data is like a viral infection, and reduces the opportunistic potential of the insight-driven strategies being leveraged to influence the usage of marketing budgets.

 

Image 2: Customer journey managementImage 2: Customer journey management

 

Moving on, using real data about customers for training models can introduce risks in regard to adherence of regulatory requirements, as well as amplify existing biases that lead to negative CX outcomes for brands. While synthetic data alone is not a one-stop solution for data biases, it does enable the opportunity for identification and mitigation. Even brands that consider themselves data-rich should revisit if and how leveraging synthetic data might benefit their analytical use cases.

 

A trending example specific to martech centers on data sharing, clean rooms and co-ops. If a brand cannot keep their promise to customers to protect data, it can compromise their integrity. As a brand that puts your customer first, are you making every effort to do the right thing? Companies often express how seriously they take their customers’ privacy and security but then either can’t or won’t deliver. Regulators have obligated brands to demonstrate their accountability for handling personal data. However, when data gets shared and processed across complex ecosystems of partners, it’s problematic to assign clear ownership and accountability.

 

Across every corner of the martech industry, brands are focused on personalized and dynamic experiences in the customer moments that matter as a part of their digital experience aspirations. From offering increased automation and anticipation of customer needs, brands must collect the necessary Zero and 1st party customer data to fuel the insights that enable these experiences. Profitable growth and customer loyalty depend on it.

 

Every data-sharing use case has unique risks and requirements. Considerations about data types (personal or sensitive), the parties involved and the extension of the data-sharing ecosystem, as well as the purposes of the sharing will force brands to acutely identify their preferred method of approach. There is demand to deliver increasingly intelligent, engaging, and profitable applications/products, but this requires balancing against stronger pressure from regulators and the public to protect personal data/privacy. Synthetic data presents a viable solution by allowing brands to use their knowledge of existing customers to create look-alike data that mirrors the original data but preserves no identifiable connection to the individual customers.

 

Image 3 - Real vs. synthetic data correlation plot comparisonsImage 3 - Real vs. synthetic data correlation plot comparisons

 

Synthetic data can be artificially manufactured by special-purpose machine learning models in a way that captures the data distributions and patterns, while also helping to maintain privacy without exposing real information.  For example, a Generative Adversarial Network algorithm, or GAN, can learn the patterns and relationships in existing data in order to generate new observations that are indistinguishable from real data. You’ve probably seen this used for what are known as deep fakes (creating very realistic images of people that don’t even exist). But we can also use this same technology for tabular data, which is most common for training predictive models with machine learning algorithms. The following demo will showcase the use of tabularGAN to generate synthetic data. 

 

 

The demo showcased how powerful SAS is for data scientists and analysts (high/low/no-code) to methodically work through a process for generating synthetic data.  So if you are early on in data preparation and recognize you need more data or need to balance your data, or if you are in model development and realize you need more data either to train your models or to validate them, you can simply use SAS to kick off training a new GAN or using one that has already been trained for your data. Refer here for the synthetic data generation steps we used in this process.

 

SAS is well positioned to deliver a comprehensive experience to not only generate synthetic data programmatically, but also expand our reach into the no/low-code user base.

 

Image 4: Technology User PersonasImage 4: Technology User Personas

 

The use cases for synthetic data are expanding every day, across the entire martech industry. We look forward to what the future brings in our development process – as we enable technology users to access all of the most recent SAS analytical developments. Learn more about how SAS can be applied for customer analytics, journey personalization and integrated marketing here.

 

 

 

Comments

Awesome blog and video, @suneelgrover. Very engaging content and sharing of valuable insight into why and how synthetic data generation is performed.

Version history
Last update:
‎02-20-2023 03:56 PM
Updated by:

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags