SAS for Supervised Learning and Profit Matrices in Martech

1 Like

Within the martech industry, there are several factors that contribute to the challenges surrounding brand decision-making. Obviously, customers and markets are more competitive and demanding. When you step back and reflect on this, it's a linear trend upward year-after-year when it comes to consumer expectations. This means, to satisfy that demand, it's well recognized that brands need to respond quicker, but it's often overlooked that accuracy holds an equal weight. Personalization, targeting, segmentation, relevance and other fun martech buzzwords all rely on this.

Data continues to flood every organization, both in size and in speed. Sometimes more data is better, but the challenge can be that critical decision-making information gets lost. Skilled analytical talent with application experience in the various domains of modern marketing is the key to move a brand from reactive to proactive. Thus, varying flavors of technology and automation are critically important to augment customer analysts in accelerating their delivery's time-to-value.

Machine learning is a branch of artificial intelligence that automates the building of systems that learn iteratively from data, identify patterns, and predict future results. And it does that with minimal human intervention. Machine learning shares many approaches with other related fields, but it focuses on predictive accuracy. Building representative machine learning models that generalize well on future data requires careful consideration of both the data at hand and assumptions about the various available training algorithms.

Supervised Learning

Supervised learning algorithms are trained using labeled examples (conversion vs. non-conversion), such as an input where the desired output is known. The learning algorithm receives a set of inputs along with the corresponding correct outputs, and the algorithm learns by comparing its actual output with correct outputs to find errors. It then modifies the model accordingly. Supervised learning is commonly used in applications where historical data predict likely future events. For example, supervised learning can anticipate when an insurance customer is likely to file a claim, or when a retail customer has a higher likelihood to be interested in an upsell recommendation.

Image 1 - Supervised Learning

SAS supports two types of supervised learning problems through natively-supported algorithms such as gradient boosting, forests, neural networks, support vector machines, Bayesian networks and more.

Classification – When the data are being used to predict a categorical target, supervised learning
is called classification. This is the case when assigning a label or indicator (for example, labeling
an image a dog or a cat). When there are only two labels, this is called binary classification. When
there are more than two categories, the problems are called nominal classification.
Regression – When the data are being used to predict interval targets, the problems are referred to as regression.

The reason supervised learning as a category contains a variety of algorithms is based on the notion that no model is uniformly the best, particularly when considering the deployment over time, when data changes. Analysts select a model primarily based on fit statistics and assessment graphics of performance.

Fit statistics transform model performance to numerical scores for easy comparison.
Assessment graphics provide a global view of model performance. They facilitate model comparisons across a variety of deployment scenarios.

Image 2 - Comparing Algorithmic Models

SAS enables users to facilitate selection between models using fit statistics and assessment graphics. The purpose of predictive modeling is generalization, which is the performance of the model on new data (not used during the training process). To compare across several modes, SAS computes all assessment measures for each available data partition (train, validate, and test).

Image 3 - SAS for Model Pipelines, Assessment & Comparison

Classifier Performance

Supervised classification does not usually end with an estimate of the posterior probability. For example, in binary classification problems, the ultimate use of a predictive model is to allocate cases (customers) to classes (target / don't target). This is accomplished by appropriately choosing a posterior probability cutoff. The cutoff or threshold represents the probability that the prediction is true.

Image 4 - Probability Cutoff Challenge

Analysts often might want to choose probability that gives the maximum accuracy. However, care should be taken when analysts have a case where the response column is skewed. This is EXTREMELY common in martech and customer analytical use cases.

Have you ever heard of a brand with a 50% conversion rate?

Unless you're reading a fictional story, this never happens. For example, a bank wants to predict the loan defaulters, so model performance needs to be assessed considering the posterior probability cutoff. An allocation rule is merely an assignment of a cutoff probability, where cases above the cutoff are allocated to class 1 (when we predict a customer will convert) and cases below the cutoff are allocated to class 0 (when we predict a customer to not convert). For example, the standard logistic regression model separates the classes by a linear surface (hyperplane), shown on the left of the image below.

Image 5 - Classification Cutoff

The decision boundary is linear, and determining the best cutoff is a fundamental concern. An allocation rule corresponds to a threshold value (cutoff) of the posterior probability that affects the confusion matrix. For example, all cases with probabilities of default greater than 0.04 might be rejected for a loan. For a given cutoff, how well does the classifier perform? This is the question.

The fundamental assessment tool is the confusion matrix. Don't you love the name? The confusion matrix is a crosstabulation of the actual and predicted classes. The confusion matrix contains true positives (events that are correctly classified/predicted), false positives (non-events that are classified/predicted as events), false negatives (events that are classified/predicted as non-events), and true negatives (non-events that are correctly classified/predicted as non-events). It quantifies the confusion of the classifier. Having fun yet?

The event of interest, whether it is unfavorable (like fraud, churn, or default) or favorable (like response, click, or purchase to an offer), is often called a positive, although this convention is arbitrary. Here are the simplest performance statistics:

Accuracy = (true positives and true negatives) / (total cases)
Misclassification rate = (false positives and false negatives) / (total cases)

Image 6 - Sensitivity and Positive Predicted Value

Image 6 highlights two specialized measures of classifier performance focused on true positives. Large sensitivities do not necessarily correspond to large values of positive predicted value (PV+). Ideally, analysts would like large values of all these statistics. Sensitivity is also known as recall or true positive rate (TPR). Positive predicted value is also known as precision. Precision and recall are popular in applications such as information retrieval and anomaly detection.

Image 7 - Specificity and Negative Predicted Value

Image 7 focuses on the analogs to these measures referred to as true negatives. Specificity is also commonly known as true negative rate (TNR).

The context of the classifier performance problem determines which of these measures is the primary concern. For example, a marketer is likely most concerned with PV+ because it relates to targeted offers (guided by predictions) and the associated response or conversion rate impacting KPIs they measure against. There is a cost to marketing and efficiency matters.

Image 8 - The Use Case Matters

Analysts must balance their desire for true positive rate (and false positive rate). There is always a trade-off between them. If you want to increase TPR, your FPR will also increase. For example, if you want to increase positive (conversion) outcomes (higher TPR), you must also be willing to incur incremental error in terms of predicting non-conversions (higher FPR). Customer behavior can never be modeled perfectly (human beings can act irrationally or unexpectedly from time-to-time). This is the very core of the challenge in deciding the probability cutoff.

Image 9 - The Interactive Cutoff Plot in SAS

The Cutoff plot is an auto-generated interactive visualization in SAS that shows how different model assessment statistics change as the prediction cutoff value changes. The model assessment statistics are based on the selected event (conversions, responses, clicks, etc.) for the model compared to non-events (non-conversions, non-responses, no clicks, etc.). Analysts can interactively move the cutoff line to represent different prediction cutoff values. As users move the cutoff line, the model assessment statistics are updated. This allows analysts to to choose a cutoff that best represents, their particular problem (maximizing marketing offer/tactic efficiency) and business objective.

Image 10 - The Event Classification Graph in SAS

The Event Classification graph is another auto-generated visualization in SAS that displays the confusion matrix at various cutoff values for each partition. Recall, the confusion matrix contains true positives (events that are correctly classified as events), false positives (non-events that are classified as events), false negatives (events that are classified as non-events), and true negatives (non-events that are correctly classified as non-events). Each of these segments are displayed in blue or yellow within the corresponding bar associated with the model's classification event level (such as conversion/non-conversion).

Image 11 - The Confusion Matrix in SAS

Different cutoffs produce different allocations and different confusion matrices. To determine the optimal cutoff, a performance criterion needs to be defined. If the goal were to increase the sensitivity of the classifier, then the optimal classifier would allocate all cases to class 1. If the goal were to increase specificity, then the optimal classifier would be to allocate all cases to class 0. For realistic data, there is a trade-off between sensitivity and specificity. Higher cutoffs decrease sensitivity and increase specificity. Lower cutoffs decrease specificity and increase sensitivity.

Users in SAS have access to a binary classification cutoff feature that specifies the cutoff for determining the predicted value for a binary target based on the posterior probabilities. If this feature is not customized by an analyst, the default value is 0.5. This is the case not only in SAS, but any software package that enables users to build supervised learning models. When applying predictive models for marketing use cases, it is essential to consider posterior probabilities prior to finalizing a modeling exercise. Remember, when have you ever heard of a brand that has a 50% conversion rate for anything they offer to customers? Okay then, let's move on.

The Profit Matrix

Determining an appropriate cutoff is problem specific, and there are many ways of accomplishing this (Bayes' Rule, Central Cutoff, KS Cutoff, etc.). We will focus on one solution referred to as the Profit Matrix, which is a formal approach to determining the optimal cutoff using statistical decision theory (McLachlan 1992, Ripley 1996, Hand 1997). The decision-theoretic approach starts by assigning profit margins to true positives and loss margins to false positives. The optimal decision rule maximizes the total expected profit.

Image 12 - Profit Matrix Example

The profit matrix in Image 12 is meant to portray a simple example. It is based on a marketing effort that costs $1 for every impression (choose your favorite channel/touchpoint)and that, when successful (targeted customer conversion), garners revenue of $100. Hence, the profit (or loss) for targeting a non-responder is -$1, and the profit for targeting a responder is $100 - $1 = $99. Given that everyone in this population has a posterior probability, simple algebra can be used to find the optimum cutoff.

Here is a typical decision rule. Target a customer if the expected profit for making an offer, given the posterior probability, is higher than the expected profit for ignoring the customer. The optimized cutoff can be identified by calculating the expected profit. Goodbye confusion matrix, hello profit matrix!

When the desired target event is rare, which is common in martech and customer journeys, the cost of a false negative is usually greater than the cost of a false positive. In other words, the monetary cost (or missed opportunity) of not targeting a customer who would have resulted in a conversion is greater than the cost of offering a promotion to someone who does not convert. Such considerations dictate cutoff rates that are less (often much less) than the default 0.5 value set in modeling software.

Image 13 - Envisioning a Profit Matrix for Subscription Business Model

To determine reasonable values for profit and loss information, consider the outcomes and the actions that your subscription-oriented brand would take given knowledge of these outcomes. In Image 13, there are two outcomes (churn and active) and two corresponding actions (offer discount and no action). Knowing that someone is a churner, analysts would naturally want to offer a discount to that person in hopes of preventing them from de-subscribing. Knowing that someone is a non-churner, you would naturally want to not offer any discount to that person. On the other hand, knowledge of an individual’s actual behavior is rarely perfect, so mistakes are made. For example, offering discount to non-churners (false positives) and not taking any action for churners (false negatives). Taken together, there are four outcome-and-action combinations shown. Each of these outcome-and-action combinations has a profit consequence (positive and negative).

Suppose from the description of the analysis problem that the variable AVG_ARPU_3M gives the customer's average revenue for the past three months. Also, there is a 15% decline in the average revenue of that customer when a discount is offered to retain them. From a statistical perspective, AVG_ARPU_3M is a random variable. Individuals who are identical on every input measurement might be associated with varying revenue amounts. To simplify the analysis, a summary statistic for AVG_ARPU_3M is plugged into the profit consequence matrix.

Image 14 - Profit Matrix - Outcome and Action Combo One - Successful Marketing Intervention

This is when the brand gives a 15% discount to customers that were predicted as churn, and in response, they no longer churn. The brand earn 3 months of average revenue minus the 15% discount. In other words, $60.30 – $9.04. The total profit is equal to $51.26.

Image 15 - Profit Matrix - Outcome and Action Combo Two - Unnecessary Marketing Intervention Wasting Budget

This is when the brand predicts that a segment of customers will churn but they actually stay indicating a marketing intervention was not necessary. The brand provided a discount to them, so it experiences a negative consequence monetarily. The amount lost is $9.04 per subscriber.

Image 16 - Profit Matrix - Outcome and Action Combo Three - Incorrect Churn Prediction and No Marketing Intervention

This is when the brand predicts that a segment of customers will not churn but they actually do indicating a marketing intervention could have mitigated this behavior. The brand did provide a discount, so it experiences a larger negative consequence monetarily. The amount lost is $60.30 per churned subscriber.

Image 17 - Profit Matrix - Outcome and Action Combo Four - Successful Non-Marketing Intervention

This is the rate when the brand correctly predicts that customers will not churn, so no discounts are given to them. The brand will earn profits as usual. If the customer does not churn, it has no effect on the model's influence on decisioning. So the value can immediately be set to 0.

Image 18 - Completed Profit Consequence Matrix

With the completed profit consequence matrix, analysts can calculate the expected profit associated with each decision. This is equal to the sum of the outcome and action profits multiplied by the outcome probabilities. The best decision for a case is the one that maximizes the expected profit for that observation. When the elements of the profit consequence matrix are constants, prediction decisions depend solely on the estimated probability of response and a constant decision threshold.

SAS enables users to leverage the HPDECIDE procedure which creates optimal decisions that are based on a decision matrix, on prior probabilities, and on output from a modeling project.

Image 19 - SAS HPDECIDE Procedure

Each model an analyst runs in a project can make a decision for each customer observation in a scoring data set, based on numerical consequences specified via a decision matrix and cost variables (or cost constants). The decision matrix can specify profit, loss, or revenue. The HPDECIDE procedure chooses the optimal decision for each observation, such as maximum expected/estimated profit or minimum expected/estimated loss. For the demonstration example in Image 20, the average revenue for the past three months minus the 15% discount cost is used as the (constant) profit associated with the churn outcome and the offer discount decision.

Image 20 - Using SAS HPDECIDE Procedure in Model Pipelining

Let's review the results of the Profit Matrix node.

Image 21 - SAS HPDECIDE Procedure Scored Customer Data

The table shows the scored data set, which displays the decision for each observation.

• D_DecisionData is the label of the decision chosen by the model. In other words, take action or don't take action.
• EP_ DecisionData is the expected profit for the decision chosen by the model.
• CP_ DecisionData is the profit computed from the target value. The value 0 signifies no change in the usual profit.

For example, the predicted probability of churn for the first observation is 0.18162 (or approximately 18%). Expected profits for decisions that offer a discount and no action are $1.90 and $-10.95, respectively. Because the first value is larger, the decision is to offer a discount, reflected in the D_ column, and the expected profit for this observation would be $1.90, reflected in the EP_ column.

Further, here comes the the statement every executive leader wants to hear during the analyst presentation.

Image 22 - SAS HPDECIDE Procedure - Profit Summary

Average profit can be used to summarize the model's overall performance. For the profit matrix used in this example, average profit is computed by multiplying the number of cases by the corresponding profit in each outcome-and-decision combination, adding across all outcome-and-decision combinations, and dividing by the total number of cases in the assessment data. This example shows that the total profit is $23,432.33 and the average profit is $0.41431, based on the decisions scored from the HPDECIDE procedure.

As an analyst, when you can communicate the value of your modeling efforts in monetary terms, every executive paying attention is going to lean in and focus. Passing these insights to influence our marketing teammates will directly impact their segmentation strategies and touchpoint tactics.

Image 23 - Profit Matrix, Segments & SAS Customer Intelligence 360

Oh yeah, let's not forget everyone needs that easy-to-understand report too!

Image 24 - Profit Matrix Optimization - Summary Report

As a life-long student of business and marketing analytics for the last two decades, this concept of a profit matrix is one of the most industry-practical topics I have ever learned. I hope readers will consider learning more on how to use this amazing spell of data magic! For those who prefer to see live demos, check this out:

(view in My Videos)

For readers who have a desire for more, go here to gain incremental awareness about how SAS can be applied for customer analytics, journey personalization and integrated marketing here.

SAS for Supervised Learning and Profit Matrices in Martech

Free course: Data Literacy Essentials

Get Started