Using a two-stage model for a continuous target with an over-abundance of zeros in Visual Statistics

Last year, a student in the SAS Visual Statistics: Interactive Modeling Building class I was teaching asked about modeling a continuous, skewed response variable with a large number of zeros. He asked for some possible approaches using SAS Visual Statistics, and we ended up talking about two-stage modeling. In this post, I’ll walk through how we used SAS Visual Statistics to fit a two-stage model to the course data set (VS_Bank).

VS_bank data

VS_Bank is a simulated data set from a large financial services firm’s accounts. It has 1.1 M rows, with 12 interval-scaled predictors (Logi_RFM1–Logi_RFM12) and 2 categorical predictors (Cat_input1, Cat_input2). The data set has 3 target variables: a binary target indicating if a purchase was made or not (B_tgt), an interval target indicating the total value of the purchases (Int_tgt) and a count target indicating the number of purchases (Cnt_tgt, not used in this post). The data set has roughly 20% purchasers (B_tgt=1) and 80% non-purchasers (B_tgt=0). The large number of zeros for the total value of purchases were coded as missing for the interval target (Int_tgt had 80% missing values). All the missing predictor values were imputed (Int_tgt was not imputed) and interval predictors were log transformed to reduce skewness prior to modeling (hence the “logi_” prefix).

Two-stage modeling

The goal we chose for modeling with the VS_bank data was to predict the most valuable potential customers for the bank’s marketing department to target. The two-stage modeling approach involves modeling separately whether a customer is a likely purchaser (the stage-1 model using the binary target, B_tgt), and to use the predicted probability of making a purchase as an input for modeling the total value of purchases (the stage-2 model using the interval target, Int_tgt). Incorporating the predictions from the stage-1 model into the stage-2 model can likely improve its predictive power, particularly when the $0 total value customers (i.e., the non-purchasers) were coded as missing values for Int_tgt. Note that the same or different predictors can be used for the two models.

The predicted values from both models can be multiplied. Then customers can be ranked from highest to lowest value for marketing. Why multiply the probability of purchasing by the predicted total value of purchases? The product can be thought of as the expected future profit, and multiplying can give an improved assessment of customer profitability, compared to looking at the predictions for total value of purchases (the stage-2 model target) alone. This is because the customers who are predicted to make the largest dollar amount of purchases might not be very likely to make any purchase at all. If the predicted probability of B_tgt=1 and Int_tgt are negatively correlated, using only the predicted value of Int_tgt would paint a misleading picture of the customers most valuable to target for marketing. This is essentially what we found after producing a stage-1 model and plotting the relationship between the stage-1 model predictions and the stage-2 target:

01_TE_highest-prob-of-purchasing-is-associated-with-lowest-value-purchasing-1024x550.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Stratified partitioning for honest assessment

We decided to partition the data into training and validation sets and asses each potential stage-1 and stage-2 model on its validation performance. The simplest approach to partitioning is to use a random sample of say 50% for training and 50% for validation. Instead, I recommend using a stratified random sample for each partition. For the stage-1 model (predicting a binary target), the partitions should be stratified based on the target. Stratification means that when making the training and validation data sets, the exact same frequency of purchasers will be in each partition. This makes the performance on the partitions more comparable than if there were say 22% purchasers in the training data and 18% purchasers in the validation data. A stratified partition can be created using the New Data Item option in Visual Analytics, then choosing Partition:

For the interval target models, care should be taken with creating the partitions. Monetary values are often highly skewed, and this is the case with the interval target, Int_tgt. If, by chance, all the rare high value purchasers ended up in the validation data with none in training, the model performance will suffer. To avoid this, a partition can be created that is stratified on a binned version of the interval target. We did this using the New Data Item option to create a measure variable “binned_purchases”: missing values were put in bin 0, total purchases <= $50K went into bin 1, total purchases <= $100K went into bin 2, and purchases greater than $100K were put into bin 3 (the max was $500K):

03_TE_binning-int_tgt-into-4-groups-measure-variable.png

We then used binned_purchases to create a categorical variable (bins) by choosing New Data Item, then Custom Category. This categorical variable was used for stratification of the second partition we created. SAS Visual statistics will hold one partition variable at a time, so the stage-1 model partition was hidden when the stage-2 partition was created.

Stage-1 models

We tried three stage-1 models for predicting the binary target B_tgt: logistic regression with backwards elimination based on SBC, logistic regression with stepwise selection based on AIC and a decision tree using default settings. The models were compared on their misclassification rate when applied to the validation data:

Model	Validation misclassification
Logistic backwards (SBC)	0.1615
Logistic stepwise (AIC)	0.1614
Decision Tree	0.1645

The AIC-based stepwise logistic model had the best performance and the Derive predicted option was used to add the predicted probabilities to the Data pane.

Stage-2 models

All of the stage-2 models used the predicted probability from the final stage-1 model as an input. How should we model the interval target? A linear regression model would be inappropriate for these data because of the high skewness of the target, Int_tgt. We decided to try two generalized linear models: lognormal and gamma regressions. These models both use continuous distributions that can account for skewness in the response. Below is a picture of the skewed lognormal and gamma distributions, compared to a (truncated) normal distribution. Note that the lognormal distribution has heavier tails than gamma and are more skewed than a truncated normal distribution:

Image modified from SAS Statistics 2: ANOVA and regression course notes

Another advantage to using lognormal and gamma regression models is that these distributions are strictly positive. So, unlike a linear regression model, these models will never produce a nonsensical prediction of a negative valued purchase. The lognormal model and the gamma model are available using a Generalized Linear Model object and choosing the normal distribution with a log link and the gamma distribution with the log link, respectively from the options pane:

How do we get predictions for the 80% of the customers that did not make a purchase? The $0 purchasers were coded as missing values for the target. When a model is built in SAS on data that include missing values of the target, the model is fit on the only the rows with complete data (200K rows here), but predictions are made on the complete data set (1.1 M rows). Sometimes a researcher will use this to get predictions from a model by concatenating a data set to be scored (missing the target) with the data used for fitting a model, then calculating the predictions. This is sometimes called the missing-Y trick.

It’s likely the predicted probability from the stage-1 model is non-linearly associated with the interval target. How should we account for this non-linear relationship? One way to model a non-linear relationship is to use a spline function. A spline is a piecewise polynomial function where the pieces are smoothly joined at knots. Splines are very flexible and can be used to model complex non-linear relationships that are difficult to model with polynomials or interactions. The spline effects are typically not interpretable, but the non-spline terms in a regression model still retain their interpretability. To add a spline to the lognormal and gamma models, we used a Generalized Additive Model (GAM) object and created a spline of the predicted probability of B_tgt using the New data item menu and choosing Spline effect:

We let SAS Visual Statistics do the hard work of iteratively finding a spline function for our generalized additive model using the default settings. For more information on creating generalized additive models in SAS programmatically or using SAS Model Studio, see Beth Ebersol’s post here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-Use-Generalized-Additive-Models-in-SAS....

The lognormal and gamma models used backwards elimination with variables removed based on the validation average squared error. The GAM models used the same predictors as the corresponding generalized linear models except the stage-1 predictions were removed from the Continuous effects role and a spline function of these stage-1 predictions was added. We also tried a decision tree as a stage-2 model. The default decision tree settings were used, and the validation Average Squared Error was compared for each model:

Model	Validation ASE
Generalized linear model: normal distribution, log link	55,646,643
GAM: normal distribution, log link	55,539,930
Generalized linear model: gamma distribution, log link	56,074,933
GAM: gamma distribution, log link	60,839,972
Decision Tree (default settings)	51,307,883

The decision tree performed the best of the 4 models tested. The predictions from the decision tree model were saved using the Derive predicted option. We then created a list table with the customer account IDs, the predictions from the stage-1 logistic regression, the predictions from the stage-2 decision tree, and the expected future profit (the product of the predictions). Finally, we sorted the table by expected future profit:

To use the two-stage model with new data, the Export model option could be used for both models to produce SAS data step scoring code which could be implemented in the SAS Studio interface.

Summary and more information

This was a brief example of two-stage modeling using SAS Visual Statistics. Would you like to know more? Generalized linear models are a great tool to expand your modeling skills beyond linear regression. These models are discussed in the course Statistics 2: ANOVA and Regression and the SAS Viya course SAS Visual Statistics: Interactive Modeling Building. Generalized Additive Models are addressed in the course Regression Methods Using SAS Viya. Hope this post has given you some new ideas about how to model your data.

See you at the next SAS class!

Find more articles from SAS Global Enablement and Learning here.

Using a two-stage model for a continuous target with an over-abundance of zeros in Visual Statistics

Free course: Data Literacy Essentials

Get Started