In a companion blog, "Accuracy vs. Interpretability? With Generalized Additive Models (GAMs), You Can Have Both," we provide an overview of generalized additive models (GAMs) and their beneficial features. GAMs are appealing because they strike a nice balance between flexibility and performance while maintaining a high degree of interpretability.
This article provides the complete step-by-step instructions to reproduce the data analysis in the companion blog, to show you how easy it is to use Model Studio to train a GAM and compare it to other machine learning models. To do that, let’s revisit an example from Lamm & Cai (2020) that trains a GAM to predict the probability that a mortgage applicant will default on a loan. See Example 2 in Lamm & Cai (2020) for more information about the example and the Hmeq data set.
For a snapshot of this example, watch the following demo video:
First, let’s run the following code within SAS® Studio in SAS® Viya® to start a new session with a SAS® Cloud Analytic Services (CAS) server, assign the Mycas libref to the Casuser caslib, and create the data set used by Lamm & Cai (2020) as the in-memory table HmeqLC in the Casuser library. Here, the DATA step code differs from Lamm & Cai (2020) in that it renames the processed data set to distinguish it from the original Hmeq data set. It also includes the PROMOTE= data set option to create the CAS table with global scope so that it is accessible from Model Studio.
libname mycas cas caslib="casuser";
if cmiss(of _all_) then delete;
if CLAge > 1000 then delete;
part = ranbin(1,1,0.2);
Next, let’s switch to Model Studio, part of SAS® Visual Data Mining and Machine Learning, and use the GAM node to train the GAM. To do this, access the Applications menu in the upper-left corner and select Build Models (Figure 1).
Figure 1 Open Model Studio via the Build Models selection in the Applications menu.
Within Model Studio, select New project and create a Data Mining and Machine Learning project with the HmeqLC data table (Figure 2).
Figure 2 Create a new Model Studio project with the preprocessed data set, HmeqLC.
In the Data tab, select the variable BAD and assign it to the role of Target (Figure 3).
Figure 3 Set the variable BAD as the target.
Next, select the part variable, assign it to the role of Partition, select Map Partition Levels, and map Training Level to 0 and Test Level to 1 (Figure 4).
Figure 4 Map the levels to the partition variable to match Lamm & Cai (2020).
Now you are ready to switch to the Pipelines tab to create a model building pipeline. To add a GAM node to your pipeline, right-click the Data node → Add child node → Supervised Learning → GAM (Figure 5).
Figure 5 Steps to add the GAM node to your pipeline.
The GAM node enables you to train a GAM without the need to write any code. The node contains a variety of options to customize your analysis, such as the interval target probability distribution, the target link function, effects options, and options related to the chosen selection method (Figure 6).
Figure 6 The GAM node includes many options to customize your model.
For example, you can use the following steps to train a model similar to the one in Lamm & Cai (2020):
After you use the node to train the model, right-click the GAM node and select Results to view the model’s results. For the effects in the final model, the results include smoothing component plots for the spline terms and parameter estimates for the parametric terms. For example, the results include a smoothing component plot for the Spline(Debtinc) term (Figure 7). You can see that the probability of default is generally higher for an applicant who has a high debt-to-income ratio and the relationship is nonlinear.
Figure 7 The probability of default is generally higher for higher debt-to-income ratios.
Given the importance of the smoothing component plots, the GAM node includes options that govern the display of the plots. For example, you can select Use a common vertical axis to use the same y-axis range for each univariate spline plot. This facilitates comparisons of the magnitudes of the estimated spline effects on the predicted target.
In addition, the ability to interpret the parametric effects like you would with a logistic regression model is another way in which the GAM’s results are relatively easy to understand. For example, an applicant with more delinquent credit lines has a higher predicted probability of default, on average (Figure 8).
Figure 8 More delinquent credit lines typically correspond to a higher probability of default.
Now that you have used the GAM node to train a GAM, let’s train a few other models for comparison, which you can easily do in Model Studio. Repeat the previous steps to add a Supervised Learning node to your pipeline to add the Gradient Boosting, Logistic Regression, Neural Network, and SVM nodes, and then click Run pipeline to train and compare all these models. Right-click the Model Comparison node to view the Results. If you sort the algorithms by Misclassification Rate, you see that the GAM ranks second (Figure 9).
Figure 9 The GAM outperforms some of the more complex models but maintains interpretability.
Even though the GAM has a slightly higher misclassification rate, it is more interpretable than the champion gradient boosting model and still outperforms other complex models such as SVM and neural network.
As you can see, Model Studio enables you to easily train a GAM alongside other common machine learning models and to compare model performance. This is done in just a few clicks and keystrokes to make analytics more accessible to a broader range of people, empowering them to use data to improve decision making.
The author is grateful to Wendy Czika, Michael Lamm, Weijie Cai, and Brett Wujek for their help and feedback during the development of the GAM node. Thanks also to Wendy Czika, Anna Brown, and Ed Huddleston for their helpful comments on an early draft.
In addition to the resources linked in the article, the following resources provide further information about training GAMs with SAS:
 Available in SAS Viya Stable 2020.1.1 (December 2020) and Long-Term Support 2021.1 (May 2021).
 A binary target variable, such as BAD, always uses the binary distribution.
 This option adds bivariate splines for all pairwise combinations of the interval inputs, whereas the GAM in Lamm & Cai (2020) includes only a few bivariate spline terms.
Time is running out to save with the early bird rate. Register by Friday, March 1 for just $695 - $100 off the standard rate.
Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.