Multinomial Classification in SAS Model Studio

1 Like

The purpose of this blog is to show how easy it is to classify a nominal target in SAS Viya. There are many examples of binary classification in both real life and in published literature. For example, will a customer default on a loan or not? Will a citizen vote Democratic or Republican? Will a patient return to the hospital or recover at home? But there are sometimes when the outcome is non-binary. For example, which one of these 5 cars will a customer purchase? Which one of these 3 fruits match the image? Which one of these 4 diseases match the patient's symptoms? When our target has multiple (more than 2) independent levels, we will want to perform multinomial classification. When data points can belong to multiple classes, multinomial classification allows data scientists to model and address real-world, complex problems which are inherently multi-faceted in nature.

In this blog, I'll be using a dataset created from a higher education institution to address the challenge of academic dropout and failure. The dataset consolidates information from a variety of databases on students pursuing undergraduate degrees in multiple fields. It includes details at the time of enrollment such as demographics, socio-economic factors, and academic performance. The target has three levels: dropout, enrolled, and graduate. The challenge is to build a machine learning model that can accurately predict a student's classification and hopefully reduce academic failure in higher education. Here is a quick snapshot of the data which contains 37 columns and just over 4400 rows.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In one of my previous blogs, An Introduction to Machine Learning in SAS Model Studio, I showed you how easy it was to build machine learning models in the SAS Viya. We are going to use SAS Model Studio today in order to build several models that can handle our challenge of a multinomial target. From SAS Drive, we open the Model Studio web application by clicking the applications menu icon and selecting Build Models.

In the Projects tab, we select New Project to create a new modeling project for our academic data. We fill out the New Project window. Let’s name this project Student Success Factors and select the Advanced template for class target. This advanced template has a nice variety of machine learning models including logistic regression, neural network, and gradient boosting. The SSF data table has already been loaded into memory in the PUBLIC library, so we can easily select it. We’ll also change the data partition by selecting Advanced -> Partition Data and changing the Training percentage to 70 and the Test percentage to 0. This eliminates the test partition from our table which now only contains two partitions, training and validation. Select Save and Save again to create the new project.

SAS Model Studio opens to the Data tab. We can scroll down to our dependent multinomial variable and see that it has been conveniently identified as the target variable because it is called Target. Select Target to examine the properties of this column. If we select Specify the Target Event Level, it will reveal the that the default event level is Graduate. By default, the software selects the highest value in alphanumeric order. The level of Graduate is nearly 50% of all the target data. While this high percentage could end up affecting the performance of our model, we won’t address unbalanced target data today. We will move forward in building our model keeping the default level.

Moving over to the Pipelines tab, it is time to run the advanced template that we selected when we created the project. The Advanced template for a class target contains a pipeline with a total of six models (the purple nodes) and one ensemble model. Even though we could modify the construction of this pipeline (e.g., add or delete nodes) or change the defaults of the existing models, we simply choose to use the template unaltered. Select Run pipeline to run all the included nodes which will reveal a champion model.

Right-click the Model Comparison node to open Results. The results from pipeline reveal that the Gradient Boosting model is the champion based off the KS(Youden) model fit statistic.

Let’s check out the results of our champion model. Right-click the Gradient Boosting node and select Results. On the Node tab expand the Output to see the results from the GRADBOOST Procedure. Scroll to the bottom and note that this machine learning model will calculate three predicted probability variables along with the predicted target variable.

If we close the procedure output and select the Output Data tab, we can examine the actual predictions. Select View output data twice to open a data table that includes the original inputs along with the scored predictions. I opted to manage the columns in order to rearrange the default order of the input variables so we could focus in on the predicted values. The first column is now the actual target value, and the second column is the predicted value for the target. As far as the first ten rows, it appears that the gradient boosting model is doing a good job at correctly predicting the target value.

Let’s return to the Summary tab for a small journey into model interpretation. Just like neural networks, gradient boosting models are also known as “black box” models or models that have little to no interpretability. Unlike a decision tree (which is very easy to interpret) a gradient boosting model which is a combination of (in this case 300) trees, is difficult to interpret. In other words, we sometimes want to be able to say why this model decided to classify a student as a Graduate versus a Dropout. Examining the Variable Importance table attempts to answer that question. At the very least, by expanding the Variable Importance table we can see which inputs are major contributors to this classification decision. It appears that the number of curricular units approved in the 2^nd semester is at the top of the list along with inputs such as course number, inflation rate, GDP, grade average in the 2^nd semester, and whether tuition fees are up to date.

Closing the Variable Importance table, let’s finally look at some model assessment output. Select the Assessment tab. Many of the assessment features like ROC and cumulative lift will be based off of binomial classification or “event” versus “non-event”. In other words, since our target level is Graduate, the levels of Dropout and Enrolled will be grouped together. However, there is one chart which reveals what is happening for all three levels independently. Expand Nominal Classification which defaults to a percentage plot. By examining the validation partition results, we can see that this model is able to predict Dropouts and Graduates nearly 80-90% of the time, while only correctly classifying Enrolled about 50% of the time. This plot is based on frequency counts of the actual versus predicted target values for all three levels.

I hope you've enjoyed seeing how easy it is to build and interpret predictive machine learning models using SAS Model Studio. While I'm not an expert in higher education data, I bet with some business knowledge and some tuning of the hyperparameters we could increase the predictive accuracy of this model. Would you like to keep learning? Maybe you would be interested in taking an instructor-led course like Machine Learning Using SAS® Viya® which will get you started. In this course you will learn how to build several different models, tweak them to get better results, and learn how to interpret the results. In fact, this course can prepare you to get certified as a SAS Certified Specialist: Machine Learning Using SAS Viya.

Never stop learning!

Multinomial Classification in SAS Model Studio

Ready to join fellow brilliant minds for the SAS Hackathon?

Free course: Data Literacy Essentials

Get Started