BookmarkSubscribeRSS Feed

Model Assessment in SAS Visual Data Mining and Machine Learning

Started ‎03-22-2017 by
Modified ‎06-14-2017 by
Views 2,324

In a previous post, I summarized the supervised learning models (support vector and factorization). In this post, I'll explore model assessment.

 

PROC ASSESS

SAS Visual Data Mining and Machine Learning (DMML) includes a procedure for assessing model performance called PROC ASSESS. You can take the output data set generated by PROC ASSESS and use PROC SGPANEL to create ROC curves or lift charts. This gives you plots similar to what you would see generated by Enterprise Miner’s Model Comparison node. I built the graph below (and all of the graphs in this post) in SAS Studio, but you will notice that it looks very similar to an Enterprise Miner graph.

 

ROC Curve Overlaid image for Model Assessment article.png

 

An ROC (receiver operating characteristic) curve lets you compare different model results. The true positive rate (sensitivity) is plotted on the vertical axis, and the false positive rate (1 minus specificity) is plotted on the horizontal axis. The better the model performance, the farther up and to the left the curve will be, maximizing the true positive rate and minimizing the false positive rate. Above we see that the gradient boosting model performed best (green line), followed by the random forest model (red line). The logistic model (blue line) performed the worst of the three models.

 

Start with the Supervised Learning Snippet (ships with the software)

We will start in SAS Studio on SAS Viya by expanding the Snippets tab. Then navigate to Snippets/Machine Learning and double-clicking on the Supervised Learning snippet to add code to your SAS Studio code pane, as shown below.

 

SAS Studio screenshot for Model Assessment article.png

 

 

The Supervised Learning snippet code uses the SAS sample data HMEQ, which is a home equity loan historic data set used to train models. The target variable is BAD. If the individual defaulted on their loan, BAD = 1.

 

If we scroll to the bottom of this snippet, we will see that code already exists to create an ROC curve and list chart for a single model. But we want to instead create an ROC curve and lift chart that compares multiple models. So, we will delete the code from line 166 to the end of the snippet.

 

Add Two Models

In addition to the random forest model, which is already included in the snippet, we will run two more models. First, we run and score a gradient boosting model using PROC GRADBOOST as follows:

 

 

 
/****************************************************/
/* Build a predictive model using Gradient Boosting */
/****************************************************/
proc gradboost data=&caslibname.._prepped ntrees=50 intervalbins=20 maxdepth=5;
 input &interval_inputs. / level = interval;
 input &class_inputs. / level = nominal;
 target bad / level = nominal;
 partition rolevar=_partind_(train='1' validate='0');
 code file="&outdir./gradboost.sas";
run;
/********************************************/
/* Score the data using the generated model */
/********************************************/
data &caslibname.._scored_gradboost;
 set &caslibname.._prepped;
 %include "&outdir./gradboost.sas";
run;

 

Next, we run and score a logistic regression model using PROC LOGSELECT and a data step as follows:

 

 

/******************************************************/
/* Build a predictive model using Logistic Regression */
/******************************************************/
proc logselect data=&caslibname.._prepped;
 class bad &class_inputs.;
 model bad(event='1')=&class_inputs. &interval_inputs.;
 selection method=forward;
 partition rolevar=_partind_(train='1' validate='0');
 code file="&outdir./logselect.sas";
run;
/*********************************************/
/* Score the data using the generated model */
/********************************************/
data &caslibname.._scored_logselect;
 set &caslibname.._prepped;
 %include "&outdir./logselect.sas";
 p_bad0=1-p_bad;
run;

 

Assess the Three Models: Random Forest, Gradient Boosting and Logistic Regression

Now we are ready to assess our three models. We will create a macro to do this, so that we don’t have to write the PROC ASSESS code three times. Our target is the variable BAD.

 

 

/****************************/
/* Assess model performance */
/****************************/
libname BethWork "/home/sasdemo";
%macro assess_model (prefix=, var_evt=, var_nevt=);
proc assess data=&caslibname.._scored_&prefix.;
 input &var_evt.;
 target bad / level=nominal event='1';
 fitstat pvar=&var_nevt. / pevent='0';
 by _partind_;
 ods output fitstat = BethWork.&prefix._fitstat
  rocinfo = BethWork.&prefix._rocinfo
  liftinfo = BethWork.&prefix._liftinfo;
run;
%mend assess_model;

 

Now we can call the macro, filling in the arguments for the prefix, the variable that indicates the probability of the event loan default (VAR_EVT,) and the variable that indicates the probability of a non-event (NVAR_EVT), i.e., no default.

 

 

%assess_model(prefix=forest, var_evt=p_bad1,var_nevt=p_bad0);
%assess_model(prefix=gradboost,var_evt=p_bad1,var_nevt=p_bad0);
%assess_model(prefix=logselect,var_evt=p_bad, var_nevt=p_bad0);

 

Combine the ROC Results into a Single Data Set

As shown below, we use PROC FORMAT to create a format to use on the partition indicator. And we combine our data sets, adding a model variable to distinguish results from the logistic, random forest, or gradient boosting model.

 

 

/*******************************************/
/* Analyze model using ROC and Lift charts */
/*******************************************/
ods graphics on;
proc format;
 value partindlbl
 0 = 'Validation'
 1 = 'Training';
run;
data BethWork.all_rocinfo;
 set
 BethWork.logselect_rocinfo(keep=sensitivity fpr _partind_ in=l)
 BethWork.forest_rocinfo (keep=sensitivity fpr _partind_ in=f)
 BethWork.gradboost_rocinfo(keep=sensitivity fpr _partind_ in=g);
 length model $ 16;
 select;
  when (l) model = 'Logistic';
  when (f) model = 'Forest';
  when (g) model = 'GradientBoosting';
 end;
run;

 

Create ROC Curves

Finally we are ready to create some charts. First, we will plot validation and training data together group=_partind_ on a separate graph for each of the three models using panelby model. So we get three side by side graphs.

 

 

/* Plot Validition and Training Together on a Separate ROC Graph for Each Model */
proc sgpanel data=BethWork.all_rocinfo aspect=1;
 panelby model / layout=columnlattice spacing=5;
 title "ROC Curve Panel";
 rowaxis label="True positive rate" values=(0 to 1 by 0.25) grid offsetmin=0.05 offsetmax=0.05;
 colaxis label="False positive rate" values=(0 to 1 by 0.25) grid offsetmin=0.05 offsetmax=0.05;
 lineparm x=0 y=0 slope=1 / transparency=0.7;
 series x=fpr y=sensitivity /group=_partind_;
 format _partind_ partindlbl.;
run;

 

This gives us the following graph, with validation data results in blue and training data results in red for each model in a different graph.

 

ROC Curve Panel image for Model Assessment article.png

 

 

But perhaps we want to see all of the models on the same graph so that we can easily compare them. We will now change panelby to_partind_ and group=model as shown below.

 

 

/* Plot ROC Curves for All Models Together */
proc sgpanel data=BethWork.all_ROCinfo;
 panelby _partind_ / layout=columnlattice spacing=5;
 title "ROC Curve Models Overlain";
 rowaxis label="True Positive Rate";
 colaxis label="False Positive Rate" grid;
 lineparm x=0 y=0 slope=1 / transparency=0.7;
 series x=fpr y=sensitivity / group=model;
 format _partind_ partindlbl.;
run;

 

This gives us the three ROC curves for the random forest model, the gradient boosting model, and the logistic regression model overlain on the same graph as shown below. I have separated the graphs for Validation and Training data.

 

ROC Curve Models Overlain image for Model Assessment article.png

 

 

Maybe we decide that we want to add markers, so we add markers markerattrs=(symbol=circlefilled) as shown in the code and graph below.

 

 

/* Plot ROC Curves for All Models Together With Markers */
proc sgpanel data=BethWork.all_ROCinfo;
 panelby _partind_ / layout=columnlattice spacing=5;
 title "ROC Curve Models Overlain With Markers";
 rowaxis label="True Positive Rate";
 colaxis label="False Positive Rate" grid;
 lineparm x=0 y=0 slope=1 / transparency=0.7;
 series x=fpr y=sensitivity / group=model markers markerattrs=(symbol=circlefilled);
 format _partind_ partindlbl.;
run;

 

ROC Curve Models Overlain WITH MARKERS image for Model Assessment article.png

 

 

Combine the Lift Results into a Single Data Set

Similarly, we can create lift charts. Again we start by combining the data results as shown below.

 

 

/* Create lift charts */
data BethWork.all_liftinfo;
 set BethWork.logselect_liftinfo(keep=depth lift cumlift _partind_ in=l)
 BethWork.forest_liftinfo (keep=depth lift cumlift _partind_ in=f)
 BethWork.gradboost_liftinfo(keep=depth lift cumlift _partind_ in=g);
 length model $ 16;
 select;
  when (l) model = 'Logistic';
  when (f) model = 'Forest';
  when (g) model = 'GradientBoosting';
end;
run;

 

 

Create Lift Charts

And again we use PROC SGPANEL to create the charts. In the example below, we create separate charts for the validation and training data, but overlay information from each of the three models on each chart.

 

 

proc sgpanel data=BethWork.all_liftinfo;
 panelby _partind_ / layout=columnlattice spacing=5;
 title "Lift Chart All 3 Models Overlain";
 rowaxis label="Lift";
 colaxis label="Depth" grid;
 series x=depth y=lift / group=model markers markerattrs=(symbol=circlefilled);
 format _partind_ partindlbl.;
run;
title;
ods graphics off;

 

Lift chart all 3 models overlain for Model Assessment article.png

 

Lift charts indicate how well the model did compared to no model by plotting the ratio between the result predicted by the model and the result using no model. As you can see above, on the vertical axis lift is plotted, and on the horizontal axis depth is plotted. Here we see that the gradient boosting model does well at low depths. For example, if we use the model to reject 10% of the loan applicants, we will appropriately reject almost 50% of the default applicants.

 

In this post, I showed you how to use PROC SGPANEL to create ROC curves and lift charts, which make it easy to graphically compare the performance of multiple models created in SAS Studio on SAS Viya. I hope this has been helpful!

 

 

Version history
Last update:
‎06-14-2017 11:37 AM
Updated by:
Contributors

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags