This post will focus on the differences and similarities between model comparison in SAS Enterprise Miner versus Model Studio. Comparing models is a critical step in the analytics lifecycle whenever predictive models are being built. Data scientists typically build several competing models and then they need to pick a champion; the best performing model which is generalizable to new data. For analysts switching from SAS Enterprise Miner to Model Studio, you’ll see in this post that there are more differences than similarities when it comes to comparing models in these two analytical work horses.
Let me start by being brutally honest. I couldn’t think of a catchy title for this post or even an exciting way to open it, so I did an internet search on “quotes about comparison”. And one that kept popping up did not initially seem relevant…until I realized it was. Some sources attribute this quote to Theadore Roosevelt, others to Mark Twain, so who knows really…my source is the internet after all. Anyhow, the quote is “Comparison is the thief of joy”. I like it. So true in nearly all aspects of life, except maybe data science. Clearly comparing yourself, your successes, your looks, your financial status, to others is a recipe for disaster. However, for data scientists, when building predictive models, comparison actually does lead to joy! The joy in having the “best” model for your data! In data science, results are data dependent. This means that, for example, when building a predictive model, a data scientist does not know ahead of time which modeling algorithm is best for the specific data they are analyzing. The result (which model is best) depends on the data. Therefore, to arrive at the “best” solution for predicting an outcome, data scientists typically build multiple models and then compares their performance on a holdout sample. Whichever model performs best on the holdout sample is the one which is most generalizable and most accurate on new, never-seen-before data. And typically, this is how predictive models in the business world are used: make predictions on new data. What could bring a data scientist more joy than that?!
This post is part 6 in an ongoing series I’m doing introducing Model Studio to the SAS Enterprise Miner user. I will be using concepts and terminology covered in my prior posts, so be sure to check them out before going further if you have not seen them yet. Links to all prior posts are at the bottom of this one.
One more comment before getting into things: This post will focus on the analytical tools and how they each assist the analyst in picking a champion model. But for completeness, I do need to state that there is often much more than just analytical results that go into picking a champion model. Things like model interpretability, speed in model training, speed in scoring, the environment of where scoring will be done, are all considerations data scientists use when picking the champion model.
Although I won’t really be focusing on the details of the analysis, the data set I’m using is called commsdata and it comes from a fictitious telecommunications company that is trying to predict customer churn. The models are built on a binary target based on which customers have churned in the past.
SAS Enterprise Miner
Some work that has already been done to get us to the point of model comparison is that a project has been built, a diagram and data source (commsdata) have been created, the data source has been placed in the diagram, the data have been partitioned (training, validation, and test), and two predictive models have been built (Decision Tree and Gradient Boosting).
In Enterprise Miner, the analyst needs to add a model comparison node to the diagram for themselves. The Model Comparison node is found above the “Assess” tab in the SEMMA tools palette. It is the third node from left, above the Assess tab.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The analyst needs to drag and drop the Model Comparison node into the diagram somewhere to the right (or below if building your process flow vertically) of the modeling nodes.
The modeling nodes of each model to compare need to be connected to the Model Comparison node.
Let’s take a look at the properties panel of the Model Comparison node. There are many properties so to simplify things, I’ll point out the two that in my opinion are most important or at the minimum, the ones most likely to be considered by the analyst. The two properties are found under the Model Selection group. They are Selection Statistic and Selection Table. Selection Statistic allows the analyst to choose the statistical measure used to determine the champion model and Selection Table allows the analyst to choose which data partition the champion model comes from.
Notice that initially, Selection Statistic is set to Default and Selection Table is set to Train, but that setting is greyed out meaning the property is inactive. So, what does “Default” mean for Selection Statistic? Well, it depends on a few things including, but not limited to, the measurement level of the target variable. If a profit or loss matrix is defined, then “Default” will use validation profit or loss to define a champion. If no profit or loss matrix is defined, then for a categorical (including binary) target, misclassification rate on validation data is used and for interval targets average squared error on validation data is used. So even though the Selection Table property shows “Train”, the default partition used is validation. Keep that in mind when we get to Model Studio. If no validation data set exists, then training is used.
For our telecommunications example, let’s run the Model Comparison node with default setting and see what we get. Below is the results window of the Model Comparison node.
First, notice several graphical items are shown. The ROC Chart and Score Rankings Overlay plots (which includes plots like lift, cumulative lift, and percent captured response charts) are shown automatically, but they are not used by Enterprise Miner in determining a champion model. They are provided for the analyst to consider. In the upper right hand corner is a table of Fit Statistics. Although it cannot be seen above, the table is very wide with lots of assessment measures provided for each partition available. The first column which provides numerical results is what Enterprise Miner uses for declaring a champion model. In this case, misclassification rate from validation data is used. This is because no profit or loss matrix had been defined and although a three-way partition for training, validation, and test was created, the default table used is validation. Finally, the measurement level of our target is binary thus misclassification is used. Keep in mind that this champion model is really only being “suggested” by Enterprise Miner. That’s why all other statistics calculated are shown in the Fit Statistics table as well as providing the graphical methods of assessing models. An analyst may make more of a holistic decision about the champion model after considering several sources of information regarding model performance.
What if the analyst is not happy with default settings? Easy. Manually change the properties to the desired values. Suppose Average Squared Error on validation data is to be used. First, change Selection Statistic to Average Squared Error by clicking on the cell that displays “Default” next to the property and selecting the desired setting.
Once the Selection Statistic has been changed, the Selection Table property activates, and it can then be changed to Validation.
Notice now the Selection Table property is active (no longer in grey). The setting stays at Training until changed by the analyst. For this example, after Selection Statistic was changed, Selection Table was manually changed to Validation. Here’s the Fit Statistics table given these new settings in the properties panel:
Recall as stated above, the first column of numerical values is what Enterprise Miner uses to suggest a champion. In this case, Enterprise Miner has selected the Decision Tree as the champion model based on it having minimum average squared error on the validation data. Notice the “Y” in the Selected Model column. You’d think “Y” in the Selected Model column stands for Yes (it actually does), but I think it comes from the last letter in the word “joy”!
Recall that in an earlier post I commented that a single Enterprise Miner project can have multiple diagrams. However, it is not possible using a node in the interface itself to compare models across diagrams. If multiple diagrams were built on the same data set, perhaps with different data preprocessing or different modeling algorithms, the analyst would need to manually compare models across diagrams to pick an over-all project champion.
Model Studio
Now let’s move to Model Studio. And I’ll preface with a statement I have made in other posts in the series. There are more differences than similarities when it comes to model comparison in Model Studio compared to Enterprise Miner. So, if you’re making the move to Model Studio, be ready for some changes. Recall that three types of projects can be built in Model Studio. Here, I’ll be discussing a Data Mining and Machine Learning project as that type aligns most closely with out-of-the-box Enterprise Miner functionality.
First, let’s start with where the model comparison node comes from. To get to this point, I’ve created a project using a blank pipeline template, am using the default 3-way partition of data, assigned the target variable on the Data tab, and am looking at pipeline 1. Given I created the project with a blank template for its initial pipeline, only the data node is in the pipeline. Coming from the Enterprise Miner world, it’s natural to think the Model Comparison node can be found in the Nodes pane. The only logical group of nodes for it to belong to would be Miscellaneous. But take a look at the Miscellaneous group below:
No Model Comparison node. Model Comparison certainly wouldn’t fall under Preprocessing. It’s not a model, so it wouldn’t fall under Supervised Learning. And in an earlier post, I pointed out that there is only a single node, Ensemble, under Postprocessing. So, where’s the “joy”!?!?!?
Here’s the first big difference from Enterprise Miner. The Model Comparison node does not exist within the Nodes pane. So how do we compare models? I didn’t say there is no Model Comparison node. Only that it doesn’t exist in the Nodes pane. Watch what happens when I bring a model into the pipeline. Rather than dragging and dropping from the Nodes pane, I’ll right click on the data node:
I’ll go with a decision tree.
Ah, the joy!! A Model Comparison node is automatically placed in the pipeline and the Decision Tree node automatically connects into it. But does it make sense to have a Model Comparison node for a single model? I’ll let you answer that one for yourself. But regardless, that’s how Model Studio operates. If a second supervised learning node were to be placed in the pipeline, it would automatically connect to the Model Comparison node also.
Before moving on, I want to comment on the node colors. Look at the colors of the squares next to the group names in the Nodes pane.
We see that the color for the Decision Tree node is purple and that matches the color assigned to the Supervised Learning group. The nodes in blue, Data and Model Comparison, are nodes automatically placed in the pipeline. Blue is not shown in the nodes pane; thus, we know that color is reserved for nodes that are placed automatically in the pipeline.
After running the pipeline, let’s take a look at the Model Comparison results.
Keep in mind that for right now these results are for a single model. Notice at the top there are two tabs: Node and Assessment. I’ll discuss the Node tab first. Like the results of the Model Comparison node in Enterprise Miner, we see a table of fits statistics. Most common assessment measures are available in the table. It can be expanded to see them all. A similarity here to the fit statistics table in Enterprise Miner is that the first column of numerical values is what Model Studio uses to declare a champion. Notice that for Model Studio, the default statistic for categorical targets is KS (Youden). The first column, labeled Champion, indicates the champion model of the pipeline with a star. A difference between this table and the one shown by the Model Comparison node in Enterprise Miner is here only values are shown for the partition used to select the champion. In the nodes window, we also see a Properties table. This table does not provide specific information about models or their performance but rather reminds the analyst what the project settings are that are used for model comparison. For example, the settings for Selection Statistics (called Selection Criteria in Model Studio) for categorical and interval targets and the partition used to pick the champion (called the Selection Table) are shown. More on these project settings in a bit.
The Assessment tab provides much more detailed information about model performance.
Like the results in Enterprise Miner, assessment plots are immediately visible and provide results for all models being compared and for all partitions. Model Studio includes the same assessment plots as Enterprise Miner, plus a few more. What Enterprise Miner calls the Score Rankings Overlay plots, Model Studio calls Lift Reports. Via the pull-down menu, this family of plots includes Cumulative Lift, Lift, Gain, Captured Response Percentage, Cumulative Captured Response Percentage, Response Percentage and Cumulative Response Percentage. In Enterprise Miner, only a traditional ROC plot is provided, but in Model Studio a few additional plots are provided under ROC Reports. This family of plots in Model Studio includes ROC, Accuracy, and F1 Score. In addition, the assessment tab provides a full table of fit statistics with information provided for all available partitions for all competing models.
We saw on the Nodes window that Model Studio used KS to declare a pipeline champion for our binary target. How do we know KS is the default statistic for our target, what’s the default for an interval target, and how would we change these? Similar to Enterprise Miner we can start by looking at the properties of the Model Comparison node. (We’d need to close the results window, to view the properties pane.)
The properties most likely to be changed by the analyst are for Class or Interval selection statistic and Selection partition. These are essentially the same properties we focused on above for Enterprise Miner. The other properties, Selection depth and ROC based cutoff, become active depending on the selection statistic chosen. Notice by default each property shows “Use rule from project settings”. So, at some point we need to make our way there. But the settings for these properties can be changed here, for a specific pipeline. One thing we must keep in mind is that Model Studio can have settings that affect all pipelines in the project, but these project level settings can be over-ridden at the pipeline level. For this pipeline, if we wanted to choose a Class selection statistic different from the rule from project settings, which is KS, we could use the drop-down menu for that property to do so.
How can I verify that KS (Youden) is the default statistic for Class targets? Where can we see the project settings? In the upper right corner of the interface, we can select the short-cut button for Project Settings.
Doing so opens the Project Settings window. (We’ve seen this Project Settings window before in other posts in this series, for example, when we learned about partitioning data.) To see the rules for model comparison, click Rules in the left-hand column.
Now we see the default project level rules. The Selection statistic for class targets is KS and the Selection statistic for interval targets is Average squared error. The Selection partition brings us to yet another difference from Enterprise Miner. We saw that the default partition used in Enterprise Miner to select a champion model is Validation data, even if a Test partition exists. Note that for Model Studio the default Selection partition is initially Test. (Technically the default setting is “Default” but reading the sentence under the property indicates “Test” is default.) If no Test partition exists, then validation is used, and if no validation data exists, Training data is used. Other properties, some of which are not shown above, become active depending on the settings for Selection statistics.
The final item to discuss for Model Studio when it comes to the joy of model comparison is Pipeline comparison. This is a key difference of, and awesome advancement beyond, the capabilities of Enterprise Miner. In Model Studio, each pipeline can have its own pipeline champion. But Model Studio allows you to have a project level champion. Think of this model as the champion of champions! This is accomplished through the Pipeline Comparison tab.
Selecting the Pipeline Comparison tab takes you to the view that is comparing champion models across pipelines. (Challenger models which are not pipeline champions can even be added to Pipeline comparison. This is done from within the pipeline where the challenger model is built. To add a challenger, right click on the node of the challenger model and select Add challenger model.) Below is the Pipeline comparison view:
The overall project champion model is denoted with a special symbol (star in a flag) in the Champion column in the table at the top. In the screen shot above, we can see that two pipelines were built, and the project level champion model is Gradient Boosting which came from Pipeline 2. Instead of a full fit statistics table, the table at the top provides minimal information about the models, primarily only the value of the Selection statistic which was used to find the champion. All the other information shown in the window is specific to the declared champion model. This information includes items such as an Error Plot, a table of Variable Importance, score code, Lift Reports, ROC Reports and Fit Statistics. Results on all partitions are shown, but this information is provided only for the champion model. What if we wanted to see a comparison between all pipeline champions? In that case, select the desired models in the table at the top and then click the Compare button.
The view changes, and at the top of the comparison, we see again the table listing all models being compared and which model is selected as the project champion.
The remainder of the comparison view looks nearly identical to the Assessment tab results when looking at the results of the Model Comparison node within a single pipeline.
We see Lift Reports and ROC Reports and a Fit Statistics table showing results on all available data partitions for all models selected.
Keep in mind that the joy in all this for the Data Scientist is to get the best model into production where it can be used. A final comment about Model Studio is how seamless it is when it comes to moving from model building to model deployment. From the Pipeline Comparison tab, actions such as registering a model to a repository for consumption in Model Manager, publishing the model to a destination, and scoring new hold-out data are all possible from the “More options” short cut button menu when a model is selected.
Although in many aspects of life, comparison is the thief of joy, to a Data Scientist, there is joy in model comparison. The goal of building predictive models is to put the best performing, most generalizable model into production for deployment needs. Building several competing models and selecting the champion is how we get to this ultimate goal. If you are moving from SAS Enterprise Miner to Model Studio, I hope the information in this post makes that move a bit easier for you when it comes to the joy of model comparison. Although there are similarities between these two tools when it comes to model comparison being prepared for the differences should help your transition be smoother.
Prior Posts:
Model Studio for SAS Enterprise Miner Users: Part 1
Model Studio for SAS Enterprise Miner Users: Part 2, Data
Model Studio for SAS Enterprise Miner Users: Part 3, Let’s get philosophical
Model Studio for SAS Enterprise Miner Users: Part 4, Partitioning Data
Model Studio for SAS Enterprise Miner Users: Part 5, Building Models…Let’s get physical!
Find more articles from SAS Global Enablement and Learning here.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.