Model Studio for SAS Enterprise Miner Users: Part 3, Let’s get philosophical

1 Like

This post will focus on the differences and similarities between the model building philosophies in SAS Enterprise Miner versus Model Studio. This is the third post in an ongoing series I’m writing to introduce Model Studio to SAS Enterprise Miner users. I’ll be honest, I struggled with this topic for a while. If you read my last post in the series, you know my plan for this one was to write about building models in SAS Enterprise Miner versus Model Studio, but starting there just didn’t feel right. I’ll have to save that for a future post. Although, this post will cover some of the topics I teased you with at the end of the previous one, such as the SEMMA tools in SAS Enterprise Miner and the nodes in Model Studio.

Before I cover the physical act of using each to build predictive models, I feel it is necessary to first cover the philosophy, or approach, that each tool uses to solve analytical projects. SAS Enterprise Miner and Model Studio both have the philosophy of model building based off the well-known concept of an analytics life cycle. So, let’s start there. But if you’ve been keeping up with this series, you know that SAS Enterprise Miner and Model Studio are as much alike as they are different. Let’s see how.

Analytics Life Cycle:

If you have been building predictive models for years, you probably already know about the analytics life cycle. There may be a few different versions of it, but they all share the same basic idea. An approach, that if followed, may not guarantee the BEST analytical solution, but it does guarantee a darn good one in most business scenarios. The analytics life cycle (also known as a data mining or predictive modeling life cycle) is like a step-by-step recipe, a philosophy, if you will, that data scientists follow to solve business problems. Although many versions of the life cycle may exist, when you get into the details, they all follow the same philosophy that SAS spells out in a simple 3 step strategy: Data, Discovery, and Deployment.

The process starts with a business problem, some question or business scenario, that needs to be solved. No computer helps with defining the business problem! It usually is arrived at in a meeting room (or in a ZOOM call!) by managers and data scientists. Almost all other aspects of the life cycle will use a computer. The first part of the process focuses on the data and preparing it for the analytical method to be applied. The Data portion includes but is not limited to obtaining the correct data source, where perhaps actions such as extracting, merging, and appending are required. Data exploration is performed here as is data preprocessing. Data preprocessing includes actions such as imputation, transformations, and feature creation and selection. It’s safe to say that most of the analyst’s time is spent in the Data phase of the life cycle. And the more time spent here typically means a bigger pay-off in the next phase of the life cycle.

The Discovery phase is where the underlying analytical methods to solve the problem are applied. For this post, the discovery phase focuses on building a predictive model. But other analytical methods could be applied during Discovery. A few such examples are performing a clustering or market basket analysis, building a forecasting model, or extracting information from unstructured data. The idea is that the data scientist is discovering the meaning or predictable pattern that exists within the data. Often multiple models may be built during the Discovery phase.

The final phase of the lifecycle is really what it’s all about. The Deployment phase is why analysts get a paycheck! Eventually we need to put a model, or whatever our discovery is, into production. This is where the results are used to make the business decisions to solve the initial problem defined before the life cycle even started. And Deployment is much more than just scoring new data. A champion model is selected, model score code is generated, the model is published to a publishing destination where it will be used, and model performance is monitored over time to watch for degradation. This ongoing process is known as model operations, or simply, Model Ops.

The life cycle is circular simply because in most cases, when one business problem is solved, there is another question to ask. Both SAS Enterprise Miner and Model Studio use this analytics life cycle philosophy in their approaches to building models and it is the basis for how the tools are designed.

SAS Enterprise Miner: Data Source Wizard

The Data phase of the life cycle for SAS Enterprise Miner starts with the Data Source Wizard. The Data Source Wizard is used often immediately following the creation of a SAS Enterprise Miner project. The wizard is how data is brought into a SAS Enterprise Miner project and how metadata is attached to the data. I addressed the Data Source Wizard in part 2 of this series, so check it out if you’re not familiar with the wizard.

Model Studio: Data Tab

Similar to SAS Enterprise Miner (but different at the same time!), Model Studio starts the Data phase of the life cycle during project creation and it’s also handled as the first thing the analyst sees after the project is created. Metadata can be defined during the creation of the project through the Advisor options in the Advanced Project Settings. Further, once the project is created, Model Studio initially displays the Data tab. The Data tab is where user-based refinements to metadata can take place and it is where metadata rules can be assigned to individual variables, such as methods of imputation and types of transformations. As with the Data Source Wizard for SAS Enterprise Miner, I discussed the Advisor options and the Data Tab in part 2 of this series. Check it out, if you haven’t done so yet.

SAS Enterprise Miner: SEMMA

The model building philosophy in SAS Enterprise Miner is best summarized by the acronym SEMMA. SEMMA stands for Sample, Explore, Modify, Model, Assess. Each of the tools contained within the SEMMA concept is found in a tools palette at the top, central part of the SAS Enterprise Miner interface. Each portion of SEMMA has its own tab and when a tab is selected the tools, (also known as nodes) for each portion are displayed above the tab. Technically there are more tabs on the tools palette than the five which spell out SEMMA, but this post will only focus specifically on SEMMA. The other tabs are additional data mining features, some of which are only available to SAS Enterprise Miner as add-ons. Here is a look at all the tabs available on the SEMMA tools palette:

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the image above, the Sample tab is selected, so the tools related to sampling are displayed at the top of the palette. As stated above, to display the tools for any one of the tabs, the user clicks on that tab. The tools for each tab are listed in alphabetical order based on the name of each tool where a visual image is used to represent each tool. And props to the folks in R&D that came up with the images for the tools, because they do a great job pictorially of “showing” what each node is. I’ll discuss a few of these soon.

Here is a complete list of tools available for each SEMMA tab with the image of each tool shown below. I’ll refer you to the SAS Enterprise Miner product documentation to see a description of each item.

SAMPLE: Append, Data Partition, File Import, Filter, Input Data, Merge, Sample

(Note how the image for the Append node, the first image on the left, shows adding a row to a data table, the image for the Data Partition node, the second image, shows a data table being broken into parts, and the image for the filter node shows a filter capturing the orange observations, but letting the blue pass through.)

EXPLORE: Association, Cluster, DMDB, Graph Explore, Link Analysis, Market Basket, MultiPlot, Path Analysis, SOM/Kohonen, Stat Explore, Variable Clustering, Variable Selection

(Note how the image for the Association node shows different items being associated with one another, the image for the Cluster node shows observations going into a funnel and coming out clustered by color, and how the image for the Stat Explore node shows common statistical symbols used to represent summary statistics.)

MODIFY: Drop, Impute, Interactive Binning, Principal Components, Replacement, Rules Builder, Transform Variables

(Note how the Impute image shows a mathematical function being inserted into a blank cell in a column and how the Transform Variables image shows a mathematical function being applied to a data table.)

MODEL: AutoNeural, Decision Tree, DMine Regression, DMNeural, Ensemble, Gradient Boosting, LARS, MBR, Model Import, Neural Network, Partial Least Squares, Regression, Rule Induction, TwoStage

(Note how the Decision Tree image shows a tree structure, the neural network image shows a feed forward multilayer perceptron, and the Regression image shows a simple linear regression on a scatterplot of points.)

ASSESS: Cutoff, Decision, Model Comparison, Score, Segment Profile

(Note how the Score image shows a model, represented as a diamond, being applied to a data table and the Segment Profile image shows a magnifying glass taking a close look at the results of a Cluster analysis.)

So how does the SEMMA philosophy fit into the Analytics Life cycle? Sample, Explore, and Modify are used for the Data phase. Model (and sometimes Explore) is used in the Discovery phase. And Assess is used in the Deployment phase. It’s pretty straight forward! In the Data phase of the life cycle, we perform tasks such as merging and appending, exploring data graphically and with summary statistics, partitioning the data, and transforming and generating new inputs. All these tasks and more are found under the Sample, Explore, and Modify tabs in SAS Enterprise Miner. This is where data preprocessing takes place. The Discovery phase fits pretty obviously with the Model tab as that is the location in SAS Enterprise Miner to find supervised learning algorithms. But sometimes the Explore tabs is used for Discovery also. The Explore tab has tools for things like clustering, association analysis, and market basket analysis. Sometimes these applications are used for discovery if the goal of an analytics project is not supervised modeling. And the Deployment phase of the life cycle includes actions such as model comparison to pick a champion model and scoring new data. The tools for these actions are found under the Assess tab.

To use these tools in a pipeline, the user simply drags and drops them into the diagram area and then manually connects them to build a process flow. I’ll cover more details on this process in another post.

Model Studio: Nodes Pane

For Model Studio, first I want to say that I’ll focus only on models built using SAS Visual Data Mining and Machine Learning (VDMML). Projects built using VDMML are the ones most closely related to SAS Enterprise Miner, especially when the extra add-ons are not licensed for SAS Enterprise Miner. I’ll save any discussion of SAS Visual Text Analytics and SAS Visual Forecasting for another post. Model Studio uses a similar philosophy to model building compared to SAS Enterprise Miner but it is a bit more simplified. First terminology: Model Studio refers to the individual tools as nodes. Rather than the 5-part SEMMA process that SAS Enterprise Miner uses, Model Studio has nodes arranged into just 4 groups, and one of those groups contains just a single node. Although the nodes in Model Studio are available directly within a pipeline (for example by right-clicking a node), I’ll show and discuss them as they are available in the nodes pane. The nodes pane is available to the left of any pipeline and is opened by clicking on the Nodes short-cut button.

Before describing how these four groupings of nodes fit into the Analytics Lifecycle, let’s see a complete list of nodes that each group contains. Just as above, I won’t describe each node here, so if you want to see a description of them, check the Model Studio product documentation.

DATA MINING PREPROCESSING: Anomaly Detection, Clustering, Feature Extraction, Feature Machine, Filtering, Imputation, Interactive Grouping, Manage Variables, Replacement, Text Mining, Transformations, Variable Clustering, Variable Selection

SUPERVISED LEARNING: Batch Code, Bayesian Additive Regression, Bayesian Network, Decision Tree, Factorization Machine, Forest, GAM, Gaussian Process Classification, Gaussian Process Regression, GLM, Gradient Boosting, Linear Regression, Logistic Regression, Model Composer, Neural Network, Quantile Regression, Score Code Import, SVM

POSTPROCESSING: Ensemble

MISCELLANEOUS: Data Exploration, Open Source Code, SAS Code, Save Data, Scorecard, Segment Profile

First, if the images next to the name of each node look familiar, that’s because they’re basically the same images used for the same tools in SAS Enterprise Miner. Nice job R&D! Thanks for that consistency. Compared to the five parts of the SEMMA philosophy in SAS Enterprise Miner, Model Studio’s more simplified approach in the nodes pane fits much more intuitively into the Analytics Life cycle. For example, it’s obvious that the Data phase of the life cycle is primarily accomplished using nodes found under Data Mining Preprocessing. One notable exception is the Data Exploration node which is found under the Miscellaneous group. (And this placement of the Data Exploration node actually does make sense as when we simply explore the data in Model Studio, we’re not actually making any changes to it, so it really shouldn’t go under Preprocessing.) So, Model Studio combines many of the SAS Enterprise Miner Sample, Explore, and Modify tools into Data Mining Preprocessing.

Just as intuitive as the Model tab is in SAS Enterprise Miner for fitting into the Discovery phase of the life cycle, the same is true for the Supervised Learning group in Model Studio. There are a few exceptions. Just as stated above for SAS Enterprise Miner, clustering could be used as the main goal of an analytics project and thus clustering can be used for the Discovery part of the life cycle. The Clustering node is found in the Preprocessing group in Model Studio. Likewise, building an Ensemble model could be used for Discovery and that node is found in the Postprocessing group.

Don’t think that the Miscellaneous group is solely for the Deployment phase, just because it is the last group of nodes in the nodes pane. In fact, the Score Data node is really the only node in the group that specifically deals with Deployment. The Save Data node certainly could be used for deployment in that it could place a scored data set in a specific CASlib, but it could be used during the Discovery, or even Data, phase as well. There is an argument that the Segment Profile node would fall into the Deployment phase. But really that node allows analysts to take a deeper look into results from a Clustering analysis. So, it probably more accurately fits into Discovery since other methods would be used to deploy a Clustering model. The Open Source Code and SAS Code nodes could really be used in any part of the life cycle, as they allow the analyst to perform any desired task using code. User specific code, whether it is SAS code or Open Source, could be used in the Data, Discovery, or Deployment phases of the life cycle. So, I think once again the folks in R&D did a great job of simply naming this group as Miscellaneous. The tools within it, for the most part, are very flexible in terms of what their applications are. There is a Model Comparison node in Model Studio but that node is added automatically to a pipeline when at least one supervised learning node is used.

The way in which these nodes are used is very similar to SAS Enterprise Miner. From the Nodes Pane, the user can simply drag and drop the nodes into a pipeline. However, Model Studio helps place the nodes in correct places within a pipeline and will not allow the user to place nodes where they do not make analytical sense, such as placing a supervised learning node prior to a data preprocessing node. Also, the pipelines in Model Studio provide specific “swim lanes” where nodes are connected and arranged automatically. This creates pipelines that are potentially less chaotic or messy compared to the full user created process flows in SAS Enterprise Miner. Again, more on creating pipelines in Model Studio in a future post.

Finally, I just want to point out that in only a few short years of being available, VDMML in SAS Viya has really come a long way. SAS Enterprise Miner has been around for over 25 years, but Model Studio has already surpassed the number of supervised modeling tools that SAS Enterprise Miner contains. As “young” of a product as Model Studio is, it is very mature when it comes to the Discovery phase of the life cycle. Further, compared to how SAS Enterprise Miner handles many of the tasks in the Data phase, Model Studio’s philosophy of handing the Data phase through the capabilities of the Data tab, is quite mature as well. So, for those of you coming into Model Studio from the SAS Enterprise Miner world, get a bigger analytical toolbox, because you’ll need a bigger one to fill it with all the modeling tools and data capabilities Model Studio has!

Find more articles from SAS Global Enablement and Learning here.

Model Studio for SAS Enterprise Miner Users: Part 3, Let’s get philosophical

SAS Innovate 2025: Call for Content

Free course: Data Literacy Essentials

Get Started