Model Studio for SAS Enterprise Miner Users: Part 8, Variable Selection

1 Like

This post will discuss the differences and similarities between the Variable Selection nodes in SAS Enterprise Miner compared with Model Studio. Although there are several ways to perform variable selection in both analytical workhorses, this post will specifically focus on the Variable Selection node in each. There are more differences than similarities when it comes to the capabilities of the two nodes in each tool so be ready for good discussion on each!

This post is part 8 in a series I’ve been working on to introduce Model Studio to the SAS Enterprise Miner user. If you haven’t seen the others, you may want to stop and check them out first. Links to all prior posts are at the bottom of this one. And just like all prior posts in this series, coming from the Enterprise Miner world is not a prerequisite to learn something useful here! Any Model Studio user, or anyone interested in variable selection in general, will benefit from the information I’m sharing.

First, why is variable selection a critical step in the analytics lifecycle when building a model for prediction? Selecting the right variables is important in building effective predictive models because the quality of the input data directly shapes the model’s accuracy, interpretability, and stability. Including irrelevant, redundant, or noisy variables can hide meaningful patterns, inflate model complexity, and lead to overfitting. Overfitting is when a model performs well on training data but fails to generalize to new observations. On the other hand, a thoughtful variable selection process helps highlight the true drivers of an outcome, reduces computational burden, and enhances the transparency of the final model. By focusing on the most informative predictors, analysts can build models that are not only more robust and efficient but also easier to communicate, deploy, and trust.

Enterprise Miner:

The Variable Selection node is the last node found above the Explore tab in the SEMMA tools palette of Enterprise Miner.

It is one of the most important tools in Enterprise Miner for preparing high quality, predictive models. The node systematically considers each input variable’s relationship with the target and determines which variables should be kept, rejected, or further examined. Its primary purpose is to reduce dimensionality (in other words, reduce the number of inputs going into a model), prevent overfitting, and improve both model interpretability and computational efficiency. Essentially the node measures how strongly each input variable is associated with the target and assigns the variable as input or rejected. Rejected does not mean “deleted” or removed, it simply means that the variable would be ignored by subsequent nodes in the remainder of the diagram.

The properties panel for the Variable Selection node is quite extensive, but I’ll just highlight the most useful and/or essential properties. Here’s a quick glance at the entire properties panel:

The primary property for the node is Target Model. This property is used to establish which of the two selection methods is used. The settings for this property are Default, R-Square, Chi-Square, R and Chi-Square, or none.

When Target Model is set to Default, the selected method depends on the target measurement level and other model information. If the target is binary and the model has greater than 400 degrees of freedom, the Chi-Square method is used. Otherwise, R-Square is used. Here is a look at the properties for each of these methods, many of which I’ll be discussing below.

R-square is best, in general, for interval targets and where linear relationships dominate in the data. When this property is used, the node calculates the squared correlation between each input and the target. It then operates in two stages. First, an initial screening is performed where variables with extremely low correlation with the target (<0.005 by default) are immediately rejected. In the second stage, forward selection regression is performed where variables are added to the regression model, one at a time, based on how much each contributes to improving R-Square. Inputs that fail to improve R-Square by at least 0.0005 (by default) are rejected.

Chi-Square is best for binary or categorical targets. This method uses Chi-Square tests to examine how different partitions of an input variable relate to target outcomes. For class variables, their levels (i.e., their categories) are evaluated directly. For interval variables, the node can automatically bin variables into up to 16 AOV bins (Analysis of Variance groupings) to detect nonlinear relationships. Each variable’s partitions are evaluated and only those exceeding a chi-square threshold (3.84 by default) are kept. If all partitions for a variable fail to pass the threshold, then the variable is rejected.

The node offers several powerful configuration options that influence how inputs are evaluated and even allows for transformed values of the inputs to be passed to subsequent nodes as inputs. As mentioned above, AOV16 Binning bins interval variables into 16 discrete groups, allowing chi-square methods to capture nonlinear effects. It is possible to pass these AOV16 groupings on as inputs using the Use AOV16 Variables property. For high-cardinality inputs (i.e., categorical variables with a lot of levels) the Use Group Variables property automatically clusters levels into fewer groups before evaluations. The variables created with these new groupings can be passed on as inputs. Finally, the node can explore two-way interactions among class variables when the Use Interactions property is enabled. This is useful when the impact of one categorical variable depends on another.

Here’s what the Results window looks like for the node using the R-square method.

Probably the most import window in the Results is for Variable Selection. This window shows the status for each variable after the node has run: Input or Rejected. If rejected, the window provides the reason in the Reason for Rejection column. The Effects in the Model window shows a rank order of selected variables based on the sequential improvement of the model R-square value as variables are added. The R2 Value window shows the R-Square value for all effects. The Output window shows the output from the underlying procedure.

There are other nodes that can be used for variable selection in Enterprise Miner. One is the Variable Clustering node, also found in the Explore group in SEMMA. Further, based on how Enterprise Miner operates, several supervised modeling nodes could also be used for variable selection. Some such nodes are regression, decision tree, and partial least squares. In Enterprise Miner, it is possible to connect one model node into another. In these cases, variables not selected by the first model node during its own specific variable selection method, are passed on to the second model node but with updated roles of rejected. Thus, such variables would be ignored by the second model.

Model Studio:

Just as with the Variable Selection node in Enterprise Miner, the Variable Selection node in Model Studio plays a critical role in the analytics lifecycle in reducing dimensionality of input data, improving model generalization, and simplifying downstream modeling needs such as interpretation and deployment. But this is about all the similarities there are between the two nodes.

Right out of the gate, I’ll state that the Variable Selection node in Model Studio has far more capabilities than the node of the same name in Enterprise Miner and arguably is more powerful. In Model Studio, the Variable Selection node is a flexible and modern feature-reduction tool that evaluates inputs using multiple supervised and unsupervised methods and retains only the variables consistently found to be predictive. Some of the supervised methods include linear models, decision trees, and other tree-based ensembles. Analysts can combine methods with voting rules, apply pre-screening based on missingness and cardinality, and produce transparent reports showing why each variable was selected or rejected.

The Variable Selection node in Model Studio is found in the Data Mining Preprocessing group.

The properties panel is large, especially when properties are expanded, so I won’t show it to you in its entirety. Here’s the first portion, showing the first three properties:

The Pre-screen Input Variables property operates like some of the Advisor Options found under the Advanced project settings, when a project is first being created. I covered these Advisor Options in part 2 of this blog series where I covered all things related to data. Here’s a look at this property when it is turned on and expanded.

When turned on, this property allows the node to reject categorical input variables which have more than 50 distinct levels or any input variable with more than 50% of its values missing. Both of these default thresholds can be changed using sliders under the property headers.

The Combination criterion property allows the node to combine multiple variable selection methods in different ways to affect the total number of variables selected as inputs by the node. There are 4 options for this property:

jt_9_MS_combination_criterion_options_expanded-300x287.png

These settings define what must happen among multiple selection methods for a variable to be passed on as an input. The default, Selected by at least 1, passes a variable on as input when at least one of the chosen methods selects the variable. This is the setting which is most generous in terms of allowing the greatest number of inputs to be selected by the node. The setting which is most strict is Selected by all. This setting passes on an input only when selected by all chosen methods. The other options are Selected by a majority and Selected by a tie or majority.

The Limit number of selected variables property allows the analyst to set a limit on the total number of inputs selected. When this option is selected, another property, Maximum variable rank, becomes available. In this case, variables are kept based upon their assigned ranks, with the maximum rank of the selected variables being no greater than the value set in the Maximum variable rank property.

The remaining properties in the properties panel pertain mostly to which selection methods are used. The default settings are shown below.

Each method shown here can be expanded where sub-properties specific to that method can be selected or changed, when the method is turned on. I will not be describing all such sub-properties here. Please look for additional details contained in product documentation.

The Unsupervised Selection method allows for variable selection based on unsupervised methods, meaning the target variable is ignored. This method allows for correlation statistics based on Correlation, Covariance, or Sum of squares and crossproducts to be used to select variables. Unsupervised selection has a property called Selection process that specifies the process used for selecting variables when both unsupervised and supervised methods are being performed. Possible values are Combine with supervised methods and Perform sequential selection. The Combine with supervised methods option allows the unsupervised selection to be combined with supervised methods via the Combination criterion option described above. The Perform sequential selection option allows the unsupervised selection method to be used before the supervised methods are performed.

The Fast Supervised Selection method is the only method turned on by default. This method performs fast supervised selection which is based on a regression model.

The remaining selection methods are based on specific modeling algorithms. When a method is selected and expanded, properties specific to that algorithm are shown. Here is what the Decision Tree Selection properties are:

Notice that the properties reflect typical hyperparameters for a decision tree. This is true for all the supervised modeling methods. When turned on and expanded, typical model hyperparameters for each model are available.

The final property is Create Validation from training. This property is turned off by default, but it is automatically turned on when any of the supervised selection methods (except for Fast Supervised) is selected. This property allows for a temporary validation sample to be created from the incoming training data. Creating a temporary validation data set is highly recommended even if the data has already been partitioned so that only the training partition is used for variable selection. This leaves the original validation data to be used only for modeling. When turned on and expanded two additional properties are available: Validation proportion and Partition seed.

jt_13_MS_create_validation_options_expanded-300x184.png

When the node is run, there is a lot of information available in the Results window, especially depending on which specific selection methods are being used. For each selection method, a table is provided summarizing the results of that specific method. I’ll highlight just two of the most important tables in the Results.

The Variable Selection window contains a table like that found in the Variable Selection window in the Results from Enterprise Miner. The window shows the status of each variable after the run of the node: Input or Rejected. The window also has a column for Reason, providing a reason for rejected variables. Here’s a closer look at the Variable Selection window showing rejected variables.

jt_15_Variable_selection_table_rejected_inputs-1024x315.png

In this case, multiple supervised methods are being used along with the unsupervised method which has the Selection process method (described above) set to Perform sequential selection. Since sequential selection is being performed a variable can be rejected based on unsupervised selection only, since this method is performed prior to combining the supervised methods. AVG_DATA_CHRGS_3M is rejected due to the lack of variance it explains based on unsupervised selection. ACCT_AGE is not being rejected for unsupervised reasons but is being rejected due to a combination of supervised methods, the details of which are available in another window.

The Variable Selection Combination Summary window provides a table detailing specific information about rejected variables when multiple selection methods are being combined in the Combination criterion property (described above).

jt_16_MS_variable_selection_combination_table-1024x312.png

Here, Fast Supervised, Linear Regression, and Decision Tree selection methods are being used where Combination criterion is set to Selected by at least 1. DAYS_OPENWORKORDERS passed the unsupervised selection method but is rejected during the supervised selection process since it was rejected by all supervised methods (in other words, it was NOT selected by at least one supervised method). DELIQ_INDICATOR is passed on from the node as an input because it was selected by at least one of the supervised methods, specifically Fast Supervised and Linear Regression. EVER_DAYS_OVER_PLAN is also passed on as an input because it was also selected by at least one supervised method (specifically it has been selected by all supervised methods). In the screen shot above, only EVER_DAYS_OVER_PLAN would be passed on as an input if the Combination criterion property was set to Selected by all.

For those coming from the Enterprise Miner world, there’s a valid question: Why does the Variable Selection node in Model Studio include supervised modeling methods? Recall that in Enterprise Miner, one supervised modeling node can be connected to another supervised modeling node thus allowing one model to select variables for another. In Model Studio, it is not possible to connect a supervised model into another. So, the only way to allow a model to select variables for another model, is through a separate preprocessing node specifically designed to do so. Thus, these capabilities are available in the Variable Selection node.

Another Model Studio node useful for variable selection is the Variable Clustering node. This node is also found under the Data Mining Preprocessing group.

In summary, although the purpose of the Variable Selection nodes in Enterprise Miner and Model Studio is the same, their capabilities are very different. The Variable Selection node in Model Studio is more modern and flexible. For analysts making the move from Enterprise Miner to Model Studio, understanding the differences in these capabilities can be highly beneficial and prevent possible frustration.

Prior Posts:

Model Studio for SAS Enterprise Miner Users: Part 1

Model Studio for SAS Enterprise Miner Users: Part 2, Data

Model Studio for SAS Enterprise Miner Users: Part 3, Let’s get philosophical

Model Studio for SAS Enterprise Miner Users: Part 4, Partitioning Data

Model Studio for SAS Enterprise Miner Users: Part 5, Building Models…Let’s get physical!

Model Studio for SAS Enterprise Miner Users: Part 6, The Joy of Model Comparison

Model Studio for SAS Enterprise Miner Users: Part 7, Backing up your work