Data mining and machine learning problems are often massive in dimension. The dimension of a problem refers to the number of features or input variables (more accurately, degrees of freedom) that are available for creating a prediction. It is imperative to use a subset of the input variables in the final model for several reasons. First, remember that using all available inputs almost always results in a predictive model that is overfit to the training data and does not generalize well to new data. Also, the more inputs you use to build the model, the more cases are required to discover the relationship between the inputs and the target. This problem is known as the “curse of dimensionality.” The curse of dimensionality limits your practical ability to fit a flexible model to noisy data (real data) when there are a large number of input variables.
A densely populated input space is required to fit highly complex machine learning models. There are many methods for selecting inputs for a model. Some of these methods are supervised, which means that the target variable is used in the selection process. Other methods are unsupervised and ignore the target. To learn more about unsupervised methods, check out the post titled: Unsupervised Variable Selection: Identifying Input Spaces That Maximize Data Variance.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
SAS Viya offers both types of methods. Supervised methods reduce irrelevant inputs, which do not offer any information about the target. In contrast, unsupervised methods remove redundant inputs, which fail to provide any additional input information.
Variable Selection Approaches
To select useful inputs, we often combine unsupervised variable selection methods with supervised variable selection techniques, such as regression, decision trees, forests, and more. There are two main approaches for combining these methods.
With the Sequential approach, first apply the unsupervised method to all input variables to eliminate redundancy. Then, run one or more supervised methods on the variables selected by the unsupervised method to reduce irrelevancy and obtain the final set of selected variables. This approach addresses redundancy first and then irrelevancy.
With the Parallel approach, run the unsupervised method and one or more supervised methods simultaneously. Each method votes on whether a variable should be selected, thereby addressing redundancy and irrelevancy at the same time.
The Variable Selection Node
The Variable Selection node in Model Studio finds and selects the best variables for analysis by using unsupervised and supervised selection methods and implements both the aforementioned approaches. The Variable Selection node assists you in reducing the number of inputs by rejecting input variables based on following:
In the Variable Selection node, you have the option to Pre-screen Input Variables before running the chosen variable selection methods. In pre-screening, if a variable exceeds the maximum number of class levels threshold or the maximum missing percent threshold, that variable is rejected and not processed by the subsequent variable selection methods.
The node offers six variable selection techniques: Unsupervised Selection, Fast Supervised Selection, Linear Regression Selection, Decision Tree Selection, Forest Selection, and Gradient Boosting Selection. You can choose one, multiple, or all techniques to run simultaneously. This makes the node a versatile and powerful tool for implementing various variable selection methods in one unified solution.
If you choose the unsupervised selection method, you can specify in the Selection process property whether this method is run prior to the supervised methods (sequential selection). If you choose to perform a sequential selection (the default), any variable rejected by the unsupervised method is not used by the subsequent supervised methods. If you are not performing a sequential selection, the results from the unsupervised method are combined with the chosen supervised methods.
If you choose multiple methods, the results from the individual methods are combined to generate the final selection result. This is done with the Combination criterion property. This is a "voting" method such that each selection method gets a vote on whether a variable is selected. As an option, you choose at what voting level (combination criterion) a variable is selected. Voting levels range from the least restrictive option (at least one chosen method selects the variable) to the most restrictive option (all chosen methods select the variable). Any variable that is not selected in the final outcome is rejected, and subsequent nodes in the pipeline do not use that variable.
The Limit number of selected variables property specifies that the number of selected variables are limited based on the value specified for the Maximum variable rank property. Variables are kept based on their assigned ranks.
Create Validation Sample from Training Data property specifies whether a validation sample should be created from the incoming training data. This is recommended even if the data have already been partitioned so that only the training partition is used for variable selection, and the validation partition can be used for modeling. By default, this is selected.
As with modeling, it is critical that all variables that are selected to prepare the training data are applied to any validation, test, or other holdout data during the data preprocessing stage. In other words, it is crucial that information from test data or holdout data does not leak into the training data. Information leakage can occur in many ways and can potentially lead to overfitting or overly optimistic error measurements.
Feature Selection in Model Studio Pipeline
We will use the getStarted dataset from the SAS Viya documentation containing 100 observations and 12 variables. The variables include interval inputs generically named X1 through X10, a binary target variable Y, and categorical input variable C. Categorical variables are dummy coded using the GLM method of parameterization.
A simple pipeline has been created with the Data node connected to a Variable Selection node. In the Variable Selection node, Unsupervised Selection, Fast Supervised Selection and Forest Selection sliders were enabled. You can select more methods or select them all at once! The maximum number of variables to be selected is limited to 5. All other settings were kept at their default values.
You can adjust each method by fine-tuning its detailed property and hyperparameter values. The default values for Unsupervised Selection and Fast Supervised Selection are shown below.
The default values of Forest selection are listed below.
After running the Variable Selection node, the Variable Selection table contains the output role for each variable. At the top of the table are the input variables selected by the node. These variables have a blank cell in the Reason column.
The Variable Selection table also shows the variables that are rejected because of the variable selection and pre-screening process (turned off in this case), as well as the reason for the rejection. Remember that sequential selection (default) is performed, and any variable rejected by the unsupervised method is not used by the subsequent supervised methods.
For each selection method, a rank is assigned to each selected variable, and the variable with the highest rank is assigned a value of 1. A fixed rank whose value is greater than the total number of inputs is assigned to variables that are not selected. The way in which ranks are assigned depends on the selection method. For example, the ranking of variables for the Forest Selection is based on variable importance, whereas the ranking of variables for the Fast Supervised method and the Unsupervised method is based on Proportion of Variance Explained. To derive the assigned ranks when multiple selection methods are enabled, selected variables are first ordered by the sum of their ranks across all of the methods used, and then assigned a rank based on their positioning in that order.
The variables that are rejected by supervised methods are represented by combination criteria (at least one in this case) in the Reason column. If you want to see whether they were selected or rejected by each method, look at the Variable Selection Combination Summary table.
The first variable C in the Variable Selection Combination Summary table is selected by both fast-supervised and forest, so it is selected as an input. For any variable rejected by both methods, for instance, X6 and X7, the output role of the variable is Rejected. If a variable is selected by only one of the selection methods, but not the other, the output role of the variable is Input, for example, X1, X10 etc. This is because of the property Combination criterion being set to Selected by at least 1.
For a quick demonstration, watch the Variable Selection Node in Model Studio tutorial to learn how to use the node to identify key predictors in Model Studio.
Concluding Remarks
In summary, the Variable Selection node is a powerful tool for data preprocessing by reducing the number of inputs based on multiple supervised and unsupervised selection results. Although rejected variables are passed to subsequent nodes in the pipeline, these variables are not used as model inputs by a successor modeling node. This node quickly identifies input variables, which are useful for predicting the target variable. The information-rich inputs can then be evaluated in more detail by one of the modeling nodes.
When multiple selection methods are applied, their results are integrated to produce the final selection, ensuring a comprehensive evaluation. Additionally, assigning ranks to the selected variables provides an opportunity to further refine your inputs for detailed analysis. Embracing this approach can significantly enhance the efficiency and precision of your data modeling process.
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.