Unsupervised Variable Selection: Identifying Input Spaces That Maximize Data Variance

Variable selection is an important data preprocessing task that improves model performance by removing irrelevant and/or redundant inputs, enhancing accuracy, and minimizing computational complexity. Variable selection can be supervised (using the target variable) or unsupervised (ignoring the target). SAS Viya supports both, with supervised methods reducing irrelevant inputs and unsupervised methods removing redundant ones. Here, we highlight a specific unsupervised method known as variance-based unsupervised variable selection. Another approach is variable clustering. To learn more about clustering, check out the SAS tutorial titled "Feature Selection Using Graphical Lasso" on YouTube.

Unsupervised Variable Selection

In SAS Viya, the unsupervised variable selection method can be implemented programmatically using the VARREDUCE procedure or the UNSUPER action, and interactively through the Variable Selection node in Model Studio pipelines.

The unsupervised variable selection method selects a set of variables that can jointly explain the maximum amount of variance in the input space. Variance is represented using one of three matrices: the Pearson correlation matrix, the covariance matrix, or the sums of squares and cross-products (SSCP) matrix.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

This method reduces dimensionality by forward selection of the variables and the output lists the variables in order of their contribution to data variance. In forward selection, the process starts with an empty set of variables and iteratively adds variables that contribute the most to the overall variance. You can specify the fraction of the total variance to be explained or the minimum increment of explained variance or both.

Forward selection adds inputs one at a time by computing the variance explained by each variable individually and selecting the one that maximizes the increase in variance explained at each step, if it is under a predetermined threshold value, called the entry cutoff. The process continues until a predetermined stopping criterion is met, such as reaching a specified number of selected variables or when adding additional variables no longer significantly increases the overall variance explained. Adding terms in this nested fashion always selects the most important variables and reduces redundancy. This method is particularly beneficial in applications where retaining the original variables is important for model exploration and interpretation.

This all sounds interesting, but how is it actually done in sequential forward selection?

An Optimization Algorithm

Unsupervised variable selection can be successfully approached as an optimization problem by means of global optimization heuristics if an appropriate objective function is considered.

We will use the getStarted dataset from the SAS Viya documentation containing 100 observations and 12 variables. The variables include interval inputs generically named X1 through X10, a binary target variable Y, and a categorical input variable C. The binary target variable Y is only needed for supervised variable selection. Thus, our input space X consists of 100 observations and 11 variables, one of which is categorical. Categorical variables are dummy coded using the GLM method of parameterization. Our objective is to maximize the explained variance in this input space, essentially making it an optimization algorithm.

This input space can be thought of being divided into two parts: the variables selected by the unsupervised method and those that were not. Suppose S contains the five selected variables (X3, X5, X7, X10 and C), while U contains the remaining six variables (X1, X2, X4, X6, X8 and X9). The problem is then approached as a minimization rather than a maximization problem by utilizing the concept of null space.

The null space of a matrix signifies the set of all vectors that, when multiplied by the matrix, result in the zero-vector. The idea of null space helps identify relationships or dependencies between variables. If there are non-zero vectors in the null space, it means some variables can be expressed in terms of others. And so, the unselected variables matrix U is projected to the null space of the selected variables matrix S to determine if the variables in U have any impact on the variations in the data.

The data variance that resides in the null space of the selected variables is measured. Since U is projected in the null space of S, this data variance is equivalent of what cannot be explained by the variables in the selected set and thus need to be minimized. This process helps select variables that jointly explain the maximum amount of the variance in the original data. Minimizing unexplained variance is equivalent to maximizing explained variance.

Implementation in Model Studio

The Variable Selection node in Model Studio includes variance-based unsupervised variable selection as one of its many available methods. A simple pipeline has been created with the Data node connected to a Variable Selection node.

In the Variable Selection node, only the Unsupervised Selection slider is enabled. The maximum number of variables to be selected is limited to 5. The Correlation Statistics field is set to its default option, Correlation, while alternative options include Covariance and Sum of Squares and Crossproducts. The cumulative variance cutoff is set to 0.9, and the incremental variance cutoff remains at its default value.

After running the Variable Selection node, Cumulative Variance Explained window displays a bar chart giving the proportion of variance explained as each variable is considered. The parameters are ordered by decreasing incremental variance.

The Output window displays the SAS output of the variable selection run including the selection summary. The Selection Summary table displays for each iteration the name of the selected effect, the name of the selected level, and the total variance explained after the iteration.

The Selected Variables table (not shown here) summarizes which variables were selected in the selection process. It also provides information about the variable type of each selected variable. Five variables were selected: X3, X7, X10, C and X5.

You can also implement this programmatically. To learn how to perform unsupervised variable reduction using the VARREDUCE procedure in SAS Viya, watch this video: "Unsupervised Variable Reduction Using the VARREDUCE Procedure in SAS Viya" on sas.com.

Concluding Remarks

Variable selection is the obvious way to thwart the curse of dimensionality. Unfortunately, reducing the dimension is also an easy way to disregard important information. In predictive modeling, relying solely on unsupervised variable selection can result in the inclusion of less relevant predictors. This is because unsupervised variable selection focuses solely on eliminating redundant inputs rather than removing irrelevant ones. To overcome this limitation, it is recommended to integrate both supervised and unsupervised methods for variable selection. That said, if the goal is purely to remove redundant inputs, such as in cluster analysis, using only unsupervised variable selection is particularly important.

Find more articles from SAS Global Enablement and Learning here.

Unsupervised Variable Selection: Identifying Input Spaces That Maximize Data Variance

Registration is open

SAS AI and Machine Learning Courses