Feature Extraction in SAS Model Studio, Part 1: Principal Components Analysis

This post will be the first in a four-part series on feature extraction techniques in SAS Model Studio, where I’ll be exploring practical methods for preparing data for modeling. First, why a four-part series? The Feature Extraction node in Model Studio has four feature extraction methods: Autoencoders, Principal Components Analysis, Robust Principal Components Analysis, and Singular Value Decomposition. I plan to write a post on each method. In case you’re unfamiliar with it, Model Studio is a GUI based data mining tool running on SAS Viya. Although it supports nearly all aspects of the data mining lifecycle, it has the primary purpose of building predictive models.

I’ll start with Principal Components Analysis (PCA) which is a well-known and foundational technique for reducing complexity and handling correlated inputs. I’ll cover the concept and purpose of PCA and provide a short, Model Studio based example. For data scientists with backgrounds in mathematics, statistics, or even engineering, the discussion on PCA will likely be a review but you’ll also gain new insight into implementing this method in a Model Studio flow.

Why Extract New Features?

In predictive modeling, success rarely comes down to the machine learning algorithm used to make predictions alone. More often, it’s driven by how well the input data is prepared. This is where feature extraction becomes critical. By the way, feature extraction can go by many terms. Feature engineering and feature creation are two of many other possible terms that could be used for the same concept. I’ll stick with feature extraction in this series, as that’s the name of the Model Studio node I’m focusing on. What’s the idea of feature extraction? It’s basically transforming and/or combining input variables into new inputs that are used in the model for prediction. By transforming raw variables into more meaningful and effective inputs, feature extraction helps reduce noise, improve model stability, and increase predictive performance. Without it, even sophisticated models can struggle with redundant, inconsistent, or overly complex data.

A Simple View of PCA

At a high level, PCA is a method for reducing a large set of variables into a smaller, more manageable set, while preserving most of the important information contained in the input variables. Rather than working with many input variables that may overlap or move together, PCA combines all features to create new ones. Even when strong correlations exist in the raw inputs, PCA creates new, independent variables which are called principal components. These principal components (PCs) are linear combinations constructed on all available inputs. Although PCA primarily would be used on interval inputs, categorical inputs could also be used through dummy coding or using some target-based transformations on them. The method is unsupervised, meaning it ignores the target variable. PCA focuses on capturing the most meaningful variation in the data. In simple terms, PCA helps you summarize your data, typically in lower dimensions, making it easier to work with and often more effective for modeling. One of the drawbacks with PCA is giving up interpretability. Although in theory, the linear combinations created in the PCA are interpretable, in practice, especially given a large number of input variables, direct interpretation of the final model is not possible.

Why Use PCA in Data Preprocessing?

PCA is especially useful during data preparation because it addresses several common modeling challenges. One issue that often arises in business data is multicollinearity. Multicollinearity is essentially having highly correlated inputs in the modeling data, the presence of which can destabilize models. PCA replaces the correlated inputs with independent components. Having too many inputs in the data and using them all in a predictive model is an easy way to prevent the model from generalizing to new data. In other words, using all (or a large number of) inputs in a model typically leads to overfitting the training data as well as increased training time. PCA reduces dimensionality of your input space. Although mathematically as many PCs can be constructed as there are inputs in the data, typically a small number of them are needed. Lower-importance variation, i.e., noise, in the raw data can be filtered out by focusing on the most informative components. Using PCs in place of raw inputs also means improved model efficiency. Simpler, cleaner inputs often lead to more stable and efficient models. When using PCA for data preprocessing, the result is typically a dataset that is smaller in terms of mathematical dimensionality but is also better structured for analysis.

Example: Constructing Principal Components in Model Studio

Let’s see how to construct PCs in Model Studio. For this application we are analyzing data from a fictitious telecommunications company that is trying to predict customer churn. The data has a binary target indicating which customers have churned and over 120 variables, the majority of which are interval. It is an ideal scenario to apply feature extraction methods. Let’s see how PCs would be constructed using SAS Model Studio. Here’s the pipeline:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

After the data node, an Imputation node is added to prevent ignoring observations with missing data. A Feature Extraction node, which will implement the PCA, is added after the Imputation node. Both the Imputation node and Feature Extraction node come from the Data Mining Preprocessing group. It is important to point out that the Feature Extraction node only operates on interval inputs. So, if categorical inputs are to be considered, they would need to be transformed in some way as mentioned earlier. In this example, I’ll focus only on interval inputs, so moving forward in the discussion, keep in mind that when I mention “input data”, I’m actually referring ONLY to the interval inputs.

Let’s take a look at the properties panel for the Feature Extraction node. Here’s the first part:

As I mentioned earlier, the Feature Extraction node performs four different methods for creating new features. Four of the options for the first property, Feature extraction method, are Autoencoder, Principal component analysis, Robust PCA, and Singular value decomposition. The fifth option, which is the default setting, is Automatic. Automatic invokes PCA when the number of input variables is less than or equal to 500, otherwise it uses Singular value decomposition. I’ve changed this property to Principal component analysis for this example, although given the number of inputs in this data set, I could have left it at Automatic.

It is common to leave the property Reject original input variables selected, as this is the proper course of action when generating new features to be used as inputs. The property Eigenvalue source indicates how eigenvalues and eigenvectors are calculated. (Quick side note on terminology: The Eigenvalue is essentially a measure of worth for each PC. So, if a PC has an eigenvalue of 2.5, it is “worth” about 2.5 of the original variables in terms of the amount of variation it accounts for in the data. There is a more rigorous, mathematical definition but that is outside of the purpose of this post. The PCs are ordered according to decreasing eigenvalue. An Eigenvector is the vector of coefficients for the linear combination of each PC.) The two options for Eigenvalue source are Correlation (default) and Covariance. Although a detailed discussion of these PCA topics is beyond the scope of this post, I will say that it is much more standard to use Correlation over Covariance. Covariance should only be used when all original inputs are on the same scale.

The Component prefix property defines the prefix that will prepend the names of newly created features. The default for PCA is PC.

The remaining properties are all related to the final number of components which are selected and passed on to successor nodes (where, again, a detailed discussion of some are beyond the purpose of this post):

Note the property Apply maximum number which is turned on by default. Depending on the number of input variables in the data, this property indicates the maximum number of PCs selected by the node. The maximum number is set in the Maximum number property which has a default value of 20. We’ll see this number soon when looking at the results.

Cumulative variance cutoff is another property that can control the final number of PCs passed on to a modeling node. As components are constructed they each account for a percentage of the variation contained within the input variables. The default setting for this property, 0.99, indicates that if the maximum number of inputs set in the Maximum number property is not met, the node will pass on the number of components that account for a cumulative amount of 99% of the variation among the input variables.

When the node is run, several windows appear in the results. I’ll highlight key ones. The Eigenvalue Plots, which by default displays Eigenvalue versus PC ID as seen in the pull-down menu in the upper left corner, shows the final number of components that will be passed on to modeling nodes. This plot is also known as a Scree plot.

The top 30 PCs are displayed. The vertical black line at a Principal Component ID value of 20 indicates that 20 components will be selected. This value is due to the Apply maximum number property discussed above. Sometimes a bend or “elbow” in such a plot can suggest the optimal number of PCs to use. That would suggest possibly just 2 or 4 PCs. If one of these values were desired as the result, it could be set in the Fixed number property of the node. It may be of interest to see how much cumulative variation in the input data is accounted for given different numbers of PCs selected. To see this information, change the drop-down menu to Cumulative Proportional Eigenvalue.

What the vertical axis displays, Cumulative Proportional Eigenvalue, is just another way to say cumulative amount of variation accounted for. This plot indicates that the 20 PCs selected by the node account for, cumulatively, just over 60% of the total variability of the input data. If we chose the number of PC’s based on bends in the Scree plot, 2 and 4 PC’s would only account for around 10% and less than 30% of the total variation in the data, respectively. It often may be desired to account for a larger amount of variation in the data, say 80% or 90%. If this was the goal of the PCA, a larger number of PCs could be selected by either increasing the Maximum number property or invoking the Fixed number property.

Another useful window in the results is the Output window from the underlying procedure.

The Output window indicates that the PCA procedure is the Viya procedure the node uses. The output shows that 71 interval variables were used. Keep in mind that the node uses only interval inputs. A few interval inputs in the raw data were rejected based on business needs. Each PC is then a linear combination of these 71 inputs. It is also shown that a total of 50 PCs were constructed. This is due to the upper bound on the Maximum number property. The Simple Statistics table shows the 71 variables used in the PCA.

So, what’s the final punchline? Well, let’s see the newly created features. A sample of the output data can be viewed on the Output Data tab.

Here we see a subset of the newly created columns based on PCs. Keep in mind that all original interval inputs are rejected and these 20 new interval columns have a role of input. Each observation has a specific value for each PC column. This value is based on using the interval input values for that observation in the linear combination that defines each PC.

Now let’s see the new features in a model. I connect the Feature Extraction node to a Gradient Boosting node and run it. No need to show all the results as I just want to emphasize the newly created features used by the model. Here’s the Variable Importance table for the results of the gradient boosting model.

Categorical variables were ignored by the Feature Extraction node so they were passed directly to the gradient boosting model. Note above that the top two most important variables based on the gradient boosting model were two of the created PCs.

Summary and Looking Ahead:

Feature extraction is an essential data preprocessing task for building effective predictive models. PCA helps reduce complexity and prevent overfitting while retaining critical information. It is especially useful for handling correlated or high-dimensional data. In the upcoming posts in this series, I’ll cover the other three methods in the Feature Extraction node: Autoencoders, Robust PCA, and Singular Value Decomposition. If there’s one method in particular you’re dying to learn more about, let me know in the comments below, and I’ll do my best to write on that one next.

For more on PCA:

Training:

Multivariate Statistics for Understanding Complex Data

Advanced Machine Learning using SAS Viya

Communities Posts:

How many principal components should I keep? Part 1: common approaches

How many principal components should I keep? Part 2: randomization-based significance tests

The Principal Components of Principal Component Analysis

Find more articles from SAS Global Enablement and Learning here.