How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE

1 Like

In my previous post What is collinearity and why does it matter?, I described collinearity, how to detect it, its consequences, and ways to reduce its impact. One method of reducing collinearity is to remove redundant predictors. In this blog, I’ll demonstrate two approaches for variable reduction which can remove collinearity prior to regression modeling.

Collinearity, also called multicollinearity, means strong linear associations among sets of predictors. Its presence can increase the variance of predictions and parameter estimates and can make the estimates unstable. The problems associated with collinearity can be avoided by removing redundant variables prior to regression modeling. Redundancy here means that the variables provide much of the same information and that they are typically (but not always) highly correlated.

Removing redundancy can be helpful even in the absence of severe collinearity problems, so the methods described below can be useful in other contexts. For example, predictive modelers often remove redundant predictors to avoid the curse of dimensionality, which refers to the problems that arise when the number of predictors is large relative to the number of observations. When the dimensionality of the data is large, even enormous data sets may be sparse and lead to unreliable modeling results.

The two approaches that I’ll demonstrate for removing redundancy and therefore collinearity involve variable clustering using the SAS 9 procedure PROC VARCLUS and variable reduction using the SAS Viya procedure PROC VARREDUCE.

For both demonstrations, I’ll use a banking data set (develop_training) with 61 numeric predictors and a binary target. The target, ins, indicates whether a customer purchased a variable annuity in response to a marketing campaign. A version of these data is available in SAS Viya for Learners. These data show strong collinearity (VIF>10). Variance inflation factors, calculated by PROC REG, show 5 predictors with VIF between 10 and 22 and 5 predictors with VIF between 45 and 63.

Reducing redundancy through variable clustering (SAS 9)

The first approach involves finding clusters of correlated variables using PROC VARCLUS. Each cluster consists of variables that have high correlations with members of their cluster but low correlations with members of other clusters. Once clusters are formed, the analyst can manually throw out all but one predictor of each cluster. So, 100 predictors grouped into 37 variable clusters would result in reduction to 37 predictors for modeling.

Let’s walk through the code, which is modified from a demonstration in the SAS class Predictive Modeling using Logistic Regression. The procedure starts with all the variables in a single cluster. It then performs principal component analysis (PCA), and if the second eigenvalue is greater than a threshold, the cluster is split into two clusters of variables. This is repeated until the second eigenvalue for all clusters is below the user-chosen threshold.

What is the logic behind using the second eigenvalue here? Here is an illustration to help understand. On the left is pictured correlated data showing heights and weights for a group of people, while on the right is relatively uncorrelated data showing heights and incomes of the same people. For each data set, PCA finds PC1, a new variable that is a linear transformation of the two variables that has the greatest variance. A second new variable, PC2, is created so that it explains the second greatest proportion of variation in the original variables and is perpendicular to PC1. The eigenvalues (variances) of PC1 and PC2 are λ₁ and λ₂ respectively.

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the left picture, λ₂ is small which indicates height and weight are correlated and thus belong in the same cluster of variables. In the right picture, λ₂ is large, indicating there are two (or more) relatively uncorrelated variables. These will get split into two variable clusters and the process repeats until all the second eigenvalues are below the chosen threshold. Each resulting cluster of variables is further split whenever the maximum second eigenvalue is above the chosen cutoff.

The default maximum second eigenvalue for PROC VARCLUS is 1. Several analysts believe that 1 reduces the number of predictors too much and prefer starting with 0.7. Smaller values will result in more variable clusters and thus less reduction in the number of predictors. I’ll show the effect of changing this setting later in this post.

PROC VARCLUS produces a lot of output that isn’t currently needed. So ODS SELECT turns off the printed report and ODS OUTPUT saves the relevant output to SAS data sets. The HI option performs hierarchical clustering. Basically, this prevents variables from switching clusters during splitting. The summary table is then printed to see how many clusters were created during the final splitting.

ods select none; 				
ods output clusterquality=work.summary 	/* last row has final # of clusters (=37) */
	   rsquare=work.clusters;	/* only print the 37 clusters solution */

proc varclus data=work.develop_training 
    maxeigen=.7 hi;			/* start with maxeigen=.7, can change later */ 
    var &inputs; 		
run;
ods select all;

title "Variation Explained by Clusters";
proc print data=work.summary label;
run;

Here is a partial printout of the summary table:

We can see that the variables were split into 37 clusters of variables, with 92% of the variability can be explained with these clusters. This will result in a reduction from the original 61 predictors to 37.

What if 37 is more predictors than desired? You could go back and change the maximum second eigenvalue to a larger number. For example, if you set MAXEIGEN=1, splitting would continue until there were 19 clusters. You can see this in row 19 of the table, where the maximum second eigenvalue first decreases below 1, when it takes the value 0.999772. This corresponds to 19 variable clusters.

Now that we know that there are 37 clusters, we can look at the cluster memberships and pick one predictor from each to keep. Printing out work.clusters where NumberOfClusters = 37 will show cluster membership for the final cluster solution:

proc print data=work.clusters noobs label split='*';
    where NumberOfClusters=37;
    var Cluster Variable RSquareRatio VariableLabel;
    label RSquareRatio="1 - RSquare*Ratio";
run;

What criterion should be used for picking one predictor per cluster to keep? Here are several suggestions:

Use subject matter expertise to keep the variable that is that is theoretically meaningful or known to be an important predictor.
Keep the predictor with the strongest association with the target. The right tool for assessing the association depends on the type of predictor and target. When both are continuous, use Pearson correlation. For two categorical variables, you can use Cramer’s V or odds ratios (only for 2-level variables). For a binary target and a continuous predictor, you can use biserial correlations. See this note for how to calculate biserial correlations in SAS: 24991 - Compute biserial, point biserial, and rank biserial correlations.
Keep the predictor with the higher quality data. For example, if other cluster variables have lots of missing values, keep the variables that have more complete information.
Keep the more reliably measured variables. It’s easier to get an accurate measure of income or number of vacation days than job satisfaction. Assuming there is no theoretical reason to choose one over the other, the more reliable variable may be a better one to keep.
Keep the variable with the smallest 1-R² A smaller number indicates a variable is a better cluster representative (more correlated within cluster and less correlated across clusters), all else being equal. I think of this as a last resort, and prefer any of the previous criteria, all of which require knowing your data.

Did this reduce collinearity? Yes. Using PROC REG to calculate variance inflation factors, I found the 37 predictors I retained all had VIF < 2.6.

Reducing redundancy using PROC VARREDUCE (SAS VIYA)

The second approach I’ll demonstrate uses the SAS Viya procedure VARREDUCE to reduce redundancy and therefore collinearity. This procedure can do two kinds of variable reduction: supervised and unsupervised. Supervised variable reduction involves retaining variables based on their association with the target. Unsupervised variable reduction retains variables based on the proportion of the variability in the original predictors that they can explain. I’ll use the unsupervised approach for reducing collinearity.

Let’s go through the code I used below. Unlike VARCLUS procedure, PROC VARREDUCE has a CLASS statement and can read in categorical variables. The categorical variables in the develop_training data were left out to use the same predictors as in the first demonstration. The REDUCE statement uses the UNSUPERVISED keyword to do unsupervised variable selection. The VARIANCEEXPLAINED option instructs the procedure to retain enough variables such that at least 90% of the variability in the original predictors can be explained. The MINVARINACEINCREMENT option only retains variables that increase the proportion of the original variance explained by at least 1%. This option can take values between (0, 1), with a default value of 0.001 (that is, 0.1% of the predictor variability).

Below is some of the PROC VARREDUCE output:

04_taelna-blog-11-last-pic-summary-from-VARREDUCE-word-full-table-725x1024.png

In the table above, the predictor IM_CCBAL is retained first because it explains the greatest proportion of variability of the predictors. The second variable added to this list of variables to keep is Dep, because it results in the greatest increases to the proportion of variance explained. This process reduces collinearity because once a predictor enters this “keep list”, any highly collinear predictors will not appreciably increase the proportion of variance explained and thus get left off this list.

Did this reduce collinearity? Yes. The reduced set of 36 predictors shown above has VIF < 2.3 for all predictors.

Conclusion

I hope these demonstrations will help you in dealing with collinearity problems or even just having too many predictors in your data. Variable reduction is only one approach to reducing the effects of collinearity. If all predictors are theoretically important and none can be dropped, other approaches will be required. In these situations, biased regression techniques such as ridge regression may be a better choice. I will describe and demonstrate ridge regression in my next post.

Links

For an explanation of collinearity and its consequences please see my previous post: What is collinearity and why does it matter?
If you’re interested in learning about principal component analysis, please see How many principal components should I keep? Part 1: common approaches .
For an excellent class on developing predictive models try out the SAS course Predictive Modeling Using Logistic Regression.

Find more articles from SAS Global Enablement and Learning here.

How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE

Registration is open

SAS AI and Machine Learning Courses