Dealing with collinearity through ridge regression

In my previous post What is collinearity and why does it matter?, I described collinearity, how to detect it, its consequences, and ways to reduce its impact. One method of reducing the high variance due to collinearity is to use ridge regression. In this post, I’ll describe ridge regression, the situations in which it is useful, and demonstrate how to implement it in SAS 9 and SAS Viya.

Collinearity, also called multicollinearity, means strong linear associations among sets of predictors. Its presence can increase the variance of predictions and parameter estimates and can make the estimates unstable. The problems associated with collinearity can be avoided by removing redundant variables prior to regression modeling. For a description of two approaches for variable reduction, please see my previous post: How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE. Another approach to dealing with collinearity problems can be applied post model-fitting. Variables that have high variance inflation factors (VIF) or condition index values can be dropped from your model. For a review of these collinearity diagnostics, please see my previous post: What is collinearity and why does it matter?

But what if all the predictors are theoretically important and none can be reasonably thrown out solely to reduce standard errors? In this case, ridge regression can be a good approach. Ridge regression, also called L2-regularization, involves shrinking regression coefficients towards zero. which decreases the variability of the model’s predictions. This shrinkage ideally adds a tolerable amount of bias in exchange for a larger reduction in the variance of parameter estimates. Lower variance in the parameter estimates leads to lower variance of predictions as well. Reducing the prediction variance of a model makes it more robust to overfitting and may improve its ability to generalize to new data (i.e., make accurate predictions). This increased accuracy is because both prediction bias and prediction variance together contribute to prediction error, and there is a trade-off between bias and variance. This bias-variance trade-off is an important concept in predictive modeling. For an explanation of its importance, try the SAS course Statistics You Need to Know for Machine Learning or see my previous post: Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off.

In datasets with collinearity, least squares regression might assign erratic weights to correlated predictors, while ridge regression distributes the influence more evenly across them. This not only improves stability but can enhance model interpretability.

The parameter shrinkage in ridge regression is based on the square of the regression coefficients. Least squares regression finds coefficients that minimize the residual sum of squares (RSS), the squared difference between actual and predicted response variable values (the Y’s). Ridge regression instead minimizes the RSS plus a penalty term. The penalty term is a function of the sum of the squared regression coefficients.

01_taelna-blog12.0-ls-vs-ridge-1024x134.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

This penalty in the optimization is equivalent to constraining the sum of the squared regression coefficients to be below some threshold t₂:

02_taelna-blog12.06-ridge-threshold-t2-300x40.png

The value of λ₂ controls the strength of the penalty, with a value of zero resulting in the OLS coefficients. The two in the subscript above distinguishes it from λ₁ used in L1-regularization, also called LASSO regression. LASSO is similar to ridge, but it constrains the sum of the absolute values of the β’s (instead of squared coefficients) to be less than some threshold t₁.

Ridge and LASSO differ in that LASSO can shrink coefficients to zero, removing them from the model. Ridge constrains the square of the coefficients, so it shrinks larger coefficients more than smaller coefficients, and thus it never removes variables from the model. This is worth noting because if all predictors are important and the concern is about overfitting (high prediction variance) or unstable parameter estimates (collinearity), ridge is a better choice than LASSO. If the concern also includes removing irrelevant predictors, LASSO regression can be a better choice. Both LASSO and ridge regression can be combined in in an approach called Elastic Net regression.

Ridge, LASSO, Elastic Net regression, and the bias-variance trade-off are all discussed in the SAS class Statistics You Need to Know for Machine Learning.

How to find λ₂ for ridge regression?

The ideal λ₂ value will not be known without some trial and error. Two goals for this exploration would be to choose a λ₂ that can reduce the variance caused by collinearity while also having good model fit. Both PROC REG (SAS 9) and PROC REGSELECT (SAS Viya) can fit ridge regression models with a range of λ₂ values. Ideally, the data would be partitioned into training and validation sets and the value of λ₂that results in the best validation performance (such as the minimum ASE) could be chosen. If the data set were too small to partition, information criteria such as AIC and SBC can be used on the training data to choose a model. Cross-validation could be used to pick λ₂values as well_.

For an example of using ridge regression to reduce variance, I’ll use a data set called Fitness. The analysis goal is to see how aerobic fitness (measured as oxygen consumption after 1.5-mile run) relate to various predictors such as age, weight, pulse, etc. The model showed high variance inflation factors (VIF> 8 for two of the 4 predictors remaining after variable selection, RunPulse and MaxPulse:

03_taelna-param-est-with-VIF-for-fitness-data-600x220.png

While these VIF are not too high, particularly because the stand errors and p-values are relatively low, let’s see how we could use ridge regression to reduce the variance due to collinearity (as measured by the VIF).

Ridge regression using PROC REG (SAS 9)

proc reg data=fitness
    ridge=0 to 0.2 by 0.002    /*try out different lambdas */
    outest=ridge_parameters    /*save parameters to data set */
    outvif                    /* save VIFs to same data set */
    plots(only)=ridge (unpack VIFaxis=log);  /*make separate plots, VIF axis on log scale */
 model oxy=Age Runtime RunPulse MaxPulse/vif;     /*show initial model VIF */
run;

The Ridge plot below shows that as the ridge parameter (λ₂) increases, the VIF for MaxPulse and RunPulse decreases from about 8 to below 1.

04_taelna-blog12.01-proc-reg-VIF-for-OXY.png

This is because the standardized coefficients are shrinking in magnitude, becoming closer to zero:

05_taelna-blog12.02-proc-reg-ridge-trace-for-OXY.png

A PROC PRINT step shows us the λ₂ values and the ridge regression (shrunk) parameters:

proc print data=ridge_parameters (obs=30);
  var _MODEL_ _TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept Age Runtime RunPulse MaxPulse;
run;

06_taelna-blog12.03-VIF-ridge-table-1-30.png

Just like in the plots, you can see the VIFs for RunPulse and MaxPulse (shown in even numbered rows) are decreasing as the ridge parameter increases. The odd numbered rows of the ridge_parameters data show the coefficients which are shrunk closer to zero relative to OLS estimates.

Ridge regression using PROC REGSELECT (SAS Viya)

proc regselect data=mycas.fitness;
  model oxy=Age Runtime RunPulse MaxPulse/vif ridge=0 to 0.2 by 0.02;
run;

PROC REGSELECT doesn’t create the nice ridge plots that PROC REG does. But it does produce the same VIF and ridge coefficients summary as PROC REG as well as this additional model summary with fit statistics:

07_taelna-blog12.05-ridge-table-proc-regselect.png

REGSELECT has a PARTITION statement which can be used to create training and validation data sets. I didn’t use a partition here for two reasons. First, this fitness data set is very small (n=31), so splitting the data doesn’t seem worthwhile. Additionally, with partitioned data, PROC REGSELECT only shows validation error for the OLS regression model (λ₂=0) when using ridge regression. The ASE shown above for the different λ₂ values are the Training ASE only. I’m using Viya 2024.03_LTS for these demos, so this is something that may change in future SAS Viya releases.

So, one possibility for choosing a ridge parameter would be to try a cross-validation approach. This would involve splitting this small Fitness data set into 5 pieces and fitting different λ₂ridge models to 80% of the data and applying the model to the remaining 20% of the data. Each fold (the 20% pieces) would be used for validation once, while remaining 80% would be used for training the ridge regression model. The 5 different ASE’s for each λ₂ ridge model could be averaged and the one with the lowest mean validation error could be used.

But this cross-validation approach leaves out the reason for using ridge in the first place: the variances are too high. The initial reason for using ridge was high variance due to collinearity and none of the predictors could be omitted because of their research importance. In this situation, the researcher would need to balance their informed idea of acceptable parameter variance with the model fit from cross-validation. The general guideline of “do something whenever VIF>10” is just a general guideline. Nothing can replace the thoughtful considerations of a researcher in developing a good statistical model.

Links

SAS Communities Post: What is collinearity and why does it matter?
SAS Communities Post: How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE
SAS Communities Post: Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off
SAS Communities Post: Statistics You Need to Know for Machine Learning

Find more articles from SAS Global Enablement and Learning here.

Dealing with collinearity through ridge regression

Catch up on SAS Innovate 2026

SAS AI and Machine Learning Courses