BookmarkSubscribeRSS Feed

Folded concave penalized selection methods for linear regression…demystified!

Started 3 weeks ago by
Modified 3 weeks ago by
Views 148

Variable selection is an important topic in linear regression.  Often an analyst will acquire data from many predictors to assist in explaining or predicting their response variable.  Then, weak or unimportant predictors are discarded from regression models through a variety of possible subset selection methods.  These methods include traditional sequential methods such as forwards, backwards, and stepwise selection, as well as penalized selection methods such as LASSO and elastic net selection.

 

In this post, I’ll describe recently available penalized variable selection methods referred to as folded concave penalized (FCP) selection methods.  I’ll explain when these approaches can be useful, compare them with LASSO and elastic net, and finally describe how to use them with the SAS Viya procedure PROC REGSELECT.

 

 

What are traditional variable selection methods for linear regression?

 

Traditional variable selection methods include forward, backward, and stepwise selection.  These approaches involve adding variables, removing variables, or both, either based on significance level, information criteria such as SBC and AIC, or other fit statistics such as adjusted R-square or validation ASE. When the resultant models meet the assumptions of linear regression (e.g., independent, homoscedastic, normally distributed errors), ordinary least squares (OLS) estimation produces among other beneficial features, unbiased parameter estimates. Unbiased here refers to an estimator whose expected value equals the true value of the parameter it's estimating. This means on average, across repeated samples, the estimator hits the true parameter.  Unbiasedness helps ensure that an estimator doesn’t systematically overestimate or underestimate the true value.

 

While unbiased parameter estimates are crucial for explanatory modeling, predictive modeling is more concerned with making the most accurate predictions possible. Unbiased parameter estimates may not lead to the most accurate predictions.  When unbiased parameter estimates are shrunk towards zero, bias is added to the parameters and thus predictions.  Because of the trade-off between prediction bias and prediction variance (both of which contribute to prediction error), adding some bias can often greatly reduce variance.  Optimizing bias and variance increases the model’s overall predictive accuracy.  Additionally, shrinking the parameters relative to unbiased estimates makes the model less sensitive to the training data which reduces the chance of overfitting.  For a general review of the bias-variance tradeoff see my previous post Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off.

 

Shrinkage of parameters can be accomplished through adding penalty functions to the least squares error parameter estimation, such as with LASSO selection and elastic net approaches.

 

 

What are penalized selection methods?

 

Penalized selection methods such as LASSO and elastic net selection are variable selection methods that shrink the coefficients compared to those produced by ordinary least squares estimation.

 

01_taelna-blog14-FCP-table-1.png

 

 

LASSO

 

LASSO selection (Least Absolute Shrinkage and Selection Operator) has a tuning parameter Lambda-1 (λ1), which controls the strength of the L1-regularization (shrinkage) applied to the model’s coefficients.  Here’s what it does:

 

  • Shrink coefficients. Larger λ1 will apply a stronger penalty on the absolute values of the regression coefficients. This pushes some coefficients closer to zero.
  • Performs variable selection.  When λ1 is large enough, some coefficients are shrunk exactly to zero. This effectively removes those variables from the model, hence performing automatic feature selection.
  • Balances bias-variance trade-off.  A higher λ1 typically increases bias but reduces variance, which can improve predictive performance when the underlying model is sparse.

 

The last bullet point merits additional explanation.  LASSO works best when the true underlying model is sparse, meaning only a small subset of predictors meaningfully contribute to the outcome.   LASSO is great at zeroing out irrelevant variables, so it simplifies the model and reduces overfitting. The increased bias from shrinkage is minimal, because many of those coefficients should be zero anyway.  When the bias increases, the variance drops significantly, improving generalization to new data. That’s the ideal situation.

 

But when the model is not sparse, many predictors truly contribute to the outcome, and shrinking their coefficients—especially some all the way to zero—introduces more bias than the variance reduction can justify. In this situation, LASSO might discard variables that actually matter, hurting model fidelity. As a result, predictive performance may degrade, especially compared to techniques like ridge regression or elastic net that spread shrinkage more gently across predictors.  In short, when there’s a rich, dense structure in the data (not sparse), LASSO’s tendency to push things to zero becomes a liability rather than a strength.

 

 

Elastic net

 

Elastic net is another penalized selection method that builds on LASSO by adding a second tuning parameter, Lambda-2 (λ2).  Elastic net with  only (that is, λ₁=0) is equivalent to ridge regression.  This additional penalty constrains the sum of the squared regression coefficients.  The λ2 tuning parameter controls the strength of the L2-regularization (shrinkage) applied to the model’s coefficients.  This penalty discourages large coefficient values and favors more stable, low-variance estimates.

 

Let’s focus on the lambda-2 part of elastic net, briefly ignoring the lambda-1 constraint.  By constraining the square of the coefficients, λ2:

 

  • Reduces model variance, especially when predictors are highly collinear or the number of predictors is large relative to observations.
  • Introduces some bias since estimates are pulled toward zero.
  • Improves prediction accuracy on new data due to a better bias-variance tradeoff. Like LASSO, adding bias can greatly reduce prediction variance, which can lower prediction error.
  • Handles collinearity well as similar (i.e., correlated) predictors tend to have similar coefficients as opposed to traditional selection methods which tend to discard one correlated predictor in favor of another.
  • Does not select variables, since coefficients are never reduced to zero. All variables remain, even if their influence is weak.

 

When both lambda-1 and lambda-2 constraints are combined in elastic net selection, the stability, reduced variance, and ability to handle collinearity of ridge regression is combined with the variable selection of LASSO models.

 

 

What are folded concave penalty selection methods?

 

Folded concave penalty (FCP) methods are more recently developed penalized selection approaches.  The goal of these methods is to overcome some of the limitations of LASSO and elastic net while retaining their desirable properties (some shrinkage and variable selection).

 

The main limitation FCP methods seeks to overcome is the bias that LASSO produces for the largest regression coefficients. The largest coefficients could indicate the most important predictors in a regression model, but these coefficients will be shrunk heavily and thus be the most biased. Variables with smaller coefficients may actually get removed by having the coefficients shrunk to zero, which is a desirable quality when the true model is sparse.  So, an improved variable selection approach would still discard unimportant predictors with small coefficients but reduce the bias (shrinkage) of the larger coefficients associated with the important predictors. FCP methods achieve this by retaining the shrinkage of LASSO while having the penalty tapering off for larger coefficients. This can be seen in the picture below from SAS Visual Statistics: Procedures (2024.09):

 

02_taelna-blog14-FCP-graph-scad-mcp-lasso.png

 

The picture shows how the penalty functions change with the magnitude of the parameter estimates for LASSO and two FCP selection methods.  These FCP methods are smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) selection.  The LASSO curve above shows that the penalty applied to coefficients (and thus the bias) increases with coefficient magnitude.  The two FCP method penalties both increase with coefficient magnitude up to a point then level off.  The SCAD penalty matches the LASSO penalty for small coefficients, then begins decreasing, and finally levels off.  The MCP penalty, unlike SCAD, immediately starts decreasing before flattening out.

 

This picture also makes it clear where the name “folded concave penalty” comes from: “concave” describes the decelerating curve of the penalty as the coefficient moves away from zero, and “folded” refers to this picture being symmetrical at zero due to the penalty’s dependence on the absolute value of the coefficients.  It looks like the picture was folded in half to produce the mirrored lines on either side.

 

The penalties for FCP selection are determined by two parameters: lambda and alpha.  Lambda is a shrinkage parameter like in LASSO: higher lambda leads to more shrinkage and more coefficients set to zero (i.e., a sparser model), and lower lambda means weaker shrinkage and more coefficients retained (less sparsity).  Alpha controls the penalty shape.  Alpha (sometimes denoted as gamma in the literature) controls the concavity of the penalty function— how quickly the penalty relaxes for larger coefficients.  Lower alpha results in more aggressive concavity and a faster relaxation of the penalty on the coefficients.  This means less bias for the large coefficients.  But the downside is that this makes the optimization less stable.  That is, lower alpha increases the risk of multiple local minima in the optimization, making it harder to find the global error minimum.  I imagine that reduced stability could be a big concern in the presence of collinearity which also adds to model instability.  For a review of collinearity see my previous post “What is collinearity and why does it matter?”.  Higher alpha has the opposite effect. Basically, the FCP selection will behave more like LASSO with a higher alpha.

 

03_taelna-blog14-FCP-table-2-1024x241.png

 

 

What values of alpha and lambda are used for FCP selection?

 

When using PROC REGSELECT for FCP selection, the default values of alpha tested by the procedure are 2.7, 3.7, 4.7, 5.7.  The values of lambda tested change depending on several details of the data. The minimum and maximum lambda values tested depend on the number of non-intercept parameters and the standard deviation of the response variable.  The lambdas tested default to a set of 10 evenly spaced values on the logarithmic scale between the calculated minimum and maximum.  See the SAS Visual Statics: Procedures documentation for the details.

 

The REGSELECT output includes the validation ASE for each combination of alpha and lambda tested.  With this information, it would be possible to narrow down promising ranges of alpha and lambda to alter to achieve a better fit. These options are mentioned below in the example code.

 

 

When should I use FCP selection methods?

 

FCP methods are useful for predictive regression modeling that requires variable selection due to having many unimportant predictors.   This also describes when to use LASSO selection, but FCP selection can be a better choice than LASSO if you think LASSO is overly shrinking important predictors or if you suspect that it is too aggressive in eliminating variables. So FCP is a good choice if you have strong signals in your predictors and want to perform variable selection, while avoiding excessive shrinkage.  FCP will generally have more aggressive variable selection than elastic net but less than LASSO.

 

Additionally, FCP methods have better asymptotic properties than LASSO.  FCP methods are better at identifying the true model structure in larger samples.  I don’t have more information to offer about the “oracle property” until I better understand the assumptions and limitations of this idea. For a “mathy” description of why FCP methods have this oracle property and how to get the oracle solution using local linear approximation, see “Strong Oracle Optimality of Folded Concave Penalized Estimation” by Fan et. al. 2014.

 

Compared with traditional variable selection approaches along with least squares estimation, FCP methods will likely work better for prediction.  As described earlier, shrinkage methods such as FCP add bias in exchange for reduced variance in order to achieve greater predictive accuracy.

 

 

When should I use other approaches over FCP selection methods?

 

If you prioritize sparsity and thus more interpretability, LASSO could be a better choice than FCP methods.  LASSO can also be better if you have many weak predictors and need stronger variable removal.  Also, LASSO is computationally less expensive than FCP, so if you have high-dimensional data and you’re concerned with computational efficiency, the simpler LASSO algorithm may be preferable.

 

In the presence of highly correlated predictors, elastic net may perform better than FCP methods.  Elastic net includes L2-regularization, which can give improved stability in the presence of collinearity.  Elastic net may also be a better choice when you need some regularization without extreme sparsity.  If you just need regularization without the variable selection, ridge regression can be a better choice (especially in the presence of collinearity).

 

Traditional sequential selection methods have their place too.  Methods like forwards, backwards, and stepwise selection, give you explicit control over the criteria for including or excluding predictors.  You may prefer the hypothesis testing approach of using p-values for variable selection.  Additionally, the parameter estimates may be easier to explain and interpret than regularized coefficients.  These are all situations when the simpler approach may be preferable to using FCP methods.

 

 

How to use FCP selection in SAS Viya

 

The two FCP selections methods can be used through PROC REGSELECT.  Here is example code followed by some additional suggested options:

 

PROC REGSELECT DATA=my_data;

   PARTITION FRACTION (VALIDATE=0.4);

   MODEL target = input1 input2 input3;

   SELECTION METHOD = SCAD(CHOOSE=VALIDATE) DETAILS=ALL;

RUN;

 

The code above is straightforward.  If you’re not familiar with the PARTITION statement, the code above results in 60% of the data set my_data being used for training the model and 40% of the data being used validation.

 

Here are some other options for scad/mcp selection to try. All but the first suggestion would go in the parentheses where it says CHOOSE=VALIDATE above:

 

  • To use MCP selection, change SCAD above to MCP. MCP reduces the bias to large coefficients more than the SCAD method.
  • After an initial run using the default settings, you might want to explore a narrower range of alpha and lambda values that seem promising. To change the minimum and maximum values tested, use the MINALPHA=, MAXALPHA=, MINLAMBDA=, and MAXLAMBDA= options.
  • The lambdas tested default to evenly spaced values on a logarithmic scale between the calculated minimum and maximum lambda values. To change this to evenly spaced values on the linear scale, use the option LAMBDAGRID= LINSPACE.
  • Finally, the default optimization method for FCP selection uses the mixed integer linear programming (MILP) solver. To switch to the nonlinear programming (NLP) solver, use the SOLVER=NLP option. Changing the solver won’t change the alpha and lambda values that get tested.  Sometimes the MILP solver will produce estimates with lower validation error and sometimes the NLP solver will work better. You’ll just have to try with your data.  NLP solver has a much lower computational cost than the MILP solver, so that may factor into which approach you start with depending on the size of your data.

 

 

In a follow up post, I will demonstrate FCP selection methods and compare their performance to LASSO selection.

 

 

References and previous posts

 

  1. A Survey of Methods in Variable Selection and Penalized Regression by Yingwei Wang from SAS Global Forum 2020 (Paper SAS4287-2020)
  2. Introducing Folded Concave Penalized Regression: New Variable Selection Methods in the REGSELECT Procedure in SAS® Viya® by Yingwei Wang (SAS Statistics Research and Applications Paper #2022-04)
  3. Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off SAS Communities post by Tarek Elnaccash
  4. What is collinearity and why does it matter? SAS Communities post by Tarek Elnaccash
  5. Strong oracle optimality of folded concave penalty estimation” by Fan et. al. 2014.

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
3 weeks ago
Updated by:

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started