Tuning FCP models in PROC REGSELECT: a case study

Tuning folded concave penalties in PROC REGSELECT can feel a bit opaque the first time through. In this post, I use a real case study to show what actually happens when you adjust lambda and alpha, why lambda deserves your attention first, why a logarithmic search grid matters, and how solver choice can matter more than small hyperparameter tweaks. I also start with a short refresher on FCP selection and walk through the key settings you will encounter, including the method, the solver, and the main tuning controls, so you have a clear sense of what each option does before looking at the results. Rather than simply listing features, I focus on the decisions that have the biggest impact on prediction error and on the patterns you are likely to see in your own data.

Quick refresher on folded concave penalty methods for regression

Folded concave penalty (FCP) selection methods are regression approaches that shrink coefficients relative to least squares parameter estimates. This shrinkage can improve predictive accuracy by optimizing the bias-variance trade off. When coefficients are shrunk to zero, FCP methods also perform variable selection as these predictors are removed from the regression model. FCP methods also penalize important predictors less by having the penalty taper off for larger coefficients, resulting in less bias to these parameter estimates. For a review of FCP and other penalized regression methods, please see my post: Folded concave penalized selection methods for linear regression…demystified!.

FCP regression settings

Several hyperparameters can be tuned for FCP selection regression models. These include the selection method, the solver, and several options to tune the alpha and lambda parameters. The next section briefly describes some of these options.

Selection method: SCAD or MCP

You can choose Smoothly Clipped Absolute Deviation (SCAD) or Minimax Concave Penalty (MCP). The two FCP method penalties both increase with coefficient magnitude up to a point then level off. The SCAD penalty matches the LASSO penalty for small coefficients, then begins decreasing. The MCP penalty, unlike SCAD, immediately starts decreasing relative to LASSO before flattening out (see picture below from SAS Visual Statistics: Procedures 2025.09).

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Solver: MILP or NLP

Finding the penalties to apply requires either Mixed Integer Linear Programming (MILP) or NonLinear Programming (NLP). The MILP solver (the default in REGSELECT) is generally thought to produce estimates with lower prediction error, but sometimes the NLP solver will work better. The optimal hyperparameters are data dependent. The NLP solver has a much lower computational cost and will produce results faster than the MILP solver. So, depending on the size of your data, this may affect which approach you try out. For reference, my MCP selection/MILP solver runs using the PVA_Donors data (see below) took about 10 minutes each using Viya for Learners (2025.09), while the NLP solver took under 1 second.

Lambda parameter

This functions the same as the lambda parameter in LASSO selection. The lambda parameter controls how aggressively the parameters are shrunk, with higher lambda producing more shrinkage relative to LS estimates. A higher lambda can cause more variables to drop out of the model and can add more bias to larger coefficients.

Alpha parameter

Alpha controls the penalty shape—how quickly the penalty relaxes for larger coefficients. Lower alpha results in a faster relaxation of the penalty on the coefficients, leading to less bias to the large coefficients. Lower alpha also makes the optimization less stable, increasing the risk of multiple local minima, making it harder to find the global error minimum.

The effects of low/high values of the lambda and alpha parameters are summarized below.

LAMBDAGRID option: LOGSPACE or LINESPACE

PROC REGSELECT defaults to testing 10 evenly spaced lambdas on a logarithmic scale, between the minimum and maximum values tested. The minimum and maximum are data dependent. The LAMBDAGRID option can be changed from the default LOGSPACE to LINESPACE to choose evenly spaced values on the linear scale. Why is the logarithmic scale the default behavior? Let’s look at some example values to see why.

Let’s say we want to test 10 lambdas between a minimum of 1 and maximum of 100. The base10 log of these numbers are 0 and 2, respectively. Ten points means dividing the [0, 2] interval into 9 equal steps, with each step being approximately 2/9 or 0.22222.

We can then convert back to the original scale to see the lambdas to be tested:

data lambda_grid;
  min = 1;
  max = 100;

  log_min = log10(min);
  log_max = log10(max);

  do i = 0 to 9;                     /* 10 values */
    log_val = log_min + i*(log_max - log_min)/9;
    lambda  = 10**log_val;
    output;
  end;

  keep log_val lambda;
run;

proc print data=lambda_grid noobs;
run;

03_blog19.2_taelna-log_val_lambda.png

What we see is that there is denser coverage of lambdas at lower values. Penalized regression paths change most rapidly at small lambda values, so small differences in lambda have a much larger effect on sparsity and validation ASE in that region.

For SCAD and MCP (and LASSO) selection, coefficients are shrunk toward zero until lambda becomes small enough for predictors to “break free.” This threshold behavior means a small decrease in lambda near zero can suddenly allow a variable to enter the model. A similar decrease at large lambda does almost nothing in regard to variable selection.

While it is possible to switch to even coverage on a linear scale, using linear spacing risks undersampling this important region and oversampling the low-impact higher lambda region. Because of this, I don’t see any advantage to changing the default behavior (LAMBDAGRID=LOGSPACE) to linear coverage (LAMBDAGRID=LINESPACE). Linear coverage and logarithmic coverage might be equally good when reaching the flat part of the folded concave penalty, but I see little reason to change this default setting.

Default Alpha and Lambda values used by PROC REGSELECT

When using PROC REGSELECT for FCP selection, the default behavior is to test 4 evenly spaced values of alpha: 1.7, 2.7, 3.7, 4.7 for MCP and 2.7, 3.7, 4.7, 5.7 for SCAD. The minimum and maximum values of lambda tested are data dependent. They are based on the number of non-intercept parameters and the standard deviation of the response variable. See the SAS Visual Statistics: Procedures documentation for details.

MAXITERLAMBDA, MINLAMBDA, and MAXLAMBDA options

By default, 10 lambdas are tested between the minimum and maximum values. To test more lambdas, try setting MAXITERLAMBDA to a higher number such as 20, 30, 50 or 100. This is fast using the NLP solver, but may take a long time using MILP. The minimum and maximum values of lambda can be changed with the MINLAMBDA and MAXLAMBDA options, respectively

MAXITERALPHA, MINALPHA, and MAXALPHA options

Similarly, the MAXITERALPHA option can be used to set the number of alpha parameters tested between the minimum and maximum. The minimum and maximum values of alpha can be changed with the MINALPHA and MAXALPHA options, respectively.

Hyperparameter tuning strategy

In this next section, my goal is to improve the performance of FCP selection models by tuning the previously mentioned hyperparameters. I’ll evaluate the performance of FCP models applied to the PVA_Donors data by examining validation ASE. Here’s my approach:

Using default options, run the 4 combinations of SCAD/MCP selection along with the MILP/NLP solvers. From there I’ll focus on the combination with the lowest validation error.
Explore promising values of lambda while holding alpha constant at a value associated with low error.
Using the previously found lambda, search for better values of alpha.

Why tune lambda before alpha? Lambda has a bigger impact on sparsity (i.e., variable reduction) so I’d like to set that first, then work on lowering bias to large coefficients by tuning the shape parameter, alpha.

Tuning step 1: find the best combination of FCP method and solver.

Here’s my SAS code, which I ran in Viya for Learners (2025.09). A version of the PVA_Donors data I used is available in the VFL environment.

%let inputs= months_since_origin in_house published_phone mor_hit_rate median_home_value 
median_household_income pct_owner_occupied pct_male_military pct_male_veterans 
pct_vietnam_veterans  pct_wwii_veterans pep_star recent_star_status frequency_status_97nk 
recent_response_prop recent_avg_gift_amt recent_card_response_prop recent_avg_card_gift_amt 
recent_response_count recent_card_response_count months_since_last_prom_resp lifetime_card_prom 
lifetime_prom lifetime_gift_amount lifetime_gift_count lifetime_avg_gift_amt lifetime_gift_range 
lifetime_max_gift_amt card_prom_12 number_prom_12 months_since_last_gift months_since_first_gift 
file_card_gift per_capita_income im_donor_age im_income_group im_wealth_rating last_gift_amt 
urbanicity ses home_owner donor_gender overlay_source recency_status_96nk;

proc regselect data=mylib.pva_final3;
partition role=role (train="train" validate="valid");
class urbanicity ses home_owner donor_gender overlay_source recency_status_96nk;
model target_d = &inputs; *44 numeric inputs;
/* selection method goes here*/
run;

Selection methods

selection method=SCAD (choose=validate solver=milp);
selection method=SCAD (choose=validate solver=nlp);
selection method=MCP (choose=validate solver=milp);
selection method=MCP (choose=validate solver=nlp);

The table below shows validation ASE for different combinations of FCP methods and solvers. The best validation ASE 77.5 and was produced by using MCP selection with the MILP solver. Note: by the end of writing this post, the best ASE ranged from about 79 to 76 on across many MCP/MILP runs. This combination consistently beats the others for the PVA_Donors data.

04_blog19.1_taelna_ASE-solver-and-method.png

Tuning step 2: find lambda

Can we reduce the validation ASE by changing some of the options? Let’s look at the selection summary report produced by PROC REGSELECT from the MCP/MILP run:

What do the Convergence Status messages mean? The documentation currently doesn’t explain the Convergence Status column. “Success” and “No Solution” are obvious, but my interpretation of “Early Termination” is that there might be a better solution, but the optimization timed out before finding the maximum or minimum of the objective function. If the lambda and alpha combination listed produce a low validation ASE, I’ll take it, even if more iterations might produce better results.

Since MCP/MILP was best so far, I repeated the analysis with more lambda values by changing the MAXITERLAMBDA option and by restricting the min and max lambdas to a promising range:

selection method=MCP (choose=validate solver=milp minlambda=.15 maxlambda=2 maxiterlambda=25);

The above selection statement took about 20 minutes of processing in VFL. With a greater number of lambdas tested, there were several combinations of alpha and lambda that had low error (ASE < 80):

06_blog19.4_taelna_selection-summary-2-1024x383.png

Alpha=4.7 and alpha=2.7 had lower error than the previous best ASE (which used alpha=1.7). Based on the above results, I decided to hold alpha at 4.7 while searching for good values of lambda. There were two reasons for this. First, alpha=4.7 with lambda=1.447 had the lowest overall error. Second, higher alpha leads to greater stability of the optimization. When exploring alpha=1.7, many combinations of alpha and lambda lead to “no solution” (data not shown) and with 10+ minute runs, I wanted to search for parameter combinations likely to produce a solution. Note: some of the “early termination” parameter combinations became “no solution” on later runs.

Here’s my next selection statement involving fixing alpha=4.7 and testing 50 lambdas:

selection method=MCP (choose=validate solver=milp alpha=4.7 minlambda=.1 maxlambda=2 maxiterlambda=50);

The best lambdas all produced higher error than earlier runs:

07_blog19.5_taelna_selection-summary-3-1024x155.png

Next, I tried 100 lambdas over a larger range of values:

selection method=MCP (choose=validate solver=milp alpha=4.7 minlambda=.05 maxlambda=2 maxiterlambda=100);

Twenty minutes later, here are the best parameters:

08_blog19.6_taelna_selection-summary-4-1024x198.png

None of the above beat the original MCP/MILP combination of alpha=4.7 and lambda=1.4468138369, so I ran that again:

selection method=MCP (choose=validate solver=milp alpha=4.7 lambda=1.4468138369);

09_blog19.7_taelna_selection-summary-5-1024x88.png

Running this code with fixed values of both alpha and lambda still took 10 minutes, but this time it produced a larger ASE. Multiple runs with the same hyperparameters can lead to differences in performance.

Tuning step 3: find alpha

Next, I fixed lambda and explored alphas close to the value of 4.7 that previously produced good results:

selection method=MCP (minalpha=3.7 maxalpha=5.7 maxiteralpha=5 choose=validate solver=milp lambda=1.4468138369);

10_blog19.8_taelna_selection-summary-6-1024x238.png

Varying alpha wasn’t giving me better results, so I tried increasing the number of alphas tested to ten:

selection method=MCP (minalpha=3.7 maxalpha=5.7 maxiteralpha=10 choose=validate solver=milp  lambda=1.4468138369);

11_blog19.9_taelna_selection-summary-7-1024x398.png

No improvement, so I tried 20 alphas:

selection method=MCP (minalpha=3.7 maxalpha=5 maxiteralpha=20 choose=validate solver=milp lambda=1.4468138369);

12_blog19.10_taelna_selection-summary-8-1024x774.png

OK, I found alpha and lambda parameter combinations that lowered validation ASE (77.25) compared with the initial run. Overall, the biggest gains were from trying out the different combinations of selection method (SCAD/MCP) and solver (MILP/NLP). If I was trying this again from scratch, I would likely stop after identifying the best method-solver combination or use a single follow-up pass using that combination with the MAXITERLAMBDA option (and possibly MAXITERALPHA) set higher than the default.

These results come from a single case study, so the results from different FCP tuning projects may vary depending on sample size, signal strength, and how well‑behaved the predictors are. FCP models can be powerful, but they also demand more patience and experimentation than convex penalties like LASSO, and the payoff isn’t always proportional to the effort. In this example, most of the performance gains came from choosing right method–solver pairing rather than from squeezing every last drop out of alpha and lambda. My goal in sharing this walkthrough is to give you a realistic picture of the workflow so you know where the effort pays off, where it doesn’t, and how to approach FCP tuning without getting lost in the weeds.

Links

Find more articles from SAS Global Enablement and Learning here.

Tuning FCP models in PROC REGSELECT: a case study

Catch up on SAS Innovate 2026

SAS AI and Machine Learning Courses