Solved: Re: HPGENSELECT - interpretation of LASSO coefficients

dufaultb · Posted 08-18-2021 11:23 AM

Hello fellow SAS users and SAS support,

I have been using HPGENSELECT with LASSO selection for a binary dependent variable, and was hoping for clarification regarding the details of the LASSO penalization method and the resulting coefficients. I will post my SAS code at the end. My two questions are:

When HPGENSELECT has been called with the LASSO option and there are CLASS variables present, does it perform group LASSO optimization, in which the categories of a class variable are either all selected or all set to zero? This is in contrast to regular LASSO, in which some categories might have a non-zero coefficient but others do not; the fact that they belong to a single effect is ignored.
When I use the PARAM = GLM option in the CLASS statement, I seem to invoke less-than-full-rank parameterization of the categorical variables. This means that each level of a class variable gets a dummy variable and all dummy variables are entered into the model. This is not estimable for OLS or maximum likelihood, so a reference category is forced, but LASSO can handle overparameterized models. My question is, how does one then interpret the coefficients? Is it done by calculating the contrasts manually?

For example, take my screenshot below of the parameter estimates on the log-odds scale. The variable "Location" has only 4 levels in the data, all of which are present in the fitted model. If one were interested in say comparing Locations 2 through 4 to Location 1 as a reference category, would you calculate the difference in estimates on the log-odds scale (e.g. 0.026 versus 0.074) and then exponentiate to obtain familiar odds ratios?

Thanks very much for any insight you can provide!

SAS code below, if it helps. Note that this is from SAS version 9.4, SAS/STAT 15.1

PROC HPGENSELECT data=my_data LASSORHO=.80 LASSOSTEPS=20;
WHERE  location NOTIN (5,6);
CLASS  gender location Physiologic_difficult_AW <many more predictors>
         / param=GLM;
MODEL  Number_attempts = 
       gender location Physiologic_difficult_AW <many more predictors> / DISTRIBUTION=BINARY ;
SELECTION METHOD=LASSO(CHOOSE=AIC) DETAILS=ALL;
RUN;

SteveDenham · Posted 08-23-2021 10:37 AM

The answer to the second question is "Yes", but there might be better ways of comparing. Once a model is selected, you could fit using GENMOD and use the LSMEANS statement with the ODDSRATIO option.

SteveDenham

View solution in original post

gcjfernandez · Posted 08-21-2021 02:56 AM

Please check this paper on HPGENSELECT :https://support.sas.com/resources/papers/proceedings15/SAS1742-2015.pdf

dufaultb · Posted 08-23-2021 10:30 AM

Thank you. This paper suggests that Group LASSO is invoked by HPGENSELECT, which answers my first question. The example shown there uses PARAM=REF, which does not address the second question however.

SteveDenham · Posted 08-23-2021 10:37 AM

The answer to the second question is "Yes", but there might be better ways of comparing. Once a model is selected, you could fit using GENMOD and use the LSMEANS statement with the ODDSRATIO option.

SteveDenham

PaigeMiller · Posted 08-23-2021 10:46 AM

Agreeing with @SteveDenham

The different parameterizations are the same model. Interpreting the coefficients is the part that trips people up, but LSMEANS eliminates all of that confusion. I wrote a post about this issue (although in a simpler example). https://communities.sas.com/t5/Statistical-Procedures/Interpreting-Multivariate-Linear-Regression-wi...

--
Paige Miller

dufaultb · Posted 08-23-2021 11:22 AM

Thanks very much for your helpful reply. I think LSMEANS is a lovely tool and certainly would be useful here.

Just one tangential comment regarding:

The different parameterizations are the same model.

This is generally true; fit statistics are invariant to parameterization for OLS and ML models. However, with LASSO, the choice of parameterization can affect variable selection and shrinkage estimates! In a way this makes sense. If we choose a reference category that lies in the middle of the others with respect to the outcome, the contrasting coefficients will be small and could get "shrunk away" to zero during optimization. Group LASSO is less vulnerable.

dufaultb · Posted 08-23-2021 11:39 AM

The "yes" confirmation is quite helpful, thank you very much.

I might be reluctant to use a secondary GLM procedure to calculate the contrasts since the regression weights will be re-estimated without shrinkage, whereas the shrunk estimates might be more reliable from a cross-validation / reproducibility point of view. But this is an ongoing conversation in the literature, to my knowledge.

Thanks again.

SteveDenham · Posted 08-23-2021 12:47 PM

Is your dataset so large that you have to use HPGENSELECT, rather than GLMSELECT? Because if you can use the latter to do the LASSO selection, you have access to the STORE statement, from which you can use PLM to get least squares means and odds ratios.

SteveDenham

dufaultb · Posted 08-25-2021 11:47 AM

Great idea - will proceed as you suggest

SAS Innovate 2025: Call for Content