BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dufaultb
Fluorite | Level 6

Hello fellow SAS users and SAS support, 

 

I have been using HPGENSELECT with LASSO selection for a binary dependent variable, and was hoping for clarification regarding the details of the LASSO penalization method and the resulting coefficients. I will post my SAS code at the end. My two questions are:

 

  1. When HPGENSELECT has been called with the LASSO option and there are CLASS variables present, does it perform group LASSO optimization, in which the categories of a class variable are either all selected or all set to zero? This is in contrast to regular LASSO, in which some categories might have a non-zero coefficient but others do not; the fact that they belong to a single effect is ignored.

  2. When I use the PARAM = GLM option in the CLASS statement, I seem to invoke less-than-full-rank parameterization of the categorical variables. This means that each level of a class variable gets a dummy variable and all dummy variables are entered into the model. This is not estimable for OLS or maximum likelihood, so a reference category is forced, but LASSO can handle overparameterized models. My question is, how does one then interpret the coefficients? Is it done by calculating the contrasts manually? 

    For example, take my screenshot below of the parameter estimates on the log-odds scale. The variable "Location" has only 4 levels in the data, all of which are present in the fitted model. If one were interested in say comparing Locations 2 through 4 to Location 1 as a reference category, would you calculate the difference in estimates on the log-odds scale (e.g. 0.026 versus 0.074) and then exponentiate to obtain familiar odds ratios?

    HPGENSELECT LASSO.png

     

     

    Thanks very much for any insight you can provide! 

    SAS code below, if it helps. Note that this is from SAS version 9.4, SAS/STAT 15.1
PROC HPGENSELECT data=my_data LASSORHO=.80 LASSOSTEPS=20;
WHERE  location NOTIN (5,6);
CLASS  gender location Physiologic_difficult_AW <many more predictors>
         / param=GLM;
MODEL  Number_attempts = 
       gender location Physiologic_difficult_AW <many more predictors> / DISTRIBUTION=BINARY ;
SELECTION METHOD=LASSO(CHOOSE=AIC) DETAILS=ALL;
RUN;
1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

The answer to the second question is "Yes", but there might be better ways of comparing.  Once a model is selected, you could fit using GENMOD and use the LSMEANS statement with the ODDSRATIO option.

 

SteveDenham

View solution in original post

8 REPLIES 8
dufaultb
Fluorite | Level 6
Thank you. This paper suggests that Group LASSO is invoked by HPGENSELECT, which answers my first question. The example shown there uses PARAM=REF, which does not address the second question however.
SteveDenham
Jade | Level 19

The answer to the second question is "Yes", but there might be better ways of comparing.  Once a model is selected, you could fit using GENMOD and use the LSMEANS statement with the ODDSRATIO option.

 

SteveDenham

PaigeMiller
Diamond | Level 26

Agreeing with @SteveDenham 

 

The different parameterizations are the same model. Interpreting the coefficients is the part that trips people up, but LSMEANS eliminates all of that confusion. I wrote a post about this issue (although in a simpler example). https://communities.sas.com/t5/Statistical-Procedures/Interpreting-Multivariate-Linear-Regression-wi...

--
Paige Miller
dufaultb
Fluorite | Level 6

Thanks very much for your helpful reply. I think LSMEANS is a lovely tool and certainly would be useful here. 

Just one tangential comment regarding:


The different parameterizations are the same model.


This is generally true; fit statistics are invariant to parameterization for OLS and ML models. However, with LASSO, the choice of parameterization can affect variable selection and shrinkage estimates!  In a way this makes sense. If we choose a reference category that lies in the middle of the others with respect to the outcome, the contrasting coefficients will be small and could get "shrunk away" to zero during optimization. Group LASSO is less vulnerable. 

dufaultb
Fluorite | Level 6
The "yes" confirmation is quite helpful, thank you very much.

I might be reluctant to use a secondary GLM procedure to calculate the contrasts since the regression weights will be re-estimated without shrinkage, whereas the shrunk estimates might be more reliable from a cross-validation / reproducibility point of view. But this is an ongoing conversation in the literature, to my knowledge.

Thanks again.
SteveDenham
Jade | Level 19

Is your dataset so large that you have to use HPGENSELECT, rather than GLMSELECT?  Because if you can use the latter to do the LASSO selection, you have access to the STORE statement, from which you can use PLM to get least squares means and odds ratios.

 

SteveDenham

dufaultb
Fluorite | Level 6
Great idea - will proceed as you suggest

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1155 views
  • 5 likes
  • 4 in conversation