Re: Lasso and Statistical Inference

PeterBr · Posted 06-01-2020 05:34 PM

I am creating a model to predict cost of a particular surgery. I have n = ~20,000 and have >300 variables to choose from. I have done some background research and many papers suggest that penalizing methods such as Lasso (proc glmselect) outperform the stepwise methods (proc phreg).

I decided to use a 5 fold cross validation Lasso approach using proc glmselect to guide model selection. I understand that because Lasso is not selecting a model based on p values, it does not generate p values. Furthermore, it is my understanding that it is difficult to make statistical inferences about the size of the coefficients of a Lasso model.

Additionally, when submitting for publication, I am concerned that having no p values will not go over well with reviewers. Perhaps this is why stepwise methodology remains attractive.

*Are there any suggestions on how to go about creating statistical inferences in these situations with a large amount of potential regressors?*

As of right now my path forward would be:

-review the model Lasso selected and think critically about what should/shouldn't have been included

-make some adjustments (either add new variables or remove some variables) and run a standard OLS to get p values to make statistical inferences about what is driving the cost of these surgeries.

PaigeMiller · Posted 06-01-2020 07:48 PM

I think the lasso is well-known enough that if you fit a model using it (and there are no p-values), any reasonable editor ought to accept it. Particularly if you do either validation or cross-validation, which gives statistical validity to the entire model. But that's my opinion, and you will run into unreasonable editors.

The size of coefficients in such a model is generally not interpretable anyway. The correlations between the variables make the variance of the estimates so large that you will get coefficients that are far away from the "true" value. People want to interpret the coefficients as if the variables are independent of each other, and they are not, and so any such interpretation based on the (usually un-stated) thought that the variables are independent is nonsense. I wish the statistical world understood this. Same goes for p-values in this situation.

In the world of Partial Least Squares regression, papers get accepted all the time where no regression coefficients are provided, but rather the loadings from a PLS model are what is interpretable. In many PLS models, the model predicts well, and no one cares about the regression coefficients.

Lastly, is your goal to get a paper published, or is your goal to get a good predictive model? If you want to get the paper published, we'll do whatever you think is best. If you want to get a good predictive model, stay away from stepwise and similar, use the Lasso or Partial Least Squares.

--
Paige Miller

SteveDenham · Posted 06-02-2020 09:59 AM

And if your world is still on a p value basis, you could do something like using GLMSELECT to find a lasso based model, then use the STORE statement to save all that is required to get the p values for the selected variables using PROC PLM. Be aware of @PaigeMiller 's warning about the correlation between variables and the effect that could have on p values and on model utility.

SteveDenham

PeterBr · Posted 06-04-2020 12:44 PM

Thanks both for the input. After much reading there seems to be a gap between predictive modeling and explanatory modeling, here is a nice simplified paper on this topic https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf.

With my case, I have a lot of dichotomous variables so multicollinearity is likely. The basis of my paper will be on inference and less about discovering a model that can predict cost with high accuracy. My world is definitely still on a p value basis so I'm going to have to get creative. I've seen some methods such as Lasso for inference in STATA packages. Ultimately, in order to perform hypothesis testing, I may have to run Lasso and use intuition about its model to add/subtract variables and subsequently perform an OLS. Though I may take some criticism for trying to fit >100 variables in an OLS model.