1. Does GLMSELECT LASSO by default, assume response variable is continuous and approximately normally distributed?
proc glmselect data=lasso_allsample plots=coefficients seed=123; partition role=SELECTED(TRAIN='1' TEST='0'); model return = "list of predictors" /selection=lasso( choose=cv stop=none) cvmethod=random(10); run;
2. A key assumption of traditional linear regression is that the residuals (the differences between the observed and predicted values) are normally distributed. This allows for statistical inference and hypothesis testing. Can we relax this assumption when doing LASSO and how to implement a NON-normal distribution of error in GLMSELECT (if the answer to question 1 is that GLMSELECT do assume normal distribution)?
Thank you.
When you use PROC GLMSELECT (like PROC GLM or PROC REG) you are assuming that the response is approximately normally distributed. If you have a response which is distributed otherwise, such as if your response is a count, is categorical, or is positively-valued and skewed, and you want to use LASSO selection, then you can use PROC HPGENSELECT and specify a suitable response distribution with the DIST= option.
In a nutshell,
@hewei2005 wrote:
1. Does GLMSELECT LASSO by default, assume response variable is continuous and approximately normally distributed?
the assumption of the statistical model depends on the model, not the method of parameter estimation. Therefore, if you are building linear regression models, the normality assumption is required regardless of whether you employ LASSO or not.
@hewei2005 wrote:
2. A key assumption of traditional linear regression is that the residuals (the differences between the observed and predicted values) are normally distributed. This allows for statistical inference and hypothesis testing. Can we relax this assumption when doing LASSO and how to implement a NON-normal distribution of error in GLMSELECT (if the answer to question 1 is that GLMSELECT do assume normal distribution)?
Thank you.
No. However, if your residual does not follow a normal distribution, then (1) transformation of the dependent variable into normality via Box-Cox transformation or (2) resort to the generalized linear model (GLM) if the dependent variable follows certain distributions that can be modeled by GLM. Please note that LASSO can also be applied in GLM's.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.