BookmarkSubscribeRSS Feed
Rischi95
Calcite | Level 5

Hi all,

I am planning to do a risk factor analysis using proc mixed (my outcome variable is continues). I have more than 20 variables, and I will screen these variables one by one in a univariable model, and forward the variables with p < 0.15 to the multivariable model and do a backward selection to get the final model. I read some literature that for linear mixed model the assumptions of homoscedasticity and normality of the residuals should be met. My questions are:

1) Should these assumptions be checked every time I run the univariable model, or only check the assumption in the final multivariable model? 

2) In case the assumption(s) is not met in the final model, and I need to transform the outcome variable, then should I re-do the screening analysis for all univariable model?

3) Lastly, which results should I present for interpretation, the model with transformed data or I just need to back transform the final model for interpretation?

Please enlighten me on these. Thank you!

Regards.

6 REPLIES 6
sbxkoenk
SAS Super FREQ

I have moved your question / topic to the "Statistical Procedures" - board.

 

Koen

PaigeMiller
Diamond | Level 26

The reasons that these assumptions are needed for linear models is because if they are not true, then your hypothesis testing and confidence intervals are technically incorrect. They don't affect the model fit.

 

There other problem when your approach is this pseudo-stepwise approach is that multicollinearity will affect the model fit. One of the assumed benefits (although I don't know that this is true) of stepwise is that it produces a model where the effects of multi-collinearity are reduced (but not gone). As I said, this may or may not be true. There are many criticisms of stepwise regression out there on the internet. However, another impact of multi-collinearity is that it inflates the variance of the parameter estimates, and therefore also inflates confidence intervals on parameter estimates and make hypothesis testing on the parameter estimates less able to detect real effects.

 

Typically, people just fit a stepwise regression without worrying about the model assumptions and without worrying about multi-collinearity. Whether or not this is a good thing to do depends on the data and the reason why you are fitting models in the first place (prediction vs understanding effects).

 

Another approach — an alternative to stepwise — is Partial Least Squares regression, in which the model fit is robust against multi-collinearity, and there are really no distributional assumptions needed. In this paper, Tobias (from SAS Institute!) takes a data set with 1000 independent variables but which are highly correlated with one another (extreme multi-collinearity), and there is no step of variable reduction, and he still is able to fit a PLS model that produces useful results. This is implemented in PROC PLS in SAS, although I believe the syntax has changed since the paper was written.

--
Paige Miller
Rischi95
Calcite | Level 5

Hi Paige,

Thank you for your response. The PROC PLS could be a good option considering the multidimensional data, however as far I know this procedure only works for continuous variables (correct me if this is wrong). I have continuous and categorical predictors, thus opt to use PROC MIXED (just as some papers used in risk factor analysis). I will consider the multicollinearity as well, and will do multicollinearity check before the multivariable model building. For the questions above re the assumption tests, would you please share your views whether the assumptions need to be checked in all univariable screening or only in the final multivariable model?
Regards. 

PaigeMiller
Diamond | Level 26

PROC PLS works fine with catagorical predictors, you use the CLASS statement in PROC PLS just the same as in any other  modeling PROC.

 

I will consider the multicollinearity as well, and will do multicollinearity check before the multivariable model building.

One of the benefits of PLS is that the above step is simply not needed. This can save you a lot of time.

 

 

would you please share your views whether the assumptions need to be checked in all univariable screening or only in the final multivariable model?

 

Assumptions about homoscedasticity and normality of residuals (and one other, you didn't mention, that the errors are independent of each other) should be checked when you do hypothesis tests or when you compute confidence intervals.

--
Paige Miller
Rischi95
Calcite | Level 5
Hi Paige, many thanks for the view and suggestion. I have never done the PROC PLS before, and probably will try this procedure at some stage. Just wondering, because some others said the PROC PLS is more powerful for predictive modelling rather than explanatory, and my analysis is primarily for the latter. Regards.
PaigeMiller
Diamond | Level 26

Actually, I find (and many others do as well) that the opposite is true. PLS loadings are extremely easy to interpret, easier than any stepwise regression (in my opinion), because the interpretation of the loadings generally is not affected (or not affected much) by multi-collinearity. In the paper I linked to, the interpretation of the PLS loadings is used and is valuable.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1050 views
  • 2 likes
  • 3 in conversation