Hello,
I am trying to model a continuous outcome variable which is highly skewed. I have several predictor variables in the model both continuous and categorical. The q-q plot of the residuals is shown below. As you can see, the normality assumption is clearly violated. I tried log transforming the outcome variable but it doesn't seem to fix the problem. Any body has an idea of how to remedy this issue ? Does the central limit theorem apply here?
Thanks.
Here is the code used:
proc glmselect data=b;
class a b c d e / param=reference;
model y=a b c d e f ;
output out=check r=residuals;
run;
proc univariate data=check;
var residuals;
histogram residuals / normal kernel;
qqplot residuals / normal(mu=est sigma=est);
run;
Personally, I would work on developing a better model. Use regression diagnostic plots to analyze whether you should include second-order interaction terms in the model. Since you are using PROC GLMSELECT, you can add in all second-order terms and use variable selection to see if any interactions improve the fit enough to make it into the final model.
Positively valued and skewed responses are often modeled using the gamma or inverse gaussian distribution as are available with the DIST= option in PROC GENMOD.
Thanks for your input. So how do I know which of the two to use, can either one of them work?
You can use PROC SEVERITY in SAS/ETS to assess the fit of several distributions, including gamma and inverse gaussian and others. For example:
proc severity data=b crit=aicc;
loss y;
dist _predefined_;
run;
Thank you so much for your input. So I did use the proc severity to select the the distribution that best fits my data and the Burr distribution was selected. This is a distribution that I am not very familiar with. How do I fit a Burr distribution in SAS?
Here is the partial output from proc severity:
Yes | 361869 | Yes |
Yes | 382973 | No |
Yes | 371532 | No |
Yes | 368401 | No |
Yes | 365528 | No |
Yes | 382977 | No |
Yes | 382975 | No |
Yes | 377067 | No |
I suggest you look at the plots (CDF/EDF and PDF) to visually assess how close the other distributions are to the EDF of the observed data. It's not so much a matter of picking the one with the lowest AICC as it is rejecting distributions that clearly don't fit well and picking one that does fit reasonably well.
Thank you so much!
What does the histogram of the residuals look like? Is there more than one mode? This would signal that you are missing some important effect, or some important interaction(s).
This is how the histogram looks like:
Personally, I would work on developing a better model. Use regression diagnostic plots to analyze whether you should include second-order interaction terms in the model. Since you are using PROC GLMSELECT, you can add in all second-order terms and use variable selection to see if any interactions improve the fit enough to make it into the final model.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.