- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to model a continuous outcome variable which is highly skewed. I have several predictor variables in the model both continuous and categorical. The q-q plot of the residuals is shown below. As you can see, the normality assumption is clearly violated. I tried log transforming the outcome variable but it doesn't seem to fix the problem. Any body has an idea of how to remedy this issue ? Does the central limit theorem apply here?
Thanks.
Here is the code used:
proc glmselect data=b;
class a b c d e / param=reference;
model y=a b c d e f ;
output out=check r=residuals;
run;
proc univariate data=check;
var residuals;
histogram residuals / normal kernel;
qqplot residuals / normal(mu=est sigma=est);
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Personally, I would work on developing a better model. Use regression diagnostic plots to analyze whether you should include second-order interaction terms in the model. Since you are using PROC GLMSELECT, you can add in all second-order terms and use variable selection to see if any interactions improve the fit enough to make it into the final model.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Positively valued and skewed responses are often modeled using the gamma or inverse gaussian distribution as are available with the DIST= option in PROC GENMOD.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your input. So how do I know which of the two to use, can either one of them work?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can use PROC SEVERITY in SAS/ETS to assess the fit of several distributions, including gamma and inverse gaussian and others. For example:
proc severity data=b crit=aicc;
loss y;
dist _predefined_;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for your input. So I did use the proc severity to select the the distribution that best fits my data and the Burr distribution was selected. This is a distribution that I am not very familiar with. How do I fit a Burr distribution in SAS?
Here is the partial output from proc severity:
Yes | 361869 | Yes |
Yes | 382973 | No |
Yes | 371532 | No |
Yes | 368401 | No |
Yes | 365528 | No |
Yes | 382977 | No |
Yes | 382975 | No |
Yes | 377067 | No |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you look at the plots (CDF/EDF and PDF) to visually assess how close the other distributions are to the EDF of the observed data. It's not so much a matter of picking the one with the lowest AICC as it is rejecting distributions that clearly don't fit well and picking one that does fit reasonably well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What does the histogram of the residuals look like? Is there more than one mode? This would signal that you are missing some important effect, or some important interaction(s).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is how the histogram looks like:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Personally, I would work on developing a better model. Use regression diagnostic plots to analyze whether you should include second-order interaction terms in the model. Since you are using PROC GLMSELECT, you can add in all second-order terms and use variable selection to see if any interactions improve the fit enough to make it into the final model.