04-23-2013 06:40 PM
Hello Sas Users (my second post today and ever!)
I am overwhelmed by the number of available statistical procedures in sas, and am hoping for someone to nudge me in the right direction. Stats are not my strength, so any attempt you could make to simplify your explanation would be appreciated.
I am an ecologist. My research question: how do various environmental variables (such as light, soil moisture, etc) affect the density of particular tree seedlings (or saplings)
So, basically, for each analysis, I have one response variable (seedling density) and multiple predictor variables.
I ran this analysis in proc reg, specifying a backwards regression. However, as the response variables are based on counts they are very non-normal (heavily skewed to right, because of many zeroes). Additionally, many of my predictor variables are heavily skewed to the right or left. I have tried various transformations of both the predictor and response variables to satisfy the assumptions of homoscedasticity and normality of residuals--and have given up on that approach.
I am reading about a lot of different procedures now, but am just not sure which would be the best for me to start learning about and working with.
My sample size is n=60.
I also should add that scatter plots of individual predictors vs the dependent variables, do not suggest a particularly strong relationship with any one variable.
Any advice would be appreciated. Also, please do not hesitate to ask if you would like me to supply you with more information.
04-23-2013 10:44 PM
If your response is counts you could consider poisson regression.
One thing though, there are no assumptions about the distribution of your predictor variables, only the residuals. Can you post your best Q-Q plot to show the violation of normality for the residuals?
04-24-2013 10:11 AM
Reeza, thank you for your response. I have several follow up questions, which I have tried to break up as neatly as possible.
1. As requested, here are my diagnostic graphs from sas output, including the q-q plot. Question: What does the q-q plot (assuming that is the one that has "quantile" in the label) show you that the plot of "percent" vs "residual" doesn't? Are they both there to allow you to check for normality? (plotting close to line on q-q, and normal curve on residual-percent curve).
2. As you can see, the residuals vs predicted value plot looks bad (above). My approach to attempt to remedy the unequal variance was to try to transform variables - first the dependent, then one or more independent variables. In this case, nothing helped much. Please let me know if this is not a valid approach.
In another similar analysis (where I had less zeroes) I transformed the dependent variable with a reciprocal log to make it normal. Then, I ran the regression and looked at the residual by regressor plots, for individual predictor variables (shown below). For predictor values where there was a cone shape (e.g. PBS, PCWD below), I tried a transformation to make the predictor value more normal, and in some cases this did improve the residual x regressor plots with random scatter. Was this a valid approach?
3. Poisson regression. I have considered using this, however, I cannot find a way in SAS to do backwards elimination with multiple poisson regression. I used proc genmod to attempt this, and couldn't find a way to specify this. If you know of one, please let me know. If I can find out how to do that, I will probably have more questions about poisson regression, but I'll let it lay for now.
Thanks again, and if you made it this far, you should get cookies or something!
04-24-2013 10:32 AM
1. Q-Q plot, is a plot of distribution of the data against a known distribution. If it's not linear then the distributions are not the same. I don't see it in the output.
As for the difference between the two, its a personal preference to be able to see if the normality is violated. The residual vs predicted plot doesn't look that bad to me, except for the 3 outliers at the top. Is there a reason for those cases?
2. It really looks like you have 3 outliers overall that are influencing your data quite a bit, I'd try dealing with those somehow first.They show up in all the residual plots. For PBS you also have a lot of 0 or very close to zero observations, are those valid responses, what does a log transform do. All of the scales are different between the variables so you can also consider standardizing them.
3. I don't know if there's an automatic backwards regression but you have 6 variables so its really easy to do manually IMO. Fit the model with all variables and remove variables iteratively until you're satisfied. I won't go into all the reasons backwards regression isn't a good method
04-24-2013 10:51 AM
Reeza, Thanks for your thorough answers. I know I'm asking a lot of questions, but I am not getting much help locally, and I will never graduate if I don't seek outside advice to get me headed in the right direction.
1. So the residual vs quantile plot is not a q-q plot (second row of first graph, first graph)? If not, what does this graph show me?
2. I will investigate the outliers.
3. I'm re-pasting an unanswered question I had above.
"In another similar analysis (where I had less zeroes in the dependent variable) I transformed the dependent variable with a reciprocal log to make it normal. Then, I ran the regression and looked at the residual by regressor plots, for individual predictor variables (shown below). For predictor values where there was a cone shape (e.g. PBS, PCWD below), I tried a transformation to make the predictor value more normal, and in some cases this did improve the residual x regressor plots with random scatter. Was this a valid approach?"