03-12-2012 02:39 PM
I just wanted to ask a little bit of advice on the goodness of fit normality tests that are generated using proc univariate.
This may be a really silly question, but why do the p-values for the goodness of fit test change when I use estimated Mu and sigma to when I type them in?
Am I misunderstanding the estimate?
Any advice would be greatly appreciated.
For ease, I have attached a small piece of code as an example.
03-12-2012 03:34 PM
It's not silly. It's somewhat subtle. The distributions of the GOF statistic depends on the parameters you are estimating.
Let's say you are estimating two parameters: location and scale. In order to estimate the scale parameter, you first have to estimate the location parameter, and then USE that estimate to estimate the scale parameter. Using an estimate is different than using the true value; there is more uncertainty.
This comes up a lot in statistics. A familiar example is the formula for the sample variance, in which you divide sum(x-xbar)^2 by (n-1) [rather than n] because you are using an estimate of the location parameter.
I think the book by D'Agostino and Stephens, Goodness-Of-Fit Techniques, covers this.
03-13-2012 04:54 AM
Thanks Rick, so it's always better to input the true values if you are going to use the univariate and the goodness of fit tests?
In your opinion, are these statistics useful? I generally look at the normal probability plot and if in doubt the skewness and kurtosis to cement my decision, so I wouldn't particularly use the other tests.
Many thanks for your help and advice, it's really appreciated!
03-13-2012 08:35 AM
Well, yes, but usually you don't know the true values! The true values are the population parameters; typically all we know are the estimates from a sample.
Are these statistics useful? I think so. A probability plot or a Q-Q plot is a useful first step in deciding whether your data might be successfully modeled by a certain distribution (see http://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot/). However, I view the Q-Q plot as complementary to statistical tests, not a replacement for them.
The "usefulness" also depends on what you are trying to do. Ask yourself why you are checking normality. Is it because you want to run some OTHER test (such as a t test) that assumes normality? If so, how robust is the other test to deviations in normality? If it is robust, then from a practical viewpoint it probably doesn't matter whether the p-value in the test of normality is 0.06 or 0.04. Also, modern statistics has a lot of nonparametric methods that do not require assumptions about the distribution of the data.