04-28-2016 12:06 PM
I want to do a t-test and also ANOVA. But my data is extremely positive skewed. I used log transformation to normalize it. Then I did a normality test.
The three tests gave different P-values. For Kolmogorov-Smirnov, the P-value is >0.05, but the other two tests are <0.05.
sample size 67994. Is this transformation acceptable for normality considering t-test and anova analysis?
proc univariate data = test1 normal; class gender; var newvar; histogram /normal kernel; qqplot newvar; run;
|Goodness-of-Fit Tests for Normal Distribution|
|Kolmogorov-Smirnov||D||0.09694082||Pr > D||0.119|
|Cramer-von Mises||W-Sq||0.14977297||Pr > W-Sq||0.024|
|Anderson-Darling||A-Sq||0.98653289||Pr > A-Sq||0.013|
04-28-2016 02:39 PM
04-28-2016 03:22 PM
But also beware that the tests here are immensely over-powered to detect differences. You will learn far more from the QQ plot. That long flat part at the beginning is evidence that there is a mixture here, and a rough look at the data makes me think that you are using a lower limit of quantitation value for a lot of observations. If that is the case, there are a number of ways to address the issues of analysis.
Also, the assumptions of ANOVA (and of the t test) are not that the data are normally distributed, but that the errors/residuals are normally distributed. Try running the analysis on the transformed data, and then testing the residuals for normality.
With a sample size this large, and the known conservatism of tests for normality, p values in this range should probably not be regarded as strong evidence for lack of normality of the residuals, which the ANOVA is relative robust to, in any case.
04-28-2016 04:07 PM
You do not need o transform your variable. The UNIVARIATE procedure can fit a lognormal and other skewed distributions.
As Steve points out, the Q-Q plot contains the graphical information about the fit. To learn more about the Q-Q plot and how to create it in SAS, see "Modeling the distribution of data? Create a Q-Q plot."
04-28-2016 11:20 PM - edited 04-28-2016 11:21 PM
Compare the p-value you get with ANOVA or ttest on the log-transformed data with the p-value from Wilcoxon rank sum (non parametric) test from proc NPAR1WAY on the untransformed data. The later should confirm the former.
04-29-2016 09:38 AM
I like this approach @PGStats, except all of the ties with the lower bound mean a loss of power. I would suggest a tobit analysis on the log transformed data (say with PROC QLIM), but that might make a newcomer to SAS run screaming, I'm afraid.