Re: Checking Normality

stefnix · Posted 07-27-2018 06:46 PM

I'm doing research for my thesis and i have made two new continues variabels for "burnout" and "boreout" with factor analysis. These two will be my dependent variables (2 models) in a linear regression. Before i can do this regression i need to check normality but every time i check the goodness of fits tests i get the exact same results for any variable i use (i get the exact same three p-values)

Kolmogorov-Smirnov: <0.010

Cramer-von Mises: <0.005

Anderson-Darling: <0.005

i've used a proc univariate and histogram to check this (the histogram look normally distributed). i also tried to do a log transformation first but again i get the three same p-values. Can anyone help?

ameriel · Posted 07-27-2018 07:48 PM

Hello stefnix,

the p-values you are getting indicate that your variables don't pass the normality test assumption. However, keep these points in mind:

a) for large sample sizes even small deviations from normality would lead to a significant normality test

b) these tests are highly sensitive to extreme values.

c) normality of the outcome is not such an important assumption to proceed with linear regression. However, normality of the residuals after you fit your model is important.

d) I find QQ plots a lot more useful to assess normality than these tests. proc univariate produces qq plots.

Log-transformation may not be appropriate for your data. If you can post a photo of the histogram we may be able to propose other more suitable transformations.

stefnix · Posted 07-28-2018 10:05 AM

Hi ameriel,

Thanks for helping!

Point A: shoudln't it be: "for large sample sizes even small deviations from normality would lead to a NON significant normality test" ?

I added the outcome of the proc univariate in attachement( i also added a qqplot). My problem is that the histogram and qqplot looks normal (I think) but for my thesis i would be better to base this on a number because i'm not sure when i can say that a qqplot or histogram is no longer normal. with a number there is a clear line.

I also think it is weird that when i use any other variable (as a test) like 'how many hours do you work a week' that the p-values for these three test remain exactly the same.

PGStats · Posted 07-28-2018 04:26 PM

Point A: @ameriel's wording was correct. In statistics, you can't say that your sample is significantly normally distributed. The test is about how likely a random normal process would generate some summary statistic (such as D for K-S). The probability is called significant when it is so low that the normality hypothesis is unlikely (usually less than 5%).

You have been told to "check for normality" because many statistical methods make that assumption. Unfortunately, maybe, almost no natural phenomenon is perfectly normal. Given enough measurements, normality tests will detect some departure. So why are those classical methods that rely on normality so popular? Well it turns out that the normality assumption doesn't need to be so strict. It has been shown that data distributions that are roughly bell shaped will lead to valid inferences with classical methods, sometimes with a slight loss of power. As a matter of fact, the more data you have, the less important the normality assumption is.

Looking at your qq-plots, I would not hesitate to apply classical (or parametric) methods. Unfortunately I don't know of any formal test to back my assesment.

That said, there exist non-parametric alternatives to many statistical methods. Do not hesitate to try them. If you ever get contradicting results using those, then be extra careful with the parametric analyses results.

PG

stefnix · Posted 07-29-2018 07:11 AM

thank you!

Reeza · Posted 07-27-2018 11:23 PM

There is no assumption of normality of data for linear regression. There are for other models though.

Before i can do this regression i need to check normality

Usually what's most important when comparing is to ensure the data has the same distribution.

but every time i check the goodness of fits tests i get the exact same results for any variable i use (i get the exact same three p-values)

Are you sure your code is correct? The log shows no error and other values are changing as expected?

stefnix · Posted 07-28-2018 09:57 AM

this is the code i use for variable burn-out:

proc univariate data= ewcsrotatie normal plot;
var burnout ;
histogram burnout / NORMAL (MU=EST SIGMA=EST);

qqplot burnout / NORMAL (MU=EST SIGMA=EST);
run;

There are no errors or warnings but even if i take a variable like 'how many hours do you work a week' i still get the exact same p-values so there's never a normal distributon.

(Maybe good to know my dataset had over 43000 cases)

thanks!

ameriel · Posted 07-28-2018 05:17 PM

Hello again stefnix,

on Point A, you have a large sample size and therefore these tests become more sensitive. I know it's counter-intuitive. However, because you have such a large sample size they are basically meaningless also. So with those histograms and QQ plots as others said, no reason to hesitate to use linear regression. What I would do is to check normality of the residuals after fitting the model. If you use proc reg or proc glm you can save the residuals in an output and then check for their normality, This in my opinion is far more important for the fit of the model than normality of the outcome.

On why you keep getting the same outcome for the normality test: Nothing wrong with your code or model. The p-values of these tests are an approximation.For example, for K-S you get a p-value of <0.01, that could be anything like 0.00049 or 0.0000000000000003. The output does not tell you the exact value, just that is smaller than 0.05. It doesn't mean it is not changing in the background.

Hope that makes sense and eases your mind that you are not doing anything wrong. Please do ask again if you have any questions.

stefnix · Posted 07-29-2018 06:47 AM

Thanks Ameriel this is very helpfull!

so i used this code to ask for residuals and store them in 'resburn':

proc mixed data= ewcsrotatie noclprint noitprint covtest;
class countid;
model burnout= ondergekwa overgekwa man leeftijd loon stopstudeer nonprofit privaat combo
ondergekwa*stopstudeer overgekwa*stopstudeer
/solution residual outpm=resburn;
random intercept / solution sub=countid;
run;

and now i wanted to do a 'proc univariate' where data= resburn but i don't know what should come after the var statement?

Rick_SAS · Posted 07-30-2018 09:01 AM

From your later posts, it looks like you are doing a regression analysis on the 'burnout' variable. There is NO NEED to check for normality of the 'burnout' variable. The "normality assumptions" that you are worried about are for the RESIDUALS of the model. So you should check the distribution of the residuals for normality, not the distribution of the response variable.

stefanievdb · Posted 08-06-2018 06:24 AM

Thanks Rick,

I know now that i have to check the normality of the residuals but i don't know what code to use to check these (see previous posts). Can you help me?

greetings stefnix

Rick_SAS · Posted 08-06-2018 08:33 AM

Use PROC UNIVARIATE to create a Q-Q plot of the residuals. The third bullet point in that article contains links that tell you how to interpret a Q-Q plot. The last example shows the PROC UNIVARIATE syntax.

stefanievdb · Posted 08-06-2018 09:06 AM

ok so i wrote this code. Is this correct because i don't see much difference with the normality check for burnout and the normality check for the residuals for burnout? :

proc mixed data= ewcsrotatie noclprint noitprint covtest;
class countid;
model burnout= ondergekwa overgekwa man leeftijd loon stopstudeer nonprofit privaat combo ondergekwa*stopstudeer overgekwa*stopstudeer /solution residual outpm=resburn ;
random intercept / solution sub=countid;
run;

greetings stefnix

proc univariate data= resburn normal plot;
var burnout;
histogram burnout/ NORMAL (MU=EST SIGMA=EST);
qqplot burnout / NORMAL (MU=EST SIGMA=EST);
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!