Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Checking Normality

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-27-2018 06:46 PM
(5887 views)

I'm doing research for my thesis and i have made two new continues variabels for "burnout" and "boreout" with factor analysis. These two will be my dependent variables (2 models) in a linear regression. Before i can do this regression i need to check normality but every time i check the goodness of fits tests i get the exact same results for any variable i use (i get the exact same three p-values)

Kolmogorov-Smirnov: <0.010

Cramer-von Mises: <0.005

Anderson-Darling: <0.005

i've used a proc univariate and histogram to check this (the histogram look normally distributed). i also tried to do a log transformation first but again i get the three same p-values. Can anyone help?

12 REPLIES 12

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello stefnix,

the p-values you are getting indicate that your variables don't pass the normality test assumption. However, keep these points in mind:

a) for large sample sizes even small deviations from normality would lead to a significant normality test

b) these tests are highly sensitive to extreme values.

c) normality of the outcome is not such an important assumption to proceed with linear regression. However, normality of the residuals after you fit your model is important.

d) I find QQ plots a lot more useful to assess normality than these tests. proc univariate produces qq plots.

Log-transformation may not be appropriate for your data. If you can post a photo of the histogram we may be able to propose other more suitable transformations.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi ameriel,

Thanks for helping!

Point A: shoudln't it be: "for large sample sizes even small deviations from normality would lead to a NON significant normality test" ?

I added the outcome of the proc univariate in attachement( i also added a qqplot). My problem is that the histogram and qqplot looks normal (I think) but for my thesis i would be better to base this on a number because i'm not sure when i can say that a qqplot or histogram is no longer normal. with a number there is a clear line.

I also think it is weird that when i use any other variable (as a test) like 'how many hours do you work a week' that the p-values for these three test remain exactly the same.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Point A: @ameriel's wording was correct. In statistics, you can't say that your sample is significantly normally distributed. The test is about how likely a random normal process would generate some summary statistic (such as D for K-S). The probability is called significant when it is so low that the normality hypothesis is unlikely (usually less than 5%).

You have been told to "*check for normality*" because many statistical methods make that assumption. Unfortunately, maybe, almost no natural phenomenon is perfectly normal. Given enough measurements, normality tests will detect some departure. So why are those classical methods that rely on normality so popular? Well it turns out that the normality assumption doesn't need to be so strict. It has been shown that data distributions that are roughly bell shaped will lead to valid inferences with classical methods, sometimes with a slight loss of power. As a matter of fact, the more data you have, the less important the normality assumption is.

Looking at your qq-plots, I would not hesitate to apply classical (or parametric) methods. Unfortunately I don't know of any formal test to back my assesment.

That said, there exist non-parametric alternatives to many statistical methods. Do not hesitate to try them. If you ever get contradicting results using those, then be extra careful with the parametric analyses results.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

thank you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

There is no assumption of normality of data for linear regression. There are for other models though.

Before i can do this regression i need to check normality

Usually what's most important when comparing is to ensure the data has the same distribution.

but every time i check the goodness of fits tests i get the exact same results for any variable i use (i get the exact same three p-values)

Are you sure your code is correct? The log shows no error and other values are changing as expected?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

this is the code i use for variable burn-out:

proc univariate data= ewcsrotatie normal plot;

var burnout ;

histogram burnout / NORMAL (MU=EST SIGMA=EST);

qqplot burnout / NORMAL (MU=EST SIGMA=EST);

run;

There are no errors or warnings but even if i take a variable like 'how many hours do you work a week' i still get the exact same p-values so there's never a normal distributon.

(Maybe good to know my dataset had over 43000 cases)

thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello again stefnix,

on Point A, you have a large sample size and therefore these tests become more sensitive. I know it's counter-intuitive. However, because you have such a large sample size they are basically meaningless also. So with those histograms and QQ plots as others said, no reason to hesitate to use linear regression. What I would do is to check normality of the residuals after fitting the model. If you use proc reg or proc glm you can save the residuals in an output and then check for their normality, This in my opinion is far more important for the fit of the model than normality of the outcome.

On why you keep getting the same outcome for the normality test: Nothing wrong with your code or model. The p-values of these tests are an approximation.For example, for K-S you get a p-value of <0.01, that could be anything like 0.00049 or 0.0000000000000003. The output does not tell you the exact value, just that is smaller than 0.05. It doesn't mean it is not changing in the background.

Hope that makes sense and eases your mind that you are not doing anything wrong. Please do ask again if you have any questions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Ameriel this is very helpfull!

so i used this code to ask for residuals and store them in 'resburn':

proc mixed data= ewcsrotatie noclprint noitprint covtest;

class countid;

model burnout= ondergekwa overgekwa man leeftijd loon stopstudeer nonprofit privaat combo

ondergekwa*stopstudeer overgekwa*stopstudeer

/solution residual outpm=resburn;

random intercept / solution sub=countid;

run;

and now i wanted to do a 'proc univariate' where data= resburn but i don't know what should come after the var statement?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**NO NEED** to check for normality of the 'burnout' variable. The "normality assumptions" that you are worried about are for the RESIDUALS of the model. So you should **check the distribution of the residuals for normality**, not the distribution of the response variable.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Rick,

I know now that i have to check the normality of the residuals but i don't know what code to use to check these (see previous posts). Can you help me?

greetings stefnix

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

ok so i wrote this code. Is this correct because i don't see much difference with the normality check for burnout and the normality check for the residuals for burnout? :

proc mixed data= ewcsrotatie noclprint noitprint covtest;

class countid;

model burnout= ondergekwa overgekwa man leeftijd loon stopstudeer nonprofit privaat combo ondergekwa*stopstudeer overgekwa*stopstudeer /solution **residual outpm=resburn ;**

random intercept / solution sub=countid;

run;

greetings stefnix

proc univariate **data= resburn** normal plot;

var burnout;

histogram burnout/ NORMAL (MU=EST SIGMA=EST);

qqplot burnout / NORMAL (MU=EST SIGMA=EST);

run;

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.