turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Normality Test

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-02-2015 08:49 AM

I've used the univariate procedure to determine the normality for the continuous varaible 'amount'. With the actual data mean is 5055 and the median is 68. Similarly skewness and kutosis is 8.5 and 166 respectively.

proc univariate data=want;

var amount

histogram /normal (color=red);

run;

As per the documentation, I understood that mean and median should almost remains the same (both values should be close to each other) and skewness and kurtosis should be close to '0' for the normal curve.So do I need to remove the outliers to make my data normal? Or we've any other better solution to create a normal data?

What other procedures\techniques can be used in SAS to conduct a normality test?

Thanks for any help you offer.

Accepted Solutions

Solution

02-06-2015
02:57 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-06-2015 02:57 AM

Hi Babloo,

We don't what exactly the purpose of your study.

If you just want to determine (**statistically prove**) that there is a difference between mean balances, then you can conduct a nonparametric-ANOVA test, or maybe (log?)transform your variable, and do an ANOVA. You don’t need to back-transform, because you are in finish. (But don’t forget to test normality of residuals as Steve mentions).

If you want to **interpret** results, you again don’t need to back transform. If you did a **log** transform, then your parameter estimates mean a **percentage change** (instead of change of the value on the original scale)

If you are interested **only in predictions** (“What will be the **expected balance** of new client with age=30 income=100?”), then you don’t need to be worry about the normality **so much**, just do an ANOVA (regression with indicator variables in your case – but I would also consider to tread age and income as continuous variables, i.e. not binning them).

If you are interested in prediction **and prediction intervals**(“’In what range will be the balance of a ** usual** age=30 income=100 client”), you again need to care about the distribution.

3.

A simple back-transform of the predictions (if you used log transform) is: *prediction_median_on_original_scale=exp(prediction_mean_on_log_scale) *

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-02-2015 10:27 AM

On the proc univariate statement add the option NORMAL. This will add some output that is the result of tests for normality. Removing outliers, if any, is a BIG topic in analysis.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-02-2015 12:51 PM

I would expect that the variable 'amount' has a lognormal distribution.

Try doing a log transformation of your data, and then look at the various moments.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-03-2015 07:25 AM

Minimum account in my dataset is 0 and the maximum is 124346. When I tried the code below.

ods graphics on;

options maxmemquery =6M;

proc univariate data=Anova_data_new normaltest plots;

var tot_balanc;

histogram /midpoints=1 to 124346 by 1000 /*How to find the divisor value*/

lognormal;

run;

ods graphics off;

Encounted by error as ERROR: The smallest value of amount is less than or equal to the threshold parameter (THETA) for the lognormal fit. According to documenation,

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= *lognormal-option*. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and with the SCALE= and SHAPE= *lognormal-options*, respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.

But I'm not sure how to compute theta value for my data. May I request you to extend your help on the same? Also let me know why we going for log normal distribution instead of normal distribution?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 02:33 AM

Any suggestions?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 06:04 AM

Hi,

It is true, that for a normal distribution "mean and median should almost remains the same (both values should be close to each other) and skewness and kurtosis should be close to 0". But there are formal statistical tests of normality, which are available in proc univariate.

Obviously your data is not normally distributed, this is why Steve suggested to test lognormality of the data.

A log normally distributed data does not contain 0-s. But maybe it doesn’t hurt to add 1 to those 0 values. Alternatively you can apply threshold=-1 (shifting the fitted distribution).

Or you could try other distributions as well.

Why is lognormal better than normal? Because (probably) it fits your data better.

But we don’t know the purpose of your study.

Why do you want to fit a distribution to your data? Why do you want to test normality (or lognormality or something else) of your data? Why do you want to **make** it normal? (By removing outliers maybe.)

Gergely

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 06:19 AM

Thanks for your valuable comments.

I need to conduct a ANOVA test with a samples. Hence I would except my samples to be normal.

Will it sounds good if I remove the values which are less than or equal 0 before applying log-normal distribution ? I wonder when I see the various moments (non-normal to normal) for my variable after I switch over from normal to log-normal? What is the significance behind log-normal distribution? Can it be applied to the variable 'balance/amount'?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 06:40 AM

You have different groups, and you want to determine whether mean *balance* is different between groups?

Yes: ANOVA requires that distribution of *balance *be normally distributed **within each group**. If not, you could transform your variable (log transfom?), or apply some other test (proc genmod?).

If you remove some observations (0-s or outliers) your results (inference) will be valid only to a subpopulations. (For example you simply want to exclude accounts with 0 or very high balance, because they do not represent the “real wealth of the account owners”)

Or you need to have a very good explanation why you are removing them. (For example if 0 is a data error.)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 07:19 AM

I need to determine whether mean *balance* is different between groups.

I agree with your point on 'If you remove some observations (0-s or outliers) your results (inference) will be valid only to a subpopulations. (For example you simply want to exclude accounts with 0 or very high balance, because they do not represent the “real wealth of the account owners”)'.

Now I wondering what values should be replaced for negative and zero values for my variable 'balance' before applying log normal distribution? Because 25% of my data is less than or equal to zero.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 07:22 AM

This really looks to me like an example where a nonparametric approach will be useful. There are zeroes and negative values and truly huge values in your data. Rather than looking at whether the means are different, consider whether the medians are different. A Mann-Whitney test, using PROC NPAR1WAY provides a distribution-free approach to this test.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 08:39 AM

Can we make the inference in non-parametric test like we do in parametric test? e.g. I can test 'model balance=income age' in ANOVA or GLM. Since my data is non-normal how can I do the similar test in

non-parametric approach?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 09:00 AM

Time to read the documentation for PROC NPAR1WAY, paying particular attention to the examples there.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 09:43 AM

Thanks,will have a look at documentation for PROC NPAR1WAY.

With regards to normality, I got a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08). but 'P' value is significant. e.g. p<0.001

So can we assume that our data is normal for the scenario as I mentioned above? or still we need to make it normal. I need to do ANOVA or GLM with that data.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-04-2015 10:17 AM

RE:the statement:

*With regards to normality, I got a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08). but 'P' value is significant. e.g. p<0.001*

This is quite different from what was given above

( With the actual data mean is 5055 and the median is 68. Similarly skewness and kutosis is 8.5 and 166 respectively.)

So I assume this is after some sort of transform?

It still appears that the data are significantly different from being normally distributed, but that is not necessarily a stopping point. Please state how you got such different results this time.

Also, it is not necessary that the raw data be distributed normally to meet the assumptions of analysis of variance. The assumption is that the errors (residuals) be normally distributed. This can be checked by fitting the model of interest, getting the residuals in an output dataset, and then checking them for normality. For a Shapiro-Wilks test of normality, I would only reject the null hypothesis (of a normal distribution) if the P value were less than 0.001.

But much better than testing for normality would be looking at a QQ plot of the residuals. If those basically fit the diagonal without anything unusual, I would trust that the data were such that the assumption is nearly met, and depend on the robustness of the method.

Now if you get some extreme bends anywhere in the QQ plot, the nonparametric approach is probably more powerful than standard ANOVA.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-05-2015 03:36 AM

With the actual raw data mean is 5055 and the median is 68. Similarly skewness and kurtosis is 8.5 and 166 respectively followed by I did binning (15 bins) and removed the outliers. Now the mean is 14006.9 median is 14608.5 and skewness and kurtosis is 0.0013 and -1.2806 respectively. Considerably record count is reduced from 24000 to 600 where as p<0.001. Got a bell curve as well. Please suggest my final samples (600) is reasonably normal?

Then I replaced the actual value of balance1 (or amount) to log value (via **balance1=log(balance1)** ) and then I did a log normal and it produced *a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08).*

According to your advice, I tried to plot a Q-Q plot to check the normality with the code below. Attached the plot as well. Sounds it is non-normal as most of the data origin towards left. But I'm not very sure.

ods graphics on;

options maxmemquery=6M;

proc univariate data=normality_new;

QQPLOT balance1 / lognormal (sigma=2); /*I don't know about sigma value here*/

run;

ods graphics off;

So may I request you to view my plot and share your thoughts (or possibly verdict)?

Thanks.