BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Babloo
Rhodochrosite | Level 12

I've used the univariate procedure to determine the normality for the continuous varaible 'amount'. With the actual data mean is 5055 and the median is 68. Similarly skewness and kutosis is 8.5 and 166 respectively.

proc univariate data=want;

var amount

histogram /normal (color=red);

run;

As per the documentation, I understood that mean and median should almost remains the same (both values should be close to each other)  and skewness and kurtosis should be close to '0' for the normal curve.So do I need to remove the outliers to make my data normal? Or we've any other better solution to create a normal data?

What other procedures\techniques can be used in SAS to conduct a normality test?

Thanks for any help you offer.

1 ACCEPTED SOLUTION

Accepted Solutions
gergely_batho
SAS Employee

Hi Babloo,

We don't what exactly the purpose of your study.

If you just want to determine (statistically prove) that there is a difference between mean balances, then you can conduct a nonparametric-ANOVA test, or maybe (log?)transform your variable, and do an ANOVA. You don’t need to back-transform, because you are in finish. (But don’t forget to test normality of residuals as Steve mentions).

If you want to interpret results, you again don’t need to back transform. If you did a log transform, then your parameter estimates mean a percentage change (instead of change of the value on the original scale)

If you are interested only in predictions (“What will be the expected balance of new client with age=30 income=100?”), then you don’t need to be worry about the normality so much, just do an ANOVA (regression with indicator variables in your case – but I would also consider to tread age and income as continuous variables, i.e. not binning them).

If you are interested in prediction and prediction intervals(“’In what range will be the balance of a usual age=30 income=100 client”), you again need to care about the distribution.

3.

A simple back-transform of the predictions (if you used log transform) is: prediction_median_on_original_scale=exp(prediction_mean_on_log_scale)

View solution in original post

29 REPLIES 29
ballardw
Super User

On the proc univariate statement add the option NORMAL. This will add some output that is the result of tests for normality. Removing outliers, if any, is a BIG topic in analysis.

SteveDenham
Jade | Level 19

I would expect that the variable 'amount' has a lognormal distribution.

Try doing a log transformation of your data, and then look at the various moments.

Steve Denham

Babloo
Rhodochrosite | Level 12

Minimum account in my dataset is 0 and the maximum is 124346. When I tried the code below.

ods graphics on;

options maxmemquery =6M;

proc univariate data=Anova_data_new normaltest plots;

var tot_balanc;

histogram /midpoints=1 to 124346 by 1000  /*How to find the divisor value*/

               lognormal;

run;

ods graphics off;

Encounted by error as ERROR: The smallest value of amount is less than or equal to the threshold parameter (THETA) for the lognormal fit.  According to documenation,

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= lognormal-option. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and with the SCALE= and SHAPE= lognormal-options, respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.


But I'm not sure how to compute theta value for my data. May I request you to extend your help on the same? Also let me know why we going for log normal distribution instead of normal distribution?

Babloo
Rhodochrosite | Level 12

Any suggestions?

gergely_batho
SAS Employee

Hi,

It is true, that for a normal distribution "mean and median should almost remains the same (both values should be close to each other)  and skewness and kurtosis should be close to 0". But there are formal statistical tests of normality, which are available in proc univariate.

Obviously your data is not normally distributed, this is why Steve suggested to test lognormality of the data.

A log normally distributed data does not contain 0-s. But maybe it doesn’t hurt to add 1 to those 0 values. Alternatively you can apply threshold=-1 (shifting the fitted distribution).

Or you could try other distributions as well.

Why is lognormal better than normal? Because (probably) it fits your data better.

But we don’t know the purpose of your study.

Why do you want to fit a distribution to your data? Why do you want to test normality (or lognormality or something else) of your data? Why do you want to make it normal? (By removing outliers maybe.)


Gergely

Babloo
Rhodochrosite | Level 12

Thanks for your valuable comments.

I need to conduct a ANOVA test with a samples. Hence I would except  my samples to be normal.

Will it sounds good if I remove the values which are less than or equal 0 before applying log-normal distribution ? I wonder when I see the various moments (non-normal to normal) for my variable after I switch over from normal to log-normal? What is the significance behind log-normal distribution? Can it be applied to the variable 'balance/amount'?

gergely_batho
SAS Employee

You have different groups, and you want to determine whether mean balance is different between groups?

Yes: ANOVA requires that distribution of balance be normally distributed within each group. If not, you could transform your variable (log transfom?), or apply some other test (proc genmod?).

If you remove some observations (0-s or outliers) your results (inference) will be valid only to a subpopulations. (For example you simply want to exclude accounts with 0 or very high balance, because they do not represent the “real wealth of the account owners”)

Or you need to have a very good explanation why you are removing them. (For example if 0 is a data error.)

Babloo
Rhodochrosite | Level 12

I need to determine whether mean balance is different between groups.


I agree with your point on 'If you remove some observations (0-s or outliers) your results (inference) will be valid only to a subpopulations. (For example you simply want to exclude accounts with 0 or very high balance, because they do not represent the “real wealth of the account owners”)'.

Now I wondering what values should be replaced for negative and zero values for my variable 'balance' before applying log normal distribution? Because 25% of my data is less than or equal to zero.

SteveDenham
Jade | Level 19

This really looks to me like an example where a nonparametric approach will be useful.  There are zeroes and negative values and truly huge values in your data.  Rather than looking at whether the means are different, consider whether the medians are different.  A Mann-Whitney test, using PROC NPAR1WAY provides a distribution-free approach to this test.

Steve Denham

Babloo
Rhodochrosite | Level 12

Can we make the inference in non-parametric test like we do in parametric test? e.g. I can test 'model balance=income age' in ANOVA or GLM. Since my data is non-normal how can I do the similar test in

non-parametric approach?

SteveDenham
Jade | Level 19

Time to read the documentation for PROC NPAR1WAY, paying particular attention to the examples there.

Steve Denham

Babloo
Rhodochrosite | Level 12

Thanks,will have a look at documentation for PROC NPAR1WAY.

With regards to normality, I got a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08). but 'P' value is significant. e.g. p<0.001

So can we assume that our data is normal for the scenario as I mentioned above? or still we need to make it normal. I need to do ANOVA or GLM with that data.

Thanks!

SteveDenham
Jade | Level 19

RE:the statement:

With regards to normality, I got a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08). but 'P' value is significant. e.g. p<0.001


This is quite different from what was given above

( With the actual data mean is 5055 and the median is 68. Similarly skewness and kutosis is 8.5 and 166 respectively.)

So I assume this is after some sort of transform?

It still appears that the data are significantly different from being normally distributed, but that is not necessarily a stopping point.  Please state how you got such different results this time.

Also, it is not necessary that the raw data be distributed normally to meet the assumptions of analysis of variance.  The assumption is that the errors (residuals) be normally distributed.  This can be checked by fitting the model of interest, getting the residuals in an output dataset, and then checking them for normality.  For a Shapiro-Wilks test of normality, I would only reject the null hypothesis (of a normal distribution) if the P value were less than 0.001.

But much better than testing for normality would be looking at a QQ plot of the residuals.  If those basically fit the diagonal without anything unusual, I would trust that the data were such that the assumption is nearly met, and depend on the robustness of the method.

Now if you get some extreme bends anywhere in the QQ plot, the nonparametric approach is probably more powerful than standard ANOVA.

Steve Denham

Babloo
Rhodochrosite | Level 12

With the actual raw data mean is 5055 and the median is 68. Similarly skewness and kurtosis is 8.5 and 166 respectively followed by I did binning (15 bins) and removed the outliers. Now the mean is 14006.9 median is 14608.5 and skewness and kurtosis is 0.0013 and -1.2806 respectively. Considerably record count is reduced from 24000 to 600 where as p<0.001. Got a bell curve as well. Please suggest my final samples (600) is reasonably normal?


Then I replaced the actual value of balance1 (or amount) to log value (via balance1=log(balance1) ) and then I did a log normal and it produced a bell curve alongside mean=median and skewness (0.2) is also similar to Kurtosis (-1.08).


According to your advice, I tried to plot a Q-Q plot to check the normality with the code below. Attached the plot as well. Sounds it is non-normal as most of the data origin towards left. But I'm not very sure.


ods graphics on;

options maxmemquery=6M;

proc univariate data=normality_new;

QQPLOT balance1 / lognormal (sigma=2); /*I don't know about sigma value here*/

run;

ods graphics off;


So may I request you to view my plot and share your thoughts (or possibly verdict)?


Thanks.

Residual_QQ_Plot.png

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 29 replies
  • 8672 views
  • 7 likes
  • 6 in conversation