BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Reeza
Super User

Babloo wrote:

Considerably record count is reduced from 24000 to 600 where as p<0.001.

That would worry me. It seems like you've thrown away most of your data to obtain normality which means any results are probably not representative of your true business any more.

Babloo
Rhodochrosite | Level 12

Then what should be the correct method?

When I did non-parametric method with the code below, it takes hours to complete. After referring the documents, I understood that it works well for limited samples (say less than 1000) whereas my data has 25000 observations.

proc npar1way wilcoxon data=want;

class income; /* categorized into three levels*/

var balance1;

exact;

run;

Any thoughts to re-direct me to obtain a normality or best fitted non-parametric procs?

Thanks.

gergely_batho
SAS Employee

I think you don't need the "exact" statement. With so many observations, your result will be "exact enough" with a simple (asymptotic) Kruskal-Wallis Test.

exact statement triggers a very expensive algorithm, it will take forever with 25000 obs.

Babloo
Rhodochrosite | Level 12

Many thanks for your reply.

P value is <0.0001 with the below code

proc npar1way wilcoxon  data=want;

class new_age;

var balance;

run;

Do we need to interpret only the P value from the output or anyother?

Before I close this thread, I would like to clarify below.

1. From the documentation, there is no separate keyword for Kruskal-Wallis test. Am I right?

WILCOXON

requests an analysis of Wilcoxon scores. When there are two classification levels (samples), this option produces the Wilcoxon rank-sum test. For any number of classification levels, this option produces the Kruskal-Wallis test. See the section Wilcoxon Scores for more information.

2. In case if I wish to include 2 classification variables (age,income) in this test, which proc should I use?

3. With regards to normal and log-normal distribution, whether is any difference whilst interpreting the outputs? Because we're modifying the actual value to a log value of a independent variable.

Thanks again for your inputs!

Reeza
Super User

1. If the documentation says that Smiley Happy

2. Depends on what you mean. If you're testing by 2 different variables, i.e. 2 separate tests probably run two separate procs. If you're testing interaction of age and income, create an age-income variable and test via the new variable. The same proc NPAR1WAY is appropriate.

3. If you're using non-parametric testing do you still need the transformation?

Reeza
Super User

You should also create box plots of your data for visual comparison and/or histograms if you already haven't.

Babloo
Rhodochrosite | Level 12

Thanks for your reply.

My apologies for the follow up questions

2. Depends on what you mean. If you're testing by 2 different variables, i.e. 2 separate tests probably run two separate procs. If you're testing interaction of age and income, create an age-income variable and test via the new variable. The same proc NPAR1WAY is appropriate.-   When you say create an age-income variable , do you want me to concatenate both the variables? Because both are different indicators

3. If you're using non-parametric testing do you still need the transformation? - In case If I use parametric testing for other subjects, then how should I intrepret log normal transformation?

Reeza
Super User

2. - Yes concatenate, They are different indicators but if you're testing the interaction thats what you need to do, i.e. 30-40 year old making <30K is one category and 30-40 year olds making >30K is another category.

3. Too generic a question, it depends on what testing you're using. Sometimes you can back transform, other times you can't.

Babloo
Rhodochrosite | Level 12

Many thanks for your response.

3. Too generic a question, it depends on what testing you're using. Sometimes you can back transform, other times you can't.- May I request you to tell me can we back transform for ANOVA test (or possibly parametric test) for my scenario listed above?

gergely_batho
SAS Employee

Hi Babloo,

We don't what exactly the purpose of your study.

If you just want to determine (statistically prove) that there is a difference between mean balances, then you can conduct a nonparametric-ANOVA test, or maybe (log?)transform your variable, and do an ANOVA. You don’t need to back-transform, because you are in finish. (But don’t forget to test normality of residuals as Steve mentions).

If you want to interpret results, you again don’t need to back transform. If you did a log transform, then your parameter estimates mean a percentage change (instead of change of the value on the original scale)

If you are interested only in predictions (“What will be the expected balance of new client with age=30 income=100?”), then you don’t need to be worry about the normality so much, just do an ANOVA (regression with indicator variables in your case – but I would also consider to tread age and income as continuous variables, i.e. not binning them).

If you are interested in prediction and prediction intervals(“’In what range will be the balance of a usual age=30 income=100 client”), you again need to care about the distribution.

3.

A simple back-transform of the predictions (if you used log transform) is: prediction_median_on_original_scale=exp(prediction_mean_on_log_scale)

Babloo
Rhodochrosite | Level 12

Any inputs on my questions?

Thanks.

Babloo
Rhodochrosite | Level 12

I don't see much difference in my output with the options 'Normaltest'. My code is below.

proc univariate data=normality normaltest ;

var balance;

histogram /normal (color=red);

run;

Any other suggestions to produce a normal data?

Thanks!

Frankdanny
Calcite | Level 5

Can also say that a variable is normally distributed when at least 95% of the observations fall in the range:

(average - 2*stddev) ,(average + 2*stddev)

So you can write:

%macro normality_test(lib, ds, var);

data &lib..&ds.;set &lib..&ds.;

sum_&var.+&var.;

run;

data &lib..&ds.;set &lib..&ds.;

by sum_&var.;

if last.sum_&var. then do;

  call symputx('nobs',_n_);

  call symputx('mean',sum_&var./symget('nobs'));

  end;

run;

data &lib..&ds.;set &lib..&ds.;

x1=(&var.-&mean.)**2;

x2=&var.-abs(&mean.);

x3=abs(x2);

run;

data &lib..&ds.;set &lib..&ds.;

sum_x1+x1;

sum_x2+x2;

sum_x3+x3;

run;

data &lib..&ds.;set &lib..&ds.;

by sum_&var.;

if last.sum_&var. then do;

  call symputx('var',sum_x1/(&nobs.-1));

  call symputx('std',sqrt(sum_x1/(&nobs.-1)));

  call symputx('CV',(symget('std')/&mean.)*100);

  call symputx('MAD',sum_x3/&nobs.);

  call symputx('minus2ds', &mean. - (2*symget('std')));

  call symputx('plus2ds', &mean. + (2*symget('std')));

end;

run;

data &lib..&ds.;set &lib..&ds.;

if &minus2ds. le x le &plus2ds. then check=1;else check=0;

if check eq 1 then sum_check+1;

call symputx('pct', put((sum_check/&nobs.)*100,8.));

run;

data _null_;

if &pct. ge 95 then put '--------------->>>>> The variable is normally distribuited';

else put '--------------->>>>> The variable is NOT normally distribuited';

run;

%put &pct.;

%mend normality_test;

And example then call the macro:

%normality_test(work,test,x);

ballardw
Super User

I would disagree with that being a universal test for normality. If the variable only takes one value it it will pass this this test and is obviously not normal.

SteveDenham
Jade | Level 19

I agree . As will a bimodal distribution with very narrow standard errors, or a Cauchy distribution.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 29 replies
  • 8672 views
  • 7 likes
  • 6 in conversation