SAS Procedures

docfermi · 2018-09-06.pdf

Using Mann-Whitney U test (Wilcoxon rank sum test), I am comparing two groups to see whether they are statistically different. Based on almost the same median and mean values between the two groups, I definitely thought that p-value would be very high. But P-value was < 0.0001 (attached). Any ideas why P-value is significant?

P-value from T-test was about 0.15.

sas code:

proc NPAR1WAY data=nis.cdiselect wilcoxon;

class primary;

var age;

run;

Reeza · Posted 09-06-2018 10:51 AM

The expected values differ significantly, so review how expected values are calculated.

You have data sets that are imbalanced in size and large so any difference is statistically significant, however that does not imply practical significance..

PGStats · Posted 09-06-2018 12:22 PM

Can you show the code and results for the t-test?

PG

docfermi · 2018-09-06 (2).pdf

proc ttest data=nis.cdiselect;

class primary;

var age;

weight trendwt;

run;

The result of above is attached.

After adding weight trends statement, P-value was 0.7473.

According to the Central Limit Theorem, I can say that the sample is normally distributed as the sample size is >=30 (the data nis.cdiselect has 3.5 million weighted frequency)? Therefore, I can use t-test?

Thanks.

Reeza · Posted 09-06-2018 01:32 PM

For the CLT ,the mean of the sample is normally distributed, not the sample. You can check the sample distribution visually if you like.

What is the hypothesis? If its that the means are the same, then yes you can use the CLT to assume the means are normally distributed and a t-test to test for significant differences.

What happens if you look at the distribution curves, say as histograms?

@docfermi wrote:

proc ttest data=nis.cdiselect;

class primary;

var age;

weight trendwt;

run;

The result of above is attached.

After adding weight trends statement, P-value was 0.7473.

According to the Central Limit Theorem, I can say that the sample is normally distributed as the sample size is >=30 (the data nis.cdiselect has 3.5 million weighted frequency)? Therefore, I can use t-test?

Thanks.

PGStats · Posted 09-06-2018 01:35 PM

You cannot assign weights to observations in the Wilcoxon rank sum test provided by NPAR1WAY. The weights that you are using might be designed expressly to balance the two samples.

PG

docfermi · Posted 09-06-2018 02:39 PM

I didn't use weight statement for Mann-Whitney (sas won't run with weight statement in any case).

Reeza · Posted 09-06-2018 02:47 PM

@docfermi wrote:

I didn't use weight statement for Mann-Whitney (sas won't run with weight statement in any case).

Exactly. This means you aren't using the same data in the two tests which makes them inconsistent and you cannot compare the results.

It's a weighted vs unweighted test.

docfermi · Posted 09-06-2018 03:21 PM

I used unweighted as well (See my initial post). t-test results showed P-value about 0.15.

Reeza · Posted 09-06-2018 03:28 PM

As I and others have mentioned, what do the distributions look like?

If you can see a difference or the distributions are markedly different it does offer evidence in a particular direction that wouldn't be seen with a traditional box plot. You're could have something similar to Anscombe's Quartet to some degree.

@docfermi wrote:

I used unweighted as well (See my initial post). t-test results showed P-value about 0.15.

docfermi · Posted 09-06-2018 03:34 PM

It is a left-skewed distribution. Thanks.

ballardw · Posted 09-06-2018 01:37 PM

How many ties do you have in the ranked values?

From the documentation for NPAR1WAY:

The asymptotic tests might be less accurate when the distribution of the data is heavily tied. For such data, it might be appropriate to use the exact tests provided by PROC NPAR1WAY as described in the section Exact Tests.

Do you see a similar result if you take a random sample of say 10% of the records?

Or have you looked at any graphic representation of the data?

Does this graph imply equal or unequal medians to you:

proc sgplot data=nis.cdiselect;
   vbar age/ group=primary  
             groupdisplay=cluster
             stat=freq
   ;
run;

docfermi · 2018-09-06, R2.pdf

Adding "exact wilcoxon"? The result was the same.

I don't know how to check how many ties that I have....?

Also I have not learned how to take a random sample of 10% of my records...?

Equal or unequal median means.. whether mean ~ median? Otherwise please explain. Mean is, median is 72 for both groups.

For your information, I added the results from the sas code that you provided. Thanks.

ballardw · Posted 09-06-2018 04:47 PM

@docfermi wrote:

Adding "exact wilcoxon"? The result was the same.

I don't know how to check how many ties that I have....?

Also I have not learned how to take a random sample of 10% of my records...?

Equal or unequal median means.. whether mean ~ median? Otherwise please explain. Mean is, median is 72 for both groups.

For your information, I added the results from the sas code that you provided. Thanks.

If the AGE variable is the likely "age in years" as an integer you can use proc freq to get exact counts. The graph results are showing 1) as many as 18,000 ties within just one of the groups and 2) age rounding to 5 year increments (those spikes). The larger spike at around 90(hard to tell with the overlapping tick labels) is also kind of interesting indicating perhaps some other factor was used for that age, possibly a group of people "at least 90". It indicates the third (blue) or second (red) largest count in the area of the histogram where the surrounding ages are showing a declining count.

The graph indicates to me that there is a difference in medians as the "blue" group seems to have more of its members towards the upper age range than the red group. Notice that around the 20's range the blue is maybe not quite twice as tall as the red but up around the 60's (?) the blue is well over twice as tall. If the medians were to similar the ratio of heights of the bars would more similar across a wide range of the data.

docfermi · Posted 09-06-2018 05:23 PM

Thank you so much! That's quite helpful. Besides normality assumption criteria, two sample t-test to look for difference in the means vs. Mann-Whitney U test for difference in the medians?

SAS Procedures

Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Re: Mann-Whitney U test

Follow Us

What is...

SAS Procedures

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...