Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Mann-Whitney U test

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-06-2018 07:14 AM
(12496 views)

Using Mann-Whitney U test (Wilcoxon rank sum test), I am comparing two groups to see whether they are statistically different. Based on almost the same median and mean values between the two groups, I definitely thought that p-value would be very high. But P-value was < 0.0001 (attached). Any ideas why P-value is significant?

P-value from T-test was about 0.15.

__sas code:__

proc NPAR1WAY data=nis.cdiselect wilcoxon;

class primary;

var age;

run;

17 REPLIES 17

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The expected values differ significantly, so review how expected values are calculated.

You have data sets that are imbalanced in size and large so any difference is statistically significant, however that does not imply practical significance..

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Can you show the code and results for the t-test?

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

proc ttest data=nis.cdiselect;

class primary;

var age;

weight trendwt;

run;

The result of above is attached.

After adding weight trends statement, P-value was 0.7473.

According to the Central Limit Theorem, I can say that the sample is normally distributed as the sample size is >=30 (the data nis.cdiselect has 3.5 million weighted frequency)? Therefore, I can use t-test?

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For the CLT ,the mean of the sample is normally distributed, not the sample. You can check the sample distribution visually if you like.

What is the hypothesis? If its that the means are the same, then yes you can use the CLT to assume the means are normally distributed and a t-test to test for significant differences.

What happens if you look at the distribution curves, say as histograms?

@docfermi wrote:

proc ttest data=nis.cdiselect;

class primary;

var age;

weight trendwt;

run;

The result of above is attached.

After adding weight trends statement, P-value was 0.7473.

According to the Central Limit Theorem, I can say that the sample is normally distributed as the sample size is >=30(the data nis.cdiselect has 3.5 million weighted frequency)? Therefore, I can use t-test?

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You cannot assign weights to observations in the Wilcoxon rank sum test provided by NPAR1WAY. The weights that you are using might be designed expressly to balance the two samples.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I didn't use weight statement for Mann-Whitney (sas won't run with weight statement in any case).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@docfermi wrote:

I didn't use weight statement for Mann-Whitney (sas won't run with weight statement in any case).

Exactly. This means you aren't using the same data in the two tests which makes them inconsistent and you cannot compare the results.

It's a weighted vs unweighted test.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I used unweighted as well (See my initial post). t-test results showed P-value about 0.15.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

As I and others have mentioned, what do the distributions look like?

If you can see a difference or the distributions are markedly different it does offer evidence in a particular direction that wouldn't be seen with a traditional box plot. You're could have something similar to Anscombe's Quartet to some degree.

@docfermi wrote:

I used unweighted as well (See my initial post). t-test results showed P-value about 0.15.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It is a left-skewed distribution. Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

How many ties do you have in the ranked values?

From the documentation for NPAR1WAY:

The asymptotic tests might be less accurate when the distribution of the data is heavily tied. For such data, it might be appropriate to use the exact tests provided by PROC NPAR1WAY as described in the section Exact Tests.

Do you see a similar result if you take a random sample of say 10% of the records?

Or have you looked at any graphic representation of the data?

Does this graph imply equal or unequal medians to you:

proc sgplot data=nis.cdiselect; vbar age/ group=primary groupdisplay=cluster stat=freq ; run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Adding "exact wilcoxon"? The result was the same.

I don't know how to check how many ties that I have....?

Also I have not learned how to take a random sample of 10% of my records...?

Equal or unequal median means.. whether mean ~ median? Otherwise please explain. Mean is, median is 72 for both groups.

For your information, I added the results from the sas code that you provided. Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@docfermi wrote:

Adding "exact wilcoxon"? The result was the same.

I don't know how to check how many ties that I have....?

Also I have not learned how to take a random sample of 10% of my records...?

Equal or unequal median means.. whether mean ~ median? Otherwise please explain. Mean is, median is 72 for both groups.

For your information, I added the results from the sas code that you provided. Thanks.

If the AGE variable is the likely "age in years" as an integer you can use proc freq to get exact counts. The graph results are showing 1) as many as 18,000 ties within just one of the groups and 2) age rounding to 5 year increments (those spikes). The larger spike at around 90(hard to tell with the overlapping tick labels) is also kind of interesting indicating perhaps some other factor was used for that age, possibly a group of people "at least 90". It indicates the third (blue) or second (red) largest count in the area of the histogram where the surrounding ages are showing a declining count.

The graph indicates to me that there is a difference in medians as the "blue" group seems to have more of its members towards the upper age range than the red group. Notice that around the 20's range the blue is maybe not quite twice as tall as the red but up around the 60's (?) the blue is well over twice as tall. If the medians were to similar the ratio of heights of the bars would more similar across a wide range of the data.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.