BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pywils
Fluorite | Level 6

Hello,

 

I am having difficulty understanding how SAS derives the p value for the Wilcoxon Signed Rank test for small n.

 

I understand how to get S, the test statistic for the Wilcoxon Signed Rank Test (see documentation here). However, I don't understand how to go from S to the pvalue. The documentation (link again) says, for small n, "the significance of S is computed from the exact distribution of S, where the distribution is a convolution of scaled binomial distributions." What exactly does that mean? 

 

Here is an example: 

 

 

data test3;
input v1
	  v2;
datalines;
2 1
1 0 
3 0 
1 0 
;

data test4;
	set test3;
	diffVar = v1 - v2;
run; 

proc univariate data = test4;
	var diffVar;
run; 

 

 

We see that S = 5, but why is Pr >= |S| = 0.1250? 

Thank you!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @pywils,

 

Interesting question. Looking into pp. 129 f. of the book by Lehmann and D’Abrera (1975) the documentation refers to, I think the argument (applied to your example) is as follows:

 

The ranks of the four diffVar values 1, 1, 3, 1 are 2, 2, 4, 2, respectively (where 2 is the average rank of the original ranks 1, 2 and 3, which are assigned to the three tied values diffVar=1). Under the null hypothesis (of symmetry about 0) the sign of each of the diffVar values could be positive or negative with probability 1/2. Due to independence, the 2**4=16 possible sign combinations are equally likely (probability 1/16).

 

The first term in the formula of the Wilcoxon signed-rank test statistic S is the sum of the ranks of positive differences:

Positive ranks None 2 4 2, 2 2, 4 2, 2, 2 2, 2, 4 2, 2, 2, 4
Probability 1/16 3/16 1/16 3/16 3/16 1/16 3/16 1/16

 

For example, 3 of the 16 possible sign combinations yield one positive rank 2 and one positive rank 4, hence the probability 3/16. Each of the four ranks 2, 2, 4, 2 contributes its value -- independently -- with probability 1/2 to the sum. So their distributions are Bernoulli distributions (binomial with parameters n=1 and p=1/2), but with a "scale" factor of 2 or 4, respectively. The distribution of a sum of independent random variables is called a convolution. A simulation of this "convolution of scaled binomial distributions" and then S (after subtracting the expectation 4*(4+1)/4=5) could look like this:

data sim;
call streaminit(314159);
array r[4] _temporary_;
do i=1 to 1e6;
  r[1]=2*rand('bern',0.5);
  r[2]=2*rand('bern',0.5);
  r[3]=4*rand('bern',0.5);
  r[4]=2*rand('bern',0.5);
  S=sum(of r[*])-5;
  output;
end;
run;

proc freq data=sim;
tables S;
run;

Your observed case "2, 2, 2, 4" yields S=5 and is the only one with S>=5, hence the two-sided p-value is twice the probability of obtaining S=5, i.e. 2*1/16=1/8=0.125.

View solution in original post

3 REPLIES 3
FreelanceReinh
Jade | Level 19

Hello @pywils,

 

Interesting question. Looking into pp. 129 f. of the book by Lehmann and D’Abrera (1975) the documentation refers to, I think the argument (applied to your example) is as follows:

 

The ranks of the four diffVar values 1, 1, 3, 1 are 2, 2, 4, 2, respectively (where 2 is the average rank of the original ranks 1, 2 and 3, which are assigned to the three tied values diffVar=1). Under the null hypothesis (of symmetry about 0) the sign of each of the diffVar values could be positive or negative with probability 1/2. Due to independence, the 2**4=16 possible sign combinations are equally likely (probability 1/16).

 

The first term in the formula of the Wilcoxon signed-rank test statistic S is the sum of the ranks of positive differences:

Positive ranks None 2 4 2, 2 2, 4 2, 2, 2 2, 2, 4 2, 2, 2, 4
Probability 1/16 3/16 1/16 3/16 3/16 1/16 3/16 1/16

 

For example, 3 of the 16 possible sign combinations yield one positive rank 2 and one positive rank 4, hence the probability 3/16. Each of the four ranks 2, 2, 4, 2 contributes its value -- independently -- with probability 1/2 to the sum. So their distributions are Bernoulli distributions (binomial with parameters n=1 and p=1/2), but with a "scale" factor of 2 or 4, respectively. The distribution of a sum of independent random variables is called a convolution. A simulation of this "convolution of scaled binomial distributions" and then S (after subtracting the expectation 4*(4+1)/4=5) could look like this:

data sim;
call streaminit(314159);
array r[4] _temporary_;
do i=1 to 1e6;
  r[1]=2*rand('bern',0.5);
  r[2]=2*rand('bern',0.5);
  r[3]=4*rand('bern',0.5);
  r[4]=2*rand('bern',0.5);
  S=sum(of r[*])-5;
  output;
end;
run;

proc freq data=sim;
tables S;
run;

Your observed case "2, 2, 2, 4" yields S=5 and is the only one with S>=5, hence the two-sided p-value is twice the probability of obtaining S=5, i.e. 2*1/16=1/8=0.125.

ballardw
Super User

When you get something like "the significance of S is computed from the exact distribution of S". That means that ALL permutations of the values are created. Then you count the numbers that have x or more of interest.

 

For a smaller sample (4) you can look at this:

data small;
   do i=0,1;
   do j=0,1;
   do k=0,1;
   do l=0,1;
      samp = cats(i,j,k,l);
num =sum(i,j,k,l);
output; end; end; end; end; run; proc freq data=small; tables samp num; run;

Each of the results has the same probability of occuring 0.0625 (6.25%). So you can look at the totals of the 1(the + in sign rank) and see that the probability of getting 3 or more, as an example, is the sum of ways you get 3 or 4. 25+6.25% or 31.25. or P(s >=3)=.3125 , P(s=4)=.0625.

 

Left as an exercise is to create the distribution for 5 elements and see what P(s=5) comes out as.

 

Yes there is more efficient coding for larger sample sizes. The above is very easy to follow for beginners.

 

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 726 views
  • 1 like
  • 3 in conversation