I am somewhat new to SAS programming and am attempting to code up a permutation test for a dependent sample test situation using the difference of the observations from two groups. I would like to simplify and generalize my code by developing a macro similar to the `allperm()` function but instead of using the difference values themselves, this macro would output permutations using a negative or positive version of each difference value.
For example, let's say the data look like this where four subjects are each measured twice (once in each of two conditions):
```
data one;
input sub y1 y2;
datalines;
1 244 225
2 255 247
3 253 249
4 253 254
run;
```
Because the permutation test need only use the difference of the observations, we can compute that as follows:
```
data two;
set two;
d = y1 - y2;
run;
```
Then the test could be performed with the following IML code:
```
proc iml;
use two;
read all var "d" into Y;
close;
N = 2**nrow(Y);
S = j(N, 1, .);
i = 1;
do a = 1 to 2;
do b = 1 to 2;
do c = 1 to 2;
do d = 1 to 2;
P = j(4, 1, .);
P[1] = 2 * (a - 1.5) * Y[1];
P[2] = 2 * (b - 1.5) * Y[2];
P[3] = 2 * (c - 1.5) * Y[3];
P[4] = 2 * (d - 1.5) * Y[4];
S[i] = abs(mean(P));
i = i + 1;
end;
end;
end;
end;
E = 0;
do i = 1 to N;
if S[i] >= mean(Y) then E = E + 1;
end;
print(E / N);
run;
```
As stated in the introduction, I would like to write a macro to replace the nested do loops similar to what the `allperm()` function does. This function should be generalized in the sense that it would work for a one-dimensional matrix of any length.
Here are some example input and desired outputs where the difference values are Y = {19 8 4 -1};
signperm(Y, 1) => {-19 -8 -4 1} /* first iteration produces the "negative" versions of the input */
signperm(Y, 2) => {-19 -8 -4 -1} /* because the later values change sign more rapidly, the second iteration produces the "positive" version of the last value */
signperm(Y, 4) => {-19 -8 4 -1}
signperm(Y, 😎 => {-19 8 4 -1}
signperm(Y, 16) => {19, 8, 4, -1} /* the last iteration produces all the "positive" versions of the input */
I suspect a recursive macro would be in order, but I keep getting tripped up when writing the macro syntax.
> 1. Do you then think it necessary to perform the repeated resampling as per examples, or can one just use each combination a single time? (This may be better asked at stackexchange)
Statistical tests that are computationally expensive come in two forms. The exact test will examine the exact sampling distribution of a statistic by using all possible combinations/permutations and obtain an exact p-value for the test. The Monte Carlo estimate of that test will consider a large number of random combinations/permutations to obtain an approximate sampling distribution and an exact p-value. For an example and discussion, see Monte Carlo simulation for contingency tables in SAS - The DO Loop
I can't advise you on how to proceed because I do not know what test you are running, or what the distribution of the statistic is. Reference?
2. Is there some way to call EXPANDGRID with a varying number of arguments?
Yes. Sure, anytime you want to "write SAS code that writes some SAS code," there are two options: CALL EXECUTE or SAS macro. Ian suggested CALL EXECUTE. Here is an example that uses the macro language:
data two;
set one nobs=n; /* get sample size */
d = y1 - y2;
call symputx("nobs", n); /* put it in a macro variable */
run;
/* create a comma separated list that repeats an argument n times */
%macro ListVal(v, nTimes);
%do i = 1 %to %eval(&nTimes-1);
&v,
%end;
&v
%mend;
proc iml;
s = "%ListVal({-1 1}, &nobs)"; /* view the result */
print s;
/* use it in ExpandGrid */
v = expandgrid( %ListVal({-1 1}, &nobs) );
print v;
> 3. Do you think there is any value here to trading memory efficiency for computational inefficiency?
An efficient IML program will vectorize operations (which means, turn them into linear algebra operations) and will avoid lots of loops over elements of vectors. So, in general, I would avoid the programming method that you are proposing. If memory is an issue, you can sometimes refactor a memory-intensive operation to use block computations. But I do not suggest that you start there. First, define your problem, explain the method you are trying to use, and write a program that works efficiently on small data.
Have moved this to SAS/IML board.
Koen
Thank you for moving this question to an appropriate board, and thank you for the references.
I had already come across the first two, and while they are helpful, they are building a reference distribution by resampling the permutations as opposed to calculating an exact reference distribution.
The third I had not seen before, and it looks promising in terms of its `perm()` function definition.
Do you have a reference for the statistical test you are trying to compute? This kind of permutation test is unfamiliar to me.
> This function should be generalized in the sense that it would work for a one-dimensional matrix of any length.
Your code includes the statement N = 2**nrow(Y), so you already understand that for a vector of length n, there are N = 2^n ways to vary the +/- signs of each entry. Thus, this algorithm won't work for arbitrarily large data. When the size of the data exceeds 20, your method is examining more than 2^20 > 1 million combinations.
A few suggestions:
1. You don't want to write a macro. You want to write a SAS IML module:
2. You don't want all permutations. What you want is vectors of length 4 that include all COMBINATIONS of +/-1. There are 4! = 24 permutations of 4 elements, but you want the 2^4 = 16 vectors that contain all combinations of +/-1. See Creating a matrix with all combinations of zeros and ones - The DO Loop (sas.com)
3. The easiest way to generate all 4-element vectors of +/-1 is to use the EXPANDGRID function, like this:
proc iml;
P = expandgrid( {-1 1}, {-1 1}, {-1 1}, {-1 1} );
print P;
4. You're four nested loops are not necessary. Instead, loop over the rows of the P matrix. If you are familiar with matrix-vector computations, you can actually use a matrix expression to compute all signed combinations.
Here's how you solve the problem for a data set that has 4 pairs. I'll wait until I get the algorithm reference/citation before I think about generalizing the problem:
proc iml;
use two;
read all var "d" into Y;
close;
V = expandgrid( {-1 1}, {-1 1}, {-1 1}, {-1 1} );
N = nrow(V);
P = V` # Y; /* elementwise multiplication of rows */
S = abs(mean(P)); /* mean of each column */
print S , (mean(Y))[L='mean(Y)'];;
E = sum(S >= mean(Y));
print E N (E/N)[L='ratio'];
I was working on updating some inherited presentation slides, but your request for a reference made me realize that this testing approach may not be entirely correct. The examples that I do find all perform repeated resampling of exchanged difference scores as opposed to only a single sample of each possible combination of exchanged difference scores. In other words, the reference distribution from the examples is made up of thousands of resamples as opposed to only one sample of each of the 16 combinations of interest in the case of four pairs of observations.
That said, I am still interested in this problem if only to learn more about programming in SAS.
Responses to suggestions:
1. Yes, I figured out that I am perhaps more interested in writing a SAS IML module than a macro after posting the question. Thank you for the references - especially your blog post as it presents much of the basic information I was after in a clear way.
2. You are correct, I am only interested in the 2^4 = 16 combinations for this example.
3. I considered EXPANDGRID, but note that its use is dependent on the size of the difference vector. For difference vectors of length 4 one can call as you did
proc iml;
P = expandgrid( {-1 1}, {-1 1}, {-1 1}, {-1 1} );
print P;
However, if the difference vector were instead of length 3, we would need to call the function with one less argument:
proc iml;
P = expandgrid( {-1 1}, {-1 1}, {-1 1} );
print P;
This would be a limiting factor to make the code more generalizable.
4. Thank you for the example solution. It is much cleaner and more compact than my work.
A note on large data:
One of my intents here was to trade memory efficiency for computational inefficiency. That is instead of computing the entire combination matrix in one go and holding it entirely in memory, I could instead compute each row of the matrix as needed and only hold the average differences in memory.
Followup questions:
1. Do you then think it necessary to perform the repeated resampling as per examples, or can one just use each combination a single time? (This may be better asked at stackexchange)
2. Is there some way to call EXPANDGRID with a varying number of arguments?
3. Do you think there is any value here to trading memory efficiency for computational inefficiency?
2. Is there some way to call EXPANDGRID with a varying number of arguments?
You could possibly build the the command that you want in a string and then use 'call execute'. But I note that what you want is essentially counting in binary, so you could create an IML module for this as follows :
proc iml;
start binmat(p);
a = 0:(2##p - 1);
b = cshape(putn(a, cat("binary",char(p),".")), 2##p, p, 1);
return( 2#(num(b) - 0.5) );
finish;
print (binmat(5)) [format=2.0];
quit;
> 1. Do you then think it necessary to perform the repeated resampling as per examples, or can one just use each combination a single time? (This may be better asked at stackexchange)
Statistical tests that are computationally expensive come in two forms. The exact test will examine the exact sampling distribution of a statistic by using all possible combinations/permutations and obtain an exact p-value for the test. The Monte Carlo estimate of that test will consider a large number of random combinations/permutations to obtain an approximate sampling distribution and an exact p-value. For an example and discussion, see Monte Carlo simulation for contingency tables in SAS - The DO Loop
I can't advise you on how to proceed because I do not know what test you are running, or what the distribution of the statistic is. Reference?
2. Is there some way to call EXPANDGRID with a varying number of arguments?
Yes. Sure, anytime you want to "write SAS code that writes some SAS code," there are two options: CALL EXECUTE or SAS macro. Ian suggested CALL EXECUTE. Here is an example that uses the macro language:
data two;
set one nobs=n; /* get sample size */
d = y1 - y2;
call symputx("nobs", n); /* put it in a macro variable */
run;
/* create a comma separated list that repeats an argument n times */
%macro ListVal(v, nTimes);
%do i = 1 %to %eval(&nTimes-1);
&v,
%end;
&v
%mend;
proc iml;
s = "%ListVal({-1 1}, &nobs)"; /* view the result */
print s;
/* use it in ExpandGrid */
v = expandgrid( %ListVal({-1 1}, &nobs) );
print v;
> 3. Do you think there is any value here to trading memory efficiency for computational inefficiency?
An efficient IML program will vectorize operations (which means, turn them into linear algebra operations) and will avoid lots of loops over elements of vectors. So, in general, I would avoid the programming method that you are proposing. If memory is an issue, you can sometimes refactor a memory-intensive operation to use block computations. But I do not suggest that you start there. First, define your problem, explain the method you are trying to use, and write a program that works efficiently on small data.
I don't see anything at anywhere in your question has something with permutations .
You just want change its sign +/- ?
data one;
input sub y1 y2;
datalines;
1 244 225
2 255 247
3 253 249
4 253 254
;
run;
data two;
set one;
d = y1 - y2;
run;
proc iml;
use two nobs n;
read all var {d} into Y;
close;
y=t(y);
t=j(n+1,n,-1);
t[loc((row(t)+col(t))>(n+1))]=1;
want=y#t;
print want;
quit;
In thinking about your problem, I was reminded of a paper I wrote in which I discuss the permutation test: 329-2010: Rediscovering SAS/IML® Software: Modern Data Analysis for the Practicing Statistician
If you look at the bottom of p. 12, you will find a random permutation test for paired data. In this test, the null hypothesis that there is no difference between the distribution of the X and the Y variable. Therefore, a bootstrap permutation test will randomly swap (with probability 0.5) the values of the X and Y variables. For the difference, this results in a change of the sign of the difference.
Perhaps this formulation is similar to the one that you are considering? If so, I think the program on p. 12 formulates the problem in a precise manner and solves it efficiently by using the bootstrap method.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.