I have a dataset containing two numeric variables. I want to find every possible combination of pairs of these variables.
Note that I am not asking for all possible combinations of the two variables (eg. I could find this by performing a cross join). I want all possible combinations of each pair. For example, if the two variables just had 3 values each (1,2,3), I'd want something like the following:
Combination1 = 1&1 2&2 3&3
Combination2 = 1&1 2&3 3&2
Combination3 = 1&2 2&3 3&1
etc
I tried using proc transpose to create a matrix but couldn't think of a way to systematically move through the matrix recording each combination. Feels like it's solvable using a 2 dimensional array but I can't quite come up with the solution!
Thanks for your help.
Hello @Joe_O and welcome to the SAS Support Communities!
Try this:
/* Create sample data for demonstration */
data have;
do x=1 to 4;
y=x; output;
end;
run;
/* Determine number of observations in HAVE */
data _null_;
call symputx('n',n);
stop;
set have nobs=n;
run;
/* Create the combinations of pairs */
data want(keep=comb x y);
length comb 8;
array _x[&n] _temporary_;
array _y[&n] _temporary_;
set have end=last;
_x[_n_]=x;
_y[_n_]=y;
if last;
do comb=1 to fact(&n);
_c=allperm(comb, of _y[*]);
do _i=1 to &n;
x=_x[_i]; y=_y[_i]; output;
end;
end;
run;
The result can be reported in a wide structure if needed:
data _null_;
length combpairs $25; /* adapt length appropriately */
do until(last.comb);
set want;
by comb;
combpairs=catx(' ',combpairs,cats('(',x,',',y,')'));
end;
put comb 3. +2 combpairs;
run;
Result:
1 (1,1) (2,2) (3,3) (4,4) 2 (1,1) (2,2) (3,4) (4,3) 3 (1,1) (2,4) (3,2) (4,3) 4 (1,4) (2,1) (3,2) (4,3) 5 (1,4) (2,1) (3,3) (4,2) ... 24 (1,2) (2,1) (3,3) (4,4)
Edit: If x1, ..., xn are the values of the first variable and y1, ..., yn are the values of the second variable, the code creates all n! bijections of the set {1, ..., n} to itself and writes the corresponding pairs (xi, yj) together with the sequence number 1, ..., n! of the combination (=bijection) to the output dataset. Note that duplicates will occur (but could be eliminated) if there are duplicate values in either variable.
I'm not quite sure what your desired result looks like.
Suppose your data looks like this
data have;
input x y;
datalines;
1 1
2 2
3 3
;
What does your desired result look like?
Thanks for the reply. Sorry for not being more specific. I'm not particular about the exact structure as long as it is usable but something like the below would be ideal. Where x and y are the input variables as you suggested and Combination is the variable created by the program.
Combination | x | y |
Combination1 | 1 | 1 |
Combination1 | 2 | 2 |
Combination1 | 3 | 3 |
Combination2 | 1 | 1 |
Combination2 | 2 | 3 |
Combination2 | 3 | 2 |
Combination3 | 1 | 2 |
Combination3 | 2 | 3 |
Combination3 | 3 | 1 |
Combination5 | 1 | 3 |
Combination5 | 2 | 2 |
Combination5 | 3 | 1 |
Combination6 | 1 | 3 |
Combination6 | 2 | 1 |
Combination6 | 3 | 2 |
Note that x and y are only used once per "combination" and each "combination" uses all values of x and y.
@Joe_O wrote:
Thanks for the reply. Sorry for not being more specific. I'm not particular about the exact structure as long as it is usable but something like the below would be ideal. Where x and y are the input variables as you suggested and Combination is the variable created by the program.
Combination x y Combination1 1 1 Combination1 2 2 Combination1 3 3 Combination2 1 1 Combination2 2 3 Combination2 3 2 Combination3 1 2 Combination3 2 3 Combination3 3 1 Combination5 1 3 Combination5 2 2 Combination5 3 1 Combination6 1 3 Combination6 2 1 Combination6 3 2
Note that x and y are only used once per "combination" and each "combination" uses all values of x and y.
I don't really understand this. Can you show us just the input data (following these instructions so we can have a data set to work from — I will not use data provided in other forms) and then explain how you derive the variable named COMBINATION?
@Joe_O wrote:
My data has hundreds of observations but the datalines statement you posted is fine as a smaller example.
The Combination variable is just a name given to each group of pairs. Once the program has identified a combination of pairs, it just needs to give them a name so that they can be grouped later on.
I'm afraid this is not an explanation that I can use to provide programming for. How? How do you assign a value to combination? Is it random? Is there some process? If there is a process, we need step-by-step details of that process in order to turn it into a program.
Here's a do-loop in a DATA step solution. I assume in this case the numbers are consecutive, starting with 1 and ending with ENDNUM (which is a variable you can set, in this case I have set it to 4);
data want;
length string $ 8;
endnum=4;
do i=1 to endnum;
do j=(i+1) to endnum;
string =cats(i,',',j);
output;
end;
end;
drop endnum;
run;
Naturally, this can be made more flexible if you need to do this for non-consecutive numbers , or if the values don't start with 1.
Hello @Joe_O and welcome to the SAS Support Communities!
Try this:
/* Create sample data for demonstration */
data have;
do x=1 to 4;
y=x; output;
end;
run;
/* Determine number of observations in HAVE */
data _null_;
call symputx('n',n);
stop;
set have nobs=n;
run;
/* Create the combinations of pairs */
data want(keep=comb x y);
length comb 8;
array _x[&n] _temporary_;
array _y[&n] _temporary_;
set have end=last;
_x[_n_]=x;
_y[_n_]=y;
if last;
do comb=1 to fact(&n);
_c=allperm(comb, of _y[*]);
do _i=1 to &n;
x=_x[_i]; y=_y[_i]; output;
end;
end;
run;
The result can be reported in a wide structure if needed:
data _null_;
length combpairs $25; /* adapt length appropriately */
do until(last.comb);
set want;
by comb;
combpairs=catx(' ',combpairs,cats('(',x,',',y,')'));
end;
put comb 3. +2 combpairs;
run;
Result:
1 (1,1) (2,2) (3,3) (4,4) 2 (1,1) (2,2) (3,4) (4,3) 3 (1,1) (2,4) (3,2) (4,3) 4 (1,4) (2,1) (3,2) (4,3) 5 (1,4) (2,1) (3,3) (4,2) ... 24 (1,2) (2,1) (3,3) (4,4)
Edit: If x1, ..., xn are the values of the first variable and y1, ..., yn are the values of the second variable, the code creates all n! bijections of the set {1, ..., n} to itself and writes the corresponding pairs (xi, yj) together with the sequence number 1, ..., n! of the combination (=bijection) to the output dataset. Note that duplicates will occur (but could be eliminated) if there are duplicate values in either variable.
@Joe_O wrote:
Thanks so much for this solution- it is precisely what I'm looking for except for one thing - the ALLPERM function has a limit of 18 variables and I will need ~100. I should have specified in my original post. Is there a method that does not use this function?
Given that 18!=6,402,373,705,728,000, I can understand why there is that limit. Are there so many duplicate values (available for elimination) among those ~100 that the number of 100!=9.3E157 bijections would be reduced substantially?
@Joe_O wrote:
I think I will be able to reduce the list of possible pairs down. For example, if my list goes up to 10, I could say
x=1 may only be paired with y= 3,5 or 7".
x=2 may only be paired with y= 1,5,8 or 9".
This sounds more and more like a combinatorial problem (maybe combinatorial optimization if a minimization or maximization of some objective function comes into play) requiring powerful algorithms such as those available in SAS/OR (which I don't have). See the item "solving optimal assignment problems" in the list of capabilities. With 100, let alone hundreds of x and y values the complexity of the problem will most likely exceed what is feasible with a simple DATA step or procedures in Base SAS or even SAS/STAT. If you have access to SAS/OR, I'd recommend that you start a new thread in the Mathematical Optimization, Discrete-Event Simulation, and OR subforum after a search in the existing posts there (see, for example, Constraint setup for skills/abiities/assignment in bin packing problem optmodel). The SAS employee answering most of the questions in that subforum is an outstanding expert (as far as I can tell). Good luck!
@Joe_O wrote:
Thanks so much for this solution- it is precisely what I'm looking for except for one thing - the ALLPERM function has a limit of 18 variables and I will need ~100. I should have specified in my original post. Is there a method that does not use this function?
Do you mean ~100 observations, or do you mean ~100 variables?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.