I have 4 separate datasets (A, B, C and D) with a different number of records in each.
A | B | C | D | |||
a1 | b1 | c1 | d1 | |||
a2 | b2 | c2 | d2 | |||
a3 | b3 | c3 | d3 | |||
a4 | b4 | d4 | ||||
a5 | b5 | |||||
a6 |
I need to produce a single dataset which would show all the possible distinct combinations, pulling 2 records at a time from the A dataset, and one at a time from the other datasets. Something like :
Part1 | Part2 | Part3 | Part4 | Part5 |
a1 | a2 | b1 | c1 | d1 |
a1 | a2 | b2 | c1 | d1 |
a1 | a2 | b1 | c2 | d1 |
a1 | a2 | b1 | c1 | d2 |
... | ... | ... | ... | ... |
Obviously this is a simplified version of the problem I have. I'm not familiar with SAS/IML but after exploring the possible solutions, IML articles keep showing up as a great option when dealing with combinations. Is it ? I've read a few of Rick Wicklin's articles on dealing with combinations, but they usually show a single list of items to work from. In my case, I have 4 list of items (A, B, C and D) and their lenght are all different, so they wouldn't form a typical matrix. Would a DATA step be a better solution ?
You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.
The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?
Your example does not specify.
If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.
The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D. A loop then writes the combinations to a data set.
data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;
proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;
/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;
AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,]; /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
append from result;
end;
close;
QUIT;
proc print data=AllComb(obs=30);
run;
If that is not what you want, then please explain further.
A Cartesian join in PROC SQL can do this, but ... caution ... it can result in very large data sets that take a very long time to create.
Here is example code using data set SASHELP.CLASS (since I don't have your data sets A B C D)
proc sql;
create table abc as select a1.name,a2.name as name2,b.sex,c.height
from sashelp.class as a1,sashelp.class as a2,sashelp.class as b,sashelp.class as c;
quit;
Working form a very small data set like SASHELP.CLASS which has 19 records, this Cartesian join produces a data set with 130,321 records.
You state: "Obviously this is a simplified version of the problem I have." I think your are going to wind up with a humongous data set that will take a very long time to create, and perhaps you need to re-consider doing this. Please explain why you need all possible combinations.
Let's assume that A through D are the names of the data sets, and Part1 through Part4 are the names of the fields within. (That's actually critical information for writing a program.) It seems like SQL should do the trick:
data A1 (rename=(part1=part1a))
A2 (rename=(part1=part1b));
set A;
run;
proc sql;
select * from A1, A2, B, C, D;
quit;
It also sounds like you would want to remove observations having part1a = part1b, as @PeterClemmensen suggests. (You beat me to it by 23 seconds!)
@Astounding nice.
Given the OPs posted want data set, you may want to limit the data so part1a ne part1b
data a;
input Part1 $;
datalines;
a1
a2
a3
a4
a5
a6
;
data b;
input Part2 $;
datalines;
b1
b2
b3
b4
b5
;
data c;
input Part3 $;
datalines;
c1
c2
c3
;
data d;
input Part4 $;
datalines;
d1
d2
d3
d4
;
data A1 (rename=(part1=part1a))
A2 (rename=(part1=part1b));
set A;
run;
proc sql;
select * from A1, A2, B, C, D
where part1a ne part1b;
quit;
You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.
The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?
Your example does not specify.
If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.
The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D. A loop then writes the combinations to a data set.
data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;
proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;
/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;
AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,]; /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
append from result;
end;
close;
QUIT;
proc print data=AllComb(obs=30);
run;
If that is not what you want, then please explain further.
> since SAS/IML keeps the results in memory, the risk is to run out of RAM, correct?
True. But as Paige pointed out, this entire method (Cartesian join) will become problematic long before you run out of RAM. His questions are important: Why do you want to do this? What problem are you trying to solve?
In real data, the number of possible combinations of values is usually far greater than the combinations that are realized. For example, you might have data for the countries
USA, Canada, UK
and the political parties
Democrat, Republican, Liberal, Conservative, New Democrat, Bloc Quebecois, Labour, Scottish National, ...
but most of the levels in a full Cartesian join of Country*Party are not going to appear in the data.
@zpc wrote:
Other variables in the datasets refer to dimension and cost. Those a1,b2,c4,etc values are ProductIDs which I am planning to then rejoin with the Size and Cost variables, trying to figure out all the possible combinations for shipment, and then pulling out the 10-15 optimal ones.
The cartesian join would actually be across more datasets, around 10 different products, e.g. A1,A2,B,C,D,E,F1,F2,F3,G. And the average dataset would have between 15 to 35 records (product IDs) each.
Even if there are just 15 records in each data set, and 10 data sets, that means you have 15**10 elements in a Cartesian join, which is over 576 billion records. Don't even attempt to do this.
There must be a better way. But this is really a weak explanation of what you are trying to do ... providing almost no detail on "... pulling out the 10-15 optimal ones". It seems to me that you should be selecting likely candidates (regardless of whether or not they are "optimal") rather than
all possible candidates, and determining which ones are optimal. But again, you have given us almost nothing to go on here, its really hard to propose any detailed way of doing things.
No need to create copies of table A. Simply name it twice in SQL. To eliminate duplicate combinations you can arbitrarly pick the one where the first part is greater than the second.
proc sql;
select * from A(rename=part1=part1a), A(rename=part1=part1b), B, C, D
where part1a gt part1b;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.