Solved: Re: Show all the possible distinct combinations from records in 4 diff...

zpc · Posted 08-12-2022 09:25 AM

I have 4 separate datasets (A, B, C and D) with a different number of records in each.

A	B	C	D
a1	b1	c1	d1
a2	b2	c2	d2
a3	b3	c3	d3
a4	b4		d4
a5	b5
a6

I need to produce a single dataset which would show all the possible distinct combinations, pulling 2 records at a time from the A dataset, and one at a time from the other datasets. Something like :

Part1	Part2	Part3	Part4	Part5
a1	a2	b1	c1	d1
a1	a2	b2	c1	d1
a1	a2	b1	c2	d1
a1	a2	b1	c1	d2
...	...	...	...	...

Obviously this is a simplified version of the problem I have. I'm not familiar with SAS/IML but after exploring the possible solutions, IML articles keep showing up as a great option when dealing with combinations. Is it ? I've read a few of Rick Wicklin's articles on dealing with combinations, but they usually show a single list of items to work from. In my case, I have 4 list of items (A, B, C and D) and their lenght are all different, so they wouldn't form a typical matrix. Would a DATA step be a better solution ?

Rick_SAS · Posted 08-12-2022 10:08 AM

You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.

The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?

Your example does not specify.

If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.

The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D. A loop then writes the combinations to a data set.

data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;

proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;

/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;

AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,];  /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
   result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
   append from result;
end;
close;
QUIT;

proc print data=AllComb(obs=30);
run;

If that is not what you want, then please explain further.

View solution in original post

PaigeMiller · Posted 08-12-2022 09:37 AM

A Cartesian join in PROC SQL can do this, but ... caution ... it can result in very large data sets that take a very long time to create.

Here is example code using data set SASHELP.CLASS (since I don't have your data sets A B C D)

proc sql;
    create table abc as select a1.name,a2.name as name2,b.sex,c.height
        from sashelp.class as a1,sashelp.class as a2,sashelp.class as b,sashelp.class as c;
quit;

Working form a very small data set like SASHELP.CLASS which has 19 records, this Cartesian join produces a data set with 130,321 records.

You state: "Obviously this is a simplified version of the problem I have." I think your are going to wind up with a humongous data set that will take a very long time to create, and perhaps you need to re-consider doing this. Please explain why you need all possible combinations.

--
Paige Miller

Astounding · Posted 08-12-2022 09:39 AM

Let's assume that A through D are the names of the data sets, and Part1 through Part4 are the names of the fields within. (That's actually critical information for writing a program.) It seems like SQL should do the trick:

data A1 (rename=(part1=part1a)) 
     A2 (rename=(part1=part1b));
   set A;
run;

proc sql;
   select * from A1, A2, B, C, D;
quit;

It also sounds like you would want to remove observations having part1a = part1b, as @PeterClemmensen suggests. (You beat me to it by 23 seconds!)

PeterClemmensen · Posted 08-12-2022 09:47 AM

@Astounding nice.

Given the OPs posted want data set, you may want to limit the data so part1a ne part1b

data a;
input Part1 $;
datalines;
a1
a2
a3
a4
a5
a6
;

data b;
input Part2 $;
datalines;
b1
b2
b3
b4
b5
;

data c;
input Part3 $;
datalines;
c1
c2
c3
;

data d;
input Part4 $;
datalines;
d1
d2
d3
d4
;

data A1 (rename=(part1=part1a)) 
     A2 (rename=(part1=part1b));
   set A;
run;

proc sql;
   select * from A1, A2, B, C, D
   where part1a ne part1b;
quit;

The DATA to DATA Step Macro
Blog: SASnrd

Rick_SAS · Posted 08-12-2022 10:08 AM

You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.

The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?

Your example does not specify.

If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.

The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D. A loop then writes the combinations to a data set.

data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;

proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;

/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;

AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,];  /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
   result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
   append from result;
end;
close;
QUIT;

proc print data=AllComb(obs=30);
run;

If that is not what you want, then please explain further.

zpc · Posted 08-12-2022 02:35 PM

Wow thank you Rick, yes do I need to avoid duplicating results, i.e. (a1,a2) means the same to me than (a2,a1). @PeterClemmensen 's solution is simpler but I do get duplicate combinations. Testing both solution, I get 1800 records with the proc sql vs 900 using your proc iml script.

From reading the documentation, since SAS/IML keeps the results in memory, the risk is to run out of RAM, correct ?

Rick_SAS · Posted 08-12-2022 03:06 PM

> since SAS/IML keeps the results in memory, the risk is to run out of RAM, correct?

True. But as Paige pointed out, this entire method (Cartesian join) will become problematic long before you run out of RAM. His questions are important: Why do you want to do this? What problem are you trying to solve?

In real data, the number of possible combinations of values is usually far greater than the combinations that are realized. For example, you might have data for the countries
USA, Canada, UK

and the political parties
Democrat, Republican, Liberal, Conservative, New Democrat, Bloc Quebecois, Labour, Scottish National, ...

but most of the levels in a full Cartesian join of Country*Party are not going to appear in the data.

zpc · Posted 08-12-2022 03:44 PM

Other variables in the datasets refer to dimension and cost. Those a1,b2,c4,etc values are ProductIDs which I am planning to then rejoin with the Size and Cost variables, trying to figure out all the possible combinations for shipment, and then pulling out the 10-15 optimal ones.

The cartesian join would actually be across more datasets, around 10 different products, e.g. A1,A2,B,C,D,E,F1,F2,F3,G. And the average dataset would have between 15 to 35 records (product IDs) each.

PaigeMiller · Posted 08-12-2022 04:15 PM

@zpc wrote:
Other variables in the datasets refer to dimension and cost. Those a1,b2,c4,etc values are ProductIDs which I am planning to then rejoin with the Size and Cost variables, trying to figure out all the possible combinations for shipment, and then pulling out the 10-15 optimal ones.

The cartesian join would actually be across more datasets, around 10 different products, e.g. A1,A2,B,C,D,E,F1,F2,F3,G. And the average dataset would have between 15 to 35 records (product IDs) each.

Even if there are just 15 records in each data set, and 10 data sets, that means you have 15**10 elements in a Cartesian join, which is over 576 billion records. Don't even attempt to do this.

There must be a better way. But this is really a weak explanation of what you are trying to do ... providing almost no detail on "... pulling out the 10-15 optimal ones". It seems to me that you should be selecting likely candidates (regardless of whether or not they are "optimal") rather than

all possible candidates, and determining which ones are optimal. But again, you have given us almost nothing to go on here, its really hard to propose any detailed way of doing things.

--
Paige Miller

zpc · Posted 08-12-2022 06:06 PM

I didn't realize how big this could grow. Like I mentioned I do need to take into account other variables as the size and cost which would limit the amount of possible combinations allowed. I've been searching through the forum and I think my question has a lot of similarities with the knapsack problem. Found this post that might be useful:

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/Variation-of-Knapsack-Problem/td-p/569457

PGStats · Posted 08-12-2022 02:53 PM

No need to create copies of table A. Simply name it twice in SQL. To eliminate duplicate combinations you can arbitrarly pick the one where the first part is greater than the second.

proc sql;
   select * from A(rename=part1=part1a), A(rename=part1=part1b), B, C, D
   where part1a gt part1b;
quit;

PG

Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets

Re: Show all the possible distinct combinations from records in 4 different datasets