BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
zpc
Fluorite | Level 6 zpc
Fluorite | Level 6

I have 4 separate datasets (A, B, C and D) with a different number of records in each.

 

A B C D
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4   d4
a5 b5    
a6      

 

 

I need to produce a single dataset which would show all the possible distinct combinations, pulling 2 records at a time from the A dataset, and one at a time from the other datasets.  Something like :

Part1Part2Part3Part4Part5
a1a2b1c1d1
a1a2b2c1d1
a1a2b1c2d1
a1a2b1c1d2
...............

 

Obviously this is a simplified version of the problem I have.  I'm not familiar with SAS/IML but after exploring the possible solutions, IML articles keep showing up as a great option when dealing with combinations.  Is it ?  I've read a few of Rick Wicklin's articles on dealing with combinations, but they usually show a single list of items to work from.  In my case, I have 4 list of items (A, B, C and D) and their lenght are all different, so they wouldn't form a typical matrix.  Would a DATA step be a better solution ?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.

 

The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?

Your example does not specify.

 

If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.

 

The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D.  A loop then writes the combinations to a data set.

 

data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;

proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;

/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;

AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,];  /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
   result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
   append from result;
end;
close;
QUIT;

proc print data=AllComb(obs=30);
run;

If that is not what you want, then please explain further.

 

 

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

A Cartesian join in PROC SQL can do this, but ... caution ... it can result in very large data sets that take a very long time to create.

 

Here is example code using data set SASHELP.CLASS (since I don't have your data sets A B C D)

 

proc sql;
    create table abc as select a1.name,a2.name as name2,b.sex,c.height
        from sashelp.class as a1,sashelp.class as a2,sashelp.class as b,sashelp.class as c;
quit;

 

Working form a very small data set like SASHELP.CLASS which has 19 records, this Cartesian join produces a data set with 130,321 records.

 

You state: "Obviously this is a simplified version of the problem I have." I think your are going to wind up with a humongous data set that will take a very long time to create, and perhaps you need to re-consider doing this. Please explain why you need all possible combinations.

--
Paige Miller
Astounding
PROC Star

Let's assume that A through D are the names of the data sets, and Part1 through Part4 are the names of the fields within.  (That's actually critical information for writing a program.)  It seems like SQL should do the trick:

data A1 (rename=(part1=part1a)) 
A2 (rename=(part1=part1b)); set A; run; proc sql; select * from A1, A2, B, C, D; quit;

It also sounds like you would want to remove observations having part1a = part1b, as @PeterClemmensen suggests.  (You beat me to it by 23 seconds!) 

PeterClemmensen
Tourmaline | Level 20

@Astounding nice. 

 

Given the OPs posted want data set, you may want to limit the data so part1a ne part1b

 

data a;
input Part1 $;
datalines;
a1
a2
a3
a4
a5
a6
;

data b;
input Part2 $;
datalines;
b1
b2
b3
b4
b5
;

data c;
input Part3 $;
datalines;
c1
c2
c3
;

data d;
input Part4 $;
datalines;
d1
d2
d3
d4
;

data A1 (rename=(part1=part1a)) 
     A2 (rename=(part1=part1b));
   set A;
run;

proc sql;
   select * from A1, A2, B, C, D
   where part1a ne part1b;
quit;
Rick_SAS
SAS Super FREQ

You can certainly do it in SAS ML, but as others have pointed out, Cartesian joins are available by using several methods in SAS.

 

The tricky part is that you seem to want to treat the A variable differently than B, C, and D. When you say you want distinct pairwise combinations, do you mean that you want (a1,a2), but not (a2,a1)?

Your example does not specify.

 

If you want pairwise combinations of the A values, then take the Cartesian join of the A data and then form the Cartesian join of that data with the join of the B, C, and D variables.

 

The following PROC IML program uses the ALLCOMB function to get the pairwise combinations for A and uses the EXPANDGRID function to get the Cartesian join for B, C, and D.  A loop then writes the combinations to a data set.

 

data A;
length A $2;
input A @@;
datalines;
a1 a2 a3 a4 a5 a6
;
data B;
length B $2;
input B @@;
datalines;
b1 b2 b3 b4 b5
;
data C;
length C $2;
input C @@;
datalines;
c1 c2 c3
;
data D;
length D $2;
input D @@;
datalines;
d1 d2 d3 d4
;

proc iml;
use A; read all var "A"; close;
/* pairs of values in A */
Aindex = allcomb(nrow(A), 2);
A1 = A[Aindex[,1]];
A2 = A[Aindex[,2]];
*print AIndex A1 A2;

/* Cartesian product of the other variables */
use B; read all var "B"; close;
use C; read all var "C"; close;
use D; read all var "D"; close;

AllBCD = ExpandGrid(B,C,D);
/* for each pair of A values, write the Cartesian product for BCD */
result = A1[1] || A2[1] || AllBCD[1,];  /* tell IML it is a character matrix */
create AllComb from result[c={A1 A2 B C D}];
n = nrow(AllBCD);
do i = 1 to nrow(AIndex);
   result = j(n,1,A1[i]) || j(n,1,A2[i]) || AllBCD;
   append from result;
end;
close;
QUIT;

proc print data=AllComb(obs=30);
run;

If that is not what you want, then please explain further.

 

 

zpc
Fluorite | Level 6 zpc
Fluorite | Level 6
Wow thank you Rick, yes do I need to avoid duplicating results, i.e. (a1,a2) means the same to me than (a2,a1). @PeterClemmensen 's solution is simpler but I do get duplicate combinations. Testing both solution, I get 1800 records with the proc sql vs 900 using your proc iml script.

From reading the documentation, since SAS/IML keeps the results in memory, the risk is to run out of RAM, correct ?

Rick_SAS
SAS Super FREQ

> since SAS/IML keeps the results in memory, the risk is to run out of RAM, correct?

True. But as Paige pointed out, this entire method (Cartesian join) will become problematic long before you run out of RAM. His questions are important: Why do you want to do this? What problem are you trying to solve?

 

In real data, the number of possible combinations of values is usually far greater than the combinations that are realized. For example, you might have data for the countries
USA, Canada, UK

and the political parties
Democrat, Republican, Liberal, Conservative, New Democrat, Bloc Quebecois, Labour, Scottish National, ...

but most of the levels in a full Cartesian join of Country*Party are not going to appear in the data.

zpc
Fluorite | Level 6 zpc
Fluorite | Level 6
Other variables in the datasets refer to dimension and cost. Those a1,b2,c4,etc values are ProductIDs which I am planning to then rejoin with the Size and Cost variables, trying to figure out all the possible combinations for shipment, and then pulling out the 10-15 optimal ones.

The cartesian join would actually be across more datasets, around 10 different products, e.g. A1,A2,B,C,D,E,F1,F2,F3,G. And the average dataset would have between 15 to 35 records (product IDs) each.
PaigeMiller
Diamond | Level 26

@zpc wrote:
Other variables in the datasets refer to dimension and cost. Those a1,b2,c4,etc values are ProductIDs which I am planning to then rejoin with the Size and Cost variables, trying to figure out all the possible combinations for shipment, and then pulling out the 10-15 optimal ones.

The cartesian join would actually be across more datasets, around 10 different products, e.g. A1,A2,B,C,D,E,F1,F2,F3,G. And the average dataset would have between 15 to 35 records (product IDs) each.

Even if there are just 15 records in each data set, and 10 data sets, that means you have 15**10 elements in a Cartesian join, which is over 576 billion records. Don't even attempt to do this.

 

There must be a better way. But this is really a weak explanation of what you are trying to do ... providing almost no detail on "... pulling out the 10-15 optimal ones". It seems to me that you should be selecting likely candidates (regardless of whether or not they are "optimal") rather than

all possible candidates, and determining which ones are optimal. But again, you have given us almost nothing to go on here, its really hard to propose any detailed way of doing things.

 

--
Paige Miller
zpc
Fluorite | Level 6 zpc
Fluorite | Level 6
I didn't realize how big this could grow. Like I mentioned I do need to take into account other variables as the size and cost which would limit the amount of possible combinations allowed. I've been searching through the forum and I think my question has a lot of similarities with the knapsack problem. Found this post that might be useful:

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/Variation-of-Knapsack-Problem/td-p/569457
PGStats
Opal | Level 21

No need to create copies of table A. Simply name it twice in SQL. To eliminate duplicate combinations you can arbitrarly pick the one where the first part is greater than the second.

 

proc sql;
   select * from A(rename=part1=part1a), A(rename=part1=part1b), B, C, D
   where part1a gt part1b;
quit;
PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 10 replies
  • 1084 views
  • 8 likes
  • 6 in conversation