Solved: Re: Using a hash merge for a cartesian product with no key

Ani7 · Posted 03-24-2022 10:08 AM

I currently have two tables, table_a and table_b with one variable each and 3 distinct values and have to merge them to create a table with every possible combination of the two variables across the two tables. I currently am using a proc sql to do so but was wondering if there was a way to do it using a hash merge (since the resultant merged table with my actual data ends up being a few billion rows long).

The code I currently use is as follows:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

proc sql;
	create table 	table_a_b
	as
	select distinct a.*, b.*
	from 			table_a as a
	left join 		table_b as b
	on 				1 = 1
;quit;

yabwon · Posted 03-24-2022 10:39 AM

For Cartesian you can do a loop:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

%let nobs = 3;
data want;
  array table_a[&nobs.] $ _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_b end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      output;
    end;
  end;
run;
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

Kurt_Bremser · Posted 03-24-2022 10:24 AM

How many observations are in those real datasets? And what is the length of the character variables?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ani7 · Posted 03-24-2022 11:04 AM

So in my actual code, table_a has ~110,000 observations and table_b has 140 observations. I then have to merge it with a table_c with 207 observations. This ends up with a table of ~3 billion rows.

yabwon · Posted 03-24-2022 11:17 AM

For your data it could be something like this:

data table_a;
	do numbers = 1 to 110000;
    output;
  end;
run;

data table_b;
  length letters1 $ 8;
  do _N_ = 1 to 140;
    letters1 = cats("A", _N_);
    output;
  end;
run;

data table_c;
  length letters2 $ 8;
  do _N_ = 1 to 207;
    letters2 = cats("B", _N_);
    output;
  end;
run;

proc sql;
  create table table_d as
  select * from
  table_b, table_c;
quit;

%let nobs = 110000;
data want;
  array table_a[&nobs.] _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_d end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      /*output;*/ /* <- commented out to prevent data b00m :-) */
    end;
  end;
run;

But the question is, what are you planing to do wit this result?

Maybe you can do this in "parts", e.g. splitting "tabel_d" (from mu code) into 10 smaller one, and do the process separately?

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser · Posted 03-24-2022 12:08 PM

Since you do not need the search capability of a hash, temporary arrays are the way to go:

proc sql noprint;
select nobs into :size_b
from dictionary.tables
where libname = "WORK" and memname = "B";
select nobs into :size_c
from dictionary.tables
where libname = "WORK" and memname = "C";
quit;

data want;
set a;
array ar_b {&size_b.} $;
array ar_c {&size_c.} $;
if _n_ = 1
then do;
  do b = 1 to &size_b.;
    set b;
    ar_b{b} = numbers; /* or whatever the variable is called in dataset b */
  end;
  do c = 1 to &size_c.;
    set c;
    ar_c{c} = letters;
  end;
end;
do b = 1 to size_b.;
  do c = 1 to size_c.;
    numbers = ar_b{b};
    letters = ar_c{c};
    output;
  end;
end;
drop b c;
run;

Temporary arrays are the fastest structures you can have in SAS. If you need to read more than one variable from b and/or c, add additional arrays.

The nice thing here is also that no sorting happens behind the scenes; everything is done sequentially.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 03-24-2022 03:57 PM

So you have three tables, not two. There are two very small (B and C) and one large (A). If this is the case, then DON'T convert table A into a hash. Instead

Read B and C into independent hashes
Create a third hash (BC) as the cartesian product of the first 2. It will still be a small object.
Then read A one obs at a time. Each time iterate through BC and output one obs per iteration.

The resulting data will be in the physical order of table A, but within each A, there is no guarantee of order.

Using this simulated data, with the szies you mention for tables A, B, and C:

data table_a;
  do numbers=1 to 110000; output; end;
run;
data table_b (keep=letters);
  length letters $2 ;
  do L1=rank('A') to rank('Z') until (nout=140);
    letters=byte(L1);
    do L2=rank('A') to rank('Z') until (nout=140);
      substr(Letters,2)=byte(L2);
      output;
      nout+1;
    end;
  end;
run;
data table_c;
  do other=-207 to -1; output; end;
run;

data template_b_c;
  set table_b table_c;
  stop;
run;

The program below creates a 3-way cartesian product and took 2:19 (139 seconds) on my machine. But notice it is a DATA _NULL_ step, so the output statement does not actually write to disk. All I am measuring is the time it takes to read in B, C, and A, and cross all their observations. The rest will depend on the speed of your output device and output channel. Whatever total time the program would take to create the dataset you want, will be almost solely driven by the time to write the data, not the time to do the cross-products.

data _null_ (drop=_:);
  set table_a;
  if _n_ =1 then do;
    if 0 then set table_b;
    declare hash b (dataset:'table_b');
      b.definekey('letters');
      b.definedata(all:'Y');
      b.definedone();
    declare hiter bi ('b');
    if 0 then set table_c;
    declare hash c (dataset:'table_c');
      c.definekey('other');
      c.definedata(all:'Y');
      c.definedone();
    declare hiter ci ('c');
    declare hash bc (dataset:'template_b_c'); 
      bc.definekey(all:'Y');
      bc.definedata(all:'Y');
      bc.definedone();
    declare hiter bci ('bc');
    do _rb=bi.first() by 0 until (bi.next()^=0);
      do _rc=ci.first() by 0 until (ci.next()^=0);
        bc.add();
      end;
    end;
  end;
  do _rc=bci.first() by 0 until (bci.next()^=0);
    output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Astounding · Posted 03-24-2022 10:26 AM

Why would it matter if there is another way? Wouldn't the end result be the same?

Ani7 · Posted 03-24-2022 11:06 AM

Yes but the actual tables in my code are much larger and the resultant merged table ends up being about 3 billion rows which takes 5-6 hours to run. A hash merge could potentially be quicker here.

yabwon · Posted 03-24-2022 10:39 AM

For Cartesian you can do a loop:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

%let nobs = 3;
data want;
  array table_a[&nobs.] $ _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_b end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      output;
    end;
  end;
run;
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ani7 · Posted 03-24-2022 11:07 AM

Would this be quicker than the proc sql merge? The main reason for a hash merge here is for speed/time constraints.

yabwon · Posted 03-24-2022 11:41 AM

Process of looping over your data (110000 * 140 * 207) without(!) output process took 27 seconds on my laptop.

Adding "output" overhead it will be longer. My result, before I run out of disk space, was:

ERROR: Insufficient space in file OUT.WANT.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 110000 observations read from the data set WORK.TABLE_A.
NOTE: There were 23606 observations read from the data set WORK.TABLE_D.
WARNING: The data set OUT.WANT may be incomplete.  When this step was stopped
         there were 2596514631 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           14:39.18
      cpu time            4:59.76

So it was about 81% of total (and ~61GB) in less than 15 minutes.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ani7 · Posted 03-24-2022 12:08 PM

This might be what I'm looking for! What do you mean "without the output process" here?

yabwon · Posted 03-24-2022 12:14 PM

In my code when the output statement were commented (so no result data set was generated) out process was 27 seconds.

When I did output in my code (result data set was generated) it took ~15 minutes to output 81% of all data.

B.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ballardw · Posted 03-24-2022 11:24 AM

Have you tried something more like:

proc sql;
	create table 	table_a_b
	as
	select distinct a.*, b.*
	from 	table_a as a,table_b as b
;
quit;

I believe that a LEFT JOIN On some condition may take more time because of the evaluation of the ON clause for each record created. Or was this an attempt to get rid of the message that simple Cartesian join code generates :

NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.

Additionally if you don't want duplicates then remove them from source data sets before joining them.

mkeintz · Posted 03-24-2022 11:52 AM

By hash merge, I presume you mean a DATA step using hash objects:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;


data want (drop=_:);
  if 0 then set table_a table_b;
  declare hash a (dataset:'table_a');
    a.definekey('numbers');
    a.definedata(all:'Y');
    a.definedone();
  declare hiter ai ('a');
  a.output(dataset:'a');
  declare hash b (dataset:'table_b');
    b.definekey('letters');
    b.definedata(all:'Y');
    b.definedone();
  declare hiter bi ('b');
  b.output(dataset:'b');

  do _ra=ai.first() by 0 until (ai.next()^=0);
    do _rb=bi.first() by 0 until (bi.next()^=0);
      output;
    end;
  end;
run;

Note this does NOT preserve the order of the original observations, nor the sorted order generated by using the DISTINCT clause in SQL. (It does make distinct rows though, because the hash object, by default, does not keep multiple data items (think multiple rows) per key value.

But you can, I believe, replicate the order generated by DISTINCT by using the "ordered:a" option in the hash declarations, i.e. change

  declare hash a (dataset:'table_a');

to

  declare hash a (dataset:'table_a',ordered:'a');

and making the same change for hash object b.

This is wrong, has been sticken out: ~~Or you could replicate the order of the observations in table_a and table_b by using the "hashexp:0" option. I.e. change~~

  declare hash a (dataset:'table_a');

to

  declare hash a (dataset:'table_a',hashexp:0);

~~and the same for hash object b, which would produce the first obs from a paired with each obs from b in sequences, then the second obs from a, etc.~~

Ordinarily using hashexp:0 might slow down hash data retrieval, since it puts all the data into one (binary-searched) bucket. But this program does not use the hash key for locating data items. Instead it just steps through them, so I suspect that hashexp:0 won't exact a speed penalty in this case Haven't tested it though.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!