BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ani7
Obsidian | Level 7

I currently have two tables, table_a and table_b with one variable each and 3 distinct values and have to merge them to create a table with every possible combination of the two variables across the two tables. I currently am using a proc sql to do so but was wondering if there was a way to do it using a hash merge (since the resultant merged table with my actual data ends up being a few billion rows long).

 

The code I currently use is as follows:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

proc sql;
	create table 	table_a_b
	as
	select distinct a.*, b.*
	from 			table_a as a
	left join 		table_b as b
	on 				1 = 1
;quit;
1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

For Cartesian you can do a loop:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

%let nobs = 3;
data want;
  array table_a[&nobs.] $ _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_b end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      output;
    end;
  end;
run;
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

14 REPLIES 14
Ani7
Obsidian | Level 7
So in my actual code, table_a has ~110,000 observations and table_b has 140 observations. I then have to merge it with a table_c with 207 observations. This ends up with a table of ~3 billion rows.
yabwon
Onyx | Level 15

For your data it could be something like this:

data table_a;
	do numbers = 1 to 110000;
    output;
  end;
run;

data table_b;
  length letters1 $ 8;
  do _N_ = 1 to 140;
    letters1 = cats("A", _N_);
    output;
  end;
run;

data table_c;
  length letters2 $ 8;
  do _N_ = 1 to 207;
    letters2 = cats("B", _N_);
    output;
  end;
run;

proc sql;
  create table table_d as
  select * from
  table_b, table_c;
quit;

%let nobs = 110000;
data want;
  array table_a[&nobs.] _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_d end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      /*output;*/ /* <- commented out to prevent data b00m :-) */
    end;
  end;
run;

But the question is, what are you planing to do wit this result?

Maybe you can do this in "parts", e.g. splitting "tabel_d" (from mu code) into 10 smaller one, and do the process separately?

 

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Kurt_Bremser
Super User

Since you do not need the search capability of a hash, temporary arrays are the way to go:

proc sql noprint;
select nobs into :size_b
from dictionary.tables
where libname = "WORK" and memname = "B";
select nobs into :size_c
from dictionary.tables
where libname = "WORK" and memname = "C";
quit;

data want;
set a;
array ar_b {&size_b.} $;
array ar_c {&size_c.} $;
if _n_ = 1
then do;
  do b = 1 to &size_b.;
    set b;
    ar_b{b} = numbers; /* or whatever the variable is called in dataset b */
  end;
  do c = 1 to &size_c.;
    set c;
    ar_c{c} = letters;
  end;
end;
do b = 1 to size_b.;
  do c = 1 to size_c.;
    numbers = ar_b{b};
    letters = ar_c{c};
    output;
  end;
end;
drop b c;
run;

Temporary arrays are the fastest structures you can have in SAS. If you need to read more than one variable from b and/or c, add additional arrays.

The nice thing here is also that no sorting happens behind the scenes; everything is done sequentially.

 

mkeintz
PROC Star

So you have three tables, not two.  There are two very small (B and C)  and one large (A).  If this is the case, then DON'T convert table A into a hash.  Instead

 

  1. Read B and C into independent hashes
  2. Create a third hash (BC) as the cartesian product of the first 2.  It will still be a small object.
  3. Then read A one obs at a time.  Each time iterate through BC and output one obs per iteration.

The resulting data will be in the physical order of table A, but within each A, there is no guarantee of order.

 

Using this simulated data, with the szies you mention for tables A, B, and C:

 

data table_a;
  do numbers=1 to 110000; output; end;
run;
data table_b (keep=letters);
  length letters $2 ;
  do L1=rank('A') to rank('Z') until (nout=140);
    letters=byte(L1);
    do L2=rank('A') to rank('Z') until (nout=140);
      substr(Letters,2)=byte(L2);
      output;
      nout+1;
    end;
  end;
run;
data table_c;
  do other=-207 to -1; output; end;
run;

data template_b_c;
  set table_b table_c;
  stop;
run;

 

The program below creates a 3-way cartesian product and took 2:19 (139 seconds) on my machine.  But notice it is a DATA _NULL_ step, so the output statement does not actually write to disk.  All I am measuring is the time it takes to read in B, C, and A, and cross all their observations.  The rest will depend on the speed of your output device and output channel.  Whatever total time the program would take to create the dataset you want, will be almost solely driven by the time to write the data, not the time to do the cross-products.

 

data _null_ (drop=_:);
  set table_a;
  if _n_ =1 then do;
    if 0 then set table_b;
    declare hash b (dataset:'table_b');
      b.definekey('letters');
      b.definedata(all:'Y');
      b.definedone();
    declare hiter bi ('b');
    if 0 then set table_c;
    declare hash c (dataset:'table_c');
      c.definekey('other');
      c.definedata(all:'Y');
      c.definedone();
    declare hiter ci ('c');
    declare hash bc (dataset:'template_b_c'); 
      bc.definekey(all:'Y');
      bc.definedata(all:'Y');
      bc.definedone();
    declare hiter bci ('bc');
    do _rb=bi.first() by 0 until (bi.next()^=0);
      do _rc=ci.first() by 0 until (ci.next()^=0);
        bc.add();
      end;
    end;
  end;
  do _rc=bci.first() by 0 until (bci.next()^=0);
    output;
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Astounding
PROC Star

Why would it matter if there is another way?  Wouldn't the end result be the same?

Ani7
Obsidian | Level 7

Yes but the actual tables in my code are much larger and the resultant merged table ends up being about 3 billion rows which takes 5-6 hours to run. A hash merge could potentially be quicker here.

yabwon
Onyx | Level 15

For Cartesian you can do a loop:

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;

%let nobs = 3;
data want;
  array table_a[&nobs.] $ _temporary_;
  do _N_ = 1 by 1 until(eof1);
    set table_a end=eof1;
    table_a[_N_] = numbers;
  end;

  do until(eof2);
    set table_b end=eof2;
    do _N_ = 1 to &nobs.;
      numbers = table_a[_N_];
      output;
    end;
  end;
run;
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ani7
Obsidian | Level 7
Would this be quicker than the proc sql merge? The main reason for a hash merge here is for speed/time constraints.
yabwon
Onyx | Level 15

Process of looping over your data (110000 * 140 * 207) without(!) output process took 27 seconds on my laptop.

Adding "output" overhead it will be longer. My result, before I run out of disk space, was:

ERROR: Insufficient space in file OUT.WANT.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 110000 observations read from the data set WORK.TABLE_A.
NOTE: There were 23606 observations read from the data set WORK.TABLE_D.
WARNING: The data set OUT.WANT may be incomplete.  When this step was stopped
         there were 2596514631 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           14:39.18
      cpu time            4:59.76

So it was about 81% of total (and ~61GB) in less than 15 minutes.

 

 

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ani7
Obsidian | Level 7
This might be what I'm looking for! What do you mean "without the output process" here?
yabwon
Onyx | Level 15

In my code when the output statement were commented (so no result data set was generated) out process was 27 seconds.

 

When I did output in my code (result data set was generated) it took ~15 minutes to output 81% of all data. 

 

B.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ballardw
Super User

Have you tried something more like:

 

proc sql;
	create table 	table_a_b
	as
	select distinct a.*, b.*
	from 	table_a as a,table_b as b
;
quit;

 

I believe that a LEFT JOIN On some condition may take more time because of the evaluation of the ON clause for each record created. Or was this an attempt to get rid of the message that simple Cartesian join code generates :

NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.

 Additionally if you don't want duplicates then remove them from source data sets before joining them.

mkeintz
PROC Star

By hash merge, I presume you mean a DATA step using hash objects:

 

data table_a;
	input numbers $;
	cards;
1
2
3
;
run;

data table_b;
	input letters $;
	cards;
a
b
c
;
run;


data want (drop=_:);
  if 0 then set table_a table_b;
  declare hash a (dataset:'table_a');
    a.definekey('numbers');
    a.definedata(all:'Y');
    a.definedone();
  declare hiter ai ('a');
  a.output(dataset:'a');
  declare hash b (dataset:'table_b');
    b.definekey('letters');
    b.definedata(all:'Y');
    b.definedone();
  declare hiter bi ('b');
  b.output(dataset:'b');

  do _ra=ai.first() by 0 until (ai.next()^=0);
    do _rb=bi.first() by 0 until (bi.next()^=0);
      output;
    end;
  end;
run;

Note this does NOT preserve the order of the original observations, nor the sorted order generated by using the DISTINCT clause in SQL.  (It does make distinct rows though, because the hash object, by default, does not keep multiple data items (think multiple rows) per key value.

 

But you can, I believe, replicate the order generated by DISTINCT by using the "ordered:a" option in the hash declarations, i.e. change

 

  declare hash a (dataset:'table_a');

to 

  declare hash a (dataset:'table_a',ordered:'a');

and making the same change for hash object b.

 

This is wrong, has been sticken out:  Or you could replicate the order of the observations in table_a and table_b by using the "hashexp:0" option.  I.e. change

 

  declare hash a (dataset:'table_a');

to

  declare hash a (dataset:'table_a',hashexp:0);

and the same for hash object b, which would produce the first obs from a paired with each obs from b in sequences, then the second obs from a, etc.

 

Ordinarily using hashexp:0 might slow down hash data retrieval, since it puts all the data into one (binary-searched) bucket.  But this program does not use the hash key for locating data items.  Instead it just steps through them, so I suspect that hashexp:0 won't exact a speed penalty in this case   Haven't tested it though.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2176 views
  • 2 likes
  • 6 in conversation