I currently have two tables, table_a and table_b with one variable each and 3 distinct values and have to merge them to create a table with every possible combination of the two variables across the two tables. I currently am using a proc sql to do so but was wondering if there was a way to do it using a hash merge (since the resultant merged table with my actual data ends up being a few billion rows long).
The code I currently use is as follows:
data table_a;
input numbers $;
cards;
1
2
3
;
run;
data table_b;
input letters $;
cards;
a
b
c
;
run;
proc sql;
create table table_a_b
as
select distinct a.*, b.*
from table_a as a
left join table_b as b
on 1 = 1
;quit;
For Cartesian you can do a loop:
data table_a;
input numbers $;
cards;
1
2
3
;
run;
data table_b;
input letters $;
cards;
a
b
c
;
run;
%let nobs = 3;
data want;
array table_a[&nobs.] $ _temporary_;
do _N_ = 1 by 1 until(eof1);
set table_a end=eof1;
table_a[_N_] = numbers;
end;
do until(eof2);
set table_b end=eof2;
do _N_ = 1 to &nobs.;
numbers = table_a[_N_];
output;
end;
end;
run;
proc print;
run;
Bart
How many observations are in those real datasets? And what is the length of the character variables?
For your data it could be something like this:
data table_a;
do numbers = 1 to 110000;
output;
end;
run;
data table_b;
length letters1 $ 8;
do _N_ = 1 to 140;
letters1 = cats("A", _N_);
output;
end;
run;
data table_c;
length letters2 $ 8;
do _N_ = 1 to 207;
letters2 = cats("B", _N_);
output;
end;
run;
proc sql;
create table table_d as
select * from
table_b, table_c;
quit;
%let nobs = 110000;
data want;
array table_a[&nobs.] _temporary_;
do _N_ = 1 by 1 until(eof1);
set table_a end=eof1;
table_a[_N_] = numbers;
end;
do until(eof2);
set table_d end=eof2;
do _N_ = 1 to &nobs.;
numbers = table_a[_N_];
/*output;*/ /* <- commented out to prevent data b00m :-) */
end;
end;
run;
But the question is, what are you planing to do wit this result?
Maybe you can do this in "parts", e.g. splitting "tabel_d" (from mu code) into 10 smaller one, and do the process separately?
Bart
Since you do not need the search capability of a hash, temporary arrays are the way to go:
proc sql noprint;
select nobs into :size_b
from dictionary.tables
where libname = "WORK" and memname = "B";
select nobs into :size_c
from dictionary.tables
where libname = "WORK" and memname = "C";
quit;
data want;
set a;
array ar_b {&size_b.} $;
array ar_c {&size_c.} $;
if _n_ = 1
then do;
do b = 1 to &size_b.;
set b;
ar_b{b} = numbers; /* or whatever the variable is called in dataset b */
end;
do c = 1 to &size_c.;
set c;
ar_c{c} = letters;
end;
end;
do b = 1 to size_b.;
do c = 1 to size_c.;
numbers = ar_b{b};
letters = ar_c{c};
output;
end;
end;
drop b c;
run;
Temporary arrays are the fastest structures you can have in SAS. If you need to read more than one variable from b and/or c, add additional arrays.
The nice thing here is also that no sorting happens behind the scenes; everything is done sequentially.
So you have three tables, not two. There are two very small (B and C) and one large (A). If this is the case, then DON'T convert table A into a hash. Instead
The resulting data will be in the physical order of table A, but within each A, there is no guarantee of order.
Using this simulated data, with the szies you mention for tables A, B, and C:
data table_a;
do numbers=1 to 110000; output; end;
run;
data table_b (keep=letters);
length letters $2 ;
do L1=rank('A') to rank('Z') until (nout=140);
letters=byte(L1);
do L2=rank('A') to rank('Z') until (nout=140);
substr(Letters,2)=byte(L2);
output;
nout+1;
end;
end;
run;
data table_c;
do other=-207 to -1; output; end;
run;
data template_b_c;
set table_b table_c;
stop;
run;
The program below creates a 3-way cartesian product and took 2:19 (139 seconds) on my machine. But notice it is a DATA _NULL_ step, so the output statement does not actually write to disk. All I am measuring is the time it takes to read in B, C, and A, and cross all their observations. The rest will depend on the speed of your output device and output channel. Whatever total time the program would take to create the dataset you want, will be almost solely driven by the time to write the data, not the time to do the cross-products.
data _null_ (drop=_:);
set table_a;
if _n_ =1 then do;
if 0 then set table_b;
declare hash b (dataset:'table_b');
b.definekey('letters');
b.definedata(all:'Y');
b.definedone();
declare hiter bi ('b');
if 0 then set table_c;
declare hash c (dataset:'table_c');
c.definekey('other');
c.definedata(all:'Y');
c.definedone();
declare hiter ci ('c');
declare hash bc (dataset:'template_b_c');
bc.definekey(all:'Y');
bc.definedata(all:'Y');
bc.definedone();
declare hiter bci ('bc');
do _rb=bi.first() by 0 until (bi.next()^=0);
do _rc=ci.first() by 0 until (ci.next()^=0);
bc.add();
end;
end;
end;
do _rc=bci.first() by 0 until (bci.next()^=0);
output;
end;
run;
Why would it matter if there is another way? Wouldn't the end result be the same?
Yes but the actual tables in my code are much larger and the resultant merged table ends up being about 3 billion rows which takes 5-6 hours to run. A hash merge could potentially be quicker here.
For Cartesian you can do a loop:
data table_a;
input numbers $;
cards;
1
2
3
;
run;
data table_b;
input letters $;
cards;
a
b
c
;
run;
%let nobs = 3;
data want;
array table_a[&nobs.] $ _temporary_;
do _N_ = 1 by 1 until(eof1);
set table_a end=eof1;
table_a[_N_] = numbers;
end;
do until(eof2);
set table_b end=eof2;
do _N_ = 1 to &nobs.;
numbers = table_a[_N_];
output;
end;
end;
run;
proc print;
run;
Bart
Process of looping over your data (110000 * 140 * 207) without(!) output process took 27 seconds on my laptop.
Adding "output" overhead it will be longer. My result, before I run out of disk space, was:
ERROR: Insufficient space in file OUT.WANT.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 110000 observations read from the data set WORK.TABLE_A.
NOTE: There were 23606 observations read from the data set WORK.TABLE_D.
WARNING: The data set OUT.WANT may be incomplete. When this step was stopped
there were 2596514631 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 14:39.18
cpu time 4:59.76
So it was about 81% of total (and ~61GB) in less than 15 minutes.
Bart
In my code when the output statement were commented (so no result data set was generated) out process was 27 seconds.
When I did output in my code (result data set was generated) it took ~15 minutes to output 81% of all data.
B.
Have you tried something more like:
proc sql;
create table table_a_b
as
select distinct a.*, b.*
from table_a as a,table_b as b
;
quit;
I believe that a LEFT JOIN On some condition may take more time because of the evaluation of the ON clause for each record created. Or was this an attempt to get rid of the message that simple Cartesian join code generates :
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
Additionally if you don't want duplicates then remove them from source data sets before joining them.
By hash merge, I presume you mean a DATA step using hash objects:
data table_a;
input numbers $;
cards;
1
2
3
;
run;
data table_b;
input letters $;
cards;
a
b
c
;
run;
data want (drop=_:);
if 0 then set table_a table_b;
declare hash a (dataset:'table_a');
a.definekey('numbers');
a.definedata(all:'Y');
a.definedone();
declare hiter ai ('a');
a.output(dataset:'a');
declare hash b (dataset:'table_b');
b.definekey('letters');
b.definedata(all:'Y');
b.definedone();
declare hiter bi ('b');
b.output(dataset:'b');
do _ra=ai.first() by 0 until (ai.next()^=0);
do _rb=bi.first() by 0 until (bi.next()^=0);
output;
end;
end;
run;
Note this does NOT preserve the order of the original observations, nor the sorted order generated by using the DISTINCT clause in SQL. (It does make distinct rows though, because the hash object, by default, does not keep multiple data items (think multiple rows) per key value.
But you can, I believe, replicate the order generated by DISTINCT by using the "ordered:a" option in the hash declarations, i.e. change
declare hash a (dataset:'table_a');
to
declare hash a (dataset:'table_a',ordered:'a');
and making the same change for hash object b.
This is wrong, has been sticken out: Or you could replicate the order of the observations in table_a and table_b by using the "hashexp:0" option. I.e. change
declare hash a (dataset:'table_a');
to
declare hash a (dataset:'table_a',hashexp:0);
and the same for hash object b, which would produce the first obs from a paired with each obs from b in sequences, then the second obs from a, etc.
Ordinarily using hashexp:0 might slow down hash data retrieval, since it puts all the data into one (binary-searched) bucket. But this program does not use the hash key for locating data items. Instead it just steps through them, so I suspect that hashexp:0 won't exact a speed penalty in this case Haven't tested it though.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.