I have three tables which I want to join:
tab1 - about 55 millions rows (size 25 GB)
tab2 - 2,5 millions rows
tab3 - 1,5 millions rows
proc sql looks like:
proc sql;
create table tab4 as
select
t1.*
,t2.col_2
,coalesce(t1.col_3, t3.col_3) as col_3_new
,coalesce(t1.col_4, t3.col_4) as col_4_new
from tab1 t1
left join tab2 t2 on t1.id = t2.id
left join tab3 t3
on t1.id = t3.id
and t1.id_1 = t3.id_1
and t1.id_2 = t3.id_2
;
quit;
data tab4;
set tab4;
drop col_3 col_4;
rename col_3_new = col_3 col_4_new = col_4;
run;
I tried to use hash tables instead of proc sql, but I don't know if I do right and how to include coalesce():
data tab4;
if _n_=1 then
do;
/* map tables */
if 0 then set tab1 tab2 tab3;
/* define and load hashes */
dcl hash h_b(dataset:'tab2');
h_b.defineKey('id');
h_b.defineData('col_2');
h_b.defineDone();
dcl hash h_c(dataset:'tab3');
h_c.defineKey('id', 'id_1', 'id_2');
h_c.defineData('col_3', 'col_4');
h_c.defineDone();
end;
call missing(of _all_);
set tab1;
_rc=h_b.find(key: id);
_rc=h_c.find(key: id, key:id_1, key:id_2);
run;
Maybe is it better method to join these tables?
Hello @kinder_skipper and welcome to the SAS Support Communities!
Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.
The COALESCE part can be added as shown in the highlighted changes below:
data tab4(drop=_:); if _n_=1 then do; /* map tables */ if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4)); /* define and load hashes */ dcl hash h_b(dataset:'tab2'); h_b.defineKey('id'); h_b.defineData('col_2'); h_b.defineDone(); dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))'); h_c.defineKey('id', 'id_1', 'id_2'); h_c.defineData('_col_3', '_col_4'); h_c.defineDone(); end; call missing(of _all_); set tab1; _rc=h_b.find(key: id); _rc=h_c.find(key: id, key:id_1, key:id_2); col_3=coalesce(col_3,_col_3); /* or coalesceC in case */ col_4=coalesce(col_4,_col_4); /* of character variables */ run;
The "key:" argument tags are redundant:
_rc=h_b.find(); _rc=h_c.find();
Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:
create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))
Hello @kinder_skipper and welcome to the SAS Support Communities!
Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.
The COALESCE part can be added as shown in the highlighted changes below:
data tab4(drop=_:); if _n_=1 then do; /* map tables */ if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4)); /* define and load hashes */ dcl hash h_b(dataset:'tab2'); h_b.defineKey('id'); h_b.defineData('col_2'); h_b.defineDone(); dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))'); h_c.defineKey('id', 'id_1', 'id_2'); h_c.defineData('_col_3', '_col_4'); h_c.defineDone(); end; call missing(of _all_); set tab1; _rc=h_b.find(key: id); _rc=h_c.find(key: id, key:id_1, key:id_2); col_3=coalesce(col_3,_col_3); /* or coalesceC in case */ col_4=coalesce(col_4,_col_4); /* of character variables */ run;
The "key:" argument tags are redundant:
_rc=h_b.find(); _rc=h_c.find();
Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:
create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))
It works. Thanku you very much!
Since you have to copy the entire LARGE dataset anyway then I suspect the SQL is going to be just as efficient (if not more) than anything you write. Especially if the dataset are sorted (or indexed).
If the datasets are sorted you just use normal SAS data step(s) to merge them. (unless the goal is to blow-up the large dataset into something even larger).
data step1 / view=step1;
merge t1(in=in1) t2(keep=id col2);
by id1;
if in1;
run;
data want;
merge step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4) ;
by id1 id2 id3 ;
if in1;
run;
Which will use the values of COL_3 and COL_4 from T3 when there is a match.
Or perhaps the coalesce() function was intended to modify the value of COL_3 and COL_4 from T1 only when the value in T3 was non-missing? (That is not replace an existing value with a missing value.) In which case you should probably use an UPDATE instead of MERGE in the last step.
data want;
update step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4);
by id1 id2 id3 ;
if in1;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.