Solved: Re: Left join big tables by hash objects

kinder_skipper · Posted 09-02-2022 08:29 AM

I have three tables which I want to join:

tab1 - about 55 millions rows (size 25 GB)

tab2 - 2,5 millions rows

tab3 - 1,5 millions rows

proc sql looks like:

proc sql;
create table tab4 as
	select
		t1.*
		,t2.col_2
		,coalesce(t1.col_3, t3.col_3) as col_3_new
		,coalesce(t1.col_4, t3.col_4) as col_4_new
	from tab1 t1
		left join tab2 t2 on t1.id = t2.id
		left join tab3 t3
					on t1.id = t3.id
					and t1.id_1 = t3.id_1
					and t1.id_2 = t3.id_2
;
quit;

data tab4;
set tab4;
drop col_3 col_4;
rename col_3_new = col_3 col_4_new = col_4;
run;

I tried to use hash tables instead of proc sql, but I don't know if I do right and how to include coalesce():

data tab4;
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3;
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('col_3', 'col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
run;

Maybe is it better method to join these tables?

FreelanceReinh · Posted 09-02-2022 12:13 PM

Hello @kinder_skipper and welcome to the SAS Support Communities!

Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.

The COALESCE part can be added as shown in the highlighted changes below:

data tab4(drop=_:);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4));
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('_col_3', '_col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
  col_3=coalesce(col_3,_col_3); /* or coalesceC in case   */
  col_4=coalesce(col_4,_col_4); /* of character variables */
run;

The "key:" argument tags are redundant:

  _rc=h_b.find();
  _rc=h_c.find();

Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:

create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))

View solution in original post

FreelanceReinh · Posted 09-02-2022 12:13 PM

Hello @kinder_skipper and welcome to the SAS Support Communities!

Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.

The COALESCE part can be added as shown in the highlighted changes below:

data tab4(drop=_:);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4));
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('_col_3', '_col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
  col_3=coalesce(col_3,_col_3); /* or coalesceC in case   */
  col_4=coalesce(col_4,_col_4); /* of character variables */
run;

The "key:" argument tags are redundant:

  _rc=h_b.find();
  _rc=h_c.find();

Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:

create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))

kinder_skipper · Posted 09-02-2022 03:42 PM

It works. Thanku you very much!

Tom · Posted 09-02-2022 09:53 PM

Since you have to copy the entire LARGE dataset anyway then I suspect the SQL is going to be just as efficient (if not more) than anything you write. Especially if the dataset are sorted (or indexed).

If the datasets are sorted you just use normal SAS data step(s) to merge them. (unless the goal is to blow-up the large dataset into something even larger).

data step1 / view=step1;
   merge t1(in=in1) t2(keep=id col2);
   by id1;
   if in1;
run;

data want;
   merge step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4) ;
   by id1 id2 id3 ;
   if in1;
run;

Which will use the values of COL_3 and COL_4 from T3 when there is a match.

Or perhaps the coalesce() function was intended to modify the value of COL_3 and COL_4 from T1 only when the value in T3 was non-missing? (That is not replace an existing value with a missing value.) In which case you should probably use an UPDATE instead of MERGE in the last step.

data want;
   update step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4);
   by id1 id2 id3 ;
   if in1;
run;

kinder_skipper · Posted 09-03-2022 12:07 AM

Thank you for your reply. These tables aren't sorted and indexed. Proc sql lasts three times longer than hash tables.

Left join big tables by hash objects

Re: Left join big tables by hash objects

Re: Left join big tables by hash objects

Re: Left join big tables by hash objects

Re: Left join big tables by hash objects

Re: Left join big tables by hash objects

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away