BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
kinder_skipper
Fluorite | Level 6

I have three tables which I want to join: 

tab1 - about 55 millions rows (size 25 GB)

tab2 - 2,5 millions rows 

tab3 - 1,5 millions rows

 

proc sql looks like:

proc sql;
create table tab4 as
	select
		t1.*
		,t2.col_2
		,coalesce(t1.col_3, t3.col_3) as col_3_new
		,coalesce(t1.col_4, t3.col_4) as col_4_new
	from tab1 t1
		left join tab2 t2 on t1.id = t2.id
		left join tab3 t3
					on t1.id = t3.id
					and t1.id_1 = t3.id_1
					and t1.id_2 = t3.id_2
;
quit;

data tab4;
set tab4;
drop col_3 col_4;
rename col_3_new = col_3 col_4_new = col_4;
run;

I tried to use hash tables instead of proc sql, but I don't know if I do right and how to include coalesce():

data tab4;
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3;
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('col_3', 'col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
run;

Maybe is it better method to join these tables?

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @kinder_skipper and welcome to the SAS Support Communities!

 

Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.

 

The COALESCE part can be added as shown in the highlighted changes below:

data tab4(drop=_:);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4));
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('_col_3', '_col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
  col_3=coalesce(col_3,_col_3); /* or coalesceC in case   */
  col_4=coalesce(col_4,_col_4); /* of character variables */
run;

 

The "key:" argument tags are redundant:

  _rc=h_b.find();
  _rc=h_c.find();

 

Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:

create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))

View solution in original post

4 REPLIES 4
FreelanceReinh
Jade | Level 19

Hello @kinder_skipper and welcome to the SAS Support Communities!

 

Basically, your code looks good, provided that ID is a unique key in TAB2 and that the combination of ID, ID_1 and ID_2 is unique in TAB3. The DATA step using the hash objects has the advantage of preserving the order of observations from TAB1. This is not guaranteed by PROC SQL (in the absence of an ORDER BY clause), in particular if ID were not unique in TAB1.

 

The COALESCE part can be added as shown in the highlighted changes below:

data tab4(drop=_:);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set tab1 tab2 tab3(rename=(col_3=_col_3 col_4=_col_4));
      /* define and load hashes */
      dcl hash h_b(dataset:'tab2');
      h_b.defineKey('id');
      h_b.defineData('col_2');
      h_b.defineDone();
      dcl hash h_c(dataset:'tab3(rename=(col_3=_col_3 col_4=_col_4))');
      h_c.defineKey('id', 'id_1', 'id_2');
      h_c.defineData('_col_3', '_col_4');
      h_c.defineDone();
    end;
  call missing(of _all_);
  set tab1;
  _rc=h_b.find(key: id);
  _rc=h_c.find(key: id, key:id_1, key:id_2);
  col_3=coalesce(col_3,_col_3); /* or coalesceC in case   */
  col_4=coalesce(col_4,_col_4); /* of character variables */
run;

 

The "key:" argument tags are redundant:

  _rc=h_b.find();
  _rc=h_c.find();

 

Also note that you could omit the entire DATA step following the PROC SQL step by using dataset options as follows:

create table tab4(drop=col_3 col_4 rename=(col_3_new = col_3 col_4_new = col_4))
kinder_skipper
Fluorite | Level 6

It works. Thanku you very much!

Tom
Super User Tom
Super User

Since you have to copy the entire LARGE dataset anyway then I suspect the SQL is going to be just as efficient (if not more) than anything you write.  Especially if the dataset are sorted (or indexed).

 

If the datasets are sorted you just use normal SAS data step(s) to merge them. (unless the goal is to blow-up the large dataset into something even larger).

data step1 / view=step1;
   merge t1(in=in1) t2(keep=id col2);
   by id1;
   if in1;
run;

data want;
   merge step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4) ;
   by id1 id2 id3 ;
   if in1;
run;

Which will use the values of COL_3 and COL_4 from T3 when there is a match.

 

Or perhaps the coalesce() function was intended to modify the value of COL_3 and COL_4 from T1 only when the value in T3 was non-missing? (That is not replace an existing value with a missing value.)  In which case you should probably use an UPDATE instead of MERGE in the last step.

data want;
   update step1(in=in1) t3(keep=id1 id2 id3 col_3 col_4);
   by id1 id2 id3 ;
   if in1;
run;

 

kinder_skipper
Fluorite | Level 6
Thank you for your reply. These tables aren't sorted and indexed. Proc sql lasts three times longer than hash tables.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 970 views
  • 1 like
  • 3 in conversation