how to use hash in data step to facilitate merge

Reply
Frequent Contributor
Posts: 122

how to use hash in data step to facilitate merge

I have two datasets to be merged. One is quite large with over 50m observations, while the other is small enough. When I use proc sql to merge them together it will take about 30 minutes. I wonder if data step with hash object will shorten the processing time siginificantly because it does not need sorting. Let me outline what datasets look like and the proc sql code I use to merge.

s1 (huge one)

id          date               c1-c10          n1-n10

001       10/31/2000         ...               ....

001       11/30/2000         ...               ....

...

005      05/31/2000     ....                    ...

005      04/30/2000     ....                    ...

...

002     08/31/1999     ....                    ....


stockname(small one)

id          stdt                    enddt                    c11          n11

001       03/01/1995        10/31/2010             aa            15

001       11/01/2010        12/31/2012             ab            19

....


Basically what I want to do is to link c1-c10 and n1-n10 values for each id-date combo in s1 to c11 and n11 in stockname(small dataset).


c1-c10 are character values

n1-n10 are numeric values


c11 is character and n11 is numeric


In s1 for each id and date c1-c10 and n1-n10 are 20 variables I need to use in the future.

In stockname dataset, stdt and enddt are effective start and end date for id to have the values c11 and n11


s1 and stockname may not be sorted accordingly yet. The final result I want to see is


Want

id          date               c1-c10          n1-n10               c11               n11

001       10/31/2000         ...               ....                    ...                    ..

001       11/30/2000         ...               ....                    ...                    ..

...

005      05/31/2000     ....                    ...

005      04/30/2000     ....                    ...

...

002     08/31/1999     ....                    ....

        

The code I used is as follows;

proc sql;

  create table want as

  select *

  from s1 a, stocknames b

  where (a.id = b.id) & (stdt<= a.date <=enddt);

quit;

But I would like to know whether it can be done and how to do it in data step with hash and see whether it will take less time to process. Thanks.

Super User
Posts: 9,671

Re: how to use hash in data step to facilitate merge

Here is . Another way is to use proc format .

data s1     ;
input id    $      date   : mmddyy10.           ( c1-c10  ) ($)        n1-n10 ;
format date    mmddyy10.  ;
cards;
001       10/31/2000   a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
001       11/30/2000    a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
005      05/31/2000     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
005      04/30/2000     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
002     08/31/1999     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
;
run;
data stockname;
input id   $    stdt    : mmddyy10.     enddt        : mmddyy10.      c11  $   n11 ;
format stdt  enddt mmddyy10.;
cards;
001       03/01/1995        10/31/2010             aa            15
001       11/01/2010        12/31/2012             ab            19
;
run;
data key(drop= stdt enddt);
 set stockname;
 do date=stdt to enddt;
  output;
 end;
 format date    mmddyy10.  ;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set key;
   declare hash h(dataset:'key',hashexp:20);
    h.definekey('id','date');
     h.definedata('c11','n11');
     h.definedone();
 end;
set s1;
call missing(c11,n11);
rc=h.find();
run;

Xia Keshan

Frequent Contributor
Posts: 122

Re: how to use hash in data step to facilitate merge

Thank you for your help. I am going to test it soon. In the meanwhile, can you show to how to use proc format to attemp this problem? Which one should be more efficient? I am new to hash and proc format.

Super User
Posts: 9,671

Re: how to use hash in data step to facilitate merge

For your scenario , I think Hash Table is better , of course if you have enough memory to handle the small table .

Frequent Contributor
Posts: 122

Re: how to use hash in data step to facilitate merge

Thanks a lot. It works great. I do have two questions regarding this. The intermediate dataset key is too large for me (two observations in stockname extend to 6516 observations). This will explode exponentially with more observations in stockname. How can I get dataset "key" in the intervals of month rather than day?

The second question is that how I can keep all observations (similar to full join in proc sql)?

Super User
Posts: 9,671

Re: how to use hash in data step to facilitate merge

That is the reason why I refer to proc format . proc format can make it as range not sparse ----  of course,  Hash also can do that as well. But that may not be as fast as Hash .

For both of your questions  .

 
data s1     ;
input id    $      date   : mmddyy10.           ( c1-c10  ) ($)        n1-n10 ;
format date    mmddyy10.  ;
cards;
001       10/31/2000   a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
001       11/30/2000    a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
005      05/31/2000     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
005      04/30/2000     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
002     08/31/1999     a a a a a a a a a a    1 1 1 1 1 1 1 1 1 1
;
run;
data stockname;
input id   $    stdt    : mmddyy10.     enddt        : mmddyy10.      c11  $   n11 ;
format stdt  enddt mmddyy10.;
cards;
001       03/01/1995        10/31/2010             aa            15
001       11/01/2010        12/31/2012             ab            19
004       11/01/2010        12/31/2012             ab            19
;
run;
data temp;
 set  stockname;
 retain found 0;
run;
data want(drop=stdt enddt rc found);
 if _n_ eq 1 then do;
  if 0 then set temp;
   declare hash h(dataset:'temp',multidata:'y');
   declare hiter hi('h');
    h.definekey('id');
     h.definedata(all:'y');
     h.definedone();
 end;
set s1 end=last;
call missing(c11,n11);
rc=h.find();
do while(rc=0);
 if  stdt le  date le  enddt then do;found=1;h.replacedup(); leave; end;
 call missing(c11,n11);
 rc=h.find_next();
end;
output;
if last then do;
 call missing(of _all_);
 do while(hi.next()=0);
   if not found then output;
 end;
end;
run;


Xia Keshan

Message was edited by: xia keshan

Message was edited by: xia keshan

Ask a Question
Discussion stats
  • 5 replies
  • 315 views
  • 2 likes
  • 2 in conversation