Solved: Hash and Merge

Saranya_Sub · Posted 04-04-2022 01:41 PM

Hi ,

Am confused with how the Hash and Merge works for this below table . For me the output of the Merge is correct . but Merge requires the data set to be sorted before the Merge to happen and its time consuming .Is there any option i can have the same output of Merge in hash join as well?

data product;
input ID $ File_Nm $ PRODUCT $;

datalines;
1001  File1 1234
1001  File1 4567
1001  File1 7890
1001  File1 0123
1001  File1 3456
1001  File1 6789
;

data YEAR2021_WW01;
input ID $ File_Nm $ Amount $;
 
datalines;
1001  File1 100
1001  File1 200
1001  File1 300
;
  
data want_hash;
 if 0 then set product ;
 
 
  declare hash p (dataset:"product");
  p.definekey("ID","File_Nm");
  p.definedata(all:'Y'); 
  p.definedone();
   do until(eof);
   set YEAR2021_WW01 end=eof;
 
if p.find() = 0 then output;
end;
stop;
run;

/*Merge Join*/
DATA WORK.want_merge;

MERGE WORK.product(IN=A) YEAR2021_WW01(IN=B);
BY ID File_Nm;
IF A;
RUN;

mkeintz · Posted 04-04-2022 10:27 PM

The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.

When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.

BTW, your merge suggests that the hash object should be for year2021_ww01, not product.

But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.

So you need to

accommodate more than one obs per key combination - use the multidata:"Y" option.
Match the first obs of the PRODUCT dataset with the first matching dataitem in the hash object, match the second with the second, etc., This is done by usnig H.FIND() method to find the first, then delete that dataitem (h.removeddup), so that the next dataitem will become the first to satisfy the next h.find().
Carry forth the last matching dataitem in the hash object to match any "extra" obs in PRODUCT. So don't h.removedup the last dataitem for the duplicate key

data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;

A couple of NOTES:

The above does not need either dataset to be sorted. Which, as far as I can tell, is the only real advantage of using hash vs MERGE in this particular application.
If there were key combinations in YEAR2021_WW01 that have a longer series than in PRODUCTS, then the above will NOT replicate the merge statement - because the merge statement would repeat the last matching obs in PRODUCTS to meet the longer series in YEAR2021_WW01, while this program would ignore extra obs in the YEAR2021_WW01 object. That would take a bit more code.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PeterClemmensen · Posted 04-04-2022 02:35 PM

Seems like you switch the two data sets around.

Try this

data want_hash;
   if 0 then set YEAR2021_WW01 ;
   declare hash p (dataset : "YEAR2021_WW01");
   p.definekey("ID", "File_Nm");
   p.definedata(all:'Y'); 
   p.definedone();

   do until(eof);
      set product end=eof;
      if p.find() = 0 then output;
   end;
   stop;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Tom · Posted 04-04-2022 02:36 PM

What is the question?

Are you asking if you can get sorted results from the HASH method without sorting the YEAR2021_WW01 dataset?

If so then the answer is no.

ChrisNZ · Posted 04-04-2022 06:41 PM

> Merge requires the data set to be sorted before the Merge to happen and its time consuming .

But then the merge takes no time at all. And the output is sorted, ready for the next merge (be sure to set the sorted flag as SAS is too lazy to do that).

Bottom line: If you want a sorted output, you need to sort at some point.

You can use the hash table for this if you want, but it is unlikely that the hash table is going to be the silver bullet you seem to think it is.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 04-04-2022 10:27 PM

The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.

When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.

BTW, your merge suggests that the hash object should be for year2021_ww01, not product.

But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.

So you need to

accommodate more than one obs per key combination - use the multidata:"Y" option.
Match the first obs of the PRODUCT dataset with the first matching dataitem in the hash object, match the second with the second, etc., This is done by usnig H.FIND() method to find the first, then delete that dataitem (h.removeddup), so that the next dataitem will become the first to satisfy the next h.find().
Carry forth the last matching dataitem in the hash object to match any "extra" obs in PRODUCT. So don't h.removedup the last dataitem for the duplicate key

data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;

A couple of NOTES:

The above does not need either dataset to be sorted. Which, as far as I can tell, is the only real advantage of using hash vs MERGE in this particular application.
If there were key combinations in YEAR2021_WW01 that have a longer series than in PRODUCTS, then the above will NOT replicate the merge statement - because the merge statement would repeat the last matching obs in PRODUCTS to meet the longer series in YEAR2021_WW01, while this program would ignore extra obs in the YEAR2021_WW01 object. That would take a bit more code.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Re: Hash and Merge

Click image to register for webinar

Classroom Training Available!