BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Saranya_Sub
Obsidian | Level 7

Hi ,

  Am confused with how the Hash and Merge works for this below table . For me the output of the Merge is correct . but  Merge requires the data set to be sorted before the Merge to happen and its time consuming .Is there any option i can have the same output of Merge in hash join as well?

 

 

data product;
input ID $ File_Nm $ PRODUCT $;

datalines;
1001  File1 1234
1001  File1 4567
1001  File1 7890
1001  File1 0123
1001  File1 3456
1001  File1 6789
;

data YEAR2021_WW01;
input ID $ File_Nm $ Amount $;
 
datalines;
1001  File1 100
1001  File1 200
1001  File1 300
;
  
data want_hash;
 if 0 then set product ;
 
 
  declare hash p (dataset:"product");
  p.definekey("ID","File_Nm");
  p.definedata(all:'Y'); 
  p.definedone();
   do until(eof);
   set YEAR2021_WW01 end=eof;
 
if p.find() = 0 then output;
end;
stop;
run;

/*Merge Join*/
DATA WORK.want_merge;

MERGE WORK.product(IN=A) YEAR2021_WW01(IN=B);
BY ID File_Nm;
IF A;
RUN;

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process.   When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.

 

When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.

 

BTW, your merge suggests that the hash object should be for year2021_ww01, not product.

 

But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.

 

So you need to

  1. accommodate more than one obs per key combination  - use the multidata:"Y" option.
  2. Match the first obs of the PRODUCT dataset with the first matching dataitem in the hash object, match the second with the second, etc., This is done by usnig H.FIND() method to find the first, then delete that dataitem (h.removeddup), so that the next dataitem will become the first to satisfy the next h.find().
  3. Carry forth the last matching dataitem in the hash object to match any "extra" obs in PRODUCT.  So don't h.removedup the last dataitem for the duplicate key

 

data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;

A couple of NOTES:

 

  1. The above does not need either dataset to be sorted.  Which, as far as I can tell, is the only real advantage of using hash vs MERGE in this particular application.

  2. If there were key combinations in YEAR2021_WW01 that have a longer series than in PRODUCTS, then the above will NOT replicate the merge statement - because the merge statement would repeat the last matching obs in PRODUCTS to meet the longer series in YEAR2021_WW01, while this program would ignore extra obs in the YEAR2021_WW01 object.  That would take a bit more code.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
PeterClemmensen
Tourmaline | Level 20

Seems like you switch the two data sets around.

 

Try this

 

data want_hash;
   if 0 then set YEAR2021_WW01 ;
   declare hash p (dataset : "YEAR2021_WW01");
   p.definekey("ID", "File_Nm");
   p.definedata(all:'Y'); 
   p.definedone();

   do until(eof);
      set product end=eof;
      if p.find() = 0 then output;
   end;
   stop;
run;
Tom
Super User Tom
Super User

What is the question?

Are you asking if you can get sorted results from the HASH method without sorting the YEAR2021_WW01 dataset?

If so then the answer is no.

ChrisNZ
Tourmaline | Level 20

>  Merge requires the data set to be sorted before the Merge to happen and its time consuming .

 

But then the merge takes no time at all. And the output is sorted, ready for the next merge (be sure to set the sorted flag as SAS is too lazy to do that).

 

Bottom line: If you want a sorted output, you need to sort at some point.

You can use the hash table for this if you want, but it is unlikely that the hash table is going to be the silver bullet you seem to think it is. 

mkeintz
PROC Star

The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process.   When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.

 

When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.

 

BTW, your merge suggests that the hash object should be for year2021_ww01, not product.

 

But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.

 

So you need to

  1. accommodate more than one obs per key combination  - use the multidata:"Y" option.
  2. Match the first obs of the PRODUCT dataset with the first matching dataitem in the hash object, match the second with the second, etc., This is done by usnig H.FIND() method to find the first, then delete that dataitem (h.removeddup), so that the next dataitem will become the first to satisfy the next h.find().
  3. Carry forth the last matching dataitem in the hash object to match any "extra" obs in PRODUCT.  So don't h.removedup the last dataitem for the duplicate key

 

data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;

A couple of NOTES:

 

  1. The above does not need either dataset to be sorted.  Which, as far as I can tell, is the only real advantage of using hash vs MERGE in this particular application.

  2. If there were key combinations in YEAR2021_WW01 that have a longer series than in PRODUCTS, then the above will NOT replicate the merge statement - because the merge statement would repeat the last matching obs in PRODUCTS to meet the longer series in YEAR2021_WW01, while this program would ignore extra obs in the YEAR2021_WW01 object.  That would take a bit more code.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1451 views
  • 2 likes
  • 5 in conversation