Hi ,
Am confused with how the Hash and Merge works for this below table . For me the output of the Merge is correct . but Merge requires the data set to be sorted before the Merge to happen and its time consuming .Is there any option i can have the same output of Merge in hash join as well?
data product;
input ID $ File_Nm $ PRODUCT $;
datalines;
1001  File1 1234
1001  File1 4567
1001  File1 7890
1001  File1 0123
1001  File1 3456
1001  File1 6789
;
data YEAR2021_WW01;
input ID $ File_Nm $ Amount $;
 
datalines;
1001  File1 100
1001  File1 200
1001  File1 300
;
  
data want_hash;
 if 0 then set product ;
 
 
  declare hash p (dataset:"product");
  p.definekey("ID","File_Nm");
  p.definedata(all:'Y'); 
  p.definedone();
   do until(eof);
   set YEAR2021_WW01 end=eof;
 
if p.find() = 0 then output;
end;
stop;
run;
/*Merge Join*/
DATA WORK.want_merge;
MERGE WORK.product(IN=A) YEAR2021_WW01(IN=B);
BY ID File_Nm;
IF A;
RUN;
The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.
When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.
BTW, your merge suggests that the hash object should be for year2021_ww01, not product.
But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.
So you need to
data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;
A couple of NOTES:
Seems like you switch the two data sets around.
Try this
data want_hash;
   if 0 then set YEAR2021_WW01 ;
   declare hash p (dataset : "YEAR2021_WW01");
   p.definekey("ID", "File_Nm");
   p.definedata(all:'Y'); 
   p.definedone();
   do until(eof);
      set product end=eof;
      if p.find() = 0 then output;
   end;
   stop;
run;What is the question?
Are you asking if you can get sorted results from the HASH method without sorting the YEAR2021_WW01 dataset?
If so then the answer is no.
> Merge requires the data set to be sorted before the Merge to happen and its time consuming .
But then the merge takes no time at all. And the output is sorted, ready for the next merge (be sure to set the sorted flag as SAS is too lazy to do that).
Bottom line: If you want a sorted output, you need to sort at some point.
You can use the hash table for this if you want, but it is unlikely that the hash table is going to be the silver bullet you seem to think it is.
The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.
When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.
BTW, your merge suggests that the hash object should be for year2021_ww01, not product.
But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.
So you need to
data want (drop=_:);
  set product year2021_ww01 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'year2021_ww01',multidata:'Y');
      h.definekey('id','file_nm');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.find()=0 then do;
    h.has_next(result:_duplicate_available);
    if _duplicate_available then h.removedup();
  end;
run;
A couple of NOTES:
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
