Hi ,
Am confused with how the Hash and Merge works for this below table . For me the output of the Merge is correct . but Merge requires the data set to be sorted before the Merge to happen and its time consuming .Is there any option i can have the same output of Merge in hash join as well?
data product; input ID $ File_Nm $ PRODUCT $; datalines; 1001 File1 1234 1001 File1 4567 1001 File1 7890 1001 File1 0123 1001 File1 3456 1001 File1 6789 ; data YEAR2021_WW01; input ID $ File_Nm $ Amount $; datalines; 1001 File1 100 1001 File1 200 1001 File1 300 ; data want_hash; if 0 then set product ; declare hash p (dataset:"product"); p.definekey("ID","File_Nm"); p.definedata(all:'Y'); p.definedone(); do until(eof); set YEAR2021_WW01 end=eof; if p.find() = 0 then output; end; stop; run;
/*Merge Join*/
DATA WORK.want_merge;
MERGE WORK.product(IN=A) YEAR2021_WW01(IN=B);
BY ID File_Nm;
IF A;
RUN;
The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.
When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.
BTW, your merge suggests that the hash object should be for year2021_ww01, not product.
But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.
So you need to
data want (drop=_:);
set product year2021_ww01 (obs=0);
if _n_=1 then do;
declare hash h (dataset:'year2021_ww01',multidata:'Y');
h.definekey('id','file_nm');
h.definedata(all:'Y');
h.definedone();
end;
if h.find()=0 then do;
h.has_next(result:_duplicate_available);
if _duplicate_available then h.removedup();
end;
run;
A couple of NOTES:
Seems like you switch the two data sets around.
Try this
data want_hash;
if 0 then set YEAR2021_WW01 ;
declare hash p (dataset : "YEAR2021_WW01");
p.definekey("ID", "File_Nm");
p.definedata(all:'Y');
p.definedone();
do until(eof);
set product end=eof;
if p.find() = 0 then output;
end;
stop;
run;
What is the question?
Are you asking if you can get sorted results from the HASH method without sorting the YEAR2021_WW01 dataset?
If so then the answer is no.
> Merge requires the data set to be sorted before the Merge to happen and its time consuming .
But then the merge takes no time at all. And the output is sorted, ready for the next merge (be sure to set the sorted flag as SAS is too lazy to do that).
Bottom line: If you want a sorted output, you need to sort at some point.
You can use the hash table for this if you want, but it is unlikely that the hash table is going to be the silver bullet you seem to think it is.
The reason you are not successful in using hash object to replicate the merge is primarily due to the fact that MERGE is a "match_merge" process. When you have a many-to-many match as you do in your example, then merge combines the first match from each dataset, then the second match from each dataset, etc.
When one of the datasets is shorter (as year2021_ww01 is here), then the last observation of that shorter sequence is matched with every "excess" obs in the longer sequence.
BTW, your merge suggests that the hash object should be for year2021_ww01, not product.
But even then your code would fail, because the hash object defaults to storing one dataitem (i.e. one "row") per key combination, so you don't even keep the entire sequence of obs from year2021_ww01 for any repeated ID/FILE_NM.
So you need to
data want (drop=_:);
set product year2021_ww01 (obs=0);
if _n_=1 then do;
declare hash h (dataset:'year2021_ww01',multidata:'Y');
h.definekey('id','file_nm');
h.definedata(all:'Y');
h.definedone();
end;
if h.find()=0 then do;
h.has_next(result:_duplicate_available);
if _duplicate_available then h.removedup();
end;
run;
A couple of NOTES:
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.