I'm testing out how to use hash objects in SAS 9.4 M6 to do fuzzy joins since PROC SQL just runs for hours on my larger dataset. I created some sample datasets (below) and what I want is for the merge to pull in exact matches on the "name" fields AND any matches that have a COMPLEV score less than 10. Right now, this code still only pulls in the exact matches. I'm very new to hash objects so I'm sure it's a simple fix but I've tried am in need of help.
data A; infile datalines missover; length nameA $50; input nameA $ ; datalines; MICKEYMOUSE2000-01-02 DAFDUCK1990-09-23 GOOFYMAN1993-05-11 ; run; *second dataset with one exact match and two that differ slightly from those in dataset A; data B; infile datalines missover; length nameB $50; input nameB $ VDAY :ddmmyy10.; format VDAY ddmmyy10.; datalines; MICKEYMOUSE2000-01-01 07/08/2021 DAFFYDUCK1990-09-23 05/11/2021 GOOFYMAN1993-05-11 08/11/2021 ; run; *only pulling in exact matches, want it to pull in other fuzzy matches; data simplemerge ; if 0 then set work.B ; *load var properties into hash table; if _n_ = 1 then do; dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset; B.definekey('nameB');*identify var in B to use as key; B.definedata('nameB','vday');*identify columns of data to bring in from B dataset; B.definedone();*complete hash table definition; end; set work.A; *bring in A data; if B.Find(KEY: nameA) ne 0 then do; if complev(nameA, nameB) < 10 then do; B.ref(key : nameB,data : nameB, data : vday); end; end; RUN;
Hash tables only perform equijoins.
To speed up your fuzzy SQL match, split it. For example:
1- Match on same value
2- Take the unmatched records and match on same uppercase values
3- Take the unmatched records and match on same first letter and same compbl() value
4- Take the unmatched records and match on same first letter and same length and complev()
5- Take the unmatched records and match on same first letter and complev()
6- Take the unmatched records and match on complev()
Adapt the steps to your data. The goal is to increase match complexity as you lower volume.
The code above is just an example: I'd hope that you have standardised things such case, punctuation or spacing.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.