How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.
I prefer using SPEDIS() function which gives a score on the likelihood of two variables.
SPEDIS(First_name,First_name_match)<5;
COMPGED is also similar function.
This is a non-trivial problem.
One factor is the number of comparisons required, because you're now required to compare each record to all other records you're trying to match to, so it becomes a N*M comparison.
You can look at the following functions for starter:
COMPGED
SOUNDEX/SOUNDS LIKE
@Doug____ wrote:
How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.
I prefer using SPEDIS() function which gives a score on the likelihood of two variables.
SPEDIS(First_name,First_name_match)<5;
COMPGED is also similar function.
How many records are you looking at?
Are there other variables in the data that help identify individuals (date of birth, address, other fields?)
I found COMPGED gives more usable results than SPEDIS.
Also, a good way to limit the size of the cartesian products, if applicable, it to match on similar lengths and on identical starting letter(s).
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.