How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.
I prefer using SPEDIS() function which gives a score on the likelihood of two variables.
SPEDIS(First_name,First_name_match)<5;
COMPGED is also similar function.
This is a non-trivial problem.
One factor is the number of comparisons required, because you're now required to compare each record to all other records you're trying to match to, so it becomes a N*M comparison.
You can look at the following functions for starter:
COMPGED
SOUNDEX/SOUNDS LIKE
@Doug____ wrote:
How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.
I prefer using SPEDIS() function which gives a score on the likelihood of two variables.
SPEDIS(First_name,First_name_match)<5;
COMPGED is also similar function.
How many records are you looking at?
Are there other variables in the data that help identify individuals (date of birth, address, other fields?)
I found COMPGED gives more usable results than SPEDIS.
Also, a good way to limit the size of the cartesian products, if applicable, it to match on similar lengths and on identical starting letter(s).
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.