Solved: Fuzzy logic to locate duplicates?

Doug____ · Posted 02-12-2018 12:56 PM

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.

SuryaKiran · Posted 02-12-2018 05:29 PM

I prefer using SPEDIS() function which gives a score on the likelihood of two variables.

SPEDIS(First_name,First_name_match)<5;

COMPGED is also similar function.

Thanks,
Suryakiran

View solution in original post

Reeza · Posted 02-12-2018 01:26 PM

This is a non-trivial problem.

One factor is the number of comparisons required, because you're now required to compare each record to all other records you're trying to match to, so it becomes a N*M comparison.

You can look at the following functions for starter:

COMPGED

SOUNDEX/SOUNDS LIKE

@Doug____ wrote:

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.

SuryaKiran · Posted 02-12-2018 05:29 PM

I prefer using SPEDIS() function which gives a score on the likelihood of two variables.

SPEDIS(First_name,First_name_match)<5;

COMPGED is also similar function.

Thanks,
Suryakiran

ballardw · Posted 02-12-2018 07:25 PM

How many records are you looking at?

Are there other variables in the data that help identify individuals (date of birth, address, other fields?)

ChrisNZ · Posted 02-12-2018 09:22 PM

I found COMPGED gives more usable results than SPEDIS.

Also, a good way to limit the size of the cartesian products, if applicable, it to match on similar lengths and on identical starting letter(s).

High-Performance SAS Coding - Third Edition

Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Catch up on SAS Innovate 2026

Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Re: Fuzzy logic to locate duplicates?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away