Solved
Contributor
Posts: 37

# Fuzzy logic to locate duplicates?

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.

Accepted Solutions
Solution
‎03-01-2018 03:43 PM
PROC Star
Posts: 622

## Re: Fuzzy logic to locate duplicates?

I prefer using SPEDIS() function which gives a score on the likelihood of two variables.

SPEDIS(First_name,First_name_match)<5;

COMPGED is also similar function.

Thanks,
Suryakiran

All Replies
Super User
Posts: 23,963

## Re: Fuzzy logic to locate duplicates?

This is a non-trivial problem.

One factor is the number of comparisons required, because you're now required to compare each record to all other records you're trying to match to, so it becomes a N*M comparison.

You can look at the following functions for starter:

COMPGED

SOUNDEX/SOUNDS LIKE

Doug____ wrote:

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.

Solution
‎03-01-2018 03:43 PM
PROC Star
Posts: 622

## Re: Fuzzy logic to locate duplicates?

I prefer using SPEDIS() function which gives a score on the likelihood of two variables.

SPEDIS(First_name,First_name_match)<5;

COMPGED is also similar function.

Thanks,
Suryakiran
Super User
Posts: 13,898

## Re: Fuzzy logic to locate duplicates?

How many records are you looking at?

Are there other variables in the data that help identify individuals (date of birth, address, other fields?)

Super User
Posts: 2,499

## Re: Fuzzy logic to locate duplicates?

I found COMPGED gives more usable results than SPEDIS.

Also, a good way to limit the size of the cartesian products, if applicable, it to match on similar lengths and on identical starting letter(s).

☑ This topic is solved.