DATA Step, Macro, Functions and more

Fuzzy logic to locate duplicates?

Accepted Solution Solved
Reply
Contributor
Posts: 31
Accepted Solution

Fuzzy logic to locate duplicates?

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.


Accepted Solutions
Solution
‎03-01-2018 03:43 PM
Super Contributor
Posts: 472

Re: Fuzzy logic to locate duplicates?

I prefer using SPEDIS() function which gives a score on the likelihood of two variables. 

 

SPEDIS(First_name,First_name_match)<5;

 

COMPGED is also similar function. 

Thanks,
Suryakiran

View solution in original post


All Replies
Super User
Posts: 22,850

Re: Fuzzy logic to locate duplicates?

This is a non-trivial problem.  

 

One factor is the number of comparisons required, because you're now required to compare each record to all other records you're trying to match to, so it becomes a N*M comparison. 

 

You can look at the following functions for starter:

 

COMPGED

SOUNDEX/SOUNDS LIKE

 


Doug____ wrote:

How can fuzzy logic be used to locate duplicate records in a single dataset? I have records where spellings can vary (or be slightly incorrect) based on a combination of free text and standard text in some variables, so the standard NODUPKEY or sort with FIRST or LAST in data steps do not work correctly.


 

 

 

Solution
‎03-01-2018 03:43 PM
Super Contributor
Posts: 472

Re: Fuzzy logic to locate duplicates?

I prefer using SPEDIS() function which gives a score on the likelihood of two variables. 

 

SPEDIS(First_name,First_name_match)<5;

 

COMPGED is also similar function. 

Thanks,
Suryakiran
Super User
Posts: 13,046

Re: Fuzzy logic to locate duplicates?

How many records are you looking at?

Are there other variables in the data that help identify individuals (date of birth, address, other fields?)

 

 

PROC Star
Posts: 2,230

Re: Fuzzy logic to locate duplicates?

I found COMPGED gives more usable results than SPEDIS. 

Also, a good way to limit the size of the cartesian products, if applicable, it to match on similar lengths and on identical starting letter(s).

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 109 views
  • 3 likes
  • 5 in conversation