Solved: Looking for fuzzy duplicates in one variable

claremc · Posted 12-09-2020 04:42 PM

Hi,

I have one variable that is a list of schools. I need to look for duplicates that are similar but potentially misspelled or missing just one word or a space or something. Everything I've found with compged, soundex, etc. is for fuzzy matching two different variables or fuzzy matching two different datasets.

Any idea how I could do this?

Thanks,

Clare

ballardw · Posted 12-09-2020 05:29 PM

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

View solution in original post

ballardw · Posted 12-09-2020 05:29 PM

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

ChrisNZ · Posted 12-09-2020 07:33 PM

To reduce the size of the Cartesian join, it's probably add another criterion such as

where a.WORD ne b.WORD
  and first(a.WORD) = first(b.WORD)
  and 0 < calculated COMPVAL le 300

High-Performance SAS Coding - Third Edition

claremc · Posted 12-23-2020 01:53 PM

Amazing thank you!

Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

Re: Looking for fuzzy duplicates in one variable

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!