BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
claremc
Obsidian | Level 7

Hi, 

 

I have one variable that is a list of schools. I need to look for duplicates that are similar but potentially misspelled or missing just one word or a space or something. Everything I've found with compged, soundex, etc. is for fuzzy matching two different variables or fuzzy matching two different datasets.

 

Any idea how I could do this? 

 

Thanks, 

Clare

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

 

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;
        

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

 

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

 

 

View solution in original post

3 REPLIES 3
ballardw
Super User

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

 

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;
        

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

 

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

 

 

ChrisNZ
Tourmaline | Level 20

To reduce the size of the Cartesian join, it's probably add another criterion such as

where a.WORD ne b.WORD
and first(a.WORD) = first(b.WORD) and 0 < calculated COMPVAL le 300

 

claremc
Obsidian | Level 7

Amazing thank you!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1369 views
  • 0 likes
  • 3 in conversation