BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
claremc
Obsidian | Level 7

Hi, 

 

I have one variable that is a list of schools. I need to look for duplicates that are similar but potentially misspelled or missing just one word or a space or something. Everything I've found with compged, soundex, etc. is for fuzzy matching two different variables or fuzzy matching two different datasets.

 

Any idea how I could do this? 

 

Thanks, 

Clare

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

 

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;
        

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

 

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

 

 

View solution in original post

3 REPLIES 3
ballardw
Super User

You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?

 

Here is an example of looking at the same variable at different positions in the data.

Data have;
  infile datalines truncover;
  input line word :$20.;
datalines;
1 Johnson
2 Hohnson
3 Marysville
4 Johnston
5 Martyville
6 Marysville
7 Johnson
;

proc sql;
  create table compare as
  select a.line, a.word, b.line as Lineb, b.word as wordb,
         compged(a.word, b.word) as compval 
  from have as a, have as b
  where a.word ne b.word
  and 0 < calculated compval le 300
  ;
quit;
        

The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.

 

If you have a clean list it would be be better to compare that but position.

The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.

If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.

 

 

ChrisNZ
Tourmaline | Level 20

To reduce the size of the Cartesian join, it's probably add another criterion such as

where a.WORD ne b.WORD
and first(a.WORD) = first(b.WORD) and 0 < calculated COMPVAL le 300

 

claremc
Obsidian | Level 7

Amazing thank you!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 889 views
  • 0 likes
  • 3 in conversation