Hi,
I have one variable that is a list of schools. I need to look for duplicates that are similar but potentially misspelled or missing just one word or a space or something. Everything I've found with compged, soundex, etc. is for fuzzy matching two different variables or fuzzy matching two different datasets.
Any idea how I could do this?
Thanks,
Clare
You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?
Here is an example of looking at the same variable at different positions in the data.
Data have; infile datalines truncover; input line word :$20.; datalines; 1 Johnson 2 Hohnson 3 Marysville 4 Johnston 5 Martyville 6 Marysville 7 Johnson ; proc sql; create table compare as select a.line, a.word, b.line as Lineb, b.word as wordb, compged(a.word, b.word) as compval from have as a, have as b where a.word ne b.word and 0 < calculated compval le 300 ; quit;
The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.
If you have a clean list it would be be better to compare that but position.
The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.
If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.
You can look at other values within a single data set or do you have a set of "clean" values that you know are correct?
Here is an example of looking at the same variable at different positions in the data.
Data have; infile datalines truncover; input line word :$20.; datalines; 1 Johnson 2 Hohnson 3 Marysville 4 Johnston 5 Martyville 6 Marysville 7 Johnson ; proc sql; create table compare as select a.line, a.word, b.line as Lineb, b.word as wordb, compged(a.word, b.word) as compval from have as a, have as b where a.word ne b.word and 0 < calculated compval le 300 ; quit;
The upper value on the 0 < calculated compval le 300 looks for pretty close spellings. You would likely want to use a larger value. That might involve picking something large, like 1500 instead of 300 and if that shows too many "not close" comparisons reduce the value.
If you have a clean list it would be be better to compare that but position.
The Line variable I included basically to just show you which values might be compared. Other filters based on your actual data might be appropriate such as a.word < b.word which may reduce pairs with a and b values switched but school names are pretty ugly for this sort of comparison and if you don't have something else to restrict data on such as School District or City gets real ugly because of common school names like Washington, Lincoln, Jefferson, Roosevelt an similar, plus those named after more local personages.
If you have district or city information you would want to add something like " and a.city = b.city" or "and a.schooldistrict =b.schooldistrict" so you are comparing more similar items to find the differences.
To reduce the size of the Cartesian join, it's probably add another criterion such as
where a.WORD ne b.WORD
and first(a.WORD) = first(b.WORD) and 0 < calculated COMPVAL le 300
Amazing thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.