hi all,
please see my below data set.
i am dealing with the worst data set that i have ever seen.
many data come with wrong spelling and missing word.
but i have to report the counting number of each team.
how can i recognize them with such similarity in wordings.
can i do this in program way? or i have to do it by eyeball check?
would you help to suggest some solutions? thanks a lot
Name Organization
Manchester united Football
Manchester uit Football
Manchester unite Football
arsenal Football
arsen Football
Manchester city Football
Manchester cit Football
laker Basketball
lake Basketball
liverpool Football
liverpoo n Football
Regards,
Harry
data have;
infile datalines dlm="09"x;
input Name :$80. Organization :$80.;
datalines;
Manchester united Football
Manchester uit Football
Manchester unite Football
arsenal Football
arsen Football
Manchester city Football
Manchester cit Football
laker Basketball
lake Basketball
liverpool Football
liverpoo n Football
;
run;
%macro compare (text);
data have;
set have;
tmp1=soundex(Name);
tmp2=soundex("&text.");
dif=compged(tmp1, tmp2);
if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
drop dif tmp1 tmp2;
run;
%mend;
%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)
title "Automatic correction: potential matches";
proc print noobs;
run;
title;
Output with Dif <= 50
Output with Dif <= 100
I think the easiest way would be to clean the data prior evaluation
data have;
length Name Organization $60;
infile datalines delimiter=',';
input Name $ Organization $;
datalines;
Manchester united,Football
Manchester uit,Football
Manchester unite,Football
arsenal,Football
arsen,Football
Manchester city,Football
Manchester cit,Football
laker,Basketball
lake,Basketball
liverpool,Football
liverpoo n,Football
;
RUN;
PROC SQL;
select distinct "else if strip(upcase(name)) eq '"||strip(upcase(name))||"' then name=propcase('xxx');"
from have
;
QUIT;
data want;
set have;
if strip(upcase(name)) eq 'ARSEN' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'ARSENAL' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'LAKE' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LAKER' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LIVERPOO N' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'LIVERPOOL' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'MANCHESTER CIT' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER CITY' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER UIT' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITE' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITED' then name=propcase('MANCHESTER UNITED');
run;
- Cheers -
Hi @harrylui
Here is an interesting article that could be relevant in your case:
HTTPS://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
Hope this helps.
best,
data have;
infile datalines dlm="09"x;
input Name :$80. Organization :$80.;
datalines;
Manchester united Football
Manchester uit Football
Manchester unite Football
arsenal Football
arsen Football
Manchester city Football
Manchester cit Football
laker Basketball
lake Basketball
liverpool Football
liverpoo n Football
;
run;
%macro compare (text);
data have;
set have;
tmp1=soundex(Name);
tmp2=soundex("&text.");
dif=compged(tmp1, tmp2);
if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
drop dif tmp1 tmp2;
run;
%mend;
%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)
title "Automatic correction: potential matches";
proc print noobs;
run;
title;
Output with Dif <= 50
Output with Dif <= 100
@harrylui wrote:
hi all,
please see my below data set.
i am dealing with the worst data set that i have ever seen.
many data come with wrong spelling and missing word.
but i have to report the counting number of each team.
how can i recognize them with such similarity in wordings.
can i do this in program way? or i have to do it by eyeball check?
would you help to suggest some solutions? thanks a lot
Name Organization
Manchester united FootballManchester uit Football
Manchester unite Football
arsenal Football
arsen Football
Manchester city Football
Manchester cit Football
laker Basketball
lake Basketball
liverpool Football
liverpoo n Football
Regards,
Harry
Not even close to worst and with only one on the problematic side relatively minor just tedious.
I have dealt with address fields that values imbedded like "See the grandmother on Fridays"
Or individuals names that made no attempt what so ever to have any given name order. One file with :
Last name, first name, Middle name
First name , last name , middle name
First name (indicators like Junior II or II) then last name, middle name
last name, First name (indicators like Junior II or II) then, middle name
last name, First name then, middle name (indicators like Junior II or II)
without delimiters and roughly 10 percent with both parents last names (again in particular order).
and more
The only real difficulty I see would be if Manchester exists in two countries/ provinces states with the same sport played at the same level.
Consider " I am a Bronco's Football fan". I know without even resorting to google that there is one professional football team and two college teams in the US. I suspect there's more at different levels.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.