Solved: Re: dealing with the worst data set

harrylui · Posted 02-14-2020 03:04 AM

hi all,

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

but i have to report the counting number of each team.

how can i recognize them with such similarity in wordings.

can i do this in program way? or i have to do it by eyeball check?

would you help to suggest some solutions? thanks a lot

Name Organization
Manchester united Football

Manchester uit Football

Manchester unite Football

arsenal Football

arsen Football

Manchester city Football

Manchester cit Football

laker Basketball

lake Basketball

liverpool Football

liverpoo n Football

Regards,

Harry

ed_sas_member · Posted 02-14-2020 05:38 AM

data have;
	infile datalines dlm="09"x;
	input Name :$80. Organization :$80.;
	datalines;
Manchester united	Football   
Manchester uit	Football  
Manchester unite	Football
arsenal	Football
arsen	Football
Manchester city	Football
Manchester cit	Football
laker	Basketball
lake	Basketball
liverpool	Football
liverpoo n	Football
;
run;

%macro compare (text);
	data have;
		set have;
		tmp1=soundex(Name);
		tmp2=soundex("&text.");
		dif=compged(tmp1, tmp2);
		if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
		drop dif tmp1 tmp2;
	run;
%mend;

%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)

title "Automatic correction: potential matches";
	proc print noobs;
	run;
title;

Output with Dif <= 50Output with Dif <= 100

View solution in original post

Oligolas · Posted 02-14-2020 03:22 AM

I think the easiest way would be to clean the data prior evaluation

data have;
length Name Organization $60;
infile datalines delimiter=','; 
input Name $ Organization $;
datalines;
Manchester united,Football 
Manchester uit,Football 
Manchester unite,Football
arsenal,Football
arsen,Football
Manchester city,Football
Manchester cit,Football
laker,Basketball
lake,Basketball
liverpool,Football
liverpoo n,Football
;
RUN;

PROC SQL;
select distinct "else if strip(upcase(name)) eq '"||strip(upcase(name))||"' then name=propcase('xxx');"
from have
;
QUIT;

data want;
   set have;
if strip(upcase(name)) eq 'ARSEN' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'ARSENAL' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'LAKE' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LAKER' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LIVERPOO N' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'LIVERPOOL' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'MANCHESTER CIT' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER CITY' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER UIT' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITE' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITED' then name=propcase('MANCHESTER UNITED');
run;

________________________
- Cheers -

ed_sas_member · Posted 02-14-2020 03:24 AM

Hi @harrylui

Here is an interesting article that could be relevant in your case:

HTTPS://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

Hope this helps.

best,

ed_sas_member · Posted 02-14-2020 05:38 AM

data have;
	infile datalines dlm="09"x;
	input Name :$80. Organization :$80.;
	datalines;
Manchester united	Football   
Manchester uit	Football  
Manchester unite	Football
arsenal	Football
arsen	Football
Manchester city	Football
Manchester cit	Football
laker	Basketball
lake	Basketball
liverpool	Football
liverpoo n	Football
;
run;

%macro compare (text);
	data have;
		set have;
		tmp1=soundex(Name);
		tmp2=soundex("&text.");
		dif=compged(tmp1, tmp2);
		if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
		drop dif tmp1 tmp2;
	run;
%mend;

%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)

title "Automatic correction: potential matches";
	proc print noobs;
	run;
title;

Output with Dif <= 50Output with Dif <= 100

ballardw · Posted 02-14-2020 11:03 AM

@harrylui wrote:

hi all,

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

but i have to report the counting number of each team.

how can i recognize them with such similarity in wordings.

can i do this in program way? or i have to do it by eyeball check?

would you help to suggest some solutions? thanks a lot

Name Organization
Manchester united Football

Manchester uit Football

Manchester unite Football

arsenal Football

arsen Football

Manchester city Football

Manchester cit Football

laker Basketball

lake Basketball

liverpool Football

liverpoo n Football

Regards,

Harry

Not even close to worst and with only one on the problematic side relatively minor just tedious.

I have dealt with address fields that values imbedded like "See the grandmother on Fridays"

Or individuals names that made no attempt what so ever to have any given name order. One file with :

Last name, first name, Middle name

First name , last name , middle name

First name (indicators like Junior II or II) then last name, middle name

last name, First name (indicators like Junior II or II) then, middle name

last name, First name then, middle name (indicators like Junior II or II)

without delimiters and roughly 10 percent with both parents last names (again in particular order).

and more

The only real difficulty I see would be if Manchester exists in two countries/ provinces states with the same sport played at the same level.

Consider " I am a Bronco's Football fan". I know without even resorting to google that there is one professional football team and two college teams in the US. I suspect there's more at different levels.

dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Registration is open

SAS Training: Just a Click Away