BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
harrylui
Obsidian | Level 7

hi all,

 

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

but i have to report the counting number of each team. 

how can i recognize them with such similarity in wordings.

can i do this in program way?  or i have to do it by eyeball check?

would you help to suggest some solutions? thanks a lot

 

     Name                              Organization
     Manchester united          Football   

     Manchester uit                Football  

     Manchester unite            Football

     arsenal                            Football

     arsen                               Football

     Manchester city               Football

     Manchester cit                 Football

     laker                                Basketball

     lake                                 Basketball

     liverpool                          Football

     liverpoo n                        Football

 

 

 

Regards,

Harry

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

 

data have;
	infile datalines dlm="09"x;
	input Name :$80. Organization :$80.;
	datalines;
Manchester united	Football   
Manchester uit	Football  
Manchester unite	Football
arsenal	Football
arsen	Football
Manchester city	Football
Manchester cit	Football
laker	Basketball
lake	Basketball
liverpool	Football
liverpoo n	Football
;
run;

%macro compare (text);
	data have;
		set have;
		tmp1=soundex(Name);
		tmp2=soundex("&text.");
		dif=compged(tmp1, tmp2);
		if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
		drop dif tmp1 tmp2;
	run;
%mend;

%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)

title "Automatic correction: potential matches";
	proc print noobs;
	run;
title;

Output with Dif <= 50Output with Dif <= 50Output with Dif <= 100Output with Dif <= 100

 

View solution in original post

4 REPLIES 4
Oligolas
Barite | Level 11

I think the easiest way would be to clean the data prior evaluation

data have;
length Name Organization $60;
infile datalines delimiter=','; 
input Name $ Organization $;
datalines;
Manchester united,Football 
Manchester uit,Football 
Manchester unite,Football
arsenal,Football
arsen,Football
Manchester city,Football
Manchester cit,Football
laker,Basketball
lake,Basketball
liverpool,Football
liverpoo n,Football
;
RUN;

PROC SQL;
select distinct "else if strip(upcase(name)) eq '"||strip(upcase(name))||"' then name=propcase('xxx');"
from have
;
QUIT;

data want;
   set have;
if strip(upcase(name)) eq 'ARSEN' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'ARSENAL' then name=propcase('ARSENAL');
else if strip(upcase(name)) eq 'LAKE' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LAKER' then name=propcase('LAKER');
else if strip(upcase(name)) eq 'LIVERPOO N' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'LIVERPOOL' then name=propcase('LIVERPOOL');
else if strip(upcase(name)) eq 'MANCHESTER CIT' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER CITY' then name=propcase('MANCHESTER CITY');
else if strip(upcase(name)) eq 'MANCHESTER UIT' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITE' then name=propcase('MANCHESTER UNITED');
else if strip(upcase(name)) eq 'MANCHESTER UNITED' then name=propcase('MANCHESTER UNITED');
run;
________________________

- Cheers -

ed_sas_member
Meteorite | Level 14

Hi @harrylui 

 

Here is an interesting article that could be relevant in your case:

 

HTTPS://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/


Hope this helps.

 

best,

ed_sas_member
Meteorite | Level 14

 

data have;
	infile datalines dlm="09"x;
	input Name :$80. Organization :$80.;
	datalines;
Manchester united	Football   
Manchester uit	Football  
Manchester unite	Football
arsenal	Football
arsen	Football
Manchester city	Football
Manchester cit	Football
laker	Basketball
lake	Basketball
liverpool	Football
liverpoo n	Football
;
run;

%macro compare (text);
	data have;
		set have;
		tmp1=soundex(Name);
		tmp2=soundex("&text.");
		dif=compged(tmp1, tmp2);
		if dif<=90 then match="&text."; /* choose an acceptable cut-off 50? 90? 100? */
		drop dif tmp1 tmp2;
	run;
%mend;

%compare(Manchester United)
%compare(Arsenal)
%compare(Liverpool)
%compare(Lakers)
%compare(Manchester City)

title "Automatic correction: potential matches";
	proc print noobs;
	run;
title;

Output with Dif <= 50Output with Dif <= 50Output with Dif <= 100Output with Dif <= 100

 

ballardw
Super User

@harrylui wrote:

hi all,

 

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

but i have to report the counting number of each team. 

how can i recognize them with such similarity in wordings.

can i do this in program way?  or i have to do it by eyeball check?

would you help to suggest some solutions? thanks a lot

 

     Name                              Organization
     Manchester united          Football   

     Manchester uit                Football  

     Manchester unite            Football

     arsenal                            Football

     arsen                               Football

     Manchester city               Football

     Manchester cit                 Football

     laker                                Basketball

     lake                                 Basketball

     liverpool                          Football

     liverpoo n                        Football

 

 

 

Regards,

Harry


Not even close to worst and with only one on the problematic side relatively minor just tedious.

I have dealt with address fields that values imbedded like "See the grandmother on Fridays"

Or individuals names that made no attempt what so ever to have any given name order. One file with :

Last name, first name, Middle name

First name , last name , middle name

First name (indicators like Junior II or II) then last name, middle name

last name, First name (indicators like Junior II or II) then, middle name

last name, First name then, middle name  (indicators like Junior II or II)

 

without delimiters and roughly 10 percent with both parents last names (again in particular order).

and more

 

The only real difficulty I see would be if Manchester exists in two countries/ provinces states with the same sport played at the same level.

 

Consider " I am a Bronco's Football fan". I know without even resorting to google that there is one professional football team and two college teams in the US. I suspect there's more at different levels.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1141 views
  • 0 likes
  • 4 in conversation