BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
harrylui
Obsidian | Level 7

hi all,

 

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

can someone help?

 

            Name                  Club no.
1    Manchester united     1234   

2    Manchester uit           1234

3    Manchester unite       1234

4    arsenal                       3214

5    arsen                          3214

6    Tottenham Hotspur     3214

7    Tottenham Hotspu      3214

8    Manchester city          4321

9    Manchester cit            4321

10    laker                         7890

11    lake                           7890

12  liverpoo ncc                 3333

13  liverpool                      3333

 

what i want is can i write a program with below logical in do loop ?

 

if row1 Club No. = row2 Club No. 

and dif=compged(Name1,Name2)

if dif >= 70 then do;

name1 = name2;

end;

 

what i expect to get is 

            Name                  Club no.
1    Manchester united     1234   

2    Manchester united     1234

3    Manchester united     1234

4    arsenal                       3214

5    arsenal                       3214

6    Tottenham Hotspur     3214

7    Tottenham Hotspur     3214

8    Manchester city          4321

9    Manchester city          4321

10  laker                          7890

11  laker                          7890

12  liverpoo ncc              3333

13  liverpoo ncc              3333

 

and by sorting the data with no duplicate 

i can get a unique name list and that is my purpose

expecting some spelling error.

the hardest part is some club share with same club no. and there are millions of data and might be there are thousands of unique club,  so i can not correct the spelling by typing the right name 

have to rely on the percentage of compged(Name1,Name2)

and the Club no.

 

thanks to all

 

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

Hi @harrylui 

 

I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).

Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).

But to answer your question, I am not sure you can avoid a manual step ...

The first thing is really to have a better knowledge of your data so as to be able to recode them.

Best,

proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;

 

View solution in original post

2 REPLIES 2
ed_sas_member
Meteorite | Level 14

Hi @harrylui 

 

I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).

Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).

But to answer your question, I am not sure you can avoid a manual step ...

The first thing is really to have a better knowledge of your data so as to be able to recode them.

Best,

proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;

 

PGStats
Opal | Level 21

What if you considered the first 2 letters of every word? Would that simplify the problem?

PG

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 801 views
  • 0 likes
  • 3 in conversation