Solved: dealing with the worst data set

harrylui · Posted 02-15-2020 09:46 AM

hi all,

please see my below data set.

i am dealing with the worst data set that i have ever seen.

many data come with wrong spelling and missing word.

can someone help?

Name Club no.
1 Manchester united 1234

2 Manchester uit 1234

3 Manchester unite 1234

4 arsenal 3214

5 arsen 3214

6 Tottenham Hotspur 3214

7 Tottenham Hotspu 3214

8 Manchester city 4321

9 Manchester cit 4321

10 laker 7890

11 lake 7890

12 liverpoo ncc 3333

13 liverpool 3333

what i want is can i write a program with below logical in do loop ?

if row1 Club No. = row2 Club No.

and dif=compged(Name1,Name2)

if dif >= 70 then do;

name1 = name2;

end;

what i expect to get is

Name Club no.
1 Manchester united 1234

2 Manchester united 1234

3 Manchester united 1234

4 arsenal 3214

5 arsenal 3214

6 Tottenham Hotspur 3214

7 Tottenham Hotspur 3214

8 Manchester city 4321

9 Manchester city 4321

10 laker 7890

11 laker 7890

12 liverpoo ncc 3333

13 liverpoo ncc 3333

and by sorting the data with no duplicate

i can get a unique name list and that is my purpose

expecting some spelling error.

the hardest part is some club share with same club no. and there are millions of data and might be there are thousands of unique club, so i can not correct the spelling by typing the right name

have to rely on the percentage of compged(Name1,Name2)

and the Club no.

thanks to all

ed_sas_member · Posted 02-15-2020 01:41 PM

Hi @harrylui

I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).

Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).

But to answer your question, I am not sure you can avoid a manual step ...

The first thing is really to have a better knowledge of your data so as to be able to recode them.

Best,

proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;

View solution in original post

ed_sas_member · Posted 02-15-2020 01:41 PM

Hi @harrylui

I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).

Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).

But to answer your question, I am not sure you can avoid a manual step ...

The first thing is really to have a better knowledge of your data so as to be able to recode them.

Best,

proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;

PGStats · Posted 02-15-2020 02:32 PM

What if you considered the first 2 letters of every word? Would that simplify the problem?

PG

dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Re: dealing with the worst data set

Click image to register for webinar

Classroom Training Available!