hi all,
please see my below data set.
i am dealing with the worst data set that i have ever seen.
many data come with wrong spelling and missing word.
can someone help?
Name Club no.
1 Manchester united 1234
2 Manchester uit 1234
3 Manchester unite 1234
4 arsenal 3214
5 arsen 3214
6 Tottenham Hotspur 3214
7 Tottenham Hotspu 3214
8 Manchester city 4321
9 Manchester cit 4321
10 laker 7890
11 lake 7890
12 liverpoo ncc 3333
13 liverpool 3333
what i want is can i write a program with below logical in do loop ?
if row1 Club No. = row2 Club No.
and dif=compged(Name1,Name2)
if dif >= 70 then do;
name1 = name2;
end;
what i expect to get is
Name Club no.
1 Manchester united 1234
2 Manchester united 1234
3 Manchester united 1234
4 arsenal 3214
5 arsenal 3214
6 Tottenham Hotspur 3214
7 Tottenham Hotspur 3214
8 Manchester city 4321
9 Manchester city 4321
10 laker 7890
11 laker 7890
12 liverpoo ncc 3333
13 liverpoo ncc 3333
and by sorting the data with no duplicate
i can get a unique name list and that is my purpose
expecting some spelling error.
the hardest part is some club share with same club no. and there are millions of data and might be there are thousands of unique club, so i can not correct the spelling by typing the right name
have to rely on the percentage of compged(Name1,Name2)
and the Club no.
thanks to all
Hi @harrylui
I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).
Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).
But to answer your question, I am not sure you can avoid a manual step ...
The first thing is really to have a better knowledge of your data so as to be able to recode them.
Best,
proc sort data=have;
by Club_no;
run;
proc freq data=have noprint;
table name / out=have_freq (drop=percent);
by Club_no;
run;
proc sort data=have_freq;
by Club_no descending count;
run;
proc transpose data=have_freq out=want (drop=_:) prefix=name;
var name;
by Club_no;
run;
Hi @harrylui
I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).
Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).
But to answer your question, I am not sure you can avoid a manual step ...
The first thing is really to have a better knowledge of your data so as to be able to recode them.
Best,
proc sort data=have;
by Club_no;
run;
proc freq data=have noprint;
table name / out=have_freq (drop=percent);
by Club_no;
run;
proc sort data=have_freq;
by Club_no descending count;
run;
proc transpose data=have_freq out=want (drop=_:) prefix=name;
var name;
by Club_no;
run;
What if you considered the first 2 letters of every word? Would that simplify the problem?
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.