Dear SAS community members,
I am struggling with the following, hopefully someone can help.
I have a fairly large table (6 million rows) with a column with names. Unfortunately, spelling is not always consistent. So, for example, I might have ‘Mel Gibson’ and ‘Mel Gibbon’ as two different spellings of the same name. Now, what I would like is to correct this by replacing all these different spellings with a single one (e.g. the most frequently occuring). Also, since it is a fairly large number of rows I am looking for something efficient.
My idea was to first make a list of indices I of all names that need to be processed. Initially this list runs from 1 to num_rows (i.e, all entries). Then start a loop, pick the first name and calculate a criterion (e.g. Levenshtein distance < 3) for all remaining names. For all names where the criterion is met replace the spelling with the most frequently occuring spelling. Then update I and do the same for the next name in line. Keep doing this until length of I is zero. The problem is that I don’t know how to accomplish this. Help would be greatly appreciated!
Below is a small dummy dataset I have been playing around with.
Many thanks in advance for any help. Best regards,
Ruud
/*make dummy dataset:*/
data work.dummy_data;
input index names $20. corrected_names $20.;
format names $20. corrected_names $20.;
datalines;
1 Tom Hanks
2 John Trovolta
3 Mel Gibson
4 Sylvester Stalone
5 Mel Gibbon
6 Tim Honks
7 Johnny Travolta
8 Sylvester Stallone
9 Mel Gibson
10 Sylvester Stallone
11 Tom Hawks
12 John Travolta
13 John Travolta
14 Tommy Hanks
15 Tom Hanks
16 Mel Gibson
;
run;