Correct inconsistent spelling (using Levenshtein distances)

RuudB83 · Posted 12-10-2018 09:55 AM

Dear SAS community members,

I am struggling with the following, hopefully someone can help.

I have a fairly large table (6 million rows) with a column with names. Unfortunately, spelling is not always consistent. So, for example, I might have ‘Mel Gibson’ and ‘Mel Gibbon’ as two different spellings of the same name. Now, what I would like is to correct this by replacing all these different spellings with a single one (e.g. the most frequently occuring). Also, since it is a fairly large number of rows I am looking for something efficient.

My idea was to first make a list of indices I of all names that need to be processed. Initially this list runs from 1 to num_rows (i.e, all entries). Then start a loop, pick the first name and calculate a criterion (e.g. Levenshtein distance < 3) for all remaining names. For all names where the criterion is met replace the spelling with the most frequently occuring spelling. Then update I and do the same for the next name in line. Keep doing this until length of I is zero. The problem is that I don’t know how to accomplish this. Help would be greatly appreciated!

Below is a small dummy dataset I have been playing around with.

Many thanks in advance for any help. Best regards,

Ruud

/*make dummy dataset:*/
data work.dummy_data;
input index names $20. corrected_names $20.;
format names $20. corrected_names $20.; 
datalines;
1 Tom Hanks
2 John Trovolta
3 Mel Gibson
4 Sylvester Stalone
5 Mel Gibbon
6 Tim Honks
7 Johnny Travolta
8 Sylvester Stallone
9 Mel Gibson
10 Sylvester Stallone
11 Tom Hawks
12 John Travolta
13 John Travolta
14 Tommy Hanks
15 Tom Hanks
16 Mel Gibson
;
run;

RW9 · Posted 12-10-2018 10:21 AM

I don't have time to code this for you, but some guidance:

Soundex function can be used to encode text:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245948.htm

This may get you some similarity.

You can use proc freq to to caculate frequencies of names.

RuudB83 · Posted 12-14-2018 09:38 AM

Probably not the most elegant solutions, but it works.

/*make dummy dataset:*/
data work.dummy_data;
input index names $20. corrected_names $20.;
format names $20. corrected_names $20.; 
datalines;
1 Tom Hanks
2 John Trovolta
3 Mel Gibson
4 Sylvester Stalone
5 Mel Gibbon
6 Tim Honks
7 Johnny Travolta
8 Sylvester Stallone
9 Mel Gibson
10 Sylvester Stallone
11 Tom Hawks
12 John Travolta
13 John Travolta
14 Tommy Hanks
15 Tom Hanks
16 Mel Gibson
;
run;

%macro clean_names;

/*macro variable with rows that still need to be processed (to-do list):*/
proc sql noprint;
select index into :to_do separated by ', '
from dummy_data
;
quit;

/*%put &to_do;*/

/*start loop, go on until there are no more rows to process):*/
%do %until(%sysfunc(countw("&to_do", ",")) = 1);

/*select rows that need to be processed:*/
proc sql;
create table work.temp1 as
select *
from dummy_data
where index in (&to_do)
;
quit;

/*take value from first row:*/
proc sql noprint;
select names into :name
from temp1
;
quit;

/*%put &name;*/

/*calculate Levenshtein distance relative to values in other rows:*/
proc sql;
create table work.temp2 as
select *,
(COMPLEV("&name", names)) as distance
from temp1
;
quit;

/*select all rows that meet criterion:*/
proc sql;
create table work.temp3 as
select *
from temp2
where distance <= 4
;
quit;

/*determine index, these rows will be corrected in this iteration:*/
proc sql noprint;
select index into :to_fill separated by ', '
from temp3
;
quit;

/*%put &to_fill;*/

/*make frequency table with various spellings:*/
proc freq data=work.temp3 order=freq noprint;
tables names / nocol norow nopercent out = work.temp_freq1;
run;

/*add index (for next step):*/
proc sql;
create table work.temp_freq2 as
select *,
(monotonic()) as row_number	
from temp_freq1
;
quit;

/*macro variable with most frequently occuring spelling (=pick):*/
proc sql noprint;
select names into :pick
from temp_freq2
where row_number = 1
;
quit;

/*%put &pick;*/

/*replace various spellings with most frequent spelling:*/
proc sql;
update dummy_data
set corrected_names = 
case 
when index in (&to_fill) then "&pick" 
else corrected_names
end;
quit;

%let to_do = 0; /* silly bug fix.*/

/*update to-do list:*/
proc sql noprint;
select index into :to_do separated by ', '
from dummy_data
where corrected_names is missing
;
quit;

%end;
%mend clean_names;

%clean_names

Correct inconsistent spelling (using Levenshtein distances)

Re: Correct inconsistent spelling (using Levenshtein distances)

Re: Correct inconsistent spelling (using Levenshtein distances)

Catch up on SAS Innovate 2026