BookmarkSubscribeRSS Feed
RuudB83
Calcite | Level 5

Dear SAS community members,

 

I am struggling with the following, hopefully someone can help.

 

I have a fairly large table (6 million rows) with a column with names. Unfortunately, spelling is not always consistent. So, for example, I might have ‘Mel Gibson’ and ‘Mel Gibbon’ as two different spellings of the same name. Now, what I would like is to correct this by replacing all these different spellings with a single one (e.g. the most frequently occuring). Also, since it is a fairly large number of rows I am looking for something efficient.

 

My idea was to first make a list of indices I of all names that need to be processed. Initially this list runs from 1 to num_rows (i.e, all entries). Then start a loop, pick the first name and calculate a criterion (e.g. Levenshtein distance < 3) for all remaining names. For all names where the criterion is met replace the spelling with the most frequently occuring spelling. Then update I and do the same for the next name in line. Keep doing this until length of I is zero. The problem is that I don’t know how to accomplish this. Help would be greatly appreciated!

 

Below is a small dummy dataset I have been playing around with.

 

Many thanks in advance for any help. Best regards,

Ruud

 

/*make dummy dataset:*/
data work.dummy_data;
input index names $20. corrected_names $20.;
format names $20. corrected_names $20.; 
datalines;
1 Tom Hanks
2 John Trovolta
3 Mel Gibson
4 Sylvester Stalone
5 Mel Gibbon
6 Tim Honks
7 Johnny Travolta
8 Sylvester Stallone
9 Mel Gibson
10 Sylvester Stallone
11 Tom Hawks
12 John Travolta
13 John Travolta
14 Tommy Hanks
15 Tom Hanks
16 Mel Gibson
;
run; 
2 REPLIES 2
RW9
Diamond | Level 26 RW9
Diamond | Level 26

I don't have time to code this for you, but some guidance:

Soundex function can be used to encode text:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245948.htm

This may get you some similarity.

 

You can use proc freq to to caculate frequencies of names.

RuudB83
Calcite | Level 5

Probably not the most elegant solutions, but it works.

 

/*make dummy dataset:*/
data work.dummy_data;
input index names $20. corrected_names $20.;
format names $20. corrected_names $20.; 
datalines;
1 Tom Hanks
2 John Trovolta
3 Mel Gibson
4 Sylvester Stalone
5 Mel Gibbon
6 Tim Honks
7 Johnny Travolta
8 Sylvester Stallone
9 Mel Gibson
10 Sylvester Stallone
11 Tom Hawks
12 John Travolta
13 John Travolta
14 Tommy Hanks
15 Tom Hanks
16 Mel Gibson
;
run;

%macro clean_names;

/*macro variable with rows that still need to be processed (to-do list):*/
proc sql noprint;
select index into :to_do separated by ', '
from dummy_data
;
quit;

/*%put &to_do;*/

/*start loop, go on until there are no more rows to process):*/
%do %until(%sysfunc(countw("&to_do", ",")) = 1);

/*select rows that need to be processed:*/
proc sql;
create table work.temp1 as
select *
from dummy_data
where index in (&to_do)
;
quit;

/*take value from first row:*/
proc sql noprint;
select names into :name
from temp1
;
quit;

/*%put &name;*/

/*calculate Levenshtein distance relative to values in other rows:*/
proc sql;
create table work.temp2 as
select *,
(COMPLEV("&name", names)) as distance
from temp1
;
quit;

/*select all rows that meet criterion:*/
proc sql;
create table work.temp3 as
select *
from temp2
where distance <= 4
;
quit;

/*determine index, these rows will be corrected in this iteration:*/
proc sql noprint;
select index into :to_fill separated by ', '
from temp3
;
quit;

/*%put &to_fill;*/

/*make frequency table with various spellings:*/
proc freq data=work.temp3 order=freq noprint;
tables names / nocol norow nopercent out = work.temp_freq1;
run;

/*add index (for next step):*/
proc sql;
create table work.temp_freq2 as
select *,
(monotonic()) as row_number	
from temp_freq1
;
quit;

/*macro variable with most frequently occuring spelling (=pick):*/
proc sql noprint;
select names into :pick
from temp_freq2
where row_number = 1
;
quit;

/*%put &pick;*/

/*replace various spellings with most frequent spelling:*/
proc sql;
update dummy_data
set corrected_names = 
case 
when index in (&to_fill) then "&pick" 
else corrected_names
end;
quit;

%let to_do = 0; /* silly bug fix.*/

/*update to-do list:*/
proc sql noprint;
select index into :to_do separated by ', '
from dummy_data
where corrected_names is missing
;
quit;

%end;
%mend clean_names;

%clean_names

 

Catch up on SAS Innovate 2026

Dive into keynotes, announcements and breakthroughs on demand.

Explore Now →
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1274 views
  • 0 likes
  • 2 in conversation