Re: Multiple Spellings of a Name

deleted_user · Posted 04-20-2009 11:10 AM

Hello All,

I am reading survey data into a SAS Data Set from a Microsoft Excel spreadsheet. In this survey, users were allowed to manually enter their supervisors name, so naturally there are 'several' spellings of each name and also a variety in case/punctuation when entering the names.

Example:

Michael Smith
MICHAEL SMITH
Micheal L. Smithi

My question is: Is there a way for me to instruct SAS to recognize these multiple entries as one person as I import these values into the Data Set? Obviously, I would like to both standardize the data and combine all 'input' from multiple employees into one observation.

Thanks!

sbb · Posted 04-20-2009 11:18 AM

Why would one assume that these are in fact the same individual at all? I know a Michael Smith and also a Michel Smith -- different individuals. Hopefully you have some other "key" variable value to associate with one person, right?

Scott Barry
SBBWorks, Inc.

deleted_user · Posted 04-20-2009 11:23 AM

Yes, I should have included that information in my original post.

This was a pretty small survey and I *can* say, without a doubt, that these are the same individual. In fact, there aren't even any names of supervisors that are even 'close' to being spelled in a similar way.

thanks Scott!

data_null__ · Posted 04-20-2009 11:46 AM

Look at functions SPEDIS SOUNDEX COMPGED and COMPLEV.

Also, "sounds like" operator. Which I think uses SOUNDEX.

Hopefully one of these will be adequate.

If you don't have a lot you could code them with a format or informat.

deleted_user · Posted 04-20-2009 03:00 PM

Hello takheeljeff,

I think that "data _null_" gave you a good advice.

Here is a little example that shows you how to implement the matching:

data T01_input;
length name $20;
infile cards delimiter=',';
input id name;
cards;
1,spiderman
2,superman
3,wonder woman
4,wonder woman
5,felix the cat
6,felix le chat
7,mickey mouse
8,samsam
;
run;

proc sql;
create table T02_matches as
select A.id as id1, A.name as name1, B.id as id2, B.name as name2,spedis(A.name,B.name) as distance
from T01_input A, T01_input B
where spedis(A.name,B.name) le 50 and A.id ne B.id;
quit;

It will give the following result:
id1=1 name1=spiderman id2=2 name2=superman distance=27
id1=2 name1=superman id2=1 name2=spiderman distance=25
id1=3 name1=wonder woman id2=4 name2=wonder woman distance=0
id1=4 name1=wonder woman id2=3 name2=wonder woman distance=0
id1=5 name1=felix the cat id2=6 name2=felix le chat distance=19
id1=6 name1=felix le chat id2=5 name2=felix the cat distance=19

Pay attention to the following remarks:
- in order to make the comparisons, we are using a "cartesian product". Take a look at the where clause: it will evaluate all possible pairs of records and will consider as valid matches those that have a spedis distance less than 50. Watch out: if your sample is big, this can imply a lot of comparisons. For example, for a sample containing 1000 records, you will make 1000*1000 comparisons, ie, 1Million comparisons. Tha'ts a lot. In order to reduce the number of comparisons, it is common to "add something" to the where clause. For example, you could decide to restrict comparisons for records whose addresses have the same zip codes. This is called "blocking".
- it is up to you to put a threshold on the distance (I took 50 ... but you could use something else). For very large datasets, it makes sense to do "some learning", ie, you take a subset of comparisons and you decide whether each of these comparisons are correct or false matches; based on that experience, you set the threshold. You could even have two threshold: 1 for "I'm sure", 1 for "To be checked" and 1 for "incorrect match". There are algorithms for the learning part ... but that's another story.
- watch out for false matches. You'll get some. For example: spiderman is not the same as superman.

You could base your matching on words instead of strings.

I hope this will be helpful to you,

Yoba Message was edited by: yoba

R_Win · Posted 04-21-2009 01:30 AM

u can check it in SAS Data flux u can do the name standardization so that i hope it will help u

Multiple Spellings of a Name