Help using Base SAS procedures

Multiple Spellings of a Name

Reply
N/A
Posts: 0

Multiple Spellings of a Name

Hello All,

I am reading survey data into a SAS Data Set from a Microsoft Excel spreadsheet. In this survey, users were allowed to manually enter their supervisors name, so naturally there are 'several' spellings of each name and also a variety in case/punctuation when entering the names.

Example:

Michael Smith
MICHAEL SMITH
Micheal L. Smithi

My question is: Is there a way for me to instruct SAS to recognize these multiple entries as one person as I import these values into the Data Set? Obviously, I would like to both standardize the data and combine all 'input' from multiple employees into one observation.

Thanks!
Super Contributor
Super Contributor
Posts: 3,174

Re: Multiple Spellings of a Name

Posted in reply to deleted_user
Why would one assume that these are in fact the same individual at all? I know a Michael Smith and also a Michel Smith -- different individuals. Hopefully you have some other "key" variable value to associate with one person, right?

Scott Barry
SBBWorks, Inc.
N/A
Posts: 0

Re: Multiple Spellings of a Name

Yes, I should have included that information in my original post.

This was a pretty small survey and I *can* say, without a doubt, that these are the same individual. In fact, there aren't even any names of supervisors that are even 'close' to being spelled in a similar way.

thanks Scott!
Respected Advisor
Posts: 3,799

Re: Multiple Spellings of a Name

Posted in reply to deleted_user
Look at functions SPEDIS SOUNDEX COMPGED and COMPLEV.

Also, "sounds like" operator. Which I think uses SOUNDEX.

Hopefully one of these will be adequate.

If you don't have a lot you could code them with a format or informat.
N/A
Posts: 0

Re: Multiple Spellings of a Name

Posted in reply to deleted_user
Hello takheeljeff,

I think that "data _null_" gave you a good advice.

Here is a little example that shows you how to implement the matching:

data T01_input;
length name $20;
infile cards delimiter=',';
input id name;
cards;
1,spiderman
2,superman
3,wonder woman
4,wonder woman
5,felix the cat
6,felix le chat
7,mickey mouse
8,samsam
;
run;

proc sql;
create table T02_matches as
select A.id as id1, A.name as name1, B.id as id2, B.name as name2,spedis(A.name,B.name) as distance
from T01_input A, T01_input B
where spedis(A.name,B.name) le 50 and A.id ne B.id;
quit;

It will give the following result:
id1=1 name1=spiderman id2=2 name2=superman distance=27
id1=2 name1=superman id2=1 name2=spiderman distance=25
id1=3 name1=wonder woman id2=4 name2=wonder woman distance=0
id1=4 name1=wonder woman id2=3 name2=wonder woman distance=0
id1=5 name1=felix the cat id2=6 name2=felix le chat distance=19
id1=6 name1=felix le chat id2=5 name2=felix the cat distance=19

Pay attention to the following remarks:
- in order to make the comparisons, we are using a "cartesian product". Take a look at the where clause: it will evaluate all possible pairs of records and will consider as valid matches those that have a spedis distance less than 50. Watch out: if your sample is big, this can imply a lot of comparisons. For example, for a sample containing 1000 records, you will make 1000*1000 comparisons, ie, 1Million comparisons. Tha'ts a lot. In order to reduce the number of comparisons, it is common to "add something" to the where clause. For example, you could decide to restrict comparisons for records whose addresses have the same zip codes. This is called "blocking".
- it is up to you to put a threshold on the distance (I took 50 ... but you could use something else). For very large datasets, it makes sense to do "some learning", ie, you take a subset of comparisons and you decide whether each of these comparisons are correct or false matches; based on that experience, you set the threshold. You could even have two threshold: 1 for "I'm sure", 1 for "To be checked" and 1 for "incorrect match". There are algorithms for the learning part ... but that's another story.
- watch out for false matches. You'll get some. For example: spiderman is not the same as superman.

You could base your matching on words instead of strings.

I hope this will be helpful to you,

Yoba Message was edited by: yoba
Regular Contributor
Posts: 229

Re: Multiple Spellings of a Name

Posted in reply to deleted_user
u can check it in SAS Data flux u can do the name standardization so that i hope it will help u
Ask a Question
Discussion stats
  • 5 replies
  • 149 views
  • 0 likes
  • 4 in conversation