I am reading survey data into a SAS Data Set from a Microsoft Excel spreadsheet. In this survey, users were allowed to manually enter their supervisors name, so naturally there are 'several' spellings of each name and also a variety in case/punctuation when entering the names.
Micheal L. Smithi
My question is: Is there a way for me to instruct SAS to recognize these multiple entries as one person as I import these values into the Data Set? Obviously, I would like to both standardize the data and combine all 'input' from multiple employees into one observation.
Why would one assume that these are in fact the same individual at all? I know a Michael Smith and also a Michel Smith -- different individuals. Hopefully you have some other "key" variable value to associate with one person, right?
Yes, I should have included that information in my original post.
This was a pretty small survey and I *can* say, without a doubt, that these are the same individual. In fact, there aren't even any names of supervisors that are even 'close' to being spelled in a similar way.
I think that "data _null_" gave you a good advice.
Here is a little example that shows you how to implement the matching:
length name $20;
infile cards delimiter=',';
input id name;
5,felix the cat
6,felix le chat
create table T02_matches as
select A.id as id1, A.name as name1, B.id as id2, B.name as name2,spedis(A.name,B.name) as distance
from T01_input A, T01_input B
where spedis(A.name,B.name) le 50 and A.id ne B.id;
It will give the following result:
id1=1 name1=spiderman id2=2 name2=superman distance=27
id1=2 name1=superman id2=1 name2=spiderman distance=25
id1=3 name1=wonder woman id2=4 name2=wonder woman distance=0
id1=4 name1=wonder woman id2=3 name2=wonder woman distance=0
id1=5 name1=felix the cat id2=6 name2=felix le chat distance=19
id1=6 name1=felix le chat id2=5 name2=felix the cat distance=19
Pay attention to the following remarks:
- in order to make the comparisons, we are using a "cartesian product". Take a look at the where clause: it will evaluate all possible pairs of records and will consider as valid matches those that have a spedis distance less than 50. Watch out: if your sample is big, this can imply a lot of comparisons. For example, for a sample containing 1000 records, you will make 1000*1000 comparisons, ie, 1Million comparisons. Tha'ts a lot. In order to reduce the number of comparisons, it is common to "add something" to the where clause. For example, you could decide to restrict comparisons for records whose addresses have the same zip codes. This is called "blocking".
- it is up to you to put a threshold on the distance (I took 50 ... but you could use something else). For very large datasets, it makes sense to do "some learning", ie, you take a subset of comparisons and you decide whether each of these comparisons are correct or false matches; based on that experience, you set the threshold. You could even have two threshold: 1 for "I'm sure", 1 for "To be checked" and 1 for "incorrect match". There are algorithms for the learning part ... but that's another story.
- watch out for false matches. You'll get some. For example: spiderman is not the same as superman.
You could base your matching on words instead of strings.