BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hello All,

I am reading survey data into a SAS Data Set from a Microsoft Excel spreadsheet. In this survey, users were allowed to manually enter their supervisors name, so naturally there are 'several' spellings of each name and also a variety in case/punctuation when entering the names.

Example:

Michael Smith
MICHAEL SMITH
Micheal L. Smithi

My question is: Is there a way for me to instruct SAS to recognize these multiple entries as one person as I import these values into the Data Set? Obviously, I would like to both standardize the data and combine all 'input' from multiple employees into one observation.

Thanks!
5 REPLIES 5
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Why would one assume that these are in fact the same individual at all? I know a Michael Smith and also a Michel Smith -- different individuals. Hopefully you have some other "key" variable value to associate with one person, right?

Scott Barry
SBBWorks, Inc.
deleted_user
Not applicable
Yes, I should have included that information in my original post.

This was a pretty small survey and I *can* say, without a doubt, that these are the same individual. In fact, there aren't even any names of supervisors that are even 'close' to being spelled in a similar way.

thanks Scott!
data_null__
Jade | Level 19
Look at functions SPEDIS SOUNDEX COMPGED and COMPLEV.

Also, "sounds like" operator. Which I think uses SOUNDEX.

Hopefully one of these will be adequate.

If you don't have a lot you could code them with a format or informat.
deleted_user
Not applicable
Hello takheeljeff,

I think that "data _null_" gave you a good advice.

Here is a little example that shows you how to implement the matching:

data T01_input;
length name $20;
infile cards delimiter=',';
input id name;
cards;
1,spiderman
2,superman
3,wonder woman
4,wonder woman
5,felix the cat
6,felix le chat
7,mickey mouse
8,samsam
;
run;

proc sql;
create table T02_matches as
select A.id as id1, A.name as name1, B.id as id2, B.name as name2,spedis(A.name,B.name) as distance
from T01_input A, T01_input B
where spedis(A.name,B.name) le 50 and A.id ne B.id;
quit;

It will give the following result:
id1=1 name1=spiderman id2=2 name2=superman distance=27
id1=2 name1=superman id2=1 name2=spiderman distance=25
id1=3 name1=wonder woman id2=4 name2=wonder woman distance=0
id1=4 name1=wonder woman id2=3 name2=wonder woman distance=0
id1=5 name1=felix the cat id2=6 name2=felix le chat distance=19
id1=6 name1=felix le chat id2=5 name2=felix the cat distance=19

Pay attention to the following remarks:
- in order to make the comparisons, we are using a "cartesian product". Take a look at the where clause: it will evaluate all possible pairs of records and will consider as valid matches those that have a spedis distance less than 50. Watch out: if your sample is big, this can imply a lot of comparisons. For example, for a sample containing 1000 records, you will make 1000*1000 comparisons, ie, 1Million comparisons. Tha'ts a lot. In order to reduce the number of comparisons, it is common to "add something" to the where clause. For example, you could decide to restrict comparisons for records whose addresses have the same zip codes. This is called "blocking".
- it is up to you to put a threshold on the distance (I took 50 ... but you could use something else). For very large datasets, it makes sense to do "some learning", ie, you take a subset of comparisons and you decide whether each of these comparisons are correct or false matches; based on that experience, you set the threshold. You could even have two threshold: 1 for "I'm sure", 1 for "To be checked" and 1 for "incorrect match". There are algorithms for the learning part ... but that's another story.
- watch out for false matches. You'll get some. For example: spiderman is not the same as superman.

You could base your matching on words instead of strings.

I hope this will be helpful to you,

Yoba Message was edited by: yoba
R_Win
Calcite | Level 5
u can check it in SAS Data flux u can do the name standardization so that i hope it will help u

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 772 views
  • 0 likes
  • 4 in conversation