BookmarkSubscribeRSS Feed
bikashten
Fluorite | Level 6

Hello,

I have this dataset: I am presenting this table with fname and lname, but both variables should be perfect match if we are going to match it using fuzzy matching because they are from same people with different version.

Obs

fname

lname

1

Morales De Rodriguez

Morales-Rodriguez

2

Morales De Rodriguez

Morales – Rodriguez

3

Morales De Rodriguez

Morales Rodriguez

4

Morales De Rodriguez

MoralesRodriguez

5

Morales De Rodriguez

Morales – De – Rodriguez

6

Morales-Rodriguez

Morales De Rodriguez

7

Morales Rodriguez

MoralesDeRodriguez

 

I am using this code to match it.

data final2;
	set work.final;
	delims = ' ,.!–-';
	fname2= compress(fname, delims);
	lname2 =compress(lname, delims);
	score_compged=compged(fname2, lname2, 'INL');
	score2_complev=complev(fname2, lname2, 'INL');
run;

proc print data=final2;
run;

ods rtf close;

Do you have any better code than this one?

 

Thanks,

Bikash

6 REPLIES 6
Reeza
Super User

but both variables should be perfect match

How are you defining a perfect match? A perfect match does not use fuzzy matching at all or COMPGED.

bikashten
Fluorite | Level 6
Hi Reeza,
fname is child last and lname is parent last name, and I am trying to match both of them in order to find whether the child from same parent or not. In the data base, they are stored in some variations as in the above table, but they should be same. I am just trying to match such a way that they are same. I have more than 10000 records like this. Please, let me know if it does make sense or not.
Reeza
Super User
If you have 10,000 records I'd be doing the matching manually. There's no easy way to solve this. the general way is first do exact matches, remove those. Then do variations on fuzzy matching, first do compged and take highest scores. Remove those matched.Repeat with a different function - SOUNDEX() or COMPLEV() and try again. And rinse and repeat. It's not fun or easy but doable. There's a SAS tool called Link King that may help - it's free.
bikashten
Fluorite | Level 6
That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash
ballardw
Super User

I'll see your thousand records and raise you 12,000. Smiley Wink

 

Having to see if any of roughly 13,000 in one data set may have been in another data system where names are stored in very different forms. One had the first, last, middle names, things like Junior or II in a single field without any fixed order. And some had two last names related to parents.

 

 

Reeza
Super User

@bikashten wrote:
That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash

Agreed, but you can't always program your way out of bad data and it's better to fix this at the sources somehow,using a number to identify companies instead of names is a starter, having a cleaned data base, having a verification step as people enter data. 

 

Trying to clean up the mess afterwards is always more work. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 596 views
  • 0 likes
  • 3 in conversation