BookmarkSubscribeRSS Feed
bikashten
Fluorite | Level 6

Hello,

I have this dataset: I am presenting this table with fname and lname, but both variables should be perfect match if we are going to match it using fuzzy matching because they are from same people with different version.

Obs

fname

lname

1

Morales De Rodriguez

Morales-Rodriguez

2

Morales De Rodriguez

Morales – Rodriguez

3

Morales De Rodriguez

Morales Rodriguez

4

Morales De Rodriguez

MoralesRodriguez

5

Morales De Rodriguez

Morales – De – Rodriguez

6

Morales-Rodriguez

Morales De Rodriguez

7

Morales Rodriguez

MoralesDeRodriguez

 

I am using this code to match it.

data final2;
	set work.final;
	delims = ' ,.!–-';
	fname2= compress(fname, delims);
	lname2 =compress(lname, delims);
	score_compged=compged(fname2, lname2, 'INL');
	score2_complev=complev(fname2, lname2, 'INL');
run;

proc print data=final2;
run;

ods rtf close;

Do you have any better code than this one?

 

Thanks,

Bikash

6 REPLIES 6
Reeza
Super User

but both variables should be perfect match

How are you defining a perfect match? A perfect match does not use fuzzy matching at all or COMPGED.

bikashten
Fluorite | Level 6
Hi Reeza,
fname is child last and lname is parent last name, and I am trying to match both of them in order to find whether the child from same parent or not. In the data base, they are stored in some variations as in the above table, but they should be same. I am just trying to match such a way that they are same. I have more than 10000 records like this. Please, let me know if it does make sense or not.
Reeza
Super User
If you have 10,000 records I'd be doing the matching manually. There's no easy way to solve this. the general way is first do exact matches, remove those. Then do variations on fuzzy matching, first do compged and take highest scores. Remove those matched.Repeat with a different function - SOUNDEX() or COMPLEV() and try again. And rinse and repeat. It's not fun or easy but doable. There's a SAS tool called Link King that may help - it's free.
bikashten
Fluorite | Level 6
That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash
ballardw
Super User

I'll see your thousand records and raise you 12,000. Smiley Wink

 

Having to see if any of roughly 13,000 in one data set may have been in another data system where names are stored in very different forms. One had the first, last, middle names, things like Junior or II in a single field without any fixed order. And some had two last names related to parents.

 

 

Reeza
Super User

@bikashten wrote:
That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash

Agreed, but you can't always program your way out of bad data and it's better to fix this at the sources somehow,using a number to identify companies instead of names is a starter, having a cleaned data base, having a verification step as people enter data. 

 

Trying to clean up the mess afterwards is always more work. 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1497 views
  • 0 likes
  • 3 in conversation