topic Re: fuzzy matching in SAS Programming

fuzzy matching

bikashten — Thu, 13 Jun 2019 13:44:27 GMT

Hello,

I have this dataset: I am presenting this table with fname and lname, but both variables should be perfect match if we are going to match it using fuzzy matching because they are from same people with different version.

Obs	fname	lname
1	Morales De Rodriguez	Morales-Rodriguez
2	Morales De Rodriguez	Morales – Rodriguez
3	Morales De Rodriguez	Morales Rodriguez
4	Morales De Rodriguez	MoralesRodriguez
5	Morales De Rodriguez	Morales – De – Rodriguez
6	Morales-Rodriguez	Morales De Rodriguez
7	Morales Rodriguez	MoralesDeRodriguez

I am using this code to match it.

data final2;
	set work.final;
	delims = ' ,.!–-';
	fname2= compress(fname, delims);
	lname2 =compress(lname, delims);
	score_compged=compged(fname2, lname2, 'INL');
	score2_complev=complev(fname2, lname2, 'INL');
run;

proc print data=final2;
run;

ods rtf close;

Do you have any better code than this one?

Thanks,

Bikash

Re: fuzzy matching

Reeza — Thu, 13 Jun 2019 15:09:18 GMT

but both variables should be perfect match

How are you defining a perfect match? A perfect match does not use fuzzy matching at all or COMPGED.

Re: fuzzy matching

bikashten — Thu, 13 Jun 2019 16:23:42 GMT

Hi Reeza,
fname is child last and lname is parent last name, and I am trying to match both of them in order to find whether the child from same parent or not. In the data base, they are stored in some variations as in the above table, but they should be same. I am just trying to match such a way that they are same. I have more than 10000 records like this. Please, let me know if it does make sense or not.

Re: fuzzy matching

Reeza — Thu, 13 Jun 2019 16:40:11 GMT

If you have 10,000 records I'd be doing the matching manually. There's no easy way to solve this. the general way is first do exact matches, remove those. Then do variations on fuzzy matching, first do compged and take highest scores. Remove those matched.Repeat with a different function - SOUNDEX() or COMPLEV() and try again. And rinse and repeat. It's not fun or easy but doable. There's a SAS tool called Link King that may help - it's free.

Re: fuzzy matching

bikashten — Thu, 13 Jun 2019 17:25:12 GMT

That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash

Re: fuzzy matching

ballardw — Fri, 14 Jun 2019 22:11:37 GMT

I'll see your thousand records and raise you 12,000.

Having to see if any of roughly 13,000 in one data set may have been in another data system where names are stored in very different forms. One had the first, last, middle names, things like Junior or II in a single field without any fixed order. And some had two last names related to parents.

Re: fuzzy matching

Reeza — Fri, 14 Jun 2019 23:00:14 GMT

@bikashten wrote:
That's I am doing right now, but I just posted it if there are any alternative better way to do it. It's not fun to do it manually over thousand records. Thanks, Bikash

Agreed, but you can't always program your way out of bad data and it's better to fix this at the sources somehow,using a number to identify companies instead of names is a starter, having a cleaned data base, having a verification step as people enter data.

Trying to clean up the mess afterwards is always more work.