Re: Find closet match of string

Ronein · Posted 09-02-2020 12:12 PM

Hello

I have a complicated problem and I need your advice please.

In one data set (date set A) I have one column with names of companies from my sources.

In another data set (data set B) I have two columns :name of company and VAT number.

Source of data set B is external source from central bank information.

The target is to add VAT number from data set B to each company in data set A .

In normal situation a quick left join will give me the solution.

Unfortunately we are not living in ideal world and the names of companies in data set B are not matching to the names of companies in data set A.

The difference is due to the fact that same companies names are written little differently by users.

The target is to find the closet matching.

It means that the target is to find for each company name in data set A ,the best matching name in data set B and then give the VAT number of this company.

I thought about some things to do in order to do it:

For example:

1-Perform Cartesian merge of data set A and B.

2-Check which letters appear in company name from data set A.

Then for each company from data set B ,check how many letters exists in name from data set B

(The order of letters in this check is not matter)

3-Check how many full words from company in data set A matched to words in data set B

4-Check how many letters (in order) from company name in data set A are matched to company name in data set B.

What will be your attitude to this problem please?

I also would like to see some code ideas to deal with this problem.

Thanks

Koe

Data A;
Input Name;
cards;
The German Hi tec intc
El Bar inc
Diamon Ibraim llt
;
run;

Data B;
Input Name VAT;
cards;
El Bar 666
German Hitecintc 999
The Diamon Ibraiim LLT 777
;
run;

Reeza · Posted 09-02-2020 01:13 PM

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780
This post has some great suggestions and code examples.

Ronein · Posted 09-02-2020 04:29 PM

and maybe we can discuss a bit about the methodology and the code?

Send a link is nice but maybe it will be more interesting and useful to explain it and understand it...

I see that there is using of compged function?

I see that this function determine the dissimilarity between two strings.

Can you explain please how is it working?

should we do Cartesian merge before?

In my case both data sets has around 15000 rows.

What will be the outcome data set after Cartesian merge of data set A and data set B?

ballardw · Posted 09-02-2020 04:53 PM

The documentation of COMPGED and SPEDIS, another difference in spelling functions, describe the rules used to calculate the distance.

A cartesian join is often used with these things because you never know which actual values you need to compare.

Suggestion: find the exact matches, if any, and remove them from the problem first.

Depending on which rule/function you use only keep values of the compged/ complev/ spedis function below a cutoff value. What the value might be depends on just how much your data is mangled.

If there are other values that are related you might want to use them. You really haven't provide much detail. Such as if you have something like an address value such as City or postal code, only consider exact matches on the other variable(s).

I have one of these where I need to match children's names from several different data systems. I start with match on UPCASEd versions of first and last name, gender and date of birth. If there is a match that gets marked and a reduced set is used for the next versions where I use the spelling distance on first and last names if gender and date of birth match, mark identified matches manually where the compged value is less than a value. Then match on gender only and spelling distance of first and last name mark and remove, then match on dob only with spelling distance of the names, after that then just the spelling distance of the names. Why multiple levels: because people can't enter things correctly. The extra variables make each level of comparison a smaller set so there are fewer to manually review. And why manually review? People are one of the best pattern recognizers around.

Or may the SAS Text mining can help if you have a license.

Find closet match of string