topic Re: Match datasets based on the likelihood of strings in SAS Programming

Match datasets based on the likelihood of strings

somebody — Sun, 21 Jun 2020 01:04:39 GMT

I have 2 datasets A and B containing company names. B contains correct names whereas A contains slightly wrong names. How can I ask SAS to find match obs in A with obs in B that are similar? an Exambple would be:

- match "AGL ENRGY LTD" in A to "AGL ENERGY LTD" in B; or

- match "AMER CAP" in A to "AMERICAN CAPITAL" in B; or

- match "EMPIRE CO' in A to "EMPIRE COMPANY" in B

I have been manually finding abbreviations and change them to full such as CO to COMPANY, or CORP to CORPORATION but there are still obs with missing letters in name. One way I can think of is to match all obs in B to each obs in A, and then use COMPGED or COMPLEV to get a similarity score and use the one with highest score. However, this would create a very large dataset. And how do I match all obs in B to each obs in A?

Re: Match datasets based on the likelihood of strings

ChrisNZ — Sun, 21 Jun 2020 01:50:35 GMT

Functions such as COMPGED perfectly answer your needs, but they are expensive.

You are right to clean your data before using them: LTD/LIMITED, etc.

This is an iterative process.

First try to also match on something else. For example

where first(NAME1)=first(NAME2) and compged(NAME1,NAME2) < some small value

As you match more and more, you can loosen the criteria on the reduced volume of unmatched names.

Re: Match datasets based on the likelihood of strings

somebody — Sun, 21 Jun 2020 01:58:51 GMT

I have been matching using the first 3 words in the names, and then 2 and then 1. But if there are some errors in the first word then they don't match.

Do you know how to match all obs in B to each obs in A?

Re: Match datasets based on the likelihood of strings

smantha — Sun, 21 Jun 2020 02:17:53 GMT

You can use compged like chris suggested or there are whole other suite like spedis etc. you can create a separate scoring set to do the mapping.

Re: Match datasets based on the likelihood of strings

Patrick — Sun, 21 Jun 2020 08:47:49 GMT

On top of what others suggested: Do you have the SAS Data Quality Server licensed? If so then this would allow you to standardize company names and then join over the standardized names.

Re: Match datasets based on the likelihood of strings

ChrisNZ — Sun, 21 Jun 2020 08:44:15 GMT

I am unsure what's unclear in my reply. Sorry.

> Do you know how to match all obs in B to each obs in A?
Unsure what this means either.

Re: Match datasets based on the likelihood of strings

somebody — Mon, 22 Jun 2020 01:08:32 GMT

Do you know how to check is my SAS has the licence?

Re: Match datasets based on the likelihood of strings

somebody — Mon, 22 Jun 2020 01:10:12 GMT

I would like to create a new dataset that has all observations in B for every observation in A. For exambple, if A has 5 observation and B has 10 obs, then the new merged dataset would have 50 observations. How can I perform this merge?

Re: Match datasets based on the likelihood of strings

ChrisNZ — Mon, 22 Jun 2020 06:20:50 GMT

> if A has 5 observation and B has 10 obs, then the new merged dataset would have 50 observations

Use: from TABLE1, TABLE2 without a where clause to create such a join.

That's a called a Cartesian join. Why would you do that?

Your current method is correct:

1. Standardise the data

2. Join on increasingly looser criteria. Only try to match the unmatched data.
- Straight equality

- Almost equal (this can be many steps)

- Not quite the same (this can be many steps)

- Quite different (this can be many steps)

Keep track of what criterion was used when you achieve a match.