Solved: Combine Datasets using Inexact Character Variables in SAS

mspak · Posted 04-26-2012 08:37 AM

Dear all,

I downloaded data from 2 different databases with different their identification codes. As such, it is impossible to match cases by an ID. Therefore, the only way is to match the cases by using their company names which is an inexact character variable.

I have 2 different datasets:

A) uw_match, with the following variables:

- underwriters_names;

- holding_company;

- ipo_date;

- others.....

B) maluw, with the following variables;

- name (label as bank name)

- bs_id_number (bankscope identification code)

- closdate (Company Fiscal Year End)

- others....

I wish to match-merge these two dataset by the following creteria:

STEP 1: the underwriters (either underwriters_names or their holding_company) in dataset A compared to bank name in dataset B;

note: I can either match-merge the underwriters_names (in A) with the bank name (in B) or holding_company (in A) with bank name (in B), the pairs with the higher matching accuracy level will be output/used; AND

STEP 2: the closest closdate (in B) with ipo_date (in A).

In short, I wish to match all the variables in A and B, by

1. the bank names (in B) = IPO underwriters (in A; either underwriters_names or their holding_company;which can provide the highest precision level)

2. in the similar period (ie. the closest fiscal year end of the banks with the IPO date).

I read an article (see the pdf attached), and I understand that it is possible with SAS. But I feel little knowledge on how to apply the examples into my context here.

Any comment and advise is much appreciated.

Thank you.

Regards,

mspak

MikeZdeb · Posted 04-26-2012 10:59 AM

hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ...

#1 in holding_company, I see ...

O.S.K. HOLDINGS BERHAD

OSK HOLDINGS BERHAD

are they the same company and should you get rid of those periods

#2 there's a mix of lower and upper case letters ... should you convert to all uppercase

#3 most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison

#4 sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA

just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names

for example, clean up the names and make some new variables to hold those names ...

data new_maluw;
set z.maluw;

* add a record number for later use;

mnrec+1;

* convert to uppercase, only keep numbers/letters/spaces, convert multiple spaces to one space;
nm = compbl(compress(upcase(name),,'kdas'));

* get rid of BHD and BERHAD;
nm = tranwrd(nm,' BHD','');
nm = tranwrd(nm,' BERHAD','');
run;

data new_uw_match;
set z.uw_match;

unrec+1;
nmh = compbl(compress(upcase(holding_company),,'kdas'));
nmh = tranwrd(nmh,' BHD','');
nmh = tranwrd(nmh,' BERHAD','');

nmu = compbl(compress(upcase(underwriters_names),,'kdas'));
nmu = tranwrd(nmu,' BHD','');
nmu = tranwrd(nmu,' BERHAD','');
run;

then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another

once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level)

I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ...

* use SQL to match the files by a comparison of names, use the COMPGED function to compare names;

* you don't have to use all the data since you have pointers (mnrec and unrec);

* nm_nmh and nm_nmu are matching scores;

proc sql;
create table both as
select mnrec, unrec, compged(nm, nmh) as nm_nmh, compged(nm, nmu) as nm_nmu
from new_maluw, new_uw_match
having nm_nmh lt 50 or nm_nmu lt 50;
quit;

* reconstruct the data using the pointers;

* maybe you only add the dates and other vars you need for more work at this point;

data both;
set both;
p1=mnrec;
p2=unrec;
set new_maluw (keep=nm closdate) point=p1;
set new_uw_match (keep=nmh nmu ipo_date) point=p2;
run;

etc ...

View solution in original post

MikeZdeb · Posted 04-26-2012 10:59 AM

hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ...

#1 in holding_company, I see ...

O.S.K. HOLDINGS BERHAD

OSK HOLDINGS BERHAD

are they the same company and should you get rid of those periods

#2 there's a mix of lower and upper case letters ... should you convert to all uppercase

#3 most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison

#4 sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA

just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names

for example, clean up the names and make some new variables to hold those names ...

data new_maluw;
set z.maluw;

* add a record number for later use;

mnrec+1;

* convert to uppercase, only keep numbers/letters/spaces, convert multiple spaces to one space;
nm = compbl(compress(upcase(name),,'kdas'));

* get rid of BHD and BERHAD;
nm = tranwrd(nm,' BHD','');
nm = tranwrd(nm,' BERHAD','');
run;

data new_uw_match;
set z.uw_match;

unrec+1;
nmh = compbl(compress(upcase(holding_company),,'kdas'));
nmh = tranwrd(nmh,' BHD','');
nmh = tranwrd(nmh,' BERHAD','');

nmu = compbl(compress(upcase(underwriters_names),,'kdas'));
nmu = tranwrd(nmu,' BHD','');
nmu = tranwrd(nmu,' BERHAD','');
run;

then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another

once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level)

I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ...

* use SQL to match the files by a comparison of names, use the COMPGED function to compare names;

* you don't have to use all the data since you have pointers (mnrec and unrec);

* nm_nmh and nm_nmu are matching scores;

proc sql;
create table both as
select mnrec, unrec, compged(nm, nmh) as nm_nmh, compged(nm, nmu) as nm_nmu
from new_maluw, new_uw_match
having nm_nmh lt 50 or nm_nmu lt 50;
quit;

* reconstruct the data using the pointers;

* maybe you only add the dates and other vars you need for more work at this point;

data both;
set both;
p1=mnrec;
p2=unrec;
set new_maluw (keep=nm closdate) point=p1;
set new_uw_match (keep=nmh nmu ipo_date) point=p2;
run;

etc ...

mspak · Posted 04-27-2012 03:07 AM

Hi MikeZdeb,

Your codes are extremely useful for me. I applied the codes suggested, I can say the result is excellent.

In BTW, I found a good article on COMPGED Function (see the Pdf).

Thank you very much for your helps.

Regards,

mspak

PGStats · Posted 04-26-2012 11:09 AM

Hello MSPak, take a look at SAS string distance functions : SPEDIS, COMPGED and COMPLEV. They seem far more sophisticated than the function described in the joined article.

PG

MikeZdeb · Posted 04-26-2012 11:25 AM

hi .. I had the same reaction and wondered how there could be an SGF 2012 paper on inexact matching that did not even reference the functions you mentioned (so someone like MSPak could read a new paper and not even be made aware of the functions)

GTickner · Posted 04-26-2012 12:04 PM

For your case (with 1,560*940=1,466,400 comparisons) the match functions would be fine, although it's still a good idea to translate common abbreviations (CO, CORP, LTD, INC, DIST, DIV, states such as BHD) and eliminate punctuation before doing an outer join. When I match company names from much larger databases, an outer join is not practical. In that case, you need to transform both sides and join on the transformed fields. I have a rather old program that does a modified soundex transformation to each name from both datasets before matching. Not particularly sophisticated but, having developed the transformations over a period of time, it performs well.

Ksharp · Posted 04-26-2012 11:23 PM

For your situation, it is very very hard .

SAS has a product named DATA FLUX can solve this problem.

As my opinion, I will maintain a table which contain all of these similar words to identify them .

And of course these spell distance function mentioned by PGStat .

The following code just for fun.

libname x v9 'D:\Software';
data uw_match;
 set x.uw_match;
 _name=holding_company;output;
 _name=underwriters_names;output;
 keep _name;
run;
proc sql;
create table x as
 select name,_name
  from x.maluw,uw_match
   where name =* _name ;quit;

Ksharp

mspak · Posted 04-27-2012 03:01 AM

Hi Ksharp,

What is the meaning of this code "where name =* _name" ?

Thanks.

Regards,

mspak

Ksharp · Posted 04-27-2012 03:45 AM

It is SOUND LIKE .

mspak · Posted 04-27-2012 04:02 AM

Thank you very much. Today, I learned a lot from all of you.

"There is no royal road to learning, learning SAS with this discussion forum is extremely useful and fun".

Hope everyone enjoy your weekend :smileylaugh:

Regards,

mspak

Patrick · Posted 04-27-2012 05:22 AM

Just to add to Ksharp's mentioning of DataFlux:

If you have the SAS Data Quality Server licensed at your site then you have have DataFlux components available which you can use within a data step.

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1dczp3khf9...

Interesting for you would be to create machcodes. Similar strings will collaps into a single matchcode which you then can use to join your tables.

The way DataFlux creates these matchcodes is far superior over functions like soundex() as it doesn't only use a rule-based approach but also lists (eg. I would assume all major companies together with a vast list of variants inclusive typos are in such a list - so "someone" does clusteranalysis here and there are regular updates to the "database").

mspak · Posted 04-27-2012 09:00 AM

Hi Patrick and Ksharp,

Thank you for letting me know about this special features of DataFlux. I am not sure whether the University has subscribed for the SAS Data Quality Server. I will check with the local SAS representative on this matter. Perhaps, I should submit a proposal for the subscription of DataFlux if it is not accessible by staff members here.

Regards,

mspak

stevennair · Posted 11-20-2013 04:12 AM

Hi Patrick,

I am using SAS Dataflux.

I have to agree that Dataflux is indeed a great software.

However, as you mentioned, Dataflux collapses similar strings into a single matchcode which I then can use it to join my table.

Not sure you are referring to Match Codes under the Entity Resolution node.

If yes, then I have a question.

As you will find in my attachment, the final output (All Matches and All_Non_Match) files have lesser total number of firms than my original input files (Text file input 1 and Text file input 2).

In other words, if the total number of firms in original input files (Text file input 1 and Text file input 2) consist of 1000 firms, then logically, the final output (All Matches and All_Non_Match) files should also consist of 1000 right?

In my case, I lost about 148 firms and upon checking manually, I realize that companies such ABC 1995, ABC 1996, ABC 1997 (which may refer to the similar or not similar companies) have been merged into one, which is ABC.

Can I customize the Match Codes function so that ABC 1995, ABC 1996, ABC 1997 are considered as different companies?

My sincere advance thank you.

Warm regards,

Steven

Patrick · Posted 11-20-2013 04:47 AM

Hi Steven

You should normally start a new post in a situation like here and then eventually reference the old post which is related to your question.

To answer what you asked for:

It's quite a while that I had my DataFlux training and I didn't have a lot of opportunity to actually use it :-). I was using match codes in the past quite a bit though (using dq... SAS functions).

I had a look at your flow and can't give you the full answer but only a few thoughts:

- The Data Joining node follows the same logic than a SQL join so in order to no loose data you would need a OUTER JOIN

- If you're already standardizing your company names then you don't have to create match codes and use the match codes for joining. Use the standardized company names instead.

- " ABC 1995, ABC 1996, ABC 1997": That such variations of the same names get clustered and resolve to the same standardized name is exactly what you want. You can play with the sensitivity which will affect what will get clustered together (the same is also true for match codes).

- Yes, you can modify the clustering. That's what the role "Data Stewart" is supposed to do. Data Quality is an ongoing process. In the heart of things are the QKB's (Quality Knowledge Base). There is - I believe - a quarterly update and the QKB's are also local/language specific.

HTH

Patrick

stevennair · Posted 11-20-2013 05:40 AM

Hi Patrick,

I am sorry for that since I am a newbie here. learnt something now :smileyblush:

Thank you for the reply, I really appreciate it and find it VERY useful.

Just to double check with you:

-When you mentioned that "The Data Joining node follows the same logic than a SQL join so in order to no loose data you would need a OUTER JOIN", you are just referring that i need the data joining node to join data from 2 different files right?

-If I have already standardized the company names, then I could join them even without match code OR I can join the company names by only using match code without standardization?

By the way, in your opinion, do you think using SAS function (dq...) is better than Dataflux?

Once again, my sincere thank you.

Warm regards,

Steven

Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Re: Combine Datasets using Inexact Character Variables in SAS

Registration is open

SAS Training: Just a Click Away