BookmarkSubscribeRSS Feed
France
Quartz | Level 8

dear all,

 

how to merge two tables based on the first character of the 'company name' variables.

bu using the 2 below tables as an example,

table 1

TypeNAMECOMPANY NAMECOMPANY NAME_1COMPANY NAME - SHORTPREVIOUS NAME
917574ABBOT GROUP DEAD - 07/03/08ABBOT GROUP PLCABBOT GROUP PLC.ABBOT GROUP PLCUNIGROUP
991144ABBOTT LABS.GBP (LON) DEAD - DEADABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABORATORIESABOTT LABS.GBP (LON)
671863ABBOTT LABS.USD (LON) DEAD - 21/10/16ABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABS. (XSQ)
926020ABERCOM GROUP (LON)ABERCOM GROUP LTD.ABERCOM GROUPABERCOM GROUP LTD 
911147ABERDEEN STEAK HOUSES DEAD - 03/01/03 ABDN.STEAK HSES.GP.PLC.  
902737ABERDEEN TRUST ABERDEEN ASSET MAN.PLC.  
507161ABDN.TST.WTS.A EXPIRY 30/09/99 ABERDEEN ASSET MAN.PLC.  
961337ABERFORTH SMALLER COS.C ABERFORTH SMCOS.IT.PLC.  
928988ABERFORTH SMCOS.C ABERFORTH SMCOS.IT.PLC.  
960130ABERFORTH SPL.C SHARE DELISTED 27/07/94 ABERFORTH  
15295CABERTIS (IRS)ABERTIS INFRAESTRUCTURASABERTIS INFSTS.SAABERTIS INFRAESTACESA INTRAESTRUCTU(IRS)
29253EABERTIS (IRS) BONUS RIGHTS 03/08/04ABERTIS INFRAESTRUCTURASABERTIS INFSTS.SAABERTIS INFRAESTACESA INTRAESTRUCTU(IRS)
256273ABGENIX (IRS) DEAD - 12/11/07ABGENIX, INC.ABGENIX INCO.ABGENIX, INC. 

 

 

table 2

codepsn_name
1ABBOTT CCCC
2ABERDEEN ww
3ACESA des
4APPLE 
5ABERFORTH ftrr
6ABDN dggg

logic:

1splite each observation based on the 'blank' ',' '.' etc. in both tables.

2merge 'psn_name' in table 2 with each company name variable in table 1(i.e., NAME,  COMPANY_NAME, COMPANY_NAME_1,COMPANY_NAME_SHORT,PREVIOUS_NAME) if 1)both of the characters is the first character of the 'company name' variable,

and 2) both of the characters are same.

if the psn_name can be matched with any of the 'company name' variables, it can be recorded with the entire row.

3 record the matched table,  non_matched table 1 and non_matched table 2 separately.

 

 

so, for the 

codepsn_name
1ABBOTT CCCC

 

the first character of psn_name variable is 'ABBOTT'. 

so it can match with 

991144ABBOTT LABS.GBP (LON) DEAD - DEADABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABORATORIESABOTT LABS.GBP (LON)
671863ABBOTT LABS.USD (LON) DEAD - 21/10/16ABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABORATORIESABBOTT LABS. (XSQ)

 

as the first character of the variable is 'ABBOTT' as well

 

the second example,

6ABDN dggg

so it can match with 

911147ABERDEEN STEAK HOUSES DEAD - 03/01/03 ABDN.STEAK HSES.GP.PLC.  

 as the 'ABDN' is as same as the 'ABDN' in 'ABDN.STEAK HSES.GP.PLC.'

 

 

Could you please give me some suggestions about this? the sample has been added to the attachment.

thanks in advance.

 

7 REPLIES 7
Reeza
Super User
See FriedEggs solution here. I’ve used that a few times, usually get a 70% match rate.
https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780
France
Quartz | Level 8
thanks Reeza, I will try it later(as my SAS is running the other program) besides, could you give me some suggestion if I really expect to match company name based on the first character of the company name? As my supervisor is asking me to match part of data handly. and I hope to improve the process by matching the first character through code
Reeza
Super User

It'll reduce the number of errors. I think I listed out the steps previously you need to do when doing a comparison like this:

 

1. Remove all exact matches

2. Do a fuzzy match - remove those

3. etc......

 

 

If you're at a University though you may qualify for a cheaper license or educational license for SAS Data Management Studio or Dataflux which can do the matching in a more robust manner. I didn't have great luck, but it depends on your data.

France
Quartz | Level 8

Dear Reeza,

 

thanks for your message.

is it the advice? 

This is a non-trivial problem because you end up doing N*M calculations which can be quite big. 

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

COMPGED works best in my experience

https://communities.sas.com/t5/General-SAS-Programming/Matching-Company-Names/td-p/439674

 

however, I confuse about several steps.

1.'Change all & or other symbols such as @ from text'  how to change the symbols? should I remove all the symbols? or replace them with the same symbol? 

2.'Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited' how to standardize these words? Is there any code to do this step? particularly, will you remove the 'Co and/or LTD' words in a dataset, if few 'Co and/or LTD' words are recorded in the database?

ballardw
Super User

@France wrote:

dear all,

 

how to merge two tables based on the first character of the 'company name' variables.

bu using the 2 below tables as an example,

 

as the first character of the variable is 'ABBOTT' as well

 

the second example,

6 ABDN dggg

so it can match with 

911147 ABERDEEN STEAK HOUSES DEAD - 03/01/03   ABDN.STEAK HSES.GP.PLC.    

 as the 'ABDN' is as same as the 'ABDN' in 'ABDN.STEAK HSES.GP.PLC.'

 

 

Could you please give me some suggestions about this? the sample has been added to the attachment.

thanks in advance.

 


You mean first word not character.

 

SCAN function will identify the first word of a string.

 

Do you have UNIGROUP in table 2? If so what would the output look like?

 

Is the program supposed to work for "ABOTT LABS.GBP (LON)" when it appears the match you are requesting would not work because ABOTT does not match the obvious ABBOTT in table 2?

 

Is the a preferred order of comparison between the 5 different name variables?

Oligolas
Barite | Level 11

Hi,

 

you could use the LIKE SQL Operator but if you're in a professional environment or what you do can have a financial impact, I would rather recommend you to select distinctively the company names from table1 and 2, match them by hand in a separate dataset and afterwards to link all 3 tables together. From my point of view it would be much faster, reliable and less challenging than trying to identify afterwards which matches were inaccurate and why.

________________________

- Cheers -

France
Quartz | Level 8
Dear ballardw,

thank you for your advice. what is the meaning of 'UNIGROUP' could you please explain it for me?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1369 views
  • 4 likes
  • 4 in conversation