BookmarkSubscribeRSS Feed
UniversitySas
Quartz | Level 8

I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.

 

My primary identifier in both tables is under "Company_Name".

 

The issues I am having can be seen if we place the tables side by side.

Row#Table 1 CompaniesTable 2 Companies
11ST SOURCE CORPORATION1ST SOURCE CORP
23D SYSTEMS3D SYSTEMS CORP
33M COMPANY3M CO
4A.H. BELOA. H. BELO CORP
5A.O. SMITHSMITH (A.O.)
6ACADIA HEALTHCARE COACADIA HEALTHCARE
7ACADIA PHARMACEUTICALS  

If I use the code:

 

PROC SQL;
	CREATE TABLE WANT AS
	SELECT l.*, r.*
	FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
	ON	(l.company_name = r.company_name) and l.year = r.year;
QUIT;

Then the above values will not find a match, even though they should be matched.

 

Since these 5 rows are the most common examples of a match failing in my data, I was wondering if 

there were a way to join tables contingent on matching substrings match? 

 

In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order

In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)

In Row 5, Table 1 is substring of Table 2, but the values are out of order.

 

In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.

 

I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started. 

 

Thanks!

 

3 REPLIES 3
Shmuel
Garnet | Level 18

Next code may give you a preparation to the join :

data table1;
   input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS 
;;;;
run;
data table2;
   input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;

%macro short(text);
   &TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
   &text = tranwrd(&text,'CORPORATION','CORP');
   &text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;

DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;

Check tables TABLE1A TABLE2A;

UniversitySas
Quartz | Level 8
Ah, I see what you are doing.

Could I confirm what the line:
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');

accomplishes?
Shmuel
Garnet | Level 18

Function COMPBL compress the text to a single space between words

Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').

Function Compress with delimiters '()'  - i/e/ both parenthesis - will remove them from the text.

 

The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 832 views
  • 1 like
  • 2 in conversation