BookmarkSubscribeRSS Feed
UniversitySas
Quartz | Level 8

I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.

 

My primary identifier in both tables is under "Company_Name".

 

The issues I am having can be seen if we place the tables side by side.

Row#Table 1 CompaniesTable 2 Companies
11ST SOURCE CORPORATION1ST SOURCE CORP
23D SYSTEMS3D SYSTEMS CORP
33M COMPANY3M CO
4A.H. BELOA. H. BELO CORP
5A.O. SMITHSMITH (A.O.)
6ACADIA HEALTHCARE COACADIA HEALTHCARE
7ACADIA PHARMACEUTICALS  

If I use the code:

 

PROC SQL;
	CREATE TABLE WANT AS
	SELECT l.*, r.*
	FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
	ON	(l.company_name = r.company_name) and l.year = r.year;
QUIT;

Then the above values will not find a match, even though they should be matched.

 

Since these 5 rows are the most common examples of a match failing in my data, I was wondering if 

there were a way to join tables contingent on matching substrings match? 

 

In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order

In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)

In Row 5, Table 1 is substring of Table 2, but the values are out of order.

 

In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.

 

I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started. 

 

Thanks!

 

3 REPLIES 3
Shmuel
Garnet | Level 18

Next code may give you a preparation to the join :

data table1;
   input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS 
;;;;
run;
data table2;
   input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;

%macro short(text);
   &TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
   &text = tranwrd(&text,'CORPORATION','CORP');
   &text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;

DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;

Check tables TABLE1A TABLE2A;

UniversitySas
Quartz | Level 8
Ah, I see what you are doing.

Could I confirm what the line:
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');

accomplishes?
Shmuel
Garnet | Level 18

Function COMPBL compress the text to a single space between words

Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').

Function Compress with delimiters '()'  - i/e/ both parenthesis - will remove them from the text.

 

The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1427 views
  • 1 like
  • 2 in conversation