BookmarkSubscribeRSS Feed
UniversitySas
Quartz | Level 8

I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.

 

My primary identifier in both tables is under "Company_Name".

 

The issues I am having can be seen if we place the tables side by side.

Row#Table 1 CompaniesTable 2 Companies
11ST SOURCE CORPORATION1ST SOURCE CORP
23D SYSTEMS3D SYSTEMS CORP
33M COMPANY3M CO
4A.H. BELOA. H. BELO CORP
5A.O. SMITHSMITH (A.O.)
6ACADIA HEALTHCARE COACADIA HEALTHCARE
7ACADIA PHARMACEUTICALS  

If I use the code:

 

PROC SQL;
	CREATE TABLE WANT AS
	SELECT l.*, r.*
	FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
	ON	(l.company_name = r.company_name) and l.year = r.year;
QUIT;

Then the above values will not find a match, even though they should be matched.

 

Since these 5 rows are the most common examples of a match failing in my data, I was wondering if 

there were a way to join tables contingent on matching substrings match? 

 

In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order

In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)

In Row 5, Table 1 is substring of Table 2, but the values are out of order.

 

In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.

 

I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started. 

 

Thanks!

 

3 REPLIES 3
Shmuel
Garnet | Level 18

Next code may give you a preparation to the join :

data table1;
   input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS 
;;;;
run;
data table2;
   input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;

%macro short(text);
   &TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
   &text = tranwrd(&text,'CORPORATION','CORP');
   &text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;

DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;

Check tables TABLE1A TABLE2A;

UniversitySas
Quartz | Level 8
Ah, I see what you are doing.

Could I confirm what the line:
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');

accomplishes?
Shmuel
Garnet | Level 18

Function COMPBL compress the text to a single space between words

Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').

Function Compress with delimiters '()'  - i/e/ both parenthesis - will remove them from the text.

 

The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1216 views
  • 1 like
  • 2 in conversation