I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.
My primary identifier in both tables is under "Company_Name".
The issues I am having can be seen if we place the tables side by side.
Row# | Table 1 Companies | Table 2 Companies |
1 | 1ST SOURCE CORPORATION | 1ST SOURCE CORP |
2 | 3D SYSTEMS | 3D SYSTEMS CORP |
3 | 3M COMPANY | 3M CO |
4 | A.H. BELO | A. H. BELO CORP |
5 | A.O. SMITH | SMITH (A.O.) |
6 | ACADIA HEALTHCARE CO | ACADIA HEALTHCARE |
7 | ACADIA PHARMACEUTICALS |
If I use the code:
PROC SQL;
CREATE TABLE WANT AS
SELECT l.*, r.*
FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
ON (l.company_name = r.company_name) and l.year = r.year;
QUIT;
Then the above values will not find a match, even though they should be matched.
Since these 5 rows are the most common examples of a match failing in my data, I was wondering if
there were a way to join tables contingent on matching substrings match?
In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order
In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)
In Row 5, Table 1 is substring of Table 2, but the values are out of order.
In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.
I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started.
Thanks!
Next code may give you a preparation to the join :
data table1;
input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS
;;;;
run;
data table2;
input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;
%macro short(text);
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
&text = tranwrd(&text,'CORPORATION','CORP');
&text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;
DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;
Check tables TABLE1A TABLE2A;
Function COMPBL compress the text to a single space between words
Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').
Function Compress with delimiters '()' - i/e/ both parenthesis - will remove them from the text.
The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.