I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.
My primary identifier in both tables is under "Company_Name".
The issues I am having can be seen if we place the tables side by side.
Row# | Table 1 Companies | Table 2 Companies |
1 | 1ST SOURCE CORPORATION | 1ST SOURCE CORP |
2 | 3D SYSTEMS | 3D SYSTEMS CORP |
3 | 3M COMPANY | 3M CO |
4 | A.H. BELO | A. H. BELO CORP |
5 | A.O. SMITH | SMITH (A.O.) |
6 | ACADIA HEALTHCARE CO | ACADIA HEALTHCARE |
7 | ACADIA PHARMACEUTICALS |
If I use the code:
PROC SQL;
CREATE TABLE WANT AS
SELECT l.*, r.*
FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
ON (l.company_name = r.company_name) and l.year = r.year;
QUIT;
Then the above values will not find a match, even though they should be matched.
Since these 5 rows are the most common examples of a match failing in my data, I was wondering if
there were a way to join tables contingent on matching substrings match?
In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order
In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)
In Row 5, Table 1 is substring of Table 2, but the values are out of order.
In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.
I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started.
Thanks!
Next code may give you a preparation to the join :
data table1;
input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS
;;;;
run;
data table2;
input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;
%macro short(text);
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
&text = tranwrd(&text,'CORPORATION','CORP');
&text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;
DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;
Check tables TABLE1A TABLE2A;
Function COMPBL compress the text to a single space between words
Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').
Function Compress with delimiters '()' - i/e/ both parenthesis - will remove them from the text.
The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.