I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.
My primary identifier in both tables is under "Company_Name".
The issues I am having can be seen if we place the tables side by side.
|Row#||Table 1 Companies||Table 2 Companies|
|1||1ST SOURCE CORPORATION||1ST SOURCE CORP|
|2||3D SYSTEMS||3D SYSTEMS CORP|
|3||3M COMPANY||3M CO|
|4||A.H. BELO||A. H. BELO CORP|
|5||A.O. SMITH||SMITH (A.O.)|
|6||ACADIA HEALTHCARE CO||ACADIA HEALTHCARE|
If I use the code:
PROC SQL; CREATE TABLE WANT AS SELECT l.*, r.* FROM TABLE1 AS l LEFT JOIN TABLE2 AS r ON (l.company_name = r.company_name) and l.year = r.year; QUIT;
Then the above values will not find a match, even though they should be matched.
Since these 5 rows are the most common examples of a match failing in my data, I was wondering if
there were a way to join tables contingent on matching substrings match?
In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order
In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)
In Row 5, Table 1 is substring of Table 2, but the values are out of order.
In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.
I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started.
Next code may give you a preparation to the join :
data table1; input company $CHAR50.; cards4; 1ST SOURCE CORPORATION 3D SYSTEMS 3M COMPANY A.H. BELO A.O. SMITH ACADIA HEALTHCARE CO ACADIA PHARMACEUTICALS ;;;; run; data table2; input company $CHAR50.; cards4; 1ST SOURCE CORP 3D SYSTEMS CORP 3M CO A. H. BELO CORP SMITH (A.O.) ACADIA HEALTHCARE ;;;; run; %macro short(text); &TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()'); &text = tranwrd(&text,'CORPORATION','CORP'); &text = tranwrd(&text,'COMPANY','CO'); %MEND SHORT; DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN; DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;
Check tables TABLE1A TABLE2A;
Function COMPBL compress the text to a single space between words
Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').
Function Compress with delimiters '()' - i/e/ both parenthesis - will remove them from the text.
The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).
Find more tutorials on the SAS Users YouTube channel.