Re: PROC SQL Joining on Substrings?

UniversitySas · Posted 09-07-2020 07:22 AM

I am trying to merge two datasets, but the problem is that my data is not in such a great format across either table.

My primary identifier in both tables is under "Company_Name".

The issues I am having can be seen if we place the tables side by side.

Row#	Table 1 Companies	Table 2 Companies
1	1ST SOURCE CORPORATION	1ST SOURCE CORP
2	3D SYSTEMS	3D SYSTEMS CORP
3	3M COMPANY	3M CO
4	A.H. BELO	A. H. BELO CORP
5	A.O. SMITH	SMITH (A.O.)
6	ACADIA HEALTHCARE CO	ACADIA HEALTHCARE
7	ACADIA PHARMACEUTICALS

If I use the code:

PROC SQL;
	CREATE TABLE WANT AS
	SELECT l.*, r.*
	FROM TABLE1 AS l LEFT JOIN TABLE2 AS r
	ON	(l.company_name = r.company_name) and l.year = r.year;
QUIT;

Then the above values will not find a match, even though they should be matched.

Since these 5 rows are the most common examples of a match failing in my data, I was wondering if

there were a way to join tables contingent on matching substrings match?

In Row 1 & 3, Table 2 is a substring of Table 1, in perfect order

In Row 2 & 4, Table 1 is a substring of Table 2, in perfect order (although in Row 4 there is a space after the first period, in table 2)

In Row 5, Table 1 is substring of Table 2, but the values are out of order.

In Row 6 and 7 - we have two companies with similar names - "Acadia" is present in both. In that case, I want to create the match between the entries which have more matching substrings. In this case, Row 6 has two matching words, whereas row 7 would only have had one.

I understand this will probably take a bit of guess work and fiddling around, as I can imagine that matching on substrings may lead to incorrect matches. In any case, I would really appreciate some help to get started.

Thanks!

Shmuel · Posted 09-07-2020 08:03 AM

Next code may give you a preparation to the join :

data table1;
   input company $CHAR50.;
cards4;
1ST SOURCE CORPORATION
3D SYSTEMS
3M COMPANY
A.H. BELO
A.O. SMITH
ACADIA HEALTHCARE CO
ACADIA PHARMACEUTICALS 
;;;;
run;
data table2;
   input company $CHAR50.;
cards4;
1ST SOURCE CORP
3D SYSTEMS CORP
3M CO
A. H. BELO CORP
SMITH (A.O.)
ACADIA HEALTHCARE
;;;;
run;

%macro short(text);
   &TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');
   &text = tranwrd(&text,'CORPORATION','CORP');
   &text = tranwrd(&text,'COMPANY','CO');
%MEND SHORT;

DATA TABLE1A; SET TABLE1; %SHORT(COMPANY);RUN;
DATA TABLE2A; SET TABLE2; %SHORT(COMPANY);RUN;

Check tables TABLE1A TABLE2A;

UniversitySas · Posted 09-07-2020 08:12 AM

Ah, I see what you are doing.

Could I confirm what the line:
&TEXT = COMPRESS(UPCASE(COMPBL(&TEXT)),'()');

accomplishes?

Shmuel · Posted 09-07-2020 08:56 AM

Function COMPBL compress the text to a single space between words

Function UPCASE eliminates case sensitivity ("A" in not equal to 'a').

Function Compress with delimiters '()' - i/e/ both parenthesis - will remove them from the text.

The only issue that I did not deal with is row # 5 - order of sub-strings (a.o. smith).

PROC SQL Joining on Substrings?