Data example:
1. Company ABCDE || ABCDE is the organisation
2. MyBusiness || MyBusiness Ltd.
3. Governmental agency FGHIJ || G. A. FGHIJ etc.
4. Made Up Company Name || "Made Up" Name
From these four records, I wish to get an indication of what is the longest substring that could be matched between the two values. This can be in numbers (e.g. 5 for the first record, 10 for the second) but preferably also create a new variable with that substring (e. g. ABCDE or MyBusiness) so that I can do a visual check of the match.
data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
set have;
x1=scan(x,1,'|'); x2=scan(x,-1,'|');
len=length(x1);
do length=len to 1 by -1;
do start=1 to len-length+1;
substr=substr(x1,start,length);
if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
end;
if yes then leave;
end;
drop len start yes x1 x2;
run;
When you say "substring" do you mean any characters, or words only? For example, is there any match at all here:
Business || MyBusiness Ltd.
That's very expensive to calculate. If your goal is to match strings, have you considered using the spelling distance functions COMPGED, COMPLEV or SPEDIS?
Yes, I have already determined many matches using the COMPGED and FIND functions. However, for my third example record none of these methods work since many manipulations are needed to get from one string to the other, and none of the two values are exactly found in the other. But it is not too bad that it would be expensive since I have already greatly reduced the set of non-matched records.
Maybe a less expensive alternative would be to do a loop where every possible substring of 4 characters of var1 is looked up in var2 trough the FIND function and vice versa? Any ideas on how to write this sort of code?
data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
set have;
x1=scan(x,1,'|'); x2=scan(x,-1,'|');
len=length(x1);
do length=len to 1 by -1;
do start=1 to len-length+1;
substr=substr(x1,start,length);
if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
end;
if yes then leave;
end;
drop len start yes x1 x2;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.