Solved: Re: Check the longest substring two strings have in common

SarahDew · Posted 11-05-2018 10:27 AM

Data example:

1. Company ABCDE || ABCDE is the organisation

2. MyBusiness || MyBusiness Ltd.

3. Governmental agency FGHIJ || G. A. FGHIJ etc.

4. Made Up Company Name || "Made Up" Name

From these four records, I wish to get an indication of what is the longest substring that could be matched between the two values. This can be in numbers (e.g. 5 for the first record, 10 for the second) but preferably also create a new variable with that substring (e. g. ABCDE or MyBusiness) so that I can do a visual check of the match.

Ksharp · Posted 11-06-2018 08:51 AM

data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
 set have;
 x1=scan(x,1,'|'); x2=scan(x,-1,'|');
 len=length(x1);
 do length=len to 1 by -1;
  do start=1 to len-length+1; 
   substr=substr(x1,start,length);
   if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
  end;
  if yes then leave;
 end;
drop len start yes x1 x2;
 run;

View solution in original post

Astounding · Posted 11-05-2018 10:42 AM

When you say "substring" do you mean any characters, or words only? For example, is there any match at all here:

Business || MyBusiness Ltd.

SarahDew · Posted 11-05-2018 10:49 AM

I mean any characters. In your example "Business" would be the matched substring.

PGStats · Posted 11-05-2018 04:23 PM

That's very expensive to calculate. If your goal is to match strings, have you considered using the spelling distance functions COMPGED, COMPLEV or SPEDIS?

PG

SarahDew · Posted 11-06-2018 03:32 AM

Yes, I have already determined many matches using the COMPGED and FIND functions. However, for my third example record none of these methods work since many manipulations are needed to get from one string to the other, and none of the two values are exactly found in the other. But it is not too bad that it would be expensive since I have already greatly reduced the set of non-matched records.

Maybe a less expensive alternative would be to do a loop where every possible substring of 4 characters of var1 is looked up in var2 trough the FIND function and vice versa? Any ideas on how to write this sort of code?

Ksharp · Posted 11-06-2018 08:51 AM

data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
 set have;
 x1=scan(x,1,'|'); x2=scan(x,-1,'|');
 len=length(x1);
 do length=len to 1 by -1;
  do start=1 to len-length+1; 
   substr=substr(x1,start,length);
   if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
  end;
  if yes then leave;
 end;
drop len start yes x1 x2;
 run;

SarahDew · Posted 11-07-2018 07:13 AM

Thanks @Ksharp! Works like a charm.

SAS Innovate 2025: Save the Date