BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SarahDew
Obsidian | Level 7

Data example:

 

1. Company ABCDE || ABCDE is the organisation

 

2. MyBusiness || MyBusiness Ltd.

 

3. Governmental agency FGHIJ || G. A. FGHIJ etc.

 

4. Made Up Company Name || "Made Up" Name

 

From these four records, I wish to get an indication of what is the longest substring that could be matched between the two values. This can be in numbers (e.g. 5 for the first record, 10 for the second) but preferably also create a new variable with that substring (e. g. ABCDE or MyBusiness) so that I can do a visual check of the match.

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
 set have;
 x1=scan(x,1,'|'); x2=scan(x,-1,'|');
 len=length(x1);
 do length=len to 1 by -1;
  do start=1 to len-length+1; 
   substr=substr(x1,start,length);
   if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
  end;
  if yes then leave;
 end;
drop len start yes x1 x2;
 run;

View solution in original post

6 REPLIES 6
Astounding
PROC Star

When you say "substring" do you mean any characters, or words only?  For example, is there any match at all here:

 

Business || MyBusiness Ltd.

SarahDew
Obsidian | Level 7
I mean any characters. In your example "Business" would be the matched substring.
PGStats
Opal | Level 21

That's very expensive to calculate. If your goal is to match strings, have you considered using the spelling distance functions COMPGED, COMPLEV or SPEDIS?

PG
SarahDew
Obsidian | Level 7

Yes, I have already determined many matches using the COMPGED and FIND functions. However, for my third example record none of these methods work since many manipulations are needed to get from one string to the other, and none of the two values are exactly found in the other. But it is not too bad that it would be expensive since I have already greatly reduced the set of non-matched records.

 

Maybe a less expensive alternative would be to do a loop where every possible substring of 4 characters of var1 is looked up in var2 trough the FIND function and vice versa? Any ideas on how to write this sort of code?

Ksharp
Super User
data have;
input x $80.;
cards;
Company ABCDE || ABCDE is the organisation
MyBusiness || MyBusiness Ltd.
Governmental agency FGHIJ || G. A. FGHIJ etc.
Made Up Company Name || "Made Up" Name
;
run;
data want;
 set have;
 x1=scan(x,1,'|'); x2=scan(x,-1,'|');
 len=length(x1);
 do length=len to 1 by -1;
  do start=1 to len-length+1; 
   substr=substr(x1,start,length);
   if find(x2,strip(substr)) and substr ne: ' ' then do;yes=1;leave;end;
  end;
  if yes then leave;
 end;
drop len start yes x1 x2;
 run;
SarahDew
Obsidian | Level 7
Thanks @Ksharp! Works like a charm.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2016 views
  • 0 likes
  • 4 in conversation