BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
GuyTreepwood
Obsidian | Level 7

Hello, 

 

I have the following two text columns, and I would like to concatenate them using overlapping text values found at the end of String_1 and the beginning of String_2 as join 'keys' (for lack of a better term). I would like to be able to do a dynamic search that can find the overlapping text values which can have different string lengths for each record. For example, the solution should find that 'GEAR DRIV' is the overlapping text in observation 1, and '3 5 HP 1560 RPM 29' in observation 2.

 

String_1 String_2 Want
GEAR  MO 298374298 19  GEAR DRIV GEAR DRIVE HOUSING GEAR  MO 298374298 19 GEAR DRIVE HOUSING
ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 29 3 5 HP  1560 RPM 295TC DRAIN ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 295TC DRAIN

 

Thanks in advance. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You could just keep trying to match shorter and shorter versions of STRING_1 to the beginning of STRING_2 until you find a match.  You can add a little wrinkle to make sure a space gets added when there is no match.

data have;
  infile datalines dsd truncover ;
  input String_1 :$100. String_2 :$100. Want :$150.;
datalines;
MOTOR LATCH 854A LATCH DRYER FOR EC,LATCH DRYER FOR ECCENTRIC MOTOR  OUT,MOTOR LATCH 854A LATCH DRYER FOR ECCENTRIC MOTOR  OUT
ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 29,3 5 HP  1560 RPM 295TC DRAIN,ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 295TC DRAIN
aaa,,aaa
,bbb,bbb
aaa,bbb,aaa bbb
;

data want;
  set have;
  length combo $150;
  do lp=1 to lengthn(string_1) until(substrn(string_1,lp,lengthn(string_1)-lp+1)=:string_2); end;
  if lp > length(string_1) then lp+1;
  combo=substrn(string_1||' ',1,lp-1) || string_2;
  match=combo=want;
run;

proc print;
run;

Results:

Tom_0-1713454944336.png

 

View solution in original post

5 REPLIES 5
ballardw
Super User

Can you provide example data in the form of a working data step with values of your String_1 and String_2? I'm way too lazy to want build such.

 

Also, what if there are differences in case: "GEAR DRIV" and "Gear drive"?

What if there are no overlaps? What is to be the result (I might guess but guessing is poor code design)

 

With one example:

data example;
   string_1='GEAR  MO 298374298 19  GEAR DRIV';
   string_2='GEAR DRIVE HOUSING';
   /* find overlap*/
   length a b $ 40. want $ 60;
   do i = length(string_1) to 1 by -1;
      a=substr(string_1,i);
      b=substr(string_2,1,length(string_1)-i+1);
      if substr(string_1,i) = substr(string_2,1,length(string_1)-i+1) then do;
         want = cats(string_1,substr(string_2,length(string_1)-i+2));
         leave;
      end;;
   end;
   drop i;
run; 

The temporary variables A and B are so you can see what is going on and should be long enough to hold string_1 and string_2. Want should be defined to hold the longest expected string.

No implementation of a "no overlap rule" as not provided.

 

Basically this parses one string from back to front and pulls the same number of characters from the second string. When they match then do the concatenation and output.

The LEAVE instruction means "exit the do loop". If code is working correctly then only the match gets set.

Drop variables A and B when you are comfortable with results.

Note that the above code does not consider case of letters in any fashion.

GuyTreepwood
Obsidian | Level 7

*Updated*

 

Hello,

 

Here is the code to create the sample dataset using a data step:

 

data have;
input String_1 :$100. String_2 :$100. Want :$150.;
infile datalines dlm=',';
datalines;
GEAR MO 298374298 19 GEAR DRIV, GEAR DRIVE HOUSING, GEAR MO 298374298 19 GEAR DRIVE HOUSING
ROTOR DC LEM9984T 3 5 HP 1560 RPM 29,3 5 HP 1560 RPM 295TC DRAIN,ROTOR DC LEM9984T 3 5 HP 1560 RPM 295TC DRAIN
;

Regarding the differences in case, I take care of that before doing this concatenation, so that should not be an issue. 

 

Regarding what to do if there are no matches, then in that case, it should just concatenate the two strings as they are using a space as the separator. 

ballardw
Super User

If the i from the Do loop is 0 then a matching string was not found. So after the do loop end adding:

   if i=0 then want=catx(' ',string_1,string_2);

should take care of the no overlap case.

 

Note that i=1 would mean that the strings are the same. IF you have a lot of data and this might happen frequently then perhaps on name of efficiency test for equality and skip the parse if so.

 

There is a potential logic issue with the above suggestion: If you have one string end in multiples of the same character and start with multiples of the same character the match will get the last/first character over lap but not all of the characters. Example: ABCDD and DDEF will yield ABCDDDEF instead of likely desired ABCDDEF.  If this is a likely case then more logic is needed (and possibly convoluted to handle cases where entire substrings like BED are in this:   ABEDBED, BEDBEDBEDBED

Tom
Super User Tom
Super User

You could just keep trying to match shorter and shorter versions of STRING_1 to the beginning of STRING_2 until you find a match.  You can add a little wrinkle to make sure a space gets added when there is no match.

data have;
  infile datalines dsd truncover ;
  input String_1 :$100. String_2 :$100. Want :$150.;
datalines;
MOTOR LATCH 854A LATCH DRYER FOR EC,LATCH DRYER FOR ECCENTRIC MOTOR  OUT,MOTOR LATCH 854A LATCH DRYER FOR ECCENTRIC MOTOR  OUT
ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 29,3 5 HP  1560 RPM 295TC DRAIN,ROTOR  DC  LEM9984T  3 5 HP  1560 RPM 295TC DRAIN
aaa,,aaa
,bbb,bbb
aaa,bbb,aaa bbb
;

data want;
  set have;
  length combo $150;
  do lp=1 to lengthn(string_1) until(substrn(string_1,lp,lengthn(string_1)-lp+1)=:string_2); end;
  if lp > length(string_1) then lp+1;
  combo=substrn(string_1||' ',1,lp-1) || string_2;
  match=combo=want;
run;

proc print;
run;

Results:

Tom_0-1713454944336.png

 

FreelanceReinh
Jade | Level 19

Hello @GuyTreepwood,

 

You could also try the PRXCHANGE function:

data want;
set have;
want=translate(prxchange('s/(.+)§\1/\1/',1,catx('§',string_1,string_2)),' ','§');
run;

(where the delimiter "§" is assumed not to occur in STRING_1 or STRING_2).

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 331 views
  • 3 likes
  • 4 in conversation