Solved: Re: How to extract text to RIGHT of character

RandoDando · Posted 07-28-2022 12:06 PM

I have some data which was collected using some decision tree method. The end result is a long text field which has data like this:

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

I want to extract the LAST string of text for each record to the RIGHT of the LAST "->" characters.

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

Given each string between those characters is of different lengths, and each record has a different number of sets, I am lost on how to approach this.

maguiremq · Posted 07-28-2022 12:24 PM

data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

View solution in original post

Cynthia_sas · Posted 07-28-2022 12:20 PM

Hi:

Is there actually a delimiter of -> in the data or was that just in your example? The existence of a delimiter will make a difference. It is possible to use the SCAN function to get the "back end" of a string, if there is a delimiter. The key is telling SCAN to start at the end of the string for scanning and extraction. Example 1 below uses -> and space as the the delimiters and Example 2 uses the | as the delimiter. The way that SCAN works is that contiguous delimiters are treated as 1 delimiter for purposes of scanning. the -1 tells SCAN to start at the right side of the string and bring back the first chunk on the end.

Cynthia

data example;
  length longstring $150 lastchunk $50 nextchunkback $50;
  infile datalines dlm=',' dsd;
  input examp_num longstring $;
  if examp_num = 1 then do;
     lastchunk = scan(longstring,-1,'-> ');
	 nextchunkback = scan(longstring,-2,'-> ');
  end;
  if examp_num = 2 then do;
     lastchunk = scan(longstring,-1,'|');
	 nextchunkback = scan(longstring,-2,'|');
  end;
  return;
  datalines;
  1,"aaa -> bbbbbbbbb-> cc -> dddddddddddddd"
  2,"xxxxxxxx|yyy|zzzzzzzzz|1z2z3z4z|5y6y7y8y9y|xx10xx11xx12xx"
  ;
  run;
  
  proc print data=example;
  var examp_num longstring lastchunk nextchunkback;
  run;

RandoDando · Posted 07-28-2022 12:23 PM

The text field I am looking at actually does have "->" as the delimiter between sections of the string.

maguiremq · Posted 07-28-2022 12:24 PM

data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

RandoDando · Posted 07-28-2022 12:30 PM

That got it. Thanks. I knew it would be that function, I am just more accustomed to starting at the beginning.

Tom · Posted 07-28-2022 12:38 PM

That is treating both - and > as delimited, independently.

So strings like

xxx ->  aaa-bbb
xxx ->  aaa>bbb

Will mistakenly return "bbb".

mkeintz · Posted 07-28-2022 12:31 PM

The SCAN function can extract "words" from a multiword expression. You have such an expression, with word dividers of '-' and or '>'. Use a -1 to tell scan to find the rightmost word (+1 would be the leftmost word).

data _null_;
  longtext='aaaaaaa->bbbb->->ccccc->dddddddd->eeee->fffff->gggggg->hhhhhhh';
  lastword=scan(longtext,-1,'->');
  put (_all_) (= /);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 07-28-2022 12:37 PM

If you had a one byte delimiter, say just >, then you could use scan.

last_word = scan(string,-1,'>');

Note that if the string has no > then the result is the whole string.

If you need to use that three byte sequence, '-> ', then you will have to work harder. You will need find the location of the last '-> ' and then use SUBSTRN() function.

data test;
  input string $80.;
  loc = find(string,'-> ',-vlength(string));
  if loc then want = substrn(string,loc+3);
cards;
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
none

;

Results:

Obs    string                                                        loc    want

 1     xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx           40    xxxxxxxxxx
 2     xxx -> xxxxxxx -> xxxxxxxx                                     16    xxxxxxxx
 3     xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx     50    xxxxxx
 4     none                                                            0
 5                                                                     0

maguiremq · Posted 07-28-2022 12:43 PM

Good catch! Thanks @Tom. Slipped by me. @willisva please look at some other solutions and mark them as the correct answer.

RandoDando · Posted 07-28-2022 12:58 PM

I just tried this and so far don't see a big difference in the output. However, given what you have said I may just go with this.

Tom · Posted 07-28-2022 01:30 PM

As long as none of the characters after the last -> sequence contain > then just using scan() with > as the delimiter will work fine.

Classroom Training Available!