I have some data which was collected using some decision tree method. The end result is a long text field which has data like this:
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
...
etc...
I want to extract the LAST string of text for each record to the RIGHT of the LAST "->" characters.
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
...
etc...
Given each string between those characters is of different lengths, and each record has a different number of sets, I am lost on how to approach this.
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;
data want;
set have;
want = strip(scan(var, -1, '->'));
run;
You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.
Hi:
Is there actually a delimiter of -> in the data or was that just in your example? The existence of a delimiter will make a difference. It is possible to use the SCAN function to get the "back end" of a string, if there is a delimiter. The key is telling SCAN to start at the end of the string for scanning and extraction. Example 1 below uses -> and space as the the delimiters and Example 2 uses the | as the delimiter. The way that SCAN works is that contiguous delimiters are treated as 1 delimiter for purposes of scanning. the -1 tells SCAN to start at the right side of the string and bring back the first chunk on the end.
Cynthia
data example;
length longstring $150 lastchunk $50 nextchunkback $50;
infile datalines dlm=',' dsd;
input examp_num longstring $;
if examp_num = 1 then do;
lastchunk = scan(longstring,-1,'-> ');
nextchunkback = scan(longstring,-2,'-> ');
end;
if examp_num = 2 then do;
lastchunk = scan(longstring,-1,'|');
nextchunkback = scan(longstring,-2,'|');
end;
return;
datalines;
1,"aaa -> bbbbbbbbb-> cc -> dddddddddddddd"
2,"xxxxxxxx|yyy|zzzzzzzzz|1z2z3z4z|5y6y7y8y9y|xx10xx11xx12xx"
;
run;
proc print data=example;
var examp_num longstring lastchunk nextchunkback;
run;
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;
data want;
set have;
want = strip(scan(var, -1, '->'));
run;
You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.
That got it. Thanks. I knew it would be that function, I am just more accustomed to starting at the beginning.
That is treating both - and > as delimited, independently.
So strings like
xxx -> aaa-bbb
xxx -> aaa>bbb
Will mistakenly return "bbb".
The SCAN function can extract "words" from a multiword expression. You have such an expression, with word dividers of '-' and or '>'. Use a -1 to tell scan to find the rightmost word (+1 would be the leftmost word).
data _null_;
longtext='aaaaaaa->bbbb->->ccccc->dddddddd->eeee->fffff->gggggg->hhhhhhh';
lastword=scan(longtext,-1,'->');
put (_all_) (= /);
run;
If you had a one byte delimiter, say just >, then you could use scan.
last_word = scan(string,-1,'>');
Note that if the string has no > then the result is the whole string.
If you need to use that three byte sequence, '-> ', then you will have to work harder. You will need find the location of the last '-> ' and then use SUBSTRN() function.
data test;
input string $80.;
loc = find(string,'-> ',-vlength(string));
if loc then want = substrn(string,loc+3);
cards;
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
none
;
Results:
Obs string loc want 1 xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx 40 xxxxxxxxxx 2 xxx -> xxxxxxx -> xxxxxxxx 16 xxxxxxxx 3 xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx 50 xxxxxx 4 none 0 5 0
As long as none of the characters after the last -> sequence contain > then just using scan() with > as the delimiter will work fine.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.