I have some data which was collected using some decision tree method. The end result is a long text field which has data like this:
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
...
etc...
I want to extract the LAST string of text for each record to the RIGHT of the LAST "->" characters.
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
...
etc...
Given each string between those characters is of different lengths, and each record has a different number of sets, I am lost on how to approach this.
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;
data want;
set have;
want = strip(scan(var, -1, '->'));
run;
You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.
Hi:
Is there actually a delimiter of -> in the data or was that just in your example? The existence of a delimiter will make a difference. It is possible to use the SCAN function to get the "back end" of a string, if there is a delimiter. The key is telling SCAN to start at the end of the string for scanning and extraction. Example 1 below uses -> and space as the the delimiters and Example 2 uses the | as the delimiter. The way that SCAN works is that contiguous delimiters are treated as 1 delimiter for purposes of scanning. the -1 tells SCAN to start at the right side of the string and bring back the first chunk on the end.
Cynthia
data example;
length longstring $150 lastchunk $50 nextchunkback $50;
infile datalines dlm=',' dsd;
input examp_num longstring $;
if examp_num = 1 then do;
lastchunk = scan(longstring,-1,'-> ');
nextchunkback = scan(longstring,-2,'-> ');
end;
if examp_num = 2 then do;
lastchunk = scan(longstring,-1,'|');
nextchunkback = scan(longstring,-2,'|');
end;
return;
datalines;
1,"aaa -> bbbbbbbbb-> cc -> dddddddddddddd"
2,"xxxxxxxx|yyy|zzzzzzzzz|1z2z3z4z|5y6y7y8y9y|xx10xx11xx12xx"
;
run;
proc print data=example;
var examp_num longstring lastchunk nextchunkback;
run;
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;
data want;
set have;
want = strip(scan(var, -1, '->'));
run;
You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.
That got it. Thanks. I knew it would be that function, I am just more accustomed to starting at the beginning.
That is treating both - and > as delimited, independently.
So strings like
xxx -> aaa-bbb
xxx -> aaa>bbb
Will mistakenly return "bbb".
The SCAN function can extract "words" from a multiword expression. You have such an expression, with word dividers of '-' and or '>'. Use a -1 to tell scan to find the rightmost word (+1 would be the leftmost word).
data _null_;
longtext='aaaaaaa->bbbb->->ccccc->dddddddd->eeee->fffff->gggggg->hhhhhhh';
lastword=scan(longtext,-1,'->');
put (_all_) (= /);
run;
If you had a one byte delimiter, say just >, then you could use scan.
last_word = scan(string,-1,'>');
Note that if the string has no > then the result is the whole string.
If you need to use that three byte sequence, '-> ', then you will have to work harder. You will need find the location of the last '-> ' and then use SUBSTRN() function.
data test;
input string $80.;
loc = find(string,'-> ',-vlength(string));
if loc then want = substrn(string,loc+3);
cards;
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
none
;
Results:
Obs string loc want 1 xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx 40 xxxxxxxxxx 2 xxx -> xxxxxxx -> xxxxxxxx 16 xxxxxxxx 3 xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx 50 xxxxxx 4 none 0 5 0
As long as none of the characters after the last -> sequence contain > then just using scan() with > as the delimiter will work fine.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.