SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
RandoDando
Pyrite | Level 9

I have some data which was collected using some decision tree method.  The end result is a long text field which has data like this:

 

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

 

I want to extract the LAST string of text for each record to the RIGHT of the LAST "->" characters. 

 

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

 

Given each string between those characters is of different lengths, and each record has a different number of sets, I am lost on how to approach this. 

1 ACCEPTED SOLUTION

Accepted Solutions
maguiremq
SAS Super FREQ
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

maguiremq_0-1659025489213.png

 

 

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

 

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

View solution in original post

10 REPLIES 10
Cynthia_sas
SAS Super FREQ

Hi:

  Is there actually a delimiter of -> in the data or was that just in your example? The existence of a delimiter will make a difference. It is possible to use the SCAN function to get the "back end" of a string, if there is a delimiter. The key is telling SCAN to start at the end of the string for scanning and extraction. Example 1 below uses -> and space as the the delimiters and Example 2 uses the | as the delimiter. The way that SCAN works is that contiguous delimiters are treated as 1 delimiter for purposes of scanning. the -1 tells SCAN to start at the right side of the string and bring back the first chunk on the end.

Cynthia

data example;
  length longstring $150 lastchunk $50 nextchunkback $50;
  infile datalines dlm=',' dsd;
  input examp_num longstring $;
  if examp_num = 1 then do;
     lastchunk = scan(longstring,-1,'-> ');
	 nextchunkback = scan(longstring,-2,'-> ');
  end;
  if examp_num = 2 then do;
     lastchunk = scan(longstring,-1,'|');
	 nextchunkback = scan(longstring,-2,'|');
  end;
  return;
  datalines;
  1,"aaa -> bbbbbbbbb-> cc -> dddddddddddddd"
  2,"xxxxxxxx|yyy|zzzzzzzzz|1z2z3z4z|5y6y7y8y9y|xx10xx11xx12xx"
  ;
  run;
  
  proc print data=example;
  var examp_num longstring lastchunk nextchunkback;
  run;

 

RandoDando
Pyrite | Level 9
The text field I am looking at actually does have "->" as the delimiter between sections of the string.
maguiremq
SAS Super FREQ
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

maguiremq_0-1659025489213.png

 

 

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

 

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

RandoDando
Pyrite | Level 9

That got it. Thanks. I knew it would be that function, I am just more accustomed to starting at the beginning.

Tom
Super User Tom
Super User

That is treating both - and > as delimited, independently.

So strings like 

xxx ->  aaa-bbb
xxx ->  aaa>bbb

Will mistakenly return "bbb".

mkeintz
PROC Star

The SCAN function can extract "words" from a multiword expression.  You have such an expression, with word dividers of '-' and or '>'.  Use a -1 to tell scan to find the rightmost word  (+1 would be the leftmost word).

 

data _null_;
  longtext='aaaaaaa->bbbb->->ccccc->dddddddd->eeee->fffff->gggggg->hhhhhhh';
  lastword=scan(longtext,-1,'->');
  put (_all_) (= /);
run; 
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

If you had a one byte delimiter, say just >, then you could use scan.

last_word = scan(string,-1,'>');

Note that if the string has no > then the result is the whole string.

 

If you need to use that three byte sequence, '-> ', then you will have to work harder.  You will need find the location of the last '-> ' and then use SUBSTRN() function.

data test;
  input string $80.;
  loc = find(string,'-> ',-vlength(string));
  if loc then want = substrn(string,loc+3);
cards;
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
none

;

Results:

Obs    string                                                        loc    want

 1     xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx           40    xxxxxxxxxx
 2     xxx -> xxxxxxx -> xxxxxxxx                                     16    xxxxxxxx
 3     xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx     50    xxxxxx
 4     none                                                            0
 5                                                                     0

 

 

maguiremq
SAS Super FREQ
Good catch! Thanks @Tom. Slipped by me. @willisva please look at some other solutions and mark them as the correct answer.


RandoDando
Pyrite | Level 9
I just tried this and so far don't see a big difference in the output. However, given what you have said I may just go with this.
Tom
Super User Tom
Super User

As long as none of the characters after the last -> sequence contain > then just using scan() with > as the delimiter will work fine.

sas-innovate-white.png

Join us for our biggest event of the year!

Four days of inspiring keynotes, product reveals, hands-on learning opportunities, deep-dive demos, and peer-led breakouts. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 5589 views
  • 3 likes
  • 5 in conversation