BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
RandoDando
Pyrite | Level 9

I have some data which was collected using some decision tree method.  The end result is a long text field which has data like this:

 

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

 

I want to extract the LAST string of text for each record to the RIGHT of the LAST "->" characters. 

 

xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx

xxx -> xxxxxxx -> xxxxxxxx

xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx

...

etc...

 

Given each string between those characters is of different lengths, and each record has a different number of sets, I am lost on how to approach this. 

1 ACCEPTED SOLUTION

Accepted Solutions
maguiremq
SAS Super FREQ
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

maguiremq_0-1659025489213.png

 

 

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

 

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

View solution in original post

10 REPLIES 10
Cynthia_sas
SAS Super FREQ

Hi:

  Is there actually a delimiter of -> in the data or was that just in your example? The existence of a delimiter will make a difference. It is possible to use the SCAN function to get the "back end" of a string, if there is a delimiter. The key is telling SCAN to start at the end of the string for scanning and extraction. Example 1 below uses -> and space as the the delimiters and Example 2 uses the | as the delimiter. The way that SCAN works is that contiguous delimiters are treated as 1 delimiter for purposes of scanning. the -1 tells SCAN to start at the right side of the string and bring back the first chunk on the end.

Cynthia

data example;
  length longstring $150 lastchunk $50 nextchunkback $50;
  infile datalines dlm=',' dsd;
  input examp_num longstring $;
  if examp_num = 1 then do;
     lastchunk = scan(longstring,-1,'-> ');
	 nextchunkback = scan(longstring,-2,'-> ');
  end;
  if examp_num = 2 then do;
     lastchunk = scan(longstring,-1,'|');
	 nextchunkback = scan(longstring,-2,'|');
  end;
  return;
  datalines;
  1,"aaa -> bbbbbbbbb-> cc -> dddddddddddddd"
  2,"xxxxxxxx|yyy|zzzzzzzzz|1z2z3z4z|5y6y7y8y9y|xx10xx11xx12xx"
  ;
  run;
  
  proc print data=example;
  var examp_num longstring lastchunk nextchunkback;
  run;

 

RandoDando
Pyrite | Level 9
The text field I am looking at actually does have "->" as the delimiter between sections of the string.
maguiremq
SAS Super FREQ
data have;
infile datalines dsd missover;
input var :$100.;
datalines;
xxxxx -> xxxx -> x
xx -> xxxx -> xxxxxxxx
xxxx -> xxx -> xxxxxxxxxxxxxxxxx
;
run;

data want;
	set have;
	want = strip(scan(var, -1, '->'));
run;

maguiremq_0-1659025489213.png

 

 

You may not need `STRIP`. -1 gets the last in the string before a given delimiter. I specified the -> as the delimiter. There are a whole host of options in the documentation - very useful function.

 

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0jshdjy2z9zdzn1h7k90u99lyq6.h...

RandoDando
Pyrite | Level 9

That got it. Thanks. I knew it would be that function, I am just more accustomed to starting at the beginning.

Tom
Super User Tom
Super User

That is treating both - and > as delimited, independently.

So strings like 

xxx ->  aaa-bbb
xxx ->  aaa>bbb

Will mistakenly return "bbb".

mkeintz
PROC Star

The SCAN function can extract "words" from a multiword expression.  You have such an expression, with word dividers of '-' and or '>'.  Use a -1 to tell scan to find the rightmost word  (+1 would be the leftmost word).

 

data _null_;
  longtext='aaaaaaa->bbbb->->ccccc->dddddddd->eeee->fffff->gggggg->hhhhhhh';
  lastword=scan(longtext,-1,'->');
  put (_all_) (= /);
run; 
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

If you had a one byte delimiter, say just >, then you could use scan.

last_word = scan(string,-1,'>');

Note that if the string has no > then the result is the whole string.

 

If you need to use that three byte sequence, '-> ', then you will have to work harder.  You will need find the location of the last '-> ' and then use SUBSTRN() function.

data test;
  input string $80.;
  loc = find(string,'-> ',-vlength(string));
  if loc then want = substrn(string,loc+3);
cards;
xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx
xxx -> xxxxxxx -> xxxxxxxx
xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx
none

;

Results:

Obs    string                                                        loc    want

 1     xxxxxxxx -> xxxxxxxx -> xxxxxxxxxxxxxx -> xxxxxxxxxx           40    xxxxxxxxxx
 2     xxx -> xxxxxxx -> xxxxxxxx                                     16    xxxxxxxx
 3     xxxxxxx -> xxxxxxxx -> xxxxxxxxxxxx -> xxxxxxxxx -> xxxxxx     50    xxxxxx
 4     none                                                            0
 5                                                                     0

 

 

maguiremq
SAS Super FREQ
Good catch! Thanks @Tom. Slipped by me. @willisva please look at some other solutions and mark them as the correct answer.


RandoDando
Pyrite | Level 9
I just tried this and so far don't see a big difference in the output. However, given what you have said I may just go with this.
Tom
Super User Tom
Super User

As long as none of the characters after the last -> sequence contain > then just using scan() with > as the delimiter will work fine.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 3124 views
  • 3 likes
  • 5 in conversation