BookmarkSubscribeRSS Feed
Ap01
Calcite | Level 5

Hello, 

 

I’m trying to define a proc format with regular expressions where if the conditions specified aren’t met, then the output is the portion of the input string after the 2nd full stop. 

Example inputs:

Pineapple.colour.yellow

Pinenuts.size.small

Apple.fruit.red

Grape.shape.round

Orange.fresh.juice

Mango.raw.pickled

 

Proc Format;

Invalue $tester (notsorted) 
“s/(.*)(pine)(.*)/pine/i” (regexpe) = _same_
“s/(.*)(apple)(.*)/apple/i” (regexpe) = _same_
“s/(.*)(gra)(.*)/gra/i” (regexpe) = _same_
/*Other= <not sure what goes here>*/
;
Quit;

I want the outputs for the input strings above to be: 

pine

pine

apple

gra

juice

pickled


 If I just define a final regex condition like

“s/.*\..*\.(.*)/\1/“ (i.e. replace any string with 2 full stops with the portion after the 2nd full stop), then it takes a really long time to run when I use the format in a data step.
My dataset is huge with a lot of long strings.

 

Is there any way I can avoid doing this matching step and just use a regex in the ‘other’ step? 

Thanks! 

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @Ap01 and welcome to the SAS Support Communities!

 

Regular expressions are a powerful tool, but simpler functions are much faster, so better use them if they are sufficient.

 

In your example the FIND function (and SCAN for the "other" case) is sufficient:

data have;
input c $30.;
cards;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;

data want;
set have;
length d $8;
if find(c,'pine','i') then d='pine';
else if find(c,'apple','i') then d='apple';
else if find(c,'gra','i') then d='gra';
else d=scan(c,3,'.');
run;

Using a 6-million-observation input dataset created by stacking one million copies of the above HAVE dataset, the data step creating WANT was about nine times faster on my workstation than an equivalent step using your $TESTER. informat with the fourth regex added (and curly quotes replaced by straight quotes):

data want;
set have;
length d $8;
d=input(c,$tester.);
run;

 

Reading those 6 million observations from a text file was still 4 - 6 times faster with FIND and SCAN, e.g., applied to the _INFILE_ variable.

Oligolas
Barite | Level 11

Hi,

I think you could define your format like this:

DATA test;
input string: $50.;
datalines;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;
RUN;

PROC FORMAT;
   Invalue $tester (notsorted) 
   's/(.*?)(pine|apple)(.*)/$2/i' (regexpe) = _same_ /*non greedy to match pine first*/
   's/(.*)(gra)(.*)/$2/i' (regexpe) = _same_
   's/[^\.]*\.[^\.]*\.(.*)/$1/i' (regexpe) = _same_
   Other=_error_
   ;
run;

DATA test2;
   SET test;
   string2=input(string,$tester.);
RUN;
________________________

- Cheers -

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 495 views
  • 0 likes
  • 3 in conversation