BookmarkSubscribeRSS Feed
Ap01
Calcite | Level 5

Hello, 

 

I’m trying to define a proc format with regular expressions where if the conditions specified aren’t met, then the output is the portion of the input string after the 2nd full stop. 

Example inputs:

Pineapple.colour.yellow

Pinenuts.size.small

Apple.fruit.red

Grape.shape.round

Orange.fresh.juice

Mango.raw.pickled

 

Proc Format;

Invalue $tester (notsorted) 
“s/(.*)(pine)(.*)/pine/i” (regexpe) = _same_
“s/(.*)(apple)(.*)/apple/i” (regexpe) = _same_
“s/(.*)(gra)(.*)/gra/i” (regexpe) = _same_
/*Other= <not sure what goes here>*/
;
Quit;

I want the outputs for the input strings above to be: 

pine

pine

apple

gra

juice

pickled


 If I just define a final regex condition like

“s/.*\..*\.(.*)/\1/“ (i.e. replace any string with 2 full stops with the portion after the 2nd full stop), then it takes a really long time to run when I use the format in a data step.
My dataset is huge with a lot of long strings.

 

Is there any way I can avoid doing this matching step and just use a regex in the ‘other’ step? 

Thanks! 

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @Ap01 and welcome to the SAS Support Communities!

 

Regular expressions are a powerful tool, but simpler functions are much faster, so better use them if they are sufficient.

 

In your example the FIND function (and SCAN for the "other" case) is sufficient:

data have;
input c $30.;
cards;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;

data want;
set have;
length d $8;
if find(c,'pine','i') then d='pine';
else if find(c,'apple','i') then d='apple';
else if find(c,'gra','i') then d='gra';
else d=scan(c,3,'.');
run;

Using a 6-million-observation input dataset created by stacking one million copies of the above HAVE dataset, the data step creating WANT was about nine times faster on my workstation than an equivalent step using your $TESTER. informat with the fourth regex added (and curly quotes replaced by straight quotes):

data want;
set have;
length d $8;
d=input(c,$tester.);
run;

 

Reading those 6 million observations from a text file was still 4 - 6 times faster with FIND and SCAN, e.g., applied to the _INFILE_ variable.

Oligolas
Barite | Level 11

Hi,

I think you could define your format like this:

DATA test;
input string: $50.;
datalines;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;
RUN;

PROC FORMAT;
   Invalue $tester (notsorted) 
   's/(.*?)(pine|apple)(.*)/$2/i' (regexpe) = _same_ /*non greedy to match pine first*/
   's/(.*)(gra)(.*)/$2/i' (regexpe) = _same_
   's/[^\.]*\.[^\.]*\.(.*)/$1/i' (regexpe) = _same_
   Other=_error_
   ;
run;

DATA test2;
   SET test;
   string2=input(string,$tester.);
RUN;
________________________

- Cheers -

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 881 views
  • 0 likes
  • 3 in conversation