Re: Proc format regex output if input doesnt match above conditions

Ap01 · Posted 05-07-2023 06:28 AM

Hello,

I’m trying to define a proc format with regular expressions where if the conditions specified aren’t met, then the output is the portion of the input string after the 2nd full stop.

Example inputs:

Pineapple.colour.yellow

Pinenuts.size.small

Apple.fruit.red

Grape.shape.round

Orange.fresh.juice

Mango.raw.pickled

Proc Format;

Invalue $tester (notsorted) 
“s/(.*)(pine)(.*)/pine/i” (regexpe) = _same_
“s/(.*)(apple)(.*)/apple/i” (regexpe) = _same_
“s/(.*)(gra)(.*)/gra/i” (regexpe) = _same_
/*Other= <not sure what goes here>*/
;
Quit;

I want the outputs for the input strings above to be:

pine

apple

gra

juice

pickled

If I just define a final regex condition like

“s/.*\..*\.(.*)/\1/“ (i.e. replace any string with 2 full stops with the portion after the 2nd full stop), then it takes a really long time to run when I use the format in a data step.
My dataset is huge with a lot of long strings.

Is there any way I can avoid doing this matching step and just use a regex in the ‘other’ step?

Thanks!

FreelanceReinh · Posted 05-07-2023 07:29 AM

Hello @Ap01 and welcome to the SAS Support Communities!

Regular expressions are a powerful tool, but simpler functions are much faster, so better use them if they are sufficient.

In your example the FIND function (and SCAN for the "other" case) is sufficient:

data have;
input c $30.;
cards;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;

data want;
set have;
length d $8;
if find(c,'pine','i') then d='pine';
else if find(c,'apple','i') then d='apple';
else if find(c,'gra','i') then d='gra';
else d=scan(c,3,'.');
run;

Using a 6-million-observation input dataset created by stacking one million copies of the above HAVE dataset, the data step creating WANT was about nine times faster on my workstation than an equivalent step using your $TESTER. informat with the fourth regex added (and curly quotes replaced by straight quotes):

data want;
set have;
length d $8;
d=input(c,$tester.);
run;

Reading those 6 million observations from a text file was still 4 - 6 times faster with FIND and SCAN, e.g., applied to the _INFILE_ variable.

Oligolas · Posted 05-10-2023 05:14 AM

Hi,

I think you could define your format like this:

DATA test;
input string: $50.;
datalines;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;
RUN;

PROC FORMAT;
   Invalue $tester (notsorted) 
   's/(.*?)(pine|apple)(.*)/$2/i' (regexpe) = _same_ /*non greedy to match pine first*/
   's/(.*)(gra)(.*)/$2/i' (regexpe) = _same_
   's/[^\.]*\.[^\.]*\.(.*)/$1/i' (regexpe) = _same_
   Other=_error_
   ;
run;

DATA test2;
   SET test;
   string2=input(string,$tester.);
RUN;

________________________
- Cheers -

Proc format regex output if input doesnt match above conditions

Re: Proc format regex output if input doesnt match above conditions

Re: Proc format regex output if input doesnt match above conditions

Click image to register for webinar

Classroom Training Available!