Hello,
I’m trying to define a proc format with regular expressions where if the conditions specified aren’t met, then the output is the portion of the input string after the 2nd full stop.
Example inputs:
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
Proc Format;
Invalue $tester (notsorted)
“s/(.*)(pine)(.*)/pine/i” (regexpe) = _same_
“s/(.*)(apple)(.*)/apple/i” (regexpe) = _same_
“s/(.*)(gra)(.*)/gra/i” (regexpe) = _same_
/*Other= <not sure what goes here>*/
;
Quit;
I want the outputs for the input strings above to be:
pine
pine
apple
gra
juice
pickled
If I just define a final regex condition like
“s/.*\..*\.(.*)/\1/“ (i.e. replace any string with 2 full stops with the portion after the 2nd full stop), then it takes a really long time to run when I use the format in a data step.
My dataset is huge with a lot of long strings.
Is there any way I can avoid doing this matching step and just use a regex in the ‘other’ step?
Thanks!
Hello @Ap01 and welcome to the SAS Support Communities!
Regular expressions are a powerful tool, but simpler functions are much faster, so better use them if they are sufficient.
In your example the FIND function (and SCAN for the "other" case) is sufficient:
data have;
input c $30.;
cards;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;
data want;
set have;
length d $8;
if find(c,'pine','i') then d='pine';
else if find(c,'apple','i') then d='apple';
else if find(c,'gra','i') then d='gra';
else d=scan(c,3,'.');
run;
Using a 6-million-observation input dataset created by stacking one million copies of the above HAVE dataset, the data step creating WANT was about nine times faster on my workstation than an equivalent step using your $TESTER. informat with the fourth regex added (and curly quotes replaced by straight quotes):
data want;
set have;
length d $8;
d=input(c,$tester.);
run;
Reading those 6 million observations from a text file was still 4 - 6 times faster with FIND and SCAN, e.g., applied to the _INFILE_ variable.
Hi,
I think you could define your format like this:
DATA test;
input string: $50.;
datalines;
Pineapple.colour.yellow
Pinenuts.size.small
Apple.fruit.red
Grape.shape.round
Orange.fresh.juice
Mango.raw.pickled
;
RUN;
PROC FORMAT;
Invalue $tester (notsorted)
's/(.*?)(pine|apple)(.*)/$2/i' (regexpe) = _same_ /*non greedy to match pine first*/
's/(.*)(gra)(.*)/$2/i' (regexpe) = _same_
's/[^\.]*\.[^\.]*\.(.*)/$1/i' (regexpe) = _same_
Other=_error_
;
run;
DATA test2;
SET test;
string2=input(string,$tester.);
RUN;
- Cheers -
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.