I’m trying to define a proc format with regular expressions where if the conditions specified aren’t met, then the output is the portion of the input string after the 2nd full stop.
Proc Format; Invalue $tester (notsorted) “s/(.*)(pine)(.*)/pine/i” (regexpe) = _same_ “s/(.*)(apple)(.*)/apple/i” (regexpe) = _same_ “s/(.*)(gra)(.*)/gra/i” (regexpe) = _same_ /*Other= <not sure what goes here>*/ ; Quit;
I want the outputs for the input strings above to be:
If I just define a final regex condition like
“s/.*\..*\.(.*)/\1/“ (i.e. replace any string with 2 full stops with the portion after the 2nd full stop), then it takes a really long time to run when I use the format in a data step.
My dataset is huge with a lot of long strings.
Is there any way I can avoid doing this matching step and just use a regex in the ‘other’ step?
Hello @Ap01 and welcome to the SAS Support Communities!
Regular expressions are a powerful tool, but simpler functions are much faster, so better use them if they are sufficient.
data have; input c $30.; cards; Pineapple.colour.yellow Pinenuts.size.small Apple.fruit.red Grape.shape.round Orange.fresh.juice Mango.raw.pickled ; data want; set have; length d $8; if find(c,'pine','i') then d='pine'; else if find(c,'apple','i') then d='apple'; else if find(c,'gra','i') then d='gra'; else d=scan(c,3,'.'); run;
Using a 6-million-observation input dataset created by stacking one million copies of the above HAVE dataset, the data step creating WANT was about nine times faster on my workstation than an equivalent step using your $TESTER. informat with the fourth regex added (and curly quotes replaced by straight quotes):
data want; set have; length d $8; d=input(c,$tester.); run;
Reading those 6 million observations from a text file was still 4 - 6 times faster with FIND and SCAN, e.g., applied to the _INFILE_ variable.
I think you could define your format like this:
DATA test; input string: $50.; datalines; Pineapple.colour.yellow Pinenuts.size.small Apple.fruit.red Grape.shape.round Orange.fresh.juice Mango.raw.pickled ; RUN; PROC FORMAT; Invalue $tester (notsorted) 's/(.*?)(pine|apple)(.*)/$2/i' (regexpe) = _same_ /*non greedy to match pine first*/ 's/(.*)(gra)(.*)/$2/i' (regexpe) = _same_ 's/[^\.]*\.[^\.]*\.(.*)/$1/i' (regexpe) = _same_ Other=_error_ ; run; DATA test2; SET test; string2=input(string,$tester.); RUN;
- Cheers -
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.