In addition to having a long list of patterns (over 50) to check using regex, I need to check these patterns against more than 700,000 observations.
Does anyone have any advice for improving efficiency?
Here's the macro I'm using to accomplish this task:
%macro prx(pattern,serial);
b=prxparse("&pattern");
if prxmatch(b,serial_number)>0 then do;
check=1;
serial=&serial;
if (length(serial) = length(serial_number)) then check=2;
end;
%mend;
Thank you.
The first things that comes to mind, without knowing more:
- can use use functions like index() or similar, they a lot cheaper to use than RegEx?
- can you use else if to avoid searching once a pattern is matched?
This may possibly be cheaper too:
if prxmatch("&pattern",serial_number)>0 then do;
Make sure your pattern uses the "o" suffix, as in "/abc[a-c]+/o", as it signals to the compiler that the pattern is a constant that only needs to be compiled once.
My understanding was that SAS used the o suffix by default in recent (9.4 ?) versions of SAS if the RegEx string was a constant.
I can't find a source though, so maybe am I mistaken.
Update: I did a quick test, this runs the same with and without the o.
data _null_; do I=1 to 1e7; R=prxmatch('/\d\w\d/o',cat(I)); end; run;
As others already wrote: Certainly use ELSE and use functions like find() or index() where possible.
If leading and trailing blanks are not important then use STRIP() as well: prxmatch(<regex>,strip(<variable>))
And last but not least: Tweak your RegEx; especially the one's applied on long strings - ie Greedy vs. Lazy
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.