In addition to having a long list of patterns (over 50) to check using regex, I need to check these patterns against more than 700,000 observations.
Does anyone have any advice for improving efficiency?
Here's the macro I'm using to accomplish this task:
%macro prx(pattern,serial);
b=prxparse("&pattern");
if prxmatch(b,serial_number)>0 then do;
check=1;
serial=&serial;
if (length(serial) = length(serial_number)) then check=2;
end;
%mend;
Thank you.
The first things that comes to mind, without knowing more:
- can use use functions like index() or similar, they a lot cheaper to use than RegEx?
- can you use else if to avoid searching once a pattern is matched?
This may possibly be cheaper too:
if prxmatch("&pattern",serial_number)>0 then do;
Make sure your pattern uses the "o" suffix, as in "/abc[a-c]+/o", as it signals to the compiler that the pattern is a constant that only needs to be compiled once.
My understanding was that SAS used the o suffix by default in recent (9.4 ?) versions of SAS if the RegEx string was a constant.
I can't find a source though, so maybe am I mistaken.
Update: I did a quick test, this runs the same with and without the o.
data _null_; do I=1 to 1e7; R=prxmatch('/\d\w\d/o',cat(I)); end; run;
As others already wrote: Certainly use ELSE and use functions like find() or index() where possible.
If leading and trailing blanks are not important then use STRIP() as well: prxmatch(<regex>,strip(<variable>))
And last but not least: Tweak your RegEx; especially the one's applied on long strings - ie Greedy vs. Lazy
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.