05-09-2018 04:04 PM
In addition to having a long list of patterns (over 50) to check using regex, I need to check these patterns against more than 700,000 observations.
Does anyone have any advice for improving efficiency?
Here's the macro I'm using to accomplish this task:
if prxmatch(b,serial_number)>0 then do;
if (length(serial) = length(serial_number)) then check=2;
05-09-2018 06:16 PM - edited 05-09-2018 06:17 PM
The first things that comes to mind, without knowing more:
- can use use functions like index() or similar, they a lot cheaper to use than RegEx?
- can you use else if to avoid searching once a pattern is matched?
This may possibly be cheaper too:
if prxmatch("&pattern",serial_number)>0 then do;
05-09-2018 10:59 PM
Make sure your pattern uses the "o" suffix, as in "/abc[a-c]+/o", as it signals to the compiler that the pattern is a constant that only needs to be compiled once.
05-10-2018 01:51 AM - edited 05-10-2018 01:59 AM
My understanding was that SAS used the o suffix by default in recent (9.4 ?) versions of SAS if the RegEx string was a constant.
I can't find a source though, so maybe am I mistaken.
Update: I did a quick test, this runs the same with and without the o.
data _null_; do I=1 to 1e7; R=prxmatch('/\d\w\d/o',cat(I)); end; run;
05-10-2018 12:13 AM
As others already wrote: Certainly use ELSE and use functions like find() or index() where possible.
If leading and trailing blanks are not important then use STRIP() as well: prxmatch(<regex>,strip(<variable>))
And last but not least: Tweak your RegEx; especially the one's applied on long strings - ie Greedy vs. Lazy