topic Re: improving the efficiency of regex with a long list of patterns and observations in SAS Programming

improving the efficiency of regex with a long list of patterns and observations

gzr2mz39 — Wed, 09 May 2018 20:04:20 GMT

In addition to having a long list of patterns (over 50) to check using regex, I need to check these patterns against more than 700,000 observations.

Does anyone have any advice for improving efficiency?

Here's the macro I'm using to accomplish this task:

%macro prx(pattern,serial);
b=prxparse("&pattern");
if prxmatch(b,serial_number)>0 then do;
check=1;
serial=&serial;
if (length(serial) = length(serial_number)) then check=2;
end;
%mend;

Thank you.

Re: improving the efficiency of regex with a long list of patterns and observations

ChrisNZ — Wed, 09 May 2018 22:17:32 GMT

The first things that comes to mind, without knowing more:

- can use use functions like index() or similar, they a lot cheaper to use than RegEx?

- can you use else if to avoid searching once a pattern is matched?

This may possibly be cheaper too:

if prxmatch("&pattern",serial_number)>0 then do;

Re: improving the efficiency of regex with a long list of patterns and observations

PGStats — Thu, 10 May 2018 02:59:04 GMT

Make sure your pattern uses the "o" suffix, as in "/abc[a-c]+/o", as it signals to the compiler that the pattern is a constant that only needs to be compiled once.

Re: improving the efficiency of regex with a long list of patterns and observations

Patrick — Thu, 10 May 2018 04:13:21 GMT

As others already wrote: Certainly use ELSE and use functions like find() or index() where possible.

If leading and trailing blanks are not important then use STRIP() as well: prxmatch(<regex>,strip(<variable>))

And last but not least: Tweak your RegEx; especially the one's applied on long strings - ie Greedy vs. Lazy

Re: improving the efficiency of regex with a long list of patterns and observations

ChrisNZ — Thu, 10 May 2018 05:59:10 GMT

@PGStats

My understanding was that SAS used the o suffix by default in recent (9.4 ?) versions of SAS if the RegEx string was a constant.

I can't find a source though, so maybe am I mistaken.

Update: I did a quick test, this runs the same with and without the o.

data _null_;
 do I=1 to 1e7; 
   R=prxmatch('/\d\w\d/o',cat(I));
 end;
run;