DATA Step, Macro, Functions and more

improving the efficiency of regex with a long list of patterns and observations

Reply
Regular Contributor
Posts: 204

improving the efficiency of regex with a long list of patterns and observations

In addition to having a long list of patterns (over 50) to check using regex, I need to check these patterns against more than 700,000 observations.

Does anyone have any advice for improving efficiency?

Here's the macro I'm using to accomplish this task:

%macro prx(pattern,serial);
b=prxparse("&pattern");
if prxmatch(b,serial_number)>0 then do;
check=1;
serial=&serial;
if (length(serial) = length(serial_number)) then check=2;
end;
%mend;

Thank you.

PROC Star
Posts: 2,318

Re: improving the efficiency of regex with a long list of patterns and observations

[ Edited ]

The first things that comes to mind, without knowing more:

- can use use functions like index() or similar, they a lot cheaper to use than RegEx?

- can you use else if  to avoid searching once a pattern is matched?

 

This may possibly be cheaper too:

if prxmatch("&pattern",serial_number)>0 then do;

Esteemed Advisor
Posts: 5,482

Re: improving the efficiency of regex with a long list of patterns and observations

Make sure your pattern uses the "o" suffix, as in "/abc[a-c]+/o", as it signals to the compiler that the pattern is a constant that only needs to be compiled once.

PG
PROC Star
Posts: 2,318

Re: improving the efficiency of regex with a long list of patterns and observations

[ Edited ]

@PGStats 

My understanding was that SAS used the o suffix by default in recent (9.4 ?) versions of SAS if the RegEx string was a constant. 

I can't find a source though, so maybe am I mistaken.

 

Update: I did a quick test, this runs the same with and without the o.

data _null_;
 do I=1 to 1e7; 
   R=prxmatch('/\d\w\d/o',cat(I));
 end;
run;
Respected Advisor
Posts: 4,679

Re: improving the efficiency of regex with a long list of patterns and observations

As others already wrote: Certainly use ELSE and use functions like find() or index() where possible.

If leading and trailing blanks are not important then use STRIP() as well: prxmatch(<regex>,strip(<variable>))

And last but not least: Tweak your RegEx; especially the one's applied on long strings - ie Greedy vs. Lazy

Ask a Question
Discussion stats
  • 4 replies
  • 103 views
  • 3 likes
  • 4 in conversation