Hi,
I am currently doing some address parsing/token creation using regular expressions.
I am looking for ways to improve my speed in the validation code. I think nesting the CASE statements or an effective DO loop could improve things.
Below is the PROC SQL and the alternate data step, however I think there might be faster ways. I have some attempting sub setting, however I need something tighter / dynamic or easy to edit.
Once a match is found, I would like to exclude those observations from the next PRXMATCH, and so on and so on until I can identify the matches by match type, and also the non matches. The parsing code post validation runs smoothly and quickly, so I am bottlenecked at this stage.
Any assistance would be greatly appreciated.
The SQL is like so -
PROC SQL;
CREATE TABLE ADDRESS_VALIDATE AS
SELECT DISTINCT
ADDRESS,
(CASE WHEN PRXMATCH ("/&AD_PATTERN_1/",ADDRESS)
THEN 'Y' ELSE '' END) AS PATTERN_A1,
(CASE WHEN (CALCULATED PATTERN_A1) IS MISSING AND
PRXMATCH ("/&AD_PATTERN_2/",ADDRESS)
THEN 'Y' ELSE '' END) AS PATTERN_A2
FROM ADDRESS
;quit;
I believe it runs slowly due to the continuous passes over the variable when we already have a match. I have done some ranking and the first PRXMATCH is the highest result.
I tried to use the data step
data ADDRESS_VALIDATE;
set ADDRESS;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
MATCH = 'N';
if PRXMATCH (ADPATTERN_A1,ADDRESS) then do;
PATTERN_A1 = 'Y';
MATCH = 'Y';
end;
call prxfree (ADPATTERN_A1);
if MATCH = 'N' then do;
if PRXMATCH (ADPATTERN_A2,ADDRESS) then do;
PATTERN_A2 = 'Y';
MATCH = 'Y';
end;
end;
call prxfree (ADPATTERN_A2);
DROP ADPATTERN_A1-ADPATTERN_A2;
run;
but not much difference in run time.
You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.
Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.
data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1 ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';
Looking at your data step code you could also implement using IF...THEN...ELSE
data ADDRESS_VALIDATE;
set ADDRESS;
MATCH = 'N';
if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
do;
PATTERN_A1 = 'Y';
MATCH = 'Y';
end;
else
do;
if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
do;
PATTERN_A2 = 'Y';
MATCH = 'Y';
end;
end;
run;
I think that your final line is the key:
"but not much difference in run time"
Usually, function operations dosn't require much system resources, even though I haven't played with regexp much. Sort/join are the ones usually in focus for optimization.
If you run your code without the regexp, do you seen any sigificant change in execution time? (check both CPU and real time).
Can you post some test data, in a datastep, and required output? It would be easier to run some scenarios.
You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.
Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.
data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1 ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';
Looking at your data step code you could also implement using IF...THEN...ELSE
data ADDRESS_VALIDATE;
set ADDRESS;
MATCH = 'N';
if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
do;
PATTERN_A1 = 'Y';
MATCH = 'Y';
end;
else
do;
if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
do;
PATTERN_A2 = 'Y';
MATCH = 'Y';
end;
end;
run;
RW9 – Thanks for the reponse, I had a look around for a open source address list to use as data, hard to find, and due to the size (2 million +) hard to replace with a dummy set.
Patrick - Awesome job mate, thank you very much. When I converted the data step RegEx substing process from an old job I didn't realise that I had missed the compile. The inital SQL in testing perfromed well with one PRXMATCH.
I managed to solve the issue with the help of Patrick's answer, I needed to complie the RegEx first.
I have a requirment to use SQL in the first instance as much as possible, due to the popularity of the flavor amoung peers.
Brought the process from 20 minutes + (job stopped) to 7 mins in PROC SQL.
Data step version - I replicated that first, I had to stop that as well, was taking too long.
PROC SQL;
CREATE TABLE ADDRESS_VALIDATE AS SELECT DISTINCT
ADDRESS,
prxparse("/&PAT1") AS PAT1,
prxparse("/&PAT2") AS PAT2,
prxparse("/&PAT3/") AS PAT3,
prxparse("/&PAT4/") AS PAT4,
prxparse("/&PAT5/") AS PAT5,
prxparse("/&PAT6/") AS PAT6,
prxparse("/&PAT7/") AS PAT7,
prxparse("/&PAT8/") AS PAT8,
prxparse("/&PAT9/") AS PAT9,
prxparse("/&PAT10/") AS PAT10, /* ECT ECT */
(CASE WHEN PRXMATCH (CALCULATED PAT1,ADDRESS) THEN 'Y' ELSE '' END) AS P1,
(CASE WHEN (CALCULATED P1) IS MISSING AND
PRXMATCH (CALCULATED PAT2,ADDRESS) THEN 'Y' ELSE '' END) AS P2,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
PRXMATCH (CALCULATED PAT3,ADDRESS) THEN 'Y' ELSE '' END) AS P3,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
PRXMATCH (CALCULATED PAT4,ADDRESS) THEN 'Y' ELSE '' END) AS P4,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
PRXMATCH (CALCULATED PAT5,ADDRESS) THEN 'Y' ELSE '' END) AS P5,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
(CALCULATED P5) IS MISSING AND
PRXMATCH (CALCULATED PAT6,ADDRESS) THEN 'Y' ELSE '' END) AS P6,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
(CALCULATED P5) IS MISSING AND
(CALCULATED P6) IS MISSING AND
PRXMATCH (CALCULATED PAT7,ADDRESS) THEN 'Y' ELSE '' END) AS P7,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
(CALCULATED P5) IS MISSING AND
(CALCULATED P6) IS MISSING AND
(CALCULATED P7) IS MISSING AND
PRXMATCH (CALCULATED PAT8,ADDRESS) THEN 'Y' ELSE '' END) AS P8,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
(CALCULATED P5) IS MISSING AND
(CALCULATED P6) IS MISSING AND
(CALCULATED P7) IS MISSING AND
(CALCULATED P8) IS MISSING AND
PRXMATCH (CALCULATED PAT9,ADDRESS) THEN 'Y' ELSE '' END) AS P9,
(CASE WHEN (CALCULATED P1) IS MISSING AND
(CALCULATED P2) IS MISSING AND
(CALCULATED P3) IS MISSING AND
(CALCULATED P4) IS MISSING AND
(CALCULATED P5) IS MISSING AND
(CALCULATED P6) IS MISSING AND
(CALCULATED P7) IS MISSING AND
(CALCULATED P8) IS MISSING AND
(CALCULATED P9) IS MISSING AND
PRXMATCH (CALCULATED PAT10,ADDRESS) THEN 'Y' ELSE '' END) AS P10
FROM ADDRESS;
;quit;
I believe your interpretation of what I wrote is not really what I meant. I would code your SQL as follows:
options fullstimer;
data address(drop=_:);
length address $26;
do _rows=1 to 2*10**6;
call missing(address);
do _i=1 to 26;
address=strip(address)||byte(ceil(ranuni(0)*26)+64);
end;
output;
end;
run;
%let pat1=AB;
%let pat2=CD;
%let pat3=EFG;
%let pat4=X(Y|Z);
PROC SQL _method;
CREATE TABLE ADDRESS_VALIDATE AS
SELECT DISTINCT
ADDRESS,
(CASE WHEN
PRXMATCH ("/&pat1/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P1,
(CASE WHEN (CALCULATED P1) IS MISSING AND
PRXMATCH ("/&pat2/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P2,
(CASE WHEN (CALCULATED P2) IS MISSING AND
PRXMATCH ("/&pat3/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P3,
(CASE WHEN (CALCULATED P3) IS MISSING AND
PRXMATCH ("/&pat4/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P4
FROM ADDRESS
;
quit;
And just for the fun of it here a quite different approach using some hidden treasures.
https://support.sas.com/resources/papers/proceedings12/245-2012.pdf
proc format;
invalue $p1_ other='P1';
invalue $p2_ other='P2';
invalue $p3_ other='P3';
invalue $p4_ other='P4';
invalue $addr_validate (default=2)
'/AB/oi' (regexp) = [$p1_.]
'/CD/oi' (regexp) = [$p2_.]
'/EFG/oi' (regexp) = [$p3_.]
'/X(Y|Z)/oi' (regexp) = [$p4_.]
other = ' '
;
run;
PROC SQL _method;
CREATE TABLE ADDRESS_VALIDATE AS
SELECT DISTINCT
ADDRESS,
input(address,$addr_validate.) as match_case length=2
FROM ADDRESS
;
quit;
Thank Patrick, using the LET shaved even more time off the process.
The PROC FORMAT method deserves more investigaiton, I do fancy the esoteric methods, sometimes they surprise you.
That's what I've tried first but I couldn't make it work. Happy for you/someone else to show me how this would need to be done.
Using regulare expressions in Proc Format is still not in the documentation. @PGStats complained about this which made me raise a Tech Support track asking if this is an "unsupported feature" or a "documentation defect". The answer was "documentation defect".
I'm waiting now for the next maintenance release and should this still not be in the documentation then I'll raise another Tech Support track - also adding a "feature request" for using just constants as labels (unless someone can show me that this already works now).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.