BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Tucky
Obsidian | Level 7

Hi,

 

I am currently doing some address parsing/token creation using regular expressions.

I am looking for ways to improve my speed in the validation code. I think nesting the CASE statements or an effective DO loop could improve things.

Below is the PROC SQL and the alternate data step, however I think there might be faster ways. I have some attempting sub setting, however I need something tighter / dynamic or easy to edit.

Once a match is found, I would like to exclude those observations from the next PRXMATCH, and so on and so on until I can identify the matches by match type, and also the non matches. The parsing code post validation runs smoothly and quickly, so I am bottlenecked at this stage.

Any assistance would be greatly appreciated.

 

The SQL is like so -

 

PROC SQL;

CREATE TABLE ADDRESS_VALIDATE AS

SELECT DISTINCT

ADDRESS,

(CASE WHEN PRXMATCH ("/&AD_PATTERN_1/",ADDRESS)

    THEN 'Y' ELSE '' END) AS PATTERN_A1,

 

(CASE WHEN (CALCULATED PATTERN_A1) IS MISSING AND

    PRXMATCH ("/&AD_PATTERN_2/",ADDRESS)

      THEN 'Y' ELSE '' END) AS PATTERN_A2

 

FROM ADDRESS

;quit;

 

I believe it runs slowly due to the continuous passes over the variable when we already have a match. I have done some ranking and the first PRXMATCH is the highest result.

 

I tried to use the data step

 

data ADDRESS_VALIDATE;

set ADDRESS;

ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");

ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");

MATCH = 'N';

 

if PRXMATCH (ADPATTERN_A1,ADDRESS) then do;

  PATTERN_A1 = 'Y';

  MATCH = 'Y';

end;

 

call prxfree (ADPATTERN_A1);

 

if MATCH = 'N' then do;

 if PRXMATCH (ADPATTERN_A2,ADDRESS) then do;

   PATTERN_A2 = 'Y';

   MATCH = 'Y';

 end;

end;

 

call prxfree (ADPATTERN_A2);

DROP ADPATTERN_A1-ADPATTERN_A2;

run;

 

 

but not much difference in run time.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Tucky

You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.

Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.

 

data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1  ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';

 

Looking at your data step code you could also implement using IF...THEN...ELSE

data ADDRESS_VALIDATE;
  set ADDRESS;
  MATCH = 'N';
  if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
    do;
      PATTERN_A1 = 'Y';
      MATCH = 'Y';
    end;
  else
    do;
      if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
        do;
          PATTERN_A2 = 'Y';
          MATCH = 'Y';
        end;
    end;
run;

 

 

View solution in original post

9 REPLIES 9
LinusH
Tourmaline | Level 20

I think that your final line is the key:

"but not much difference in run time"

Usually, function operations dosn't require much system resources, even though I haven't played with regexp much. Sort/join are the ones usually in focus for optimization.

If you run your code without the regexp, do you seen any sigificant change in execution time? (check both CPU and real time).

Data never sleeps
Tucky
Obsidian | Level 7
The intent or spirit of that line was it takes just as long, which I believe through more elegant or efficient coding we can speed things up.

From memory, without the Regex the Proc is quicker, I'll check at work tomorrow.
Thanks for your speedy reply Linus.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Can you post some test data, in a datastep, and required output?  It would be easier to run some scenarios.

Patrick
Opal | Level 21

@Tucky

You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.

Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.

 

data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1  ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';

 

Looking at your data step code you could also implement using IF...THEN...ELSE

data ADDRESS_VALIDATE;
  set ADDRESS;
  MATCH = 'N';
  if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
    do;
      PATTERN_A1 = 'Y';
      MATCH = 'Y';
    end;
  else
    do;
      if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
        do;
          PATTERN_A2 = 'Y';
          MATCH = 'Y';
        end;
    end;
run;

 

 

Tucky
Obsidian | Level 7

RW9  – Thanks for the reponse, I had a look around for a open source address list to use as data, hard to find, and due to the size (2 million +) hard to replace with a dummy set.

 

Patrick - Awesome job mate, thank you very much. When I converted the data step RegEx substing process from an old job I didn't realise that I had missed the compile. The inital SQL in testing perfromed well with one PRXMATCH.

 

I managed to solve the issue with the help of Patrick's answer, I needed to complie the RegEx first.

I have a requirment to use SQL in the first instance as much as possible, due to the popularity of the flavor amoung peers.

 

Brought the process from 20 minutes + (job stopped) to 7 mins in PROC SQL.

Data step version - I replicated that first, I had to stop that as well, was taking too long.

 

 

PROC SQL;

CREATE TABLE ADDRESS_VALIDATE AS SELECT DISTINCT

ADDRESS,

prxparse("/&PAT1") AS PAT1,

prxparse("/&PAT2") AS PAT2,

prxparse("/&PAT3/") AS PAT3,

prxparse("/&PAT4/") AS PAT4,

prxparse("/&PAT5/") AS PAT5,

prxparse("/&PAT6/") AS PAT6,

prxparse("/&PAT7/") AS PAT7,

prxparse("/&PAT8/") AS PAT8,

prxparse("/&PAT9/") AS PAT9,

prxparse("/&PAT10/") AS PAT10, /* ECT ECT */

(CASE WHEN PRXMATCH (CALCULATED PAT1,ADDRESS) THEN 'Y' ELSE '' END) AS P1,

(CASE WHEN (CALCULATED P1) IS MISSING AND

PRXMATCH (CALCULATED PAT2,ADDRESS) THEN 'Y' ELSE '' END) AS P2,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

PRXMATCH (CALCULATED PAT3,ADDRESS) THEN 'Y' ELSE '' END) AS P3,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

PRXMATCH (CALCULATED PAT4,ADDRESS) THEN 'Y' ELSE '' END) AS P4,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

PRXMATCH (CALCULATED PAT5,ADDRESS) THEN 'Y' ELSE '' END) AS P5,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

PRXMATCH (CALCULATED PAT6,ADDRESS) THEN 'Y' ELSE '' END) AS P6,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

PRXMATCH (CALCULATED PAT7,ADDRESS) THEN 'Y' ELSE '' END) AS P7,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

PRXMATCH (CALCULATED PAT8,ADDRESS) THEN 'Y' ELSE '' END) AS P8,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

(CALCULATED P8) IS MISSING AND

PRXMATCH (CALCULATED PAT9,ADDRESS) THEN 'Y' ELSE '' END) AS P9,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

(CALCULATED P8) IS MISSING AND

(CALCULATED P9) IS MISSING AND

PRXMATCH (CALCULATED PAT10,ADDRESS) THEN 'Y' ELSE '' END) AS P10

FROM ADDRESS;

 

;quit;

Patrick
Opal | Level 21

@Tucky

I believe your interpretation of what I wrote is not really what I meant. I would code your SQL as follows:

options fullstimer;
data address(drop=_:);
  length address $26;
  do _rows=1 to 2*10**6;
    call missing(address);
    do _i=1 to 26;
      address=strip(address)||byte(ceil(ranuni(0)*26)+64);
    end;
    output;
  end;
run;

%let pat1=AB;
%let pat2=CD;
%let pat3=EFG;
%let pat4=X(Y|Z);

PROC SQL _method;
  CREATE TABLE ADDRESS_VALIDATE AS 
    SELECT DISTINCT
      ADDRESS,
      (CASE WHEN 
        PRXMATCH ("/&pat1/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P1,
      (CASE WHEN  (CALCULATED P1) IS MISSING AND
        PRXMATCH ("/&pat2/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P2,
      (CASE WHEN  (CALCULATED P2) IS MISSING AND
        PRXMATCH ("/&pat3/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P3,
      (CASE WHEN  (CALCULATED P3) IS MISSING AND
        PRXMATCH ("/&pat4/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P4
    FROM ADDRESS
  ;
quit;

 

 

And just for the fun of it here a quite different approach using some hidden treasures.

https://support.sas.com/resources/papers/proceedings12/245-2012.pdf 

proc format;
  invalue $p1_ other='P1';
  invalue $p2_ other='P2';
  invalue $p3_ other='P3';
  invalue $p4_ other='P4';
  invalue $addr_validate (default=2)
    '/AB/oi'     (regexp) = [$p1_.]
    '/CD/oi'     (regexp) = [$p2_.]
    '/EFG/oi'    (regexp) = [$p3_.]
    '/X(Y|Z)/oi' (regexp) = [$p4_.]
    other    = ' '
  ;
run;

PROC SQL _method;
  CREATE TABLE ADDRESS_VALIDATE AS 
    SELECT DISTINCT
      ADDRESS,
      input(address,$addr_validate.) as match_case length=2
    FROM ADDRESS
  ;
quit;

 

 

 

Tucky
Obsidian | Level 7

Thank Patrick, using the LET shaved even more time off the process.

 

The PROC FORMAT method deserves more investigaiton, I do fancy the esoteric methods, sometimes they surprise you.

 

 

 

 

 

Peter_C
Rhodochrosite | Level 12
Patrick
Would it work with just the P1 to P4 constants on the right of the $address_validate. informat? ( rather than embedding a format in the label)
Patrick
Opal | Level 21

@Peter_C

That's what I've tried first but I couldn't make it work. Happy for you/someone else to show me how this would need to be done.

 

Using regulare expressions in Proc Format is still not in the documentation. @PGStats complained about this which made me raise a Tech Support track asking if this is an "unsupported feature" or a "documentation defect". The answer was "documentation defect".

 

I'm waiting now for the next maintenance release and should this still not be in the documentation then I'll raise another Tech Support track - also adding a "feature request" for using just constants as labels (unless someone can show me that this already works now).

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 6930 views
  • 6 likes
  • 5 in conversation