Solved: Re: PROC SQL Nested CASE or DATA STEP DO LOOP

Tucky · Posted 10-08-2015 03:51 AM

Hi,

I am currently doing some address parsing/token creation using regular expressions.

I am looking for ways to improve my speed in the validation code. I think nesting the CASE statements or an effective DO loop could improve things.

Below is the PROC SQL and the alternate data step, however I think there might be faster ways. I have some attempting sub setting, however I need something tighter / dynamic or easy to edit.

Once a match is found, I would like to exclude those observations from the next PRXMATCH, and so on and so on until I can identify the matches by match type, and also the non matches. The parsing code post validation runs smoothly and quickly, so I am bottlenecked at this stage.

Any assistance would be greatly appreciated.

The SQL is like so -

PROC SQL;

CREATE TABLE ADDRESS_VALIDATE AS

SELECT DISTINCT

ADDRESS,

(CASE WHEN PRXMATCH ("/&AD_PATTERN_1/",ADDRESS)

THEN 'Y' ELSE '' END) AS PATTERN_A1,

(CASE WHEN (CALCULATED PATTERN_A1) IS MISSING AND

PRXMATCH ("/&AD_PATTERN_2/",ADDRESS)

THEN 'Y' ELSE '' END) AS PATTERN_A2

FROM ADDRESS

;quit;

I believe it runs slowly due to the continuous passes over the variable when we already have a match. I have done some ranking and the first PRXMATCH is the highest result.

I tried to use the data step

data ADDRESS_VALIDATE;

set ADDRESS;

ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");

ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");

MATCH = 'N';

if PRXMATCH (ADPATTERN_A1,ADDRESS) then do;

PATTERN_A1 = 'Y';

MATCH = 'Y';

end;

call prxfree (ADPATTERN_A1);

if MATCH = 'N' then do;

if PRXMATCH (ADPATTERN_A2,ADDRESS) then do;

PATTERN_A2 = 'Y';

MATCH = 'Y';

end;

call prxfree (ADPATTERN_A2);

DROP ADPATTERN_A1-ADPATTERN_A2;

run;

but not much difference in run time.

Patrick · Posted 10-08-2015 06:57 PM

@Tucky

You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.

Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.

data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1  ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';

Looking at your data step code you could also implement using IF...THEN...ELSE

data ADDRESS_VALIDATE;
  set ADDRESS;
  MATCH = 'N';
  if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
    do;
      PATTERN_A1 = 'Y';
      MATCH = 'Y';
    end;
  else
    do;
      if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
        do;
          PATTERN_A2 = 'Y';
          MATCH = 'Y';
        end;
    end;
run;

View solution in original post

LinusH · Posted 10-08-2015 04:01 AM

I think that your final line is the key:

"but not much difference in run time"

Usually, function operations dosn't require much system resources, even though I haven't played with regexp much. Sort/join are the ones usually in focus for optimization.

If you run your code without the regexp, do you seen any sigificant change in execution time? (check both CPU and real time).

Data never sleeps

Tucky · Posted 10-08-2015 04:12 AM

The intent or spirit of that line was it takes just as long, which I believe through more elegant or efficient coding we can speed things up.

From memory, without the Regex the Proc is quicker, I'll check at work tomorrow.
Thanks for your speedy reply Linus.

RW9 · Posted 10-08-2015 04:35 AM

Can you post some test data, in a datastep, and required output? It would be easier to run some scenarios.

Patrick · Posted 10-08-2015 06:57 PM

@Tucky

You're compiling the RegEx for every single iteration in your data step. That's why it's slower than the SQL.

Use code as below instead or have the RegEx directly inside the PRXMATCH() function - may be also adding the "o" switch to the RegEx definition. In doing so you also won't need the PRXFREE() calls anymore.

data ADDRESS_VALIDATE;
set ADDRESS;
retain ADPATTERN_A1  ADPATTERN_A2 ;
if _n_=1 then do;
ADPATTERN_A1 = prxparse("/&AD_PATTERN_1/");
ADPATTERN_A2 = prxparse("/&AD_PATTERN_2/");
end;
MATCH = 'N';

Looking at your data step code you could also implement using IF...THEN...ELSE

data ADDRESS_VALIDATE;
  set ADDRESS;
  MATCH = 'N';
  if PRXMATCH ("/&AD_PATTERN_1/o",ADDRESS) then
    do;
      PATTERN_A1 = 'Y';
      MATCH = 'Y';
    end;
  else
    do;
      if PRXMATCH ("/&AD_PATTERN_2/o",ADDRESS) then
        do;
          PATTERN_A2 = 'Y';
          MATCH = 'Y';
        end;
    end;
run;

Tucky · Posted 10-09-2015 12:14 AM

RW9 – Thanks for the reponse, I had a look around for a open source address list to use as data, hard to find, and due to the size (2 million +) hard to replace with a dummy set.

Patrick - Awesome job mate, thank you very much. When I converted the data step RegEx substing process from an old job I didn't realise that I had missed the compile. The inital SQL in testing perfromed well with one PRXMATCH.

I managed to solve the issue with the help of Patrick's answer, I needed to complie the RegEx first.

I have a requirment to use SQL in the first instance as much as possible, due to the popularity of the flavor amoung peers.

Brought the process from 20 minutes + (job stopped) to 7 mins in PROC SQL.

Data step version - I replicated that first, I had to stop that as well, was taking too long.

PROC SQL;

CREATE TABLE ADDRESS_VALIDATE AS SELECT DISTINCT

ADDRESS,

prxparse("/&PAT1") AS PAT1,

prxparse("/&PAT2") AS PAT2,

prxparse("/&PAT3/") AS PAT3,

prxparse("/&PAT4/") AS PAT4,

prxparse("/&PAT5/") AS PAT5,

prxparse("/&PAT6/") AS PAT6,

prxparse("/&PAT7/") AS PAT7,

prxparse("/&PAT8/") AS PAT8,

prxparse("/&PAT9/") AS PAT9,

prxparse("/&PAT10/") AS PAT10, /* ECT ECT */

(CASE WHEN PRXMATCH (CALCULATED PAT1,ADDRESS) THEN 'Y' ELSE '' END) AS P1,

(CASE WHEN (CALCULATED P1) IS MISSING AND

PRXMATCH (CALCULATED PAT2,ADDRESS) THEN 'Y' ELSE '' END) AS P2,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

PRXMATCH (CALCULATED PAT3,ADDRESS) THEN 'Y' ELSE '' END) AS P3,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

PRXMATCH (CALCULATED PAT4,ADDRESS) THEN 'Y' ELSE '' END) AS P4,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

PRXMATCH (CALCULATED PAT5,ADDRESS) THEN 'Y' ELSE '' END) AS P5,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

PRXMATCH (CALCULATED PAT6,ADDRESS) THEN 'Y' ELSE '' END) AS P6,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

PRXMATCH (CALCULATED PAT7,ADDRESS) THEN 'Y' ELSE '' END) AS P7,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

PRXMATCH (CALCULATED PAT8,ADDRESS) THEN 'Y' ELSE '' END) AS P8,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

(CALCULATED P8) IS MISSING AND

PRXMATCH (CALCULATED PAT9,ADDRESS) THEN 'Y' ELSE '' END) AS P9,

(CASE WHEN (CALCULATED P1) IS MISSING AND

(CALCULATED P2) IS MISSING AND

(CALCULATED P3) IS MISSING AND

(CALCULATED P4) IS MISSING AND

(CALCULATED P5) IS MISSING AND

(CALCULATED P6) IS MISSING AND

(CALCULATED P7) IS MISSING AND

(CALCULATED P8) IS MISSING AND

(CALCULATED P9) IS MISSING AND

PRXMATCH (CALCULATED PAT10,ADDRESS) THEN 'Y' ELSE '' END) AS P10

FROM ADDRESS;

;quit;

Patrick · Posted 10-09-2015 04:27 AM

@Tucky

I believe your interpretation of what I wrote is not really what I meant. I would code your SQL as follows:

options fullstimer;
data address(drop=_:);
  length address $26;
  do _rows=1 to 2*10**6;
    call missing(address);
    do _i=1 to 26;
      address=strip(address)||byte(ceil(ranuni(0)*26)+64);
    end;
    output;
  end;
run;

%let pat1=AB;
%let pat2=CD;
%let pat3=EFG;
%let pat4=X(Y|Z);

PROC SQL _method;
  CREATE TABLE ADDRESS_VALIDATE AS 
    SELECT DISTINCT
      ADDRESS,
      (CASE WHEN 
        PRXMATCH ("/&pat1/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P1,
      (CASE WHEN  (CALCULATED P1) IS MISSING AND
        PRXMATCH ("/&pat2/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P2,
      (CASE WHEN  (CALCULATED P2) IS MISSING AND
        PRXMATCH ("/&pat3/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P3,
      (CASE WHEN  (CALCULATED P3) IS MISSING AND
        PRXMATCH ("/&pat4/oi",ADDRESS) THEN 'Y' ELSE '' END) AS P4
    FROM ADDRESS
  ;
quit;

And just for the fun of it here a quite different approach using some hidden treasures.

https://support.sas.com/resources/papers/proceedings12/245-2012.pdf

proc format;
  invalue $p1_ other='P1';
  invalue $p2_ other='P2';
  invalue $p3_ other='P3';
  invalue $p4_ other='P4';
  invalue $addr_validate (default=2)
    '/AB/oi'     (regexp) = [$p1_.]
    '/CD/oi'     (regexp) = [$p2_.]
    '/EFG/oi'    (regexp) = [$p3_.]
    '/X(Y|Z)/oi' (regexp) = [$p4_.]
    other    = ' '
  ;
run;

PROC SQL _method;
  CREATE TABLE ADDRESS_VALIDATE AS 
    SELECT DISTINCT
      ADDRESS,
      input(address,$addr_validate.) as match_case length=2
    FROM ADDRESS
  ;
quit;

Tucky · Posted 10-15-2015 03:27 AM

Thank Patrick, using the LET shaved even more time off the process.

The PROC FORMAT method deserves more investigaiton, I do fancy the esoteric methods, sometimes they surprise you.

Peter_C · Posted 10-10-2015 10:12 AM

Patrick
Would it work with just the P1 to P4 constants on the right of the $address_validate. informat? ( rather than embedding a format in the label)

Patrick · Posted 10-10-2015 10:30 AM

@Peter_C

That's what I've tried first but I couldn't make it work. Happy for you/someone else to show me how this would need to be done.

Using regulare expressions in Proc Format is still not in the documentation. @PGStats complained about this which made me raise a Tech Support track asking if this is an "unsupported feature" or a "documentation defect". The answer was "documentation defect".

I'm waiting now for the next maintenance release and should this still not be in the documentation then I'll raise another Tech Support track - also adding a "feature request" for using just constants as labels (unless someone can show me that this already works now).

Registration is open

SAS Training: Just a Click Away