Solved: Re: Processing speed regular expressions in FCMP

gabonzo · Posted 05-26-2020 09:35 AM

Hi,

I have a set of 3 regular expressions that I must apply to 2 different text variables. In order to be able to reuse the code, I decided to store the regular expressions and perform the matching inside a FCMP block:

proc fcmp outlib=myfuncz.turnout43ge.pdprx;
	subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$);
		outargs pd_num, pd_num_sfx, split_sfx;
		re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
		re2 = prxparse("/^(\d+)-(\d)$/");
		re3 = prxparse("/^(\d+)([A-Z]+)$/");
		pd_num=.;
		pd_num_sfx="";
		split_sfx="";
		if prxmatch(re, ps_code) then do;
			wc = prxparen(re);
			select (wc);
			when (1) pd_num=input(ps_code, best32.);
			when (2) 
				do;
					dummy = prxmatch(re2, ps_code);
					call prxposn(re2, 1, pos, len);
					pd_num=input(substr(ps_code, pos, len), best32.);
					call prxposn(re2, 2, pos, len);
					pd_num_sfx=substr(ps_code, pos, len);
				end;
			when (3) 
				do;
					dummy = prxmatch(re3, trim(ps_code));
					call prxposn(re3, 1, pos, len);
					pd_num=input(substr(ps_code, pos, len), best32.);
					call prxposn(re3, 2, pos, len);
					split_sfx=substr(ps_code, pos, len);
				end;
			otherwise;
			end;
		end;
	endsub;
quit;

Unfortunately, this increases processing time in an unacceptable way: it now takes 10 minutes to process 75,000 records, instead of barely a minute if the the processing is done in the DATA step.

I suppose that what is happening is that the patterns are being recompiled each time the FCMP function is called, which explains the slowdown.

My question is the following: is there a way to cache the compiled patterns, so that I can still use the FCMP method? I could always write a macro to apply to the DATA step, but I see it as an inferior solution.

Thanks!

DavePrinsloo · Posted 05-26-2020 11:21 AM

Each FCMP call launches a new sas sub-session. Thats why it is so slow. Its great for doing things in macros with %sysfunc.
Its not my area of expertis, but I think it is possible to get configured it to use a 2nd SAS session that is available as a service and is therefore quicker.
I have had the same issue and I ended up using macros to generate the code.
If you call the macro 10 times, you are generating a lot mode code, but that is compiled once only. The code excuted is the same but without launching a new sas sessiion multiple times per FCMP function call.

View solution in original post

PeterClemmensen · Posted 05-26-2020 10:07 AM

Can you post a sample of your data? How long did the processing take before?

The DATA to DATA Step Macro
Blog: SASnrd

gabonzo · Posted 05-26-2020 10:28 AM

Unfortunately not, it's protected data.

To give you an idea, I am running the match against two string codes, which can have three different formats:

1. A one-to-three digit number, e.g. 5, 17, 202

2. A one-to-three digit number, followed by a single character, e.g. 5A ,17B, 202E

3. A one-to-three digit number, followed by a dash, then a single digit number, e.g. 5-1, 17-0, 202-2

So the strings are pretty short, the match should be immediate.

Actually, I have to revise my previous statement: if I use FCMP, it takes 9 minutes, if I code directly in the DATA step, it takes 0.2 seconds!

Tom · Posted 05-26-2020 11:31 AM

Welcome to the world of performance tuning. In-lining subroutines (which is what a macro would do) is a common method for improving performance.

You might also check whether you need to use REGEX. If you can use normal string functions like SUBSTR, SCAN, VERIFY,COUNTC,INDEXC, etc they usually work much faster than REGEX.

BillM_SAS · Posted 05-27-2020 09:59 PM

Using your supplied description of the data, I created a DATA step to randomly create test data. I then used the code in your FCMP subroutine to also produce a DATA step that will run the same code outside of FCMP. I then ran the FCMP subroutine and the DATA step against various sizes of the data. The timing numbers are from SAS log of only the DATA steps actually checking the codes. It was run at SAS 9.4 maintenance 6 and showed the following:
Observations       FCMP Soubroutine                   DATA step
count        Real time      CPU Time        Real Time CPU Time
    1,000      0.11 seconds   0.09 seconds    0.05 seconds   0.03 seconds
    10,000     0.14 seconds   0.07 seconds    0.07 seconds   0.06 seconds
   100,000     0.55 seconds   0.48 seconds    0.35 seconds   0.32 seconds
1,000,000    4.70 seconds   4.67 seconds    3.15 seconds   3.15 seconds
10,000,000    44.27 seconds 44.25 seconds   31.06 seconds 31.04 seconds

While the FCMP subroutine implementation did run slightly slower than the DATA step, I'm not seeing the vast time discrepancies that you are seeing. Does my test faithfully recreate what you are doing? Note that since I use a seed value in the STREAMINIT call, you should be able to run the same code to produce the same data and compare your timings to what I'm seeing.

data RandomData;
   length code $5;
   call streaminit(12345);
   do i = 1 to 1000;
     codeType = rand("Integer", 1, 3);  /* requires SAS 9.4M5 or later */
     if (codeType EQ 1) /* up to 3 digit number */
        then code = rand("Integer", 0, 999); 
     if (codeType EQ 2) /* up to 3 digit number followed by a letter */
       then do;
	     number = rand("Integer", 0, 999);
         letter = byte(int(rand("Integer", 65, 90)));
	     code   = CATS(number,letter);
	   end;
     if (codeType EQ 3) /* 3 digit number, a "-", and a single digit number */
       then do;
	     number3digits = rand("Integer", 0, 999);
	     number1digit  = rand("Integer", 0, 9);
	     code          = CATS(number3digits, '-', number1digit);
	   end;
     output;
  end;
run;

LIBNAME myfuncz 'C:\temp';

proc fcmp outlib=myfuncz.turnout43ge.pdprx;
        subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$);
               outargs pd_num, pd_num_sfx, split_sfx;
               re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
               re2 = prxparse("/^(\d+)-(\d)$/");
               re3 = prxparse("/^(\d+)([A-Z]+)$/");
               pd_num=.;
               pd_num_sfx="";
               split_sfx="";
               if prxmatch(re, ps_code) then do;
                       wc = prxparen(re);
					   /* put ps_code wc; */
                       select (wc);
                       when (1) pd_num=input(ps_code, best32.);
                       when (2) 
                               do;
                                      dummy = prxmatch(re2, ps_code);
                                      call prxposn(re2, 1, pos, len);
                                      pd_num=input(substr(ps_code, pos, len), best32.);
                                      call prxposn(re2, 2, pos, len);
                                      pd_num_sfx=substr(ps_code, pos, len);
                               end;
                       when (3) 
                               do;
                                      dummy = prxmatch(re3, trim(ps_code));
                                      call prxposn(re3, 1, pos, len);
                                      pd_num=input(substr(ps_code, pos, len), best32.);
                                      call prxposn(re3, 2, pos, len);
                                      split_sfx=substr(ps_code, pos, len);
                               end;
                       otherwise;
                       end;
               end;
        endsub;
quit;

OPTION CMPLIB=myfuncz.turnout43ge;

data codesOutFcmp(keep=code pd_num pd_num_sfx split_sfx);
  length pd_num_sfx $1. split_sfx $1.;
  set RandomData;
  call pdprx(compress(code), pd_num, pd_num_sfx, split_sfx);
  output;
run;

data codesOutDS(keep=code pd_num pd_num_sfx split_sfx);
  length pd_num_sfx $1. split_sfx $1.;
  set RandomData;
/* Same code as in FCMP function */
               re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
               re2 = prxparse("/^(\d+)-(\d)$/");
               re3 = prxparse("/^(\d+)([A-Z]+)$/");
               pd_num=.;
               pd_num_sfx="";
               split_sfx="";
               if prxmatch(re, code) then do;
                       wc = prxparen(re);
					   /* put ps_code wc; */
                       select (wc);
                       when (1) pd_num=input(code, best32.);
                       when (2) 
                               do;
                                      dummy = prxmatch(re2, code);
                                      call prxposn(re2, 1, pos, len);
                                      pd_num=input(substr(code, pos, len), best32.);
                                      call prxposn(re2, 2, pos, len);
                                      pd_num_sfx=substr(code, pos, len);
                               end;
                       when (3) 
                               do;
                                      dummy = prxmatch(re3, trim(code));
                                      call prxposn(re3, 1, pos, len);
                                      pd_num=input(substr(code, pos, len), best32.);
                                      call prxposn(re3, 2, pos, len);
                                      split_sfx=substr(code, pos, len);
                               end;
                       otherwise;
                       end;
               end;
  output;
run;

DavePrinsloo · Posted 05-26-2020 11:21 AM

Each FCMP call launches a new sas sub-session. Thats why it is so slow. Its great for doing things in macros with %sysfunc.
Its not my area of expertis, but I think it is possible to get configured it to use a 2nd SAS session that is available as a service and is therefore quicker.
I have had the same issue and I ended up using macros to generate the code.
If you call the macro 10 times, you are generating a lot mode code, but that is compiled once only. The code excuted is the same but without launching a new sas sessiion multiple times per FCMP function call.

gabonzo · Posted 05-26-2020 11:29 AM

Ouch. Well I think I will use macros then. Thanks.

ballardw · Posted 05-26-2020 11:49 AM

@gabonzo wrote:

Ouch. Well I think I will use macros then. Thanks.

You didn't provide any example of actual use of the code.

Since you mentioned doing the same thing to two variables (or more???) that always points me toward an Array solution if used in a data step to reduce code.

PeterClemmensen · Posted 05-27-2020 02:30 AM

@gabonzo, if you do choose to go with the FCMP, here are a few things that may speed up the executions:

Use the Static Statement in PROC FCMP to initialize your pattern ids. No need to initialize them at each call.
Since you have +75000 observations and rather simple patterns, there are bound to be duplicates. Since you mention a cache yourself, you can use a hash object to cache values already encountered in a previous call. It is much quicker to retrieve the value from a cache than re-calculating. Read about the technique in the great article Hashing in PROC FCMP to Enhance Your Productivity.

Regards

The DATA to DATA Step Macro
Blog: SASnrd

gabonzo · Posted 05-27-2020 09:01 AM

Oh that's interesting!
If a compiled regexp can be set up as static, that should do the trick.

I am aware of hashmaps, in fact I use them later in the code, but thank you for pointing that out!

Cheers

BillM_SAS · Posted 05-28-2020 12:39 PM

@DavePrinsloo, as a member of the FCMP development team, we would like to clear up something mentioned in your post. Once the DATA step encounters an FCMP function or subroutine, the FCMP code is located in the specified FCMP function library (OPTION CMPLIB), the needed FCMP code is loaded into memory, and compiled. The DATA step then calls the FCMP code as it would with any other SAS function. The compiled FCMP function code remains in memory until the end of the DATA step. There are no new SAS sub-sessions created when using FCMP.

DavePrinsloo · Posted 05-28-2020 12:51 PM

Thanks for the heads up! A long time, when FCMP was initially released, I used it a lot more, but now I use it in a more measured manner when processing large SAS tables (millions of rows) because of the performance hit.

Registration is open

SAS Training: Just a Click Away