BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gabonzo
Quartz | Level 8

Hi,

 

I have a set of 3 regular expressions that I must apply to 2 different text variables. In order to be able to reuse the code, I decided to store the regular expressions and perform the matching inside a FCMP block:

 

proc fcmp outlib=myfuncz.turnout43ge.pdprx;
	subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$);
		outargs pd_num, pd_num_sfx, split_sfx;
		re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
		re2 = prxparse("/^(\d+)-(\d)$/");
		re3 = prxparse("/^(\d+)([A-Z]+)$/");
		pd_num=.;
		pd_num_sfx="";
		split_sfx="";
		if prxmatch(re, ps_code) then do;
			wc = prxparen(re);
			select (wc);
			when (1) pd_num=input(ps_code, best32.);
			when (2) 
				do;
					dummy = prxmatch(re2, ps_code);
					call prxposn(re2, 1, pos, len);
					pd_num=input(substr(ps_code, pos, len), best32.);
					call prxposn(re2, 2, pos, len);
					pd_num_sfx=substr(ps_code, pos, len);
				end;
			when (3) 
				do;
					dummy = prxmatch(re3, trim(ps_code));
					call prxposn(re3, 1, pos, len);
					pd_num=input(substr(ps_code, pos, len), best32.);
					call prxposn(re3, 2, pos, len);
					split_sfx=substr(ps_code, pos, len);
				end;
			otherwise;
			end;
		end;
	endsub;
quit;

 

Unfortunately, this increases processing time in an unacceptable way: it now takes 10 minutes to process 75,000 records, instead of barely a minute if the the processing is done in the DATA step.

 

I suppose that what is happening is that the patterns are being recompiled each time the FCMP function is called, which explains the slowdown.

 

My question is the following: is there a way to cache the compiled patterns, so that I can still use the FCMP method? I could always write a macro to apply to the DATA step, but I see it as an inferior solution.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
DavePrinsloo
Pyrite | Level 9
Each FCMP call launches a new sas sub-session. Thats why it is so slow. Its great for doing things in macros with %sysfunc.
Its not my area of expertis, but I think it is possible to get configured it to use a 2nd SAS session that is available as a service and is therefore quicker.
I have had the same issue and I ended up using macros to generate the code.
If you call the macro 10 times, you are generating a lot mode code, but that is compiled once only. The code excuted is the same but without launching a new sas sessiion multiple times per FCMP function call.

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Can you post a sample of your data? How long did the processing take before?

gabonzo
Quartz | Level 8

Unfortunately not, it's protected data.

To give you an idea, I am running the match against two string codes, which can have three different formats:

 

1. A one-to-three digit number, e.g. 5, 17, 202

2. A one-to-three digit number, followed by a single character, e.g. 5A ,17B, 202E

3. A one-to-three digit number, followed by a dash, then a single digit number, e.g. 5-1, 17-0, 202-2

 

So the strings are pretty short, the match should be immediate.

 

Actually, I have to revise my previous statement: if I use FCMP, it takes 9 minutes, if I code directly in the DATA step, it takes 0.2 seconds!

Tom
Super User Tom
Super User

Welcome to the world of performance tuning.  In-lining subroutines (which is what a macro would do) is a common method for improving performance.

You might also check whether you need to use REGEX. If you can use normal string functions like SUBSTR, SCAN, VERIFY,COUNTC,INDEXC, etc they usually work much faster than REGEX.

BillM_SAS
SAS Employee

Using your supplied description of the data, I created a DATA step to randomly create test data. I then used the code in your FCMP subroutine to also produce a DATA step that will run the same code outside of FCMP. I then ran the FCMP subroutine and the DATA step against various sizes of the data. The timing numbers are from SAS log of only the DATA steps actually checking the codes. It was run at SAS 9.4 maintenance 6 and showed the following:
Observations       FCMP Soubroutine                   DATA step
  count        Real time      CPU Time        Real Time      CPU Time
    1,000      0.11 seconds   0.09 seconds    0.05 seconds   0.03 seconds
    10,000     0.14 seconds   0.07 seconds    0.07 seconds   0.06 seconds
   100,000     0.55 seconds   0.48 seconds    0.35 seconds   0.32 seconds
 1,000,000     4.70 seconds   4.67 seconds    3.15 seconds   3.15 seconds
10,000,000    44.27 seconds  44.25 seconds   31.06 seconds  31.04 seconds


While the FCMP subroutine implementation did run slightly slower than the DATA step, I'm not seeing the vast time discrepancies that you are seeing. Does my test faithfully recreate what you are doing? Note that since I use a seed value in the STREAMINIT call, you should be able to run the same code to produce the same data and compare your timings to what I'm seeing.

 

data RandomData;
   length code $5;
   call streaminit(12345);
   do i = 1 to 1000;
     codeType = rand("Integer", 1, 3);  /* requires SAS 9.4M5 or later */
     if (codeType EQ 1) /* up to 3 digit number */
        then code = rand("Integer", 0, 999); 
     if (codeType EQ 2) /* up to 3 digit number followed by a letter */
       then do;
	     number = rand("Integer", 0, 999);
         letter = byte(int(rand("Integer", 65, 90)));
	     code   = CATS(number,letter);
	   end;
     if (codeType EQ 3) /* 3 digit number, a "-", and a single digit number */
       then do;
	     number3digits = rand("Integer", 0, 999);
	     number1digit  = rand("Integer", 0, 9);
	     code          = CATS(number3digits, '-', number1digit);
	   end;
     output;
  end;
run;

LIBNAME myfuncz 'C:\temp';

proc fcmp outlib=myfuncz.turnout43ge.pdprx;
        subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$);
               outargs pd_num, pd_num_sfx, split_sfx;
               re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
               re2 = prxparse("/^(\d+)-(\d)$/");
               re3 = prxparse("/^(\d+)([A-Z]+)$/");
               pd_num=.;
               pd_num_sfx="";
               split_sfx="";
               if prxmatch(re, ps_code) then do;
                       wc = prxparen(re);
					   /* put ps_code wc; */
                       select (wc);
                       when (1) pd_num=input(ps_code, best32.);
                       when (2) 
                               do;
                                      dummy = prxmatch(re2, ps_code);
                                      call prxposn(re2, 1, pos, len);
                                      pd_num=input(substr(ps_code, pos, len), best32.);
                                      call prxposn(re2, 2, pos, len);
                                      pd_num_sfx=substr(ps_code, pos, len);
                               end;
                       when (3) 
                               do;
                                      dummy = prxmatch(re3, trim(ps_code));
                                      call prxposn(re3, 1, pos, len);
                                      pd_num=input(substr(ps_code, pos, len), best32.);
                                      call prxposn(re3, 2, pos, len);
                                      split_sfx=substr(ps_code, pos, len);
                               end;
                       otherwise;
                       end;
               end;
        endsub;
quit;

OPTION CMPLIB=myfuncz.turnout43ge;

data codesOutFcmp(keep=code pd_num pd_num_sfx split_sfx);
  length pd_num_sfx $1. split_sfx $1.;
  set RandomData;
  call pdprx(compress(code), pd_num, pd_num_sfx, split_sfx);
  output;
run;

data codesOutDS(keep=code pd_num pd_num_sfx split_sfx);
  length pd_num_sfx $1. split_sfx $1.;
  set RandomData;
/* Same code as in FCMP function */
               re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
               re2 = prxparse("/^(\d+)-(\d)$/");
               re3 = prxparse("/^(\d+)([A-Z]+)$/");
               pd_num=.;
               pd_num_sfx="";
               split_sfx="";
               if prxmatch(re, code) then do;
                       wc = prxparen(re);
					   /* put ps_code wc; */
                       select (wc);
                       when (1) pd_num=input(code, best32.);
                       when (2) 
                               do;
                                      dummy = prxmatch(re2, code);
                                      call prxposn(re2, 1, pos, len);
                                      pd_num=input(substr(code, pos, len), best32.);
                                      call prxposn(re2, 2, pos, len);
                                      pd_num_sfx=substr(code, pos, len);
                               end;
                       when (3) 
                               do;
                                      dummy = prxmatch(re3, trim(code));
                                      call prxposn(re3, 1, pos, len);
                                      pd_num=input(substr(code, pos, len), best32.);
                                      call prxposn(re3, 2, pos, len);
                                      split_sfx=substr(code, pos, len);
                               end;
                       otherwise;
                       end;
               end;
  output;
run;
DavePrinsloo
Pyrite | Level 9
Each FCMP call launches a new sas sub-session. Thats why it is so slow. Its great for doing things in macros with %sysfunc.
Its not my area of expertis, but I think it is possible to get configured it to use a 2nd SAS session that is available as a service and is therefore quicker.
I have had the same issue and I ended up using macros to generate the code.
If you call the macro 10 times, you are generating a lot mode code, but that is compiled once only. The code excuted is the same but without launching a new sas sessiion multiple times per FCMP function call.
gabonzo
Quartz | Level 8

Ouch. Well I think I will use macros then. Thanks.

ballardw
Super User

@gabonzo wrote:

Ouch. Well I think I will use macros then. Thanks.


You didn't provide any example of actual use of the code.

Since you mentioned doing the same thing to two variables (or more???) that always points me toward an Array solution if used in a data step to reduce code.

PeterClemmensen
Tourmaline | Level 20

@gabonzo, if you do choose to go with the FCMP, here are a few things that may speed up the executions:

 

  • Use the Static Statement in PROC FCMP to initialize your pattern ids. No need to initialize them at each call.
  • Since you have +75000 observations and rather simple patterns, there are bound to be duplicates. Since you mention a cache yourself, you can use a hash object to cache values already encountered in a previous call. It is much quicker to retrieve the value from a cache than re-calculating. Read about the technique in the great article Hashing in PROC FCMP to Enhance Your Productivity.

Regards

gabonzo
Quartz | Level 8
Oh that's interesting!
If a compiled regexp can be set up as static, that should do the trick.

I am aware of hashmaps, in fact I use them later in the code, but thank you for pointing that out!

Cheers
BillM_SAS
SAS Employee

@DavePrinsloo, as a member of the FCMP development team, we would like to clear up something mentioned in your post. Once the DATA step encounters an FCMP function or subroutine, the FCMP code is located in the specified FCMP function library (OPTION CMPLIB), the needed FCMP code is loaded into memory, and compiled. The DATA step then calls the FCMP code as it would with any other SAS function. The compiled FCMP function code remains in memory until the end of the DATA step. There are no new SAS sub-sessions created when using FCMP.

DavePrinsloo
Pyrite | Level 9
Thanks for the heads up! A long time, when FCMP was initially released, I used it a lot more, but now I use it in a more measured manner when processing large SAS tables (millions of rows) because of the performance hit.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1237 views
  • 7 likes
  • 6 in conversation