Hi,
I have a set of 3 regular expressions that I must apply to 2 different text variables. In order to be able to reuse the code, I decided to store the regular expressions and perform the matching inside a FCMP block:
proc fcmp outlib=myfuncz.turnout43ge.pdprx; subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$); outargs pd_num, pd_num_sfx, split_sfx; re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/"); re2 = prxparse("/^(\d+)-(\d)$/"); re3 = prxparse("/^(\d+)([A-Z]+)$/"); pd_num=.; pd_num_sfx=""; split_sfx=""; if prxmatch(re, ps_code) then do; wc = prxparen(re); select (wc); when (1) pd_num=input(ps_code, best32.); when (2) do; dummy = prxmatch(re2, ps_code); call prxposn(re2, 1, pos, len); pd_num=input(substr(ps_code, pos, len), best32.); call prxposn(re2, 2, pos, len); pd_num_sfx=substr(ps_code, pos, len); end; when (3) do; dummy = prxmatch(re3, trim(ps_code)); call prxposn(re3, 1, pos, len); pd_num=input(substr(ps_code, pos, len), best32.); call prxposn(re3, 2, pos, len); split_sfx=substr(ps_code, pos, len); end; otherwise; end; end; endsub; quit;
Unfortunately, this increases processing time in an unacceptable way: it now takes 10 minutes to process 75,000 records, instead of barely a minute if the the processing is done in the DATA step.
I suppose that what is happening is that the patterns are being recompiled each time the FCMP function is called, which explains the slowdown.
My question is the following: is there a way to cache the compiled patterns, so that I can still use the FCMP method? I could always write a macro to apply to the DATA step, but I see it as an inferior solution.
Thanks!
Can you post a sample of your data? How long did the processing take before?
Unfortunately not, it's protected data.
To give you an idea, I am running the match against two string codes, which can have three different formats:
1. A one-to-three digit number, e.g. 5, 17, 202
2. A one-to-three digit number, followed by a single character, e.g. 5A ,17B, 202E
3. A one-to-three digit number, followed by a dash, then a single digit number, e.g. 5-1, 17-0, 202-2
So the strings are pretty short, the match should be immediate.
Actually, I have to revise my previous statement: if I use FCMP, it takes 9 minutes, if I code directly in the DATA step, it takes 0.2 seconds!
Welcome to the world of performance tuning. In-lining subroutines (which is what a macro would do) is a common method for improving performance.
You might also check whether you need to use REGEX. If you can use normal string functions like SUBSTR, SCAN, VERIFY,COUNTC,INDEXC, etc they usually work much faster than REGEX.
Using your supplied description of the data, I created a DATA step to randomly create test data. I then used the code in your FCMP subroutine to also produce a DATA step that will run the same code outside of FCMP. I then ran the FCMP subroutine and the DATA step against various sizes of the data. The timing numbers are from SAS log of only the DATA steps actually checking the codes. It was run at SAS 9.4 maintenance 6 and showed the following:
Observations FCMP Soubroutine DATA step
count Real time CPU Time Real Time CPU Time
1,000 0.11 seconds 0.09 seconds 0.05 seconds 0.03 seconds
10,000 0.14 seconds 0.07 seconds 0.07 seconds 0.06 seconds
100,000 0.55 seconds 0.48 seconds 0.35 seconds 0.32 seconds
1,000,000 4.70 seconds 4.67 seconds 3.15 seconds 3.15 seconds
10,000,000 44.27 seconds 44.25 seconds 31.06 seconds 31.04 seconds
While the FCMP subroutine implementation did run slightly slower than the DATA step, I'm not seeing the vast time discrepancies that you are seeing. Does my test faithfully recreate what you are doing? Note that since I use a seed value in the STREAMINIT call, you should be able to run the same code to produce the same data and compare your timings to what I'm seeing.
data RandomData;
length code $5;
call streaminit(12345);
do i = 1 to 1000;
codeType = rand("Integer", 1, 3); /* requires SAS 9.4M5 or later */
if (codeType EQ 1) /* up to 3 digit number */
then code = rand("Integer", 0, 999);
if (codeType EQ 2) /* up to 3 digit number followed by a letter */
then do;
number = rand("Integer", 0, 999);
letter = byte(int(rand("Integer", 65, 90)));
code = CATS(number,letter);
end;
if (codeType EQ 3) /* 3 digit number, a "-", and a single digit number */
then do;
number3digits = rand("Integer", 0, 999);
number1digit = rand("Integer", 0, 9);
code = CATS(number3digits, '-', number1digit);
end;
output;
end;
run;
LIBNAME myfuncz 'C:\temp';
proc fcmp outlib=myfuncz.turnout43ge.pdprx;
subroutine pdprx(ps_code$, pd_num, pd_num_sfx$, split_sfx$);
outargs pd_num, pd_num_sfx, split_sfx;
re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
re2 = prxparse("/^(\d+)-(\d)$/");
re3 = prxparse("/^(\d+)([A-Z]+)$/");
pd_num=.;
pd_num_sfx="";
split_sfx="";
if prxmatch(re, ps_code) then do;
wc = prxparen(re);
/* put ps_code wc; */
select (wc);
when (1) pd_num=input(ps_code, best32.);
when (2)
do;
dummy = prxmatch(re2, ps_code);
call prxposn(re2, 1, pos, len);
pd_num=input(substr(ps_code, pos, len), best32.);
call prxposn(re2, 2, pos, len);
pd_num_sfx=substr(ps_code, pos, len);
end;
when (3)
do;
dummy = prxmatch(re3, trim(ps_code));
call prxposn(re3, 1, pos, len);
pd_num=input(substr(ps_code, pos, len), best32.);
call prxposn(re3, 2, pos, len);
split_sfx=substr(ps_code, pos, len);
end;
otherwise;
end;
end;
endsub;
quit;
OPTION CMPLIB=myfuncz.turnout43ge;
data codesOutFcmp(keep=code pd_num pd_num_sfx split_sfx);
length pd_num_sfx $1. split_sfx $1.;
set RandomData;
call pdprx(compress(code), pd_num, pd_num_sfx, split_sfx);
output;
run;
data codesOutDS(keep=code pd_num pd_num_sfx split_sfx);
length pd_num_sfx $1. split_sfx $1.;
set RandomData;
/* Same code as in FCMP function */
re = prxparse("/^(\d+)$|^(\d+-\d)$|^(\d+[A-Z]+)$/");
re2 = prxparse("/^(\d+)-(\d)$/");
re3 = prxparse("/^(\d+)([A-Z]+)$/");
pd_num=.;
pd_num_sfx="";
split_sfx="";
if prxmatch(re, code) then do;
wc = prxparen(re);
/* put ps_code wc; */
select (wc);
when (1) pd_num=input(code, best32.);
when (2)
do;
dummy = prxmatch(re2, code);
call prxposn(re2, 1, pos, len);
pd_num=input(substr(code, pos, len), best32.);
call prxposn(re2, 2, pos, len);
pd_num_sfx=substr(code, pos, len);
end;
when (3)
do;
dummy = prxmatch(re3, trim(code));
call prxposn(re3, 1, pos, len);
pd_num=input(substr(code, pos, len), best32.);
call prxposn(re3, 2, pos, len);
split_sfx=substr(code, pos, len);
end;
otherwise;
end;
end;
output;
run;
Ouch. Well I think I will use macros then. Thanks.
@gabonzo wrote:
Ouch. Well I think I will use macros then. Thanks.
You didn't provide any example of actual use of the code.
Since you mentioned doing the same thing to two variables (or more???) that always points me toward an Array solution if used in a data step to reduce code.
@gabonzo, if you do choose to go with the FCMP, here are a few things that may speed up the executions:
Regards
@DavePrinsloo, as a member of the FCMP development team, we would like to clear up something mentioned in your post. Once the DATA step encounters an FCMP function or subroutine, the FCMP code is located in the specified FCMP function library (OPTION CMPLIB), the needed FCMP code is loaded into memory, and compiled. The DATA step then calls the FCMP code as it would with any other SAS function. The compiled FCMP function code remains in memory until the end of the DATA step. There are no new SAS sub-sessions created when using FCMP.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.