I am wondering if there are any rules about how SAS handles constants and if having the macro language resolve macro variables before they reach the data-step can improve processing time.
Take the following toy example. (I know there are more efficient ways to achieve the same outcome in the code below, this is purely for illustrative purposes.)
%macro macro_vs_datastep();
%let macro_date = 01JAN1980;
%let quarters_to_process = 10;
data data_step;
set sashelp.gnp;
%do h = 1 %to &quarters_to_process;
if intnx("QUARTER", Date, 0, "E") = intnx("QUARTER", "¯o_date"d, &h, "E") then output;
%end;
run;
data macro;
set sashelp.gnp;
%do h = 1 %to &quarters_to_process;
if intnx("QUARTER", Date, 0, "E") = %sysfunc(intnx(QUARTER, "¯o_date"d, &h, E)) then output;
%end;
run;
%mend;
%macro_vs_datastep();
My intuition is that that resolving the macro variables 'macro_date' and 'h' inside of %sysfunc() should reduce I/O time as in the 2nd data-step, as SAS only has to resolve intnx() with each loop of 'h'. Compare that to the first data-step where intnx() must be resolved with each loop of 'h' along with each record in the dataset.
Are there any rules-of-thumb around best practices in these sorts of situations?
I don't think there is a good general rule. Personally I would go with option 2:
%sysfunc(intnx(QUARTER, "¯o_date"d, &h, E))
I would do that for logical coherence, rather than run-time efficiency. The expression doesn't have any DATA step variables, so it makes sense to me to use the macro language.
The first option:
intnx("QUARTER", "¯o_date"d, &h, "E")
looks weird to me, because you're using the DATA step INTNX function on constants. Which isn't a bad thing or a thing I've never done, but feels odd.
Efficiency-wise, I think it's going to come down to how many macro loops you have, and how many data step loops you have.
In option 1 you have two DATA step invocations of INTNX. I'm assuming the DATA step compiler is not smart enough to recognize that the second one is a constant (but I could be wrong, which would destroy my argument here).
In option 2 you have one DATA step invocation of INTNX, and one macro language invocation of INTNX.
So if the number of records in your data is N, I think the difference in time between the two is N*10*(speed of DATA step to execute INTNX one time) - 10*speed of %sysfunc to execute INTNX once.
The macro language is generally slow, for for small N it could lose, but I guess for big N the macro language might win. Essentially in the macro language the expression:
%sysfunc(intnx(QUARTER, "¯o_date"d, &h, E))
Executes in constant time, independent of N. I think that is O(1) in big O notation.
You should try testing it out. Fun question.
You were right! It seems that the Data-Step Compiler does not recognize:
intnx("QUARTER", "¯o_date", &h, "E")
as a constant.
%macro macro_vs_datastep();
options fullstimer;
data input_dta;
length Date t 8;
do i = 1 to 100;
Date = intnx("MONTH", "01JAN1980"d, i, "E");
do t = 1 to 10000;
output;
end;
end;
run;
%let macro_date = 01JAN1980;
%let months_to_process = 100;
data data_step;
set input_dta;
%do h = 1 %to &months_to_process;
if Date = intnx("MONTH", "¯o_date"d, &h, "E") then output;
%end;
run;
data macro;
set input_dta;
%do h = 1 %to &months_to_process;
if Date = %sysfunc(intnx(MONTH, "¯o_date"d, &h, E)) then output;
%end;
run;
%mend;
%macro_vs_datastep();
Here's the log:
NOTE: There were 1000000 observations read from the data set WORK.INPUT_DTA.
NOTE: The data set WORK.DATA_STEP has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 18.73 seconds
user cpu time 18.70 seconds
system cpu time 0.03 seconds
memory 1497.43k
OS Memory 21272.00k
Timestamp 04/29/2024 05:05:39 PM
Step Count 59 Switch Count 9
NOTE: There were 1000000 observations read from the data set WORK.INPUT_DTA.
NOTE: The data set WORK.MACRO has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.23 seconds
user cpu time 0.21 seconds
system cpu time 0.01 seconds
memory 1195.31k
OS Memory 21088.00k
Timestamp 04/29/2024 05:05:40 PM
Step Count 60 Switch Count 5
This could be specific to intnx(), other functions might be optimized to recognize this behavior, but good to know!
Seems like you don't need to ask 100 questions but just one with end date of 100 months from start.
48 %let macro_date = 01JAN1980;
49 %let months_to_process = 100;
50
51 %let x = %sysfunc(intnx(MONTH,"¯o_date"d,100,E),date9);
52 %put &=x;
X=31MAY1988
@sjb1 I believe the difference in runtime you observe is solely due to the SAS datastep code your macro generates.
To easily see the macro generated code the SAS compiler will get:
filename mprint temp;
options fullstimer mprint mfile;
%macro_vs_datastep();
data _null_;
infile mprint;
input;
put _infile_;
run;
filename mprint clear;
With 5 loops (%let months_to_process = 5;) the code the macro generates is as below:
data input_dta;
length Date t 8;
do i = 1 to 100;
Date = intnx("MONTH", "01JAN1980"d, i, "E");
do t = 1 to 10000;
output;
end;
end;
run;
data data_step;
set input_dta;
if Date = intnx("MONTH", "01JAN1980"d, 1, "E") then output;
if Date = intnx("MONTH", "01JAN1980"d, 2, "E") then output;
if Date = intnx("MONTH", "01JAN1980"d, 3, "E") then output;
if Date = intnx("MONTH", "01JAN1980"d, 4, "E") then output;
if Date = intnx("MONTH", "01JAN1980"d, 5, "E") then output;
run;
data macro;
set input_dta;
if Date = 7364 then output;
if Date = 7395 then output;
if Date = 7425 then output;
if Date = 7456 then output;
if Date = 7486 then output;
run;
The step "data_step" needs to execute the intnx() function once per loop while step "macro" only needs to do a simple comparison per loop. I believe that's the reason for the performance difference you observe.
As you mentioned already it does look like the intnx() function needs to execute for every single iteration of the data step and doesn't already resolve to a SAS date value during compilation time.
I guess if you would generate below data step code then the intnx() function would also only execute once.
if Date = %sysfunc(intnx(MONTH, "01JAN1980"d, 1, E)) then output;
...and of course both code versions would profit from change for a real implementation.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.