I am able to export a Gradient Boosted Machine (GBM) model from R into SAS if-then-else logic. An example of a typical tree is included below. GBMs often comprise of thousands of trees which translates to hundreds of thousands of lines of SAS code. What takes R seconds to score new data takes SAS upwards of twenty minutes on my i5 laptop. I understand this is not a fair comparison, however, I only have access to Base SAS and would like to tune my logic to make it quicker.
Is there anything I can do to the SAS code below to increase the speed? For clarification, each tree updates final_score as well as any variables prefixed with an '_' which represent their running contribution to the score. Both elements are required.
/* Tree: 1 */
final_score + 0.0000063690;
if .z < VAR1 <= 79.5 then do;
_VAR1 + (0.0001102467 - 0.0001102467);
final_score + 0.0001102467;
if .z < VAR2 <= 3.5 then do;
_VAR2 + (0.0002477234 - 0.0002477234);
final_score + 0.0002477234;
end; else
if .z < VAR2 > 3.5 then do;
_VAR2 + (0.0002477234 - -0.0001439436);
final_score + -0.0001439436;
end; else
if missing(VAR2) then do;
_VAR2 + (0.0002477234 - 0.0000000000);
final_score + 0.0000000000;
end;
end; else
if .z < VAR1 > 79.5 then do;
_VAR1 + (0.0001102467 - -0.0002722800);
final_score + -0.0002722800;
end; else
if missing(VAR1) then do;
_VAR1 + (0.0001102467 - 0.0000000000);
final_score + 0.0000000000;
end;
if .z < VAR2 > 3.5
What do you expect from this?
This:
1
2 data _NULL_;
3 array TMP {3} _TEMPORARY_ (., 5, 2);
4 do i = 1 to dim(TMP);
5 if .z < TMP > 3.5 then put TMP "TRUE"; else
6 put TMP "FALSE";
7 end;
8 run;
Log:
. FALSE
5 TRUE
2 FALSE
Some basics, not sure how much they'll help.
if .z < VAR2 > 3.5
Is the same as VAR2>3.5 and operates slightly faster (about 15%)
Second, look at your if conditions and order them from most frequent. For example if you expect a lot of missing put that first.
Third, unless your evaluating floating point accuracy perhaps consider per-calculating statements such as 0.0001102467 - 0.0000000000
Since you do have a specific action for missing values: if you put that first, you can then drop all comparisons to missing values (.z). That may speed up the code, but it surely makes it easier to read.
As Reeza said, put calculated, but basically fixed values into one constant, and leave the original code in a comment for clarification. I'm not sure if the SAS data step compiler is able to do such optimizations on its own.
If you already checked for VAR2 <= 3.5, and there is no other breakoff value after that, the comparison VAR2 > 3.5 is unnecessary and will probably eat CPU cycles. A simple "else" is enough.
You are talking about performance and tuning running code. That requires a good understanding what is happening and where the bottlenecks are.
At first I was thinking on code issues. You often can choose on them. Reeza did one, I was thinking or select where constructs with the most likely up in the series as described at: http://www2.sas.com/proceedings/sugi30/002-30.pdf.
Than reading your question. You have something with "scoring code" as result of a mining model. It could be run on several new observations or many.
You are trying to do that on many reading a dataset. Reading the dataset is involving IO (input/output) unless you have your dasd copied to an in-memory image (the R approach) this part is commonly the most time consuming. So I have the questions:
- what is your data-structure you are inputting to SAS
- how is your SAS I/O tuned?
What does the steps of SAS processing tell you on resource usage (SAS log).
I don't think I/O is bottleneck here: "...which translates to hundreds of thousands of lines of SAS code".
If I understand it correctly this is some kind of standard transformation from R to Base SAS. Is there any parameters that can be adjusted which will have impact on the generated code?
If not, I can't see it's doable to manually optimize thousands of SAS code lines each time you import R code.
So, perhaps doing your scoring in R is your best alternative, and then do the rest in SAS?
The problem with R is that it works only in memory (unless you use special extensions), where one usually runs into limits quickly when operating with large real-world datasets. Scoring is usually done on the complete customerbase/whatever. That's where SAS's ability to scale (almost) indefinitely comes into play.
If there are literally "hundreds of thousands" equally structured lines of code, then an approach where the code is manipulated automatically along some predefined may be of great help. This could be done in a sophisticated data step or by employing standard text-manipulating tools like awk.
If the R source tree always looks the same and just the factors change, a customized filter to do the optimizations could lead to a completely automated process.
Linus, that is indeed something to think about. Knowing Eminer, you can run a simple scoring calculation that exports to a RDBMS but also one that is first doing a complete transformation of all kind of data in several steps. Would be nice if there was a R language to SAS conversion but I have doubt on that.
You have tens of thousands of lines of SAS code from the R application. I expect the a good portion of the 25 minutes is consumed compiling the data step. You can verify this by creating a stored compiled data step program. Run this with options FULLSTIMER=1;
If my suspension if true and if your data steps are needed more that once you could gain efficiency by compiling the program first. Of course if the giant program is only needed once then nothing is gained.
Also can you show example of the input data? Is it VAR1-VAR(tens of thousands)? That very wide data if also time consuming to read and may be an issue depending of where it is stored.
You might also investigate run this giant program using DS2 but I would look at compile time first.
Responding to as many points as possible:
In step 2.4, why not use the data set created in Step 2.2 that has 150 vars instead of the original data that has 1200 vars?
Thanks, that's a fairly obvious place for speed improvement. In production, I will have to apply both steps to the 1200 variable dataset. I suppose for testing I can use the much skinnier version.
It is a pity you have just SAS-base and not miner. The whole process could be an Enterprise miner project.
Getting Started with SAS(R) Enterprise Miner(TM) 7.1 (the latest release Em is 13.1 not very different).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.