Help speeding up thousands of trees represented as if-then-else logic ...

Zelazny7 · Posted 05-21-2014 02:17 PM

I am able to export a Gradient Boosted Machine (GBM) model from R into SAS if-then-else logic. An example of a typical tree is included below. GBMs often comprise of thousands of trees which translates to hundreds of thousands of lines of SAS code. What takes R seconds to score new data takes SAS upwards of twenty minutes on my i5 laptop. I understand this is not a fair comparison, however, I only have access to Base SAS and would like to tune my logic to make it quicker.

Is there anything I can do to the SAS code below to increase the speed? For clarification, each tree updates final_score as well as any variables prefixed with an '_' which represent their running contribution to the score. Both elements are required.

/* Tree: 1 */

final_score + 0.0000063690;

if .z < VAR1 <= 79.5 then do;

_VAR1 + (0.0001102467 - 0.0001102467);

final_score + 0.0001102467;

if .z < VAR2 <= 3.5 then do;

_VAR2 + (0.0002477234 - 0.0002477234);

final_score + 0.0002477234;

end; else

if .z < VAR2 > 3.5 then do;

_VAR2 + (0.0002477234 - -0.0001439436);

final_score + -0.0001439436;

end; else

if missing(VAR2) then do;

_VAR2 + (0.0002477234 - 0.0000000000);

final_score + 0.0000000000;

end;

end; else

if .z < VAR1 > 79.5 then do;

_VAR1 + (0.0001102467 - -0.0002722800);

final_score + -0.0002722800;

end; else

if missing(VAR1) then do;

_VAR1 + (0.0001102467 - 0.0000000000);

final_score + 0.0000000000;

end;

Reeza · Posted 05-21-2014 02:57 PM

if .z < VAR2 > 3.5

What do you expect from this?

Zelazny7 · Posted 05-21-2014 03:06 PM

This:

1

2 data _NULL_;

3 array TMP {3} _TEMPORARY_ (., 5, 2);

4 do i = 1 to dim(TMP);

5 if .z < TMP > 3.5 then put TMP "TRUE"; else

6 put TMP "FALSE";

7 end;

8 run;

Log:

. FALSE

5 TRUE

2 FALSE

Reeza · Posted 05-21-2014 03:17 PM

Some basics, not sure how much they'll help.

if .z < VAR2 > 3.5

Is the same as VAR2>3.5 and operates slightly faster (about 15%)

Second, look at your if conditions and order them from most frequent. For example if you expect a lot of missing put that first.

Third, unless your evaluating floating point accuracy perhaps consider per-calculating statements such as 0.0001102467 - 0.0000000000

Kurt_Bremser · Posted 05-22-2014 02:10 AM

Since you do have a specific action for missing values: if you put that first, you can then drop all comparisons to missing values (.z). That may speed up the code, but it surely makes it easier to read.

As Reeza said, put calculated, but basically fixed values into one constant, and leave the original code in a comment for clarification. I'm not sure if the SAS data step compiler is able to do such optimizations on its own.

If you already checked for VAR2 <= 3.5, and there is no other breakoff value after that, the comparison VAR2 > 3.5 is unnecessary and will probably eat CPU cycles. A simple "else" is enough.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 05-22-2014 03:05 AM

You are talking about performance and tuning running code. That requires a good understanding what is happening and where the bottlenecks are.
At first I was thinking on code issues. You often can choose on them. Reeza did one, I was thinking or select where constructs with the most likely up in the series as described at: http://www2.sas.com/proceedings/sugi30/002-30.pdf.

Than reading your question. You have something with "scoring code" as result of a mining model. It could be run on several new observations or many.
You are trying to do that on many reading a dataset. Reading the dataset is involving IO (input/output) unless you have your dasd copied to an in-memory image (the R approach) this part is commonly the most time consuming. So I have the questions:

- what is your data-structure you are inputting to SAS

- how is your SAS I/O tuned?
What does the steps of SAS processing tell you on resource usage (SAS log).

---->-- ja karman --<-----

LinusH · Posted 05-22-2014 03:10 AM

I don't think I/O is bottleneck here: "...which translates to hundreds of thousands of lines of SAS code".

If I understand it correctly this is some kind of standard transformation from R to Base SAS. Is there any parameters that can be adjusted which will have impact on the generated code?

If not, I can't see it's doable to manually optimize thousands of SAS code lines each time you import R code.

So, perhaps doing your scoring in R is your best alternative, and then do the rest in SAS?

Data never sleeps

Kurt_Bremser · Posted 05-22-2014 03:23 AM

The problem with R is that it works only in memory (unless you use special extensions), where one usually runs into limits quickly when operating with large real-world datasets. Scoring is usually done on the complete customerbase/whatever. That's where SAS's ability to scale (almost) indefinitely comes into play.

If there are literally "hundreds of thousands" equally structured lines of code, then an approach where the code is manipulated automatically along some predefined may be of great help. This could be done in a sophisticated data step or by employing standard text-manipulating tools like awk.

If the R source tree always looks the same and just the factors change, a customized filter to do the optimizations could lead to a completely automated process.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 05-22-2014 03:21 AM

Linus, that is indeed something to think about. Knowing Eminer, you can run a simple scoring calculation that exports to a RDBMS but also one that is first doing a complete transformation of all kind of data in several steps. Would be nice if there was a R language to SAS conversion but I have doubt on that.

---->-- ja karman --<-----

data_null__ · Posted 05-22-2014 05:56 AM

You have tens of thousands of lines of SAS code from the R application. I expect the a good portion of the 25 minutes is consumed compiling the data step. You can verify this by creating a stored compiled data step program. Run this with options FULLSTIMER=1;

If my suspension if true and if your data steps are needed more that once you could gain efficiency by compiling the program first. Of course if the giant program is only needed once then nothing is gained.

Also can you show example of the input data? Is it VAR1-VAR(tens of thousands)? That very wide data if also time consuming to read and may be an issue depending of where it is stored.

You might also investigate run this giant program using DS2 but I would look at compile time first.

Zelazny7 · Posted 05-22-2014 09:07 AM

Responding to as many points as possible:

I wrote the R translation function myself, so I can easily tweak what is written to the .SAS file. I already plan to incorporate many of Reeza's suggestions
My workflow looks like this:
1. Start with a very wide SAS dataset (250K obs by 1200 columns)
2. Apply a SAS processing data step that transforms this data into roughly 150 columns of all numeric types for easier R use
3. Build the model in R on this skinnied dataset and translate the model into SAS code
4. Using the datastep from Step (1):
  1. Repeat step 2 AND %include the tree code
The trees I am running currently are mostly for testing purposes, but when the final model is chosen, it will need to be run many times. Therefore, the compilation step could be very helpful.

Rick_SAS · Posted 05-22-2014 10:04 AM

In step 2.4, why not use the data set created in Step 2.2 that has 150 vars instead of the original data that has 1200 vars?

Zelazny7 · Posted 05-22-2014 10:06 AM

Thanks, that's a fairly obvious place for speed improvement. In production, I will have to apply both steps to the 1200 variable dataset. I suppose for testing I can use the much skinnier version.

jakarman · Posted 05-22-2014 10:01 AM

It is a pity you have just SAS-base and not miner. The whole process could be an Enterprise miner project.

Getting Started with SAS(R) Enterprise Miner(TM) 7.1 (the latest release Em is 13.1 not very different).

---->-- ja karman --<-----

Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Re: Help speeding up thousands of trees represented as if-then-else logic (Base SAS only)

Registration is open

SAS Training: Just a Click Away