BookmarkSubscribeRSS Feed
Zelazny7
Fluorite | Level 6

I am able to export a Gradient Boosted Machine (GBM) model from R into SAS if-then-else logic. An example of a typical tree is included below.  GBMs often comprise of thousands of trees which translates to hundreds of thousands of lines of SAS code. What takes R seconds to score new data takes SAS upwards of twenty minutes on my i5 laptop.  I understand this is not a fair comparison, however, I only have access to Base SAS and would like to tune my logic to make it quicker.

Is there anything I can do to the SAS code below to increase the speed?  For clarification, each tree updates final_score as well as any variables prefixed with an '_' which represent their running contribution to the score. Both elements are required.

/* Tree: 1 */

final_score + 0.0000063690;

if .z < VAR1 <= 79.5 then do;

     _VAR1 + (0.0001102467 - 0.0001102467);

     final_score + 0.0001102467;

     if .z < VAR2 <= 3.5 then do;

         _VAR2 + (0.0002477234 - 0.0002477234);

         final_score + 0.0002477234;

     end; else

     if .z < VAR2 > 3.5 then do;

         _VAR2 + (0.0002477234 - -0.0001439436);

         final_score + -0.0001439436;

     end; else

     if missing(VAR2) then do;

         _VAR2 + (0.0002477234 - 0.0000000000);

         final_score + 0.0000000000;

     end;

end; else

if .z < VAR1 > 79.5 then do;

     _VAR1 + (0.0001102467 - -0.0002722800);

     final_score + -0.0002722800;

end; else

if missing(VAR1) then do;

     _VAR1 + (0.0001102467 - 0.0000000000);

     final_score + 0.0000000000;

end;

13 REPLIES 13
Reeza
Super User

if .z < VAR2 > 3.5


What do you expect from this?

Zelazny7
Fluorite | Level 6

This:

1       

2          data _NULL_;

3               array TMP {3} _TEMPORARY_ (., 5, 2);

4               do i = 1 to dim(TMP);

5                    if .z < TMP > 3.5 then put TMP "TRUE"; else

6                    put TMP "FALSE";

7               end;

8          run;

Log:

. FALSE

5 TRUE

2 FALSE

Reeza
Super User

Some basics, not sure how much they'll help.

if .z < VAR2 > 3.5

Is the same as VAR2>3.5 and operates slightly faster (about 15%)

Second, look at your if conditions and order them from most frequent. For example if you expect a lot of missing put that first.

Third, unless your evaluating floating point accuracy perhaps consider per-calculating statements such as 0.0001102467 - 0.0000000000

Kurt_Bremser
Super User

Since you do have a specific action for missing values: if you put that first, you can then drop all comparisons to missing values (.z). That may speed up the code, but it surely makes it easier to read.

As Reeza said, put calculated, but basically fixed values into one constant, and leave the original code in a comment for clarification. I'm not sure if the SAS data step compiler is able to do such optimizations on its own.

If you already checked for VAR2 <= 3.5, and there is no other breakoff value after that, the comparison VAR2 > 3.5 is unnecessary and will probably eat CPU cycles. A simple "else" is enough.

jakarman
Barite | Level 11

You are talking about performance and tuning running code. That requires a good understanding what is happening and where the bottlenecks are.
At first I was thinking on code issues. You often can choose on them. Reeza did one, I was thinking or select where constructs with the most likely up in the series as described at: http://www2.sas.com/proceedings/sugi30/002-30.pdf.

Than reading your question. You have something with "scoring code" as result of a mining model. It could be run on several new observations or many.
You are trying to do that on many reading a dataset. Reading the dataset is involving IO (input/output) unless you have your dasd copied to an in-memory image (the R approach) this part is commonly the most time consuming. So I have the questions:

- what is your data-structure you are inputting to SAS

- how is your SAS I/O tuned?
What does the steps of SAS processing tell you on resource usage (SAS log).

---->-- ja karman --<-----
LinusH
Tourmaline | Level 20

I don't think I/O is bottleneck here: "...which translates to hundreds of thousands of lines of SAS code".

If I understand it correctly this is some kind of standard transformation from R to Base SAS. Is there any parameters that can be adjusted which will have impact on the generated code?

If not, I can't see it's doable to manually optimize thousands of SAS code lines each time you import R code.

So, perhaps doing your scoring in R is your best alternative, and then do the rest in SAS?

Data never sleeps
Kurt_Bremser
Super User

The problem with R is that it works only in memory (unless you use special extensions), where one usually runs into limits quickly when operating with large real-world datasets. Scoring is usually done on the complete customerbase/whatever. That's where SAS's ability to scale (almost) indefinitely comes into play.

If there are literally "hundreds of thousands" equally structured lines of code, then an approach where the code is manipulated automatically along some predefined may be of great help. This could be done in a sophisticated data step or by employing standard text-manipulating tools like awk.

If the R source tree always looks the same and just the factors change, a customized filter to do the optimizations could lead to a completely automated process.

jakarman
Barite | Level 11

Linus, that is indeed something to think about. Knowing Eminer, you can run a simple scoring calculation that exports to a RDBMS but also one that is first doing a complete transformation of all kind of data in several steps. Would be nice if there was a R language to SAS conversion but I have doubt on that.

---->-- ja karman --<-----
data_null__
Jade | Level 19

You have tens of thousands of lines of SAS code from the R application.  I expect the a good portion of the 25 minutes is consumed compiling the data step. You can verify this by creating a stored compiled data step program.  Run this with options FULLSTIMER=1;

If my suspension if true and if your data steps are needed more that once you could gain efficiency by compiling the program first.  Of course if the giant program is only needed once then nothing is gained.

Also can you show example of the input data? Is it VAR1-VAR(tens of thousands)? That very wide data if also time consuming to read and may be an issue depending of where it is stored.

You might also investigate run this giant program using DS2 but I would look at compile time first.

Zelazny7
Fluorite | Level 6

Responding to as many points as possible:

  1. I wrote the R translation function myself, so I can easily tweak what is written to the .SAS file. I already plan to incorporate many of Reeza's suggestions
  2. My workflow looks like this:
    1. Start with a very wide SAS dataset (250K obs by 1200 columns)
    2. Apply a SAS processing data step that transforms this data into roughly 150 columns of all numeric types for easier R use
    3. Build the model in R on this skinnied dataset and translate the model into SAS code
    4. Using the datastep from Step (1):
      1. Repeat step 2 AND %include the tree code
  3. The trees I am running currently are mostly for testing purposes, but when the final model is chosen, it will need to be run many times. Therefore, the compilation step could be very helpful.
Rick_SAS
SAS Super FREQ

In step 2.4, why not use the data set created in Step 2.2 that has 150 vars instead of the original data that has 1200 vars?

Zelazny7
Fluorite | Level 6

Thanks, that's a fairly obvious place for speed improvement. In production, I will have to apply both steps to the 1200 variable dataset. I suppose for testing I can use the much skinnier version.

jakarman
Barite | Level 11

It is a pity you have just SAS-base and not miner. The whole process could be an Enterprise miner project.

Getting Started with SAS(R) Enterprise Miner(TM) 7.1 (the latest release Em is 13.1 not very different).

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1548 views
  • 6 likes
  • 7 in conversation