We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Tip: How Can I Apply HP Forest Score Code to Distributed Data with SAS® Enterprise Miner™ 13.2?

by SAS Super FREQ on ‎09-19-2014 10:37 AM - edited on ‎10-06-2015 01:35 PM by Community Manager (2,923 Views)

The HP Forest node in SAS® Enterprise Miner™ creates a forest predictive model -- an ensemble of many, potentially hundreds of, decision trees -- using the HPFOREST procedure. To overcome the challenges associated with capturing the decision rules from so many trees in SAS DATA step code, the HP4SCORE procedure is used to score new data with the forest model that was trained by PROC HPFOREST.  This non-DATA-step score code, however, presents its own complications when aggregated with the code for an entire process flow diagram in order to score new data stored in a distributed environment such as a database.  SAS Enterprise Miner 13.2 supports an approach that circumvents these issues and enables a forest model to be applied to distributed data to generate predictions.

 

When a process flow diagram in SAS Enterprise Miner includes a High-Performance Data Mining (HPDM) modeling node such as HP GLM, HP Neural, HP Regression, and/or HP Tree, applying the model’s score code to data that resides in a database is supported by SAS® Scoring Accelerator, either directly or via SAS® Model Manager.  There are several ways you can use a Score node to accomplish this when you attach it to the end of a flow that contains one or more of these HPDM modeling nodes:

 

  • Create a model package from the Score node and register it to the SAS Metadata Repository to be integrated into SAS Model Manager, which then publishes models to the database with SAS Scoring Accelerator.

 

  • Connect the Score node to a Register Model node (available in releases 13.1 and later) to register the model to the SAS Metadata Repository to be integrated into SAS Model Manager, which then publishes models to the database with SAS Scoring Accelerator.

 

  • Connect the Score node to a Score Code Export node to export to a user-specified directory the score files that SAS Scoring Accelerator can use directly to score data in the database.

 

See the SAS Global Forum 2013 paperTime is Precious, So are Your Models: SAS® Provides Solutions to Streamline Deployment(Wexler et al.) for more information.

 

While a model from the HP Forest node can be registered to the SAS Metadata Repository, deploying that model to data in a database with SAS Scoring Accelerator is not supported due to its non-DATA-step score code.  With version 13.2 of SAS Enterprise Miner, an alternative approach is available so a forest model can be used to score distributed data.  The score code can include the code from other HPDM nodes in the process flow diagram as well, for example an HP Impute node that occurs in the flow before the HP Forest node.

 

Here are the steps to score new data in a database or other distributed environment with score code from a process flow diagram that includes the HP Forest node:

 

1.    To define the macro variables and librefs that are needed for both training and scoring distributed data, run the following lines of code in your Project Start Code, with your information replacing the purple text. Note that you may not have write access to the folder in the database where the data to be scored resides.  In that case, you can define a second library in the database (emwrklib below) to store the final scored data and intermediate, temporary files that are created when scoring.

 

/********************************************************************/

/* Code to define the libref and options for training data in a    */

/* database */

/********************************************************************/

 

/* Library assignment for location of input training and score data */

 

libname your-input-libref your-engine

server  = 'your-server'

user    = XXXXX

schema  = your-schema-name

password = YYYYY

database = your-db;

 

/* HPA Gridoptions */

option set = GRIDHOST      = "your-gridhost";

option set = GRIDINSTALLLOC = "your-gridinstallloc";

 

 

/********************************************************************/

/* Code to add for scoring data in a database                      */

/********************************************************************/

 

/* Library assignment for location of final scored data            */

/*  – if different from above library folder                      */

 

libname emwrklib your-engine

server = 'your-server'

user  = XXXXX

schema  = your-schema-name

password = YYYYY

database = your-db;

 

 

/* Macro variables to set for scoring distributed data */

%let _MM_InputDS  = your-input-libref.your-data-to-score;

%let _MM_OutputDS  = emwrklib.name-for-scored-data;

%let _MM_OutputLib = emwrklib;

 

2.    To create HP Forest score code that can be applied to data in a distributed environment, you must train the forest model with data in a distributed environment.  Create a process flow diagram for modeling your data that includes: an Input Data node representing the (distributed) training data, other HPDM nodes for partitioning, modifying, and/or exploring your data (optional), and the HP Forest node.  Run the flow, assess the model, and make changes and re-run as needed to obtain a satisfactory forest model for your data.

 

3.    Connect the HP Forest node to a Score node and run the Score node.  Note that the Score node only performs scoring on a local sample of the data.

 

4.    Find a file named HPDMSCORECODE.sas located in the folder corresponding to the Score node in the EMWS workspace directory for the diagram containing your flow.  The path to this folder is in the following form:

 

SAS-Server-Directory/Project-Name/Workspaces/EMWSx/Score-node-ID/

 

 

5.    Copy the contents of the HPDMSCORECODE.sas file into the SAS Enterprise Miner Program Editor, and run the code.  The code from this file is in the correct form to score data in the database once the above librefs and macro variables have been defined.

 

After completing these steps, the data set name-for-scored-data is created in the emwrklib library in the database and contains your original data along with new columns created by the scoring code, namely the predicted values from the forest model for the target of interest.  Hooray!

 

Though differing from the conventional way of scoring distributed data with a model from an HPDM node using SAS Scoring Accelerator, these steps make it possible to deploy your forest model to a database.  And by taking advantage of distributed computing without having to move your data, the time it takes to obtain the results you need is minimized.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.