This is the second tip on scoring; the earlier tip outlined the various types of score code generated by high-performance data mining (HPDM) nodes in SAS Enterprise Miner and their deployment into production. This tip will introduce a new score code file type called Analytic Store or ASTORE that is new in SAS 9.4M3 and SAS Enterprise Miner 14.1 release. This file type supports the scoring of complex models like Forest and SVM.
The ASTORE is a binary file that contains information about the state of an analytic object and it is transportable -- that is, the file can be produced on one host and consumed on others without the need of traditional SAS export or import. This universal and compact file form was developed to overcome the shortcomings of encoding a large number of rules from hundreds of models (typical in ensembles) into SAS DATA Step code. And it can be used to score models in a distributed environment/in-database with SAS Scoring Accelerator for Hadoop, Teradata and SAP HANA.
ASTORE score file can be generated by:
In SAS Enterprise Miner 14.1, these files are created in the project directory when HP SVM and HP Forest nodes run in SMP (Symmetric Multiprocessing) or MPP (Massively Parallel Processing) mode.
This section describes the steps involved in publishing and scoring a data set in the distributed Hadoop environment using the ASTORE files. For this exercise, the files from SAS Enterprise Miner are used. Note that the ASTORE files contain the score code of a single model, whether it be from HP SVM or HP FOREST node -- in other words, they do not include score code from all preceding nodes in a SAS Enterprise Miner process flow diagram.
The ASTORE files score.sasast and score.sas created by SAS Enterprise Miner are located at <EM_Project>\Workspaces\EMWSx\HPDMForestx folder for HP Forest node and <EM_Project>\Workspaces\EMWSx\HPSVMx for HP SVM.
Before proceeding, make sure you have SAS Scoring Accelerator for Hadoop and SAS/ACCESS for Hadoop licenses on the SAS server that acts as a client, in addition to the In-Database deployment package on the Hadoop environment, where the data to be scored resides. If the client SAS server is different than what SAS Enterprise Miner uses, the ASTORE files need to be moved accordingly. This example was tested with the client SAS server on Windows and Hadoop environment on Linux.
Step 1: Load data to score
The first step is to load a local data set (C:\scoredata\DONOR_SCORE_DATA.sas7bdat) for scoring onto Hadoop using SAS/ACCESS as shown below. The HDFS_TEMPDIR, HDFS_DATADIR and HDFS_METADIR options on the HADOOP libname statement are necessary to create the metadata file that is needed in the next step. Before executing the following code on the client SAS server, make sure to create tempdir, datadir, metadir directories on Hadoop name-node and to update the respective macro variables along with SAS_HADOOP_JAR_PATH and SAD_HADOOP_CONFIG_PATH.
/* Setup macro variables */ %let tempdir=%str(/user/xxxx/temp); %let datadir=%str(/user/xxxx/data); %let metadir=%str(/user/xxxx/meta); options set=SAS_HADOOP_JAR_PATH="\\hdp\u\admin\hadoopjars\cdh53d"; options set=SAS_HADOOP_CONFIG_PATH="\\hdp\u\admin\clusters\cdh53d1"; /* Hadoop libname */ libname myhdlib HADOOP server="server1" user=xxxx pw=xxxx HDFS_TEMPDIR="&tempdir" HDFS_DATADIR="&datadir" HDFS_METADIR="&metadir" DBCREATE_TABLE_EXTERNAL=NO ; /* Libname for accessing the score data set */ libname eminput "C:\scoredata"; /* Create the input scoring dataset in HDSFS */ data myhdlib.donor_score_data;
set eminput.donor_score_data; run;
Step 2: Publish and Score
The macros indhd_publish_model and indhd_run_model are part of SAS Scoring Accelerator and are used to publish and score data on Hadoop. Use modelname and astorepath macro variables to specify a model label (string without spaces) and the directory of the ASTORE files respectively. Finally, update variables ds2dir, scoredata, outdata, INDCONN according to your data and Hadoop environment setup and run the code.
NOTE: For additional documentation about these macros and their options, refer to SAS Scoring Accelerator for Hadoop at Overview of Running Scoring Models in Hadoop
/* Setup macro variables */ %let modelname=forest1; %let astorepath=%str(c:\astore_dir); %let ds2dir=%str(/user/xxxx/ds2); /* Input and output scoring dataset names */ %let scoredata=donor_score_data; %let outdata=donor_score_out; /* Publish and run in Hadoop */ /* Setup the indb macro catalog for Hadoop */ %indhdpm; /* Hadoop options */ %let INDCONN=%str(USER=xxxx); /* Publish the model */ %indhd_publish_model( dir=&astorepath, modeldir=&ds2dir, modelname=&modelname, action=replace, trace=no ); /* Run the model */ %indhd_run_model( inmetaname=&metadir./&scoredata..sashdmd, outdatadir=&datadir./&outdata, outmetadir=&metadir./&outdata..sashdmd, store=&ds2dir./&modelname./&modelname..is, forceoverwrite=true, trace=no ); /* Display results */ proc sql outobs=10; select * from myhdlib.&outdata; quit;
After successful execution of the above code, the scored data set DONOR_SCORE_OUT on Hadoop will contain model predictions. The SQL statement at the end prints top 10 observations from DONOR_SCORE_OUT.
Scoring can similarly be done on Teradata and SAP HANA; refer to the respective SAS Scoring Accelerator documentation for details. In conclusion, the ASTORE files extend the functionality of SAS Scoring Accelerator by efficiently scoring complex models like Forest, SVM and any new ensemble models added in the future.
Earlier tips in this series are available at: