Running Python scripts within SAS Enterprise Miner enables you to use open source packages alongside the statistical, data mining and machine learning methodologies available in SAS. The underlying technique that is used to implement this tip is explained in detail in the "Open Source Integration Using the Base SAS Java Object" paper. This tip takes the idea in the paper a step further by incorporating it into the SAS Enterprise Miner flow where models from Python package can easily be assessed and compared with those from SAS Enterprise Miner or other data mining packages.
To begin, download the necessary files from https://github.com/sassoftware/enlighten-integration/tree/master/SAS_Base_OpenSrcIntegration and follow the steps outlined in “Compiling the Provided Java Classes” and “Setting the Java Classpath” sections provided with the paper to compile and setup the Java classpath in the SAS environment. The working directory where the files are downloaded is referred to as WORK_DIR.
The training data has 785 columns - the first column is the label (dependent variable) with digits 1 or 7 and the following 784 columns are predictors. This tip uses the Java class SASJavaExec.java and digitsdata_17_train.csv files from the ZIP file and em_digitsdata_forest.py Python script that is attached to this post. The code in the Python script uses Random Forest ensemble from scikit-learn package to model binary target in the training data. Make sure to copy the Python script em_digitsdata_forest.py to the WORK_DIR.
Follow these 5 steps to execute the Python model and display its fit statistics in SAS Enterprise Miner:
STEP 1: SETUP
Create a new project in SAS Enterprise Miner and copy below start-up code into Project Start Code window. Update WORK_DIR (working directory where the downloaded files are located) and PYTHON_EXEC_COMMAND (location of Python executable) appropriately for your system and click the RunNow button.
*** WORKING DIRECTORY (----- USER UPDATE NEEDED -----);
%let WORK_DIR = C:\SGF2015\OpenSrcIntegration;
*** SYSTEM PYTHON LOCATION (----- USER UPDATE NEEDED -----);
%let PYTHON_EXEC_COMMAND = C:\Anaconda\python.exe;
*** JAVA LIBRARIES/CLASS FILES LOCATION;
%let JAVA_BIN_DIR = &WORK_DIR.\bin;
options linesize = MAX;
STEP 2: A SIMPLE DIAGRAM
Create a new diagram with the SAS Code node followed by the Metadata node and the Model Import node.
STEP 3: SAS CODE
Copy the following SAS code example into the SAS Code node and Run it. If the Java classpath is not specified correctly, an error is returned. Make sure to correct the problem before proceeding and refer to the paper for details on setting the Java classpath. The following SAS code:
*** VALIDATE JAVA CLASSPATH;
length _x1 $32767;
_x1 = sysget('CLASSPATH');
_x2 = index(upcase(trim(_x1)), %upcase("&JAVA_BIN_DIR"));
if _x2 = 0 then put "ERROR: Invalid Java Classpath.";
/*** Part I: Python ***/
length rtn_val 8;
*** Python program takes working directory as first argument;
python_pgm = "&WORK_DIR.\em_digitsdata_forest.py";
python_arg1 = "&WORK_DIR";
python_call = cat('"', trim(python_pgm), '" "', trim(python_arg1), '"');
declare javaobj j("dev.SASJavaExec", "&PYTHON_EXEC_COMMAND", python_call);
*** Part III: Load CSV files into SAS datasets ****************;
out = predict_py
datafile = "&WORK_DIR.\predict_train_py_forest_prob.csv"
dbms = csv
getnames = no;
out = digitsdata_17_train
datafile = "&WORK_DIR.\digitsdata_17_train.csv"
dbms = csv
getnames = yes;
set predict_py (rename=(var1=p_label1 var2=p_label7));
STEP 4: UPDATE METADATA
Select the Metadata node and click the button next to Train property under Variables tab. For the variable label, change New Role to Target and New Level to Binary as shown in figure below. The purpose of this node is to add metadata to the output data set generated by the Python script.
STEP 5: FIT STATISTICS
Lastly, select Model Import node and click on the button next to Mapping Editor under Predicted Variables tab and make sure it is similar to the figure below. Run all nodes and view the fit statistics in the Results window of the Model Import node.
The Model Import node can further be connected to Model Comparison node to compare the model in the Python script with other existing models built in the SAS Enterprise Miner.
The input and output files exchanged between the Python script and SAS Enterprise Miner are in standard CSV format to enable flexibility and ease of use of this solution. Also, this methodology is not limited to a Python script but is extendable to any valid executable command and their necessary command-line arguments.