We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Assessing Models by using k-fold Cross Validation in SAS® Enterprise Miner ™

by SAS Employee Funda_SAS on ‎05-11-2017 11:54 AM - edited on ‎05-15-2017 02:32 PM by Community Manager (2,169 Views)

My previous tip on cross validation shows how to compare three trained models (regression, random forest, and gradient boosting) based on their 5-fold cross validation training errors in SAS Enterprise Miner. This tip is the second installment about using cross validation in SAS Enterprise Miner and builds on the diagram that is used in the first tip.

 

In addition to comparing models based on their 5-fold cross validation training errors, this tip also shows how to obtain a 5-fold cross validation testing error; so it provides a more complete SAS Enterprise Miner flow (shown below).tip2_screen.PNG

 

First a quick note about how k-fold cross validation training and testing errors are calculated:

  • k-fold cross validation training error is calculated by using the predictions on the whole training set which is obtained by combining the k sets of cross validation holdout predictions.
  • k-fold cross validation testing error is calculated by using the average of k different sets of test predictions that come from the k trained models of cross validation.

Following is a step-by-step explanation of the preceding Enterprise Miner flow. You can run this process flow by using the attached xml file.

  1. The Data Source node HMEQ_TRAIN includes about two-thirds of the HMEQ data, in which the Role property is specified as Train. The HMEQ_TEST node includes the remaining one-third of the data, in which the Role is specified as Test. In both nodes, the target variable is BAD, whose level is binary.
  2. Similar to the first tip, to make sure that looping through the Start Groups and End Groups nodes perform k-fold cross validation on the training and test set properly, you need to do some simple tricks. For the training set, this involves using the Transform Variables node as explained in step 3. You need to create 5 replications of test set. In the first test set, the _fold_ variable takes a value 1, in the second test set it takes a value 2 and so on. The following statements in SAS Code node (Test Data X 5) create these five test sets. Note also that it first defines a unique ID for each row in the original test set. The IDs will be used in the next SAS Code nodes (those that come after the End Groups nodes) for averaging the five test set predictions per ID.
    data temptest;
       set &EM_import_TEST;
       ID = _N_;
    run;
     
    data &EM_EXPORT_TEST;
       set temptest (in=in1) temptest (in=in2) temptest (in=in3) temptest  (in=in4) temptest(in=in5);
       if in1 then _fold_= 1;
       else if in2 then _fold_=2;
       else if in3 then _fold_=3;
       else if in4 then _fold_=4;
       else if in5 then _fold_=5;
    run; 
  3. The Transform Variables node (which is connected to the training set) creates a k-fold cross validation indicator as a new input variable, _fold_ which randomly divides the training set into k folds, and saves this new indicator as a segment variable. More information about this node can be found in the first tip.
  4. The Control Point node establishes a control point within the process flow diagram.
  5. The next three nodes a Start Groups node, a Modeling Node, and an End Groups node train models for each of the five training sets of 5-fold cross validation (each set omits one fold) and obtain predictions on the holdout (omitted) sets. They also obtain predictions on the test set for each of the five trained models of 5-fold cross validation.
  6. For each modeling algorithm, the SAS Code 2 node averages the test scores that come from the 5 trained models of 5-fold cross validation by executing the following statements:
    proc sort data=&EM_IMPORT_TEST out=&EM_EXPORT_TEST;
       by ID;
    run;
     
    data &EM_EXPORT_TRAIN;
       set &EM_IMPORT_DATA;
    run;
     
    data test1 test2 test3 test4 test5;
       set &EM_IMPORT_TEST;
       if _fold_ = 1 then output test1;
       if _fold_ = 2 then output test2;
       if _fold_ = 3 then output test3;
       if _fold_ = 4 then output test4;
       if _fold_ = 5 then output test5;
    run;
     
    data &EM_EXPORT_TEST;
       merge test1(rename=(P_BAD1 = P_BAD_1)) test2(rename=(P_BAD1 =P_BAD_2))
             test3(rename=(P_BAD1 = P_BAD_3)) test4(rename=(P_BAD1 = P_BAD_4)) 
             test5(rename=(P_BAD1 =  P_BAD_5));
       by ID;
       P_BAD1 = (P_BAD_1 + P_BAD_2 + P_BAD_3 + P_BAD_4+P_BAD_5)/5;
    run;
    
  7. The Metadata node assigns a new role of “Prediction” for the average predictions of the test set, “P_BAD1”. This new role will be used for performance metric calculation by the Model Comparison node. The Metadata node also rejects the other test set prediction variables with the “Predition” role, P_BAD_1, …, P_BAD_5, so that the Model Comparison node knows which one exactly to use for calculating test set errors.
  8. The last node, Model Comparison, compares the gradient boosting, regression, and decision tree models based on the 5-fold training and test set cross validation errors. It provides the following output table, in which the training and testing errors are actually the 5-fold cross validation training and the 5-fold cross validation testing error.capture3.pngCapture4.png

Note that you can obtain the cross validated predictions of the test set by saving the exported data (&EM_EXPORT_TEST) of the SAS Code 2 node.

 

If you run this flow diagram or replicate this analysis for your own data, make sure that you run each Start Groups/End Groups block separately, because multiple looping actions do not work at the same time.

 

Thanks a lot to Ralph Abbey for his help in putting this together.

Attachment
Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.