BookmarkSubscribeRSS Feed

Scoring Series Part 2: SAS® Enterprise Miner™ Scoring Output Variables

Started ‎05-19-2015 by
Modified ‎10-06-2015 by
Views 6,128

Have you ever wondered what some of the output variables generated by Enterprise Miner score code represent?

 

The definition of some of variables generated by Enterprise Miner score code may be obvious while others may be less so. The score code published for a model in EM is often a combination of code generated by procedures and by the nodes. The following is a summary of the variables potentially created and a brief description to provide some orientation.

 

From Procedures

 

Many of the names of the computed variables (outputs, residuals, etc.,) are created by concatenating a prefix with the name of the corresponding target variable or decision variable. The table below lists most of the possible prefixes for variables calculated in EM procedure score code. If you got really wild with the procedures you might be able to generate some of the more esoteric variables.

 

List of many of the possible prefixes used for variable names in the EM procedures’ OUT= data sets:

Prefix

Label *

     Description

AOV16_

AOV16:VNM

Interval variables binned into 16 equally-spaced groups

   BL_

Best Loss:VNM

Best possible loss of any of the decisions

   BP_

Best Profit:VNM

Best possible profit of any of the decisions

   CL_

Computed Loss:VNM

Loss computed from the target value

   CP_

Computed Profit:VNM

Profit computed from the target value

   D_

Decision:VNM

Decision chosen by the model

   EL_

Expected Loss:VNM

Expected loss of the decision chosen by the model

   EP_

Expected Profit:VNM

Expected profit of the decision chosen by the model

   E_

Error Function:VNM

Error function

   F_

From:VNM

Normalized target value of the category that the case comes from

  GRP_

Grouped:VNM

Based on variable characteristics

   G_

Grouped:VNM

based on the relationship to the target

   IC_

Investment Cost:VNM

Investment cost

   IM_

Imputed:VNM

Variable with any missing values replaced

   I_

Into:VNM

Normalized category that the case is classified into

   M_

Missing:VNM

Missingness indicator dummy variable

   P_

Predicted: VNM

Outputs (i.e. predicted values and posterior probabilities)

   Q_

Unadjusted P:VNM

Old posteriors, prior to adjustment for priors

   RAT_

  1. Anscombe Res.: VNM

Studentized Anscombe residuals

   RA_

Anscombe Residual: VNM

Anscombe residuals

   RAS_

  1. Anscombe Res.: VNM

Standardized Anscombe residuals

   RD_

  1. Residual: VNM

Deviance residuals

   RDS_

  1. Dev. Res.: VNM

Standardized deviance residuals

   RDT_

  1. Dev. Res.:VNM

Studentized deviance residuals

   ROI_

Return on Investment:

Return on investment

   RPT_

Pearson Residual: VNM

Studentized Pearson residuals

   RP_

Pearson Residual:VNM

Pearson residuals

   RPS_

  1. Pearson Res.:VNM

Standardized Pearson residuals

   RS_

  1. Residual: VNM

Standardized residuals

   RT_

  1. Residual: VNM

Studentized residuals

   R_

Residual: VNM

Plain residuals: target minus output

   S_

Standard:VNM

Standardized variable

   T_

Transform:VNM

Transformed variable

   U_

Unnormalized Into:VNM

Un-normalized category that the case is classified into

   V_

Validated:VNM

Same as P_ only based on validation data. Tree only.

  WOE_

Weight of Evidence:VNM

Relative risk of an attribute or group level

 

* For non-categorical targets, the "VNM" above indicates the name of the target variable. For categorical targets, "VNM" represents the name of the target variable followed by an equal sign and the un-normalized category value.

 

The generated score code almost always computes the P_* variable(s), and for a categorical target, the I_* and U_* variable(s). But some modeling engines may allow other ways of fitting categorical targets. For example, Regression (proc DMREG) fits an ordinal target by linear least squares using the index of the category as the actual target value, and hence does not produce posterior probabilities.

 

Only the decision tree outputs a V_ output variable, which is similar to a corresponding P_ output variable except it is computed using validation data instead of training data.

 

One of the more ubiquitous variables is the global variable _WARN_. It is used to indicate problems that may occur computing predicted values or making decisions. The _WARN_ variable has 4 columns and each can be set to a specific code.

Column

Code

Description

    1

M

Missing input

    2

U

Unrecognized input category

    3

P

Invalid posterior probability

    4

C

Missing cost variable

 

By default, the EM score code contains no reference to the target variable. Only in the flow score code is the RESIDUAL option specified. So only the EM flow score code can calculate values that depend on the target variable. If the RESIDUAL option is specified in the CODE statement of the modeling procedure, the code should compute the R_* variable(s), and for a categorical target, the F_* variable(s). Other kinds of residuals may be computed if that is feasible, for example CL_*, CP_*, BL_*, BP_*, or ROI_*. Plain residuals are not multiplied by error weights or by frequencies. Plain residuals will always be the actual target value minus the predicted value.

 

Only if decision processing is specified will variables with prefixes like D_ EL_ or EP_* be calculated. The formula for D_targetname varies with the data mining model. There are too many formulas to list here and they should be identifiable in the score code.

 

From Nodes

 

In some cases it is desirable to have an output variable have the same name regardless of the target name. The EM Score node by default provides variables with fixed names for a variety of output variables.

Fixed Output Name

Label

Description

EM_PREDICTION

Prediction for vnm

The prediction variable for an interval target.

EM_PROBABILITY

Probability of Classification

Posterior probability associated with the predicted classification. That is, it corresponds the maximum of the posterior probabilities, max(P1, P2, ..., Pk).

EM_EVENTPROBABILITY

Probability for level n of vnm

Posterior probability associated with target event.

EM_DECISION

Recommended Decision for vnm

Maps to D_targetname variables.

EM_PROFIT

Expected Profit for vnm

Expected profit predicted for a target variable set from EP_targetname

EM_LOSS

Expect Loss for vnm

Expected loss predicted for a target variable set from EL_targetname

EM_CLASSIFICATION

Prediction for vnm

I_variable, the prediction variable for a class target.

EM_SEGMENT

Node or Segment Variable

Segment identifier derived from Decision Tree Leaf, Cluster number, or SOM cell ID

 

If there is a situation where a different value in a fixed name is required you can use the Rules Builder node to assign just about anything to the EM_Outcome variable.

 

The Cutoff node will provide the calculated decision point as EM_CUTOFF.

 

In TwoStage models you also get a variable prefixed with EV_ and labeled “Expected Value:vnm” where vnm = interval target name. It is derived from the predicted value and any specified bias.

 

Neural Network models can also produce variables named H<number> for the value of the hidden units.

 

The Link Analysis node provides the item-cluster detection information as the _segment_ variable.

 

Decision Tree Score code creates variables named _NODE_ and _LEAF_. They are labeled 'Node' and 'Leaf'. They identify each final node or leaf by both leaf number and node number.

 

The Variable Clustering node can replace a large set of variables with a smaller set of cluster components with little loss of information. The cluster components are named Clusn where "n" is a number.

 

Some of the Modify nodes in EM can produce output variables with identifying prefixes on the variable name.

Prefix

Node

Description

IMP_

Impute

Original variable’s value or if missing an imputed value

GRP_

Interactive Binning

Group number based on the original variable's value

REP_

Replacement

Replacement values for the variable’s class and interval levels

 

The Principal Component node by default produces variables named PC_n where "n" is a number but the "PC" can be change in the node properties. Its value is an uncorrelated linear combination of the original input variables.

 

The Transform Variables node's formula builder defaults to a name like TRANS_n where "n" is a number. However that name can be modified. It does generate names for several pre-defined transformations available from the Train properties, Variables dialog.

 

The new variables created are named with the selected variable's name and a prefix to identify the specific transformation.

Prefix

Method

Description

BIN_

Bucket

the bin based on the difference between the maximum and minimum values

CNTR_

Centering

the grand mean centered value

EXP_

Exponential

the exponential logarithm of the variable

INV_

Inverse

the inverse of the variable

LOG_

Log

the natural log of the original variable

LG10_

Log 10

the base 10 logarithm of the original variable

OPT_

Optimal Binning

Binned in order to maximize the relationship to the target.

PWR_

Optimal Max. Equalize

Best power transformation to equalize target level spread.

PCTL_

Quantile

values grouped so groups have same frequency in each group

RANGE_

Range Standardization

scaled value of the variable

SQR_

Square

square of the variable

SQRT_

Square Root

square root of the variable.

STD_

Standardize

Produced by subtracting the mean and dividing by the standard deviation.

TI_

Dummy Indicator

creates dummy variable for categorical variables from highest class value to lowest class value

 

If you are into unsupervised learning you have probably already experimented with the SOM/Kohonen node.

Variable 

Label

Description

SOM_Segment

SOM Segment ID

integer identifying the cluster

SOM_ID

SOM ID

contains the row and column in the SOM

Distance

Distance

the distance from each case to the cluster seed

SOM_DIMENSION1

SOM Dimension1

identifies rows or columns in the SOM

SOM_DIMENSION2

SOM Dimension2

identifies rows or columns in the SOM

 

Incremental Response node provides several variables that could be used to optimize customer targeting.

Variable

Description

EM_P_CONTROL_RESPONSE

Predicted response probability from the control group

EM_P_CONTROL_NONRESPONSE

1- EM_P_CONTROL_RESPONSE

EM_P_ADJ_INCREMENT_RESPONSE

Adjusted to be positive incremental predicted response rate

EM_P_ADJ_INCREMENT_NONRESPONSE

1 - EM_P_ADJ_INCREMENT_RESPONSE

EM_P_ABS_INCREMENT_RESPONSE

Absolute value of the incremental predicted response rate (available when an outcome model used)

EM_P_ABS_INCREMENT_NONRESPONSE

1 - EM_P_ABS_INCREMENT_RESPONSE

EM_P_TREATMENT_RESPONSE

Predicted response probability from the treatment group

EM_P_TREATMENT_NONRESPONSE

1 - EM_P_TREATMENT_RESPONSE

EM_P_INCREMENT_RESPONSE

EM_P_TREATMENT_RESPONSE - EM_P_CONTROL_RESPONSE

EM_P_INCREMENT_NONRESPONSE

EM_P_TREATMENT_NONRESPONSE - EM_P_CONTROL_NONRESPONSE

EM_P_CONTROL_OUTCOME

Predicted value of the outcome variable from the control group

EM_P_TREATMENT_OUTCOME

Predicted value of the outcome variable from the treatment group

EM_P_INCREMENT_OUTCOME

EM_P_TREATMENT_OUTCOME - EM_P_CONTROL_OUTCOME

EM_REV_TREATMENT

Estimated revenue for the treatment group EM_P_TREATMENT_RESPONSE * EM_P_TREATMENT_OUTCOME – Cost or if Constant Revenue is set EM_P_CONTROL_RESPONSE * Revenue_Per_Response – Cost

EM_REV_CONTROL

Estimated revenue for the control group EM_P_CONTROL_RESPONSE * EM_P_CONTROL_OUTCOME

EM_REV_INCREMENT

Estimated incremental revenue EM_REV_TREATMENT - EM_REV_CONTROL

 

If you can think of other variables generated by EM score code or a better description, please add them to the comments. I left one out just to prove to someone that nobody ever used it. Smiley Happy

    

"Risk comes from not knowing what you're doing."

  ~ Warren Edward Buffett (born August 30, 1930) an American business magnate, investor, and philanthropist

Version history
Last update:
‎10-06-2015 11:15 AM
Updated by:
Contributors

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags