Have you ever wondered what some of the output variables generated by Enterprise Miner score code represent?
The definition of some of variables generated by Enterprise Miner score code may be obvious while others may be less so. The score code published for a model in EM is often a combination of code generated by procedures and by the nodes. The following is a summary of the variables potentially created and a brief description to provide some orientation.
From Procedures
Many of the names of the computed variables (outputs, residuals, etc.,) are created by concatenating a prefix with the name of the corresponding target variable or decision variable. The table below lists most of the possible prefixes for variables calculated in EM procedure score code. If you got really wild with the procedures you might be able to generate some of the more esoteric variables.
List of many of the possible prefixes used for variable names in the EM procedures’ OUT= data sets:
Prefix 
Label * 
Description 
AOV16_ 
AOV16:VNM 
Interval variables binned into 16 equallyspaced groups 
BL_ 
Best Loss:VNM 
Best possible loss of any of the decisions 
BP_ 
Best Profit:VNM 
Best possible profit of any of the decisions 
CL_ 
Computed Loss:VNM 
Loss computed from the target value 
CP_ 
Computed Profit:VNM 
Profit computed from the target value 
D_ 
Decision:VNM 
Decision chosen by the model 
EL_ 
Expected Loss:VNM 
Expected loss of the decision chosen by the model 
EP_ 
Expected Profit:VNM 
Expected profit of the decision chosen by the model 
E_ 
Error Function:VNM 
Error function 
F_ 
From:VNM 
Normalized target value of the category that the case comes from 
GRP_ 
Grouped:VNM 
Based on variable characteristics 
G_ 
Grouped:VNM 
based on the relationship to the target 
IC_ 
Investment Cost:VNM 
Investment cost 
IM_ 
Imputed:VNM 
Variable with any missing values replaced 
I_ 
Into:VNM 
Normalized category that the case is classified into 
M_ 
Missing:VNM 
Missingness indicator dummy variable 
P_ 
Predicted: VNM 
Outputs (i.e. predicted values and posterior probabilities) 
Q_ 
Unadjusted P:VNM 
Old posteriors, prior to adjustment for priors 
RAT_ 

Studentized Anscombe residuals 
RA_ 
Anscombe Residual: VNM 
Anscombe residuals 
RAS_ 

Standardized Anscombe residuals 
RD_ 

Deviance residuals 
RDS_ 

Standardized deviance residuals 
RDT_ 

Studentized deviance residuals 
ROI_ 
Return on Investment: 
Return on investment 
RPT_ 
Pearson Residual: VNM 
Studentized Pearson residuals 
RP_ 
Pearson Residual:VNM 
Pearson residuals 
RPS_ 

Standardized Pearson residuals 
RS_ 

Standardized residuals 
RT_ 

Studentized residuals 
R_ 
Residual: VNM 
Plain residuals: target minus output 
S_ 
Standard:VNM 
Standardized variable 
T_ 
Transform:VNM 
Transformed variable 
U_ 
Unnormalized Into:VNM 
Unnormalized category that the case is classified into 
V_ 
Validated:VNM 
Same as P_ only based on validation data. Tree only. 
WOE_ 
Weight of Evidence:VNM 
Relative risk of an attribute or group level 
* For noncategorical targets, the "VNM" above indicates the name of the target variable. For categorical targets, "VNM" represents the name of the target variable followed by an equal sign and the unnormalized category value.
The generated score code almost always computes the P_* variable(s), and for a categorical target, the I_* and U_* variable(s). But some modeling engines may allow other ways of fitting categorical targets. For example, Regression (proc DMREG) fits an ordinal target by linear least squares using the index of the category as the actual target value, and hence does not produce posterior probabilities.
Only the decision tree outputs a V_ output variable, which is similar to a corresponding P_ output variable except it is computed using validation data instead of training data.
One of the more ubiquitous variables is the global variable _WARN_. It is used to indicate problems that may occur computing predicted values or making decisions. The _WARN_ variable has 4 columns and each can be set to a specific code.
Column 
Code 
Description 
1 
M 
Missing input 
2 
U 
Unrecognized input category 
3 
P 
Invalid posterior probability 
4 
C 
Missing cost variable 
By default, the EM score code contains no reference to the target variable. Only in the flow score code is the RESIDUAL option specified. So only the EM flow score code can calculate values that depend on the target variable. If the RESIDUAL option is specified in the CODE statement of the modeling procedure, the code should compute the R_* variable(s), and for a categorical target, the F_* variable(s). Other kinds of residuals may be computed if that is feasible, for example CL_*, CP_*, BL_*, BP_*, or ROI_*. Plain residuals are not multiplied by error weights or by frequencies. Plain residuals will always be the actual target value minus the predicted value.
Only if decision processing is specified will variables with prefixes like D_ EL_ or EP_* be calculated. The formula for D_targetname varies with the data mining model. There are too many formulas to list here and they should be identifiable in the score code.
From Nodes
In some cases it is desirable to have an output variable have the same name regardless of the target name. The EM Score node by default provides variables with fixed names for a variety of output variables.
Fixed Output Name 
Label 
Description 
EM_PREDICTION 
Prediction for vnm 
The prediction variable for an interval target. 
EM_PROBABILITY 
Probability of Classification 
Posterior probability associated with the predicted classification. That is, it corresponds the maximum of the posterior probabilities, max(P1, P2, ..., Pk). 
EM_EVENTPROBABILITY 
Probability for level n of vnm 
Posterior probability associated with target event. 
EM_DECISION 
Recommended Decision for vnm 
Maps to D_targetname variables. 
EM_PROFIT 
Expected Profit for vnm 
Expected profit predicted for a target variable set from EP_targetname 
EM_LOSS 
Expect Loss for vnm 
Expected loss predicted for a target variable set from EL_targetname 
EM_CLASSIFICATION 
Prediction for vnm 
I_variable, the prediction variable for a class target. 
EM_SEGMENT 
Node or Segment Variable 
Segment identifier derived from Decision Tree Leaf, Cluster number, or SOM cell ID 
If there is a situation where a different value in a fixed name is required you can use the Rules Builder node to assign just about anything to the EM_Outcome variable.
The Cutoff node will provide the calculated decision point as EM_CUTOFF.
In TwoStage models you also get a variable prefixed with EV_ and labeled “Expected Value:vnm” where vnm = interval target name. It is derived from the predicted value and any specified bias.
Neural Network models can also produce variables named H<number> for the value of the hidden units.
The Link Analysis node provides the itemcluster detection information as the _segment_ variable.
Decision Tree Score code creates variables named _NODE_ and _LEAF_. They are labeled 'Node' and 'Leaf'. They identify each final node or leaf by both leaf number and node number.
The Variable Clustering node can replace a large set of variables with a smaller set of cluster components with little loss of information. The cluster components are named Clusn where "n" is a number.
Some of the Modify nodes in EM can produce output variables with identifying prefixes on the variable name.
Prefix 
Node 
Description 
IMP_ 
Impute 
Original variable’s value or if missing an imputed value 
GRP_ 
Interactive Binning 
Group number based on the original variable's value 
REP_ 
Replacement 
Replacement values for the variable’s class and interval levels 
The Principal Component node by default produces variables named PC_n where "n" is a number but the "PC" can be change in the node properties. Its value is an uncorrelated linear combination of the original input variables.
The Transform Variables node's formula builder defaults to a name like TRANS_n where "n" is a number. However that name can be modified. It does generate names for several predefined transformations available from the Train properties, Variables dialog.
The new variables created are named with the selected variable's name and a prefix to identify the specific transformation.
Prefix 
Method 
Description 
BIN_ 
Bucket 
the bin based on the difference between the maximum and minimum values 
CNTR_ 
Centering 
the grand mean centered value 
EXP_ 
Exponential 
the exponential logarithm of the variable 
INV_ 
Inverse 
the inverse of the variable 
LOG_ 
Log 
the natural log of the original variable 
LG10_ 
Log 10 
the base 10 logarithm of the original variable 
OPT_ 
Optimal Binning 
Binned in order to maximize the relationship to the target. 
PWR_ 
Optimal Max. Equalize 
Best power transformation to equalize target level spread. 
PCTL_ 
Quantile 
values grouped so groups have same frequency in each group 
RANGE_ 
Range Standardization 
scaled value of the variable 
SQR_ 
Square 
square of the variable 
SQRT_ 
Square Root 
square root of the variable. 
STD_ 
Standardize 
Produced by subtracting the mean and dividing by the standard deviation. 
TI_ 
Dummy Indicator 
creates dummy variable for categorical variables from highest class value to lowest class value 
If you are into unsupervised learning you have probably already experimented with the SOM/Kohonen node.
Variable 
Label 
Description 
SOM_Segment 
SOM Segment ID 
integer identifying the cluster 
SOM_ID 
SOM ID 
contains the row and column in the SOM 
Distance 
Distance 
the distance from each case to the cluster seed 
SOM_DIMENSION1 
SOM Dimension1 
identifies rows or columns in the SOM 
SOM_DIMENSION2 
SOM Dimension2 
identifies rows or columns in the SOM 
Incremental Response node provides several variables that could be used to optimize customer targeting.
Variable 
Description 
EM_P_CONTROL_RESPONSE 
Predicted response probability from the control group 
EM_P_CONTROL_NONRESPONSE 
1 EM_P_CONTROL_RESPONSE 
EM_P_ADJ_INCREMENT_RESPONSE 
Adjusted to be positive incremental predicted response rate 
EM_P_ADJ_INCREMENT_NONRESPONSE 
1  EM_P_ADJ_INCREMENT_RESPONSE 
EM_P_ABS_INCREMENT_RESPONSE 
Absolute value of the incremental predicted response rate (available when an outcome model used) 
EM_P_ABS_INCREMENT_NONRESPONSE 
1  EM_P_ABS_INCREMENT_RESPONSE 
EM_P_TREATMENT_RESPONSE 
Predicted response probability from the treatment group 
EM_P_TREATMENT_NONRESPONSE 
1  EM_P_TREATMENT_RESPONSE 
EM_P_INCREMENT_RESPONSE 
EM_P_TREATMENT_RESPONSE  EM_P_CONTROL_RESPONSE 
EM_P_INCREMENT_NONRESPONSE 
EM_P_TREATMENT_NONRESPONSE  EM_P_CONTROL_NONRESPONSE 
EM_P_CONTROL_OUTCOME 
Predicted value of the outcome variable from the control group 
EM_P_TREATMENT_OUTCOME 
Predicted value of the outcome variable from the treatment group 
EM_P_INCREMENT_OUTCOME 
EM_P_TREATMENT_OUTCOME  EM_P_CONTROL_OUTCOME 
EM_REV_TREATMENT 
Estimated revenue for the treatment group EM_P_TREATMENT_RESPONSE * EM_P_TREATMENT_OUTCOME – Cost or if Constant Revenue is set EM_P_CONTROL_RESPONSE * Revenue_Per_Response – Cost 
EM_REV_CONTROL 
Estimated revenue for the control group EM_P_CONTROL_RESPONSE * EM_P_CONTROL_OUTCOME 
EM_REV_INCREMENT 
Estimated incremental revenue EM_REV_TREATMENT  EM_REV_CONTROL 
If you can think of other variables generated by EM score code or a better description, please add them to the comments. I left one out just to prove to someone that nobody ever used it. :smileyhappy:
"Risk comes from not knowing what you're doing."
~ Warren Edward Buffett (born August 30, 1930) an American business magnate, investor, and philanthropist
Want to write an article? Sign in with your profile.
Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.