Have you ever wondered what some of the output variables generated by Enterprise Miner score code represent?
The definition of some of variables generated by Enterprise Miner score code may be obvious while others may be less so. The score code published for a model in EM is often a combination of code generated by procedures and by the nodes. The following is a summary of the variables potentially created and a brief description to provide some orientation.
From Procedures
Many of the names of the computed variables (outputs, residuals, etc.,) are created by concatenating a prefix with the name of the corresponding target variable or decision variable. The table below lists most of the possible prefixes for variables calculated in EM procedure score code. If you got really wild with the procedures you might be able to generate some of the more esoteric variables.
List of many of the possible prefixes used for variable names in the EM procedures’ OUT= data sets:
Prefix |
Label * |
Description |
AOV16_ |
AOV16:VNM |
Interval variables binned into 16 equally-spaced groups |
BL_ |
Best Loss:VNM |
Best possible loss of any of the decisions |
BP_ |
Best Profit:VNM |
Best possible profit of any of the decisions |
CL_ |
Computed Loss:VNM |
Loss computed from the target value |
CP_ |
Computed Profit:VNM |
Profit computed from the target value |
D_ |
Decision:VNM |
Decision chosen by the model |
EL_ |
Expected Loss:VNM |
Expected loss of the decision chosen by the model |
EP_ |
Expected Profit:VNM |
Expected profit of the decision chosen by the model |
E_ |
Error Function:VNM |
Error function |
F_ |
From:VNM |
Normalized target value of the category that the case comes from |
GRP_ |
Grouped:VNM |
Based on variable characteristics |
G_ |
Grouped:VNM |
based on the relationship to the target |
IC_ |
Investment Cost:VNM |
Investment cost |
IM_ |
Imputed:VNM |
Variable with any missing values replaced |
I_ |
Into:VNM |
Normalized category that the case is classified into |
M_ |
Missing:VNM |
Missingness indicator dummy variable |
P_ |
Predicted: VNM |
Outputs (i.e. predicted values and posterior probabilities) |
Q_ |
Unadjusted P:VNM |
Old posteriors, prior to adjustment for priors |
RAT_ |
|
Studentized Anscombe residuals |
RA_ |
Anscombe Residual: VNM |
Anscombe residuals |
RAS_ |
|
Standardized Anscombe residuals |
RD_ |
|
Deviance residuals |
RDS_ |
|
Standardized deviance residuals |
RDT_ |
|
Studentized deviance residuals |
ROI_ |
Return on Investment: |
Return on investment |
RPT_ |
Pearson Residual: VNM |
Studentized Pearson residuals |
RP_ |
Pearson Residual:VNM |
Pearson residuals |
RPS_ |
|
Standardized Pearson residuals |
RS_ |
|
Standardized residuals |
RT_ |
|
Studentized residuals |
R_ |
Residual: VNM |
Plain residuals: target minus output |
S_ |
Standard:VNM |
Standardized variable |
T_ |
Transform:VNM |
Transformed variable |
U_ |
Unnormalized Into:VNM |
Un-normalized category that the case is classified into |
V_ |
Validated:VNM |
Same as P_ only based on validation data. Tree only. |
WOE_ |
Weight of Evidence:VNM |
Relative risk of an attribute or group level |
* For non-categorical targets, the "VNM" above indicates the name of the target variable. For categorical targets, "VNM" represents the name of the target variable followed by an equal sign and the un-normalized category value.
The generated score code almost always computes the P_* variable(s), and for a categorical target, the I_* and U_* variable(s). But some modeling engines may allow other ways of fitting categorical targets. For example, Regression (proc DMREG) fits an ordinal target by linear least squares using the index of the category as the actual target value, and hence does not produce posterior probabilities.
Only the decision tree outputs a V_ output variable, which is similar to a corresponding P_ output variable except it is computed using validation data instead of training data.
One of the more ubiquitous variables is the global variable _WARN_. It is used to indicate problems that may occur computing predicted values or making decisions. The _WARN_ variable has 4 columns and each can be set to a specific code.
Column |
Code |
Description |
1 |
M |
Missing input |
2 |
U |
Unrecognized input category |
3 |
P |
Invalid posterior probability |
4 |
C |
Missing cost variable |
By default, the EM score code contains no reference to the target variable. Only in the flow score code is the RESIDUAL option specified. So only the EM flow score code can calculate values that depend on the target variable. If the RESIDUAL option is specified in the CODE statement of the modeling procedure, the code should compute the R_* variable(s), and for a categorical target, the F_* variable(s). Other kinds of residuals may be computed if that is feasible, for example CL_*, CP_*, BL_*, BP_*, or ROI_*. Plain residuals are not multiplied by error weights or by frequencies. Plain residuals will always be the actual target value minus the predicted value.
Only if decision processing is specified will variables with prefixes like D_ EL_ or EP_* be calculated. The formula for D_targetname varies with the data mining model. There are too many formulas to list here and they should be identifiable in the score code.
From Nodes
In some cases it is desirable to have an output variable have the same name regardless of the target name. The EM Score node by default provides variables with fixed names for a variety of output variables.
Fixed Output Name |
Label |
Description |
EM_PREDICTION |
Prediction for vnm |
The prediction variable for an interval target. |
EM_PROBABILITY |
Probability of Classification |
Posterior probability associated with the predicted classification. That is, it corresponds the maximum of the posterior probabilities, max(P1, P2, ..., Pk). |
EM_EVENTPROBABILITY |
Probability for level n of vnm |
Posterior probability associated with target event. |
EM_DECISION |
Recommended Decision for vnm |
Maps to D_targetname variables. |
EM_PROFIT |
Expected Profit for vnm |
Expected profit predicted for a target variable set from EP_targetname |
EM_LOSS |
Expect Loss for vnm |
Expected loss predicted for a target variable set from EL_targetname |
EM_CLASSIFICATION |
Prediction for vnm |
I_variable, the prediction variable for a class target. |
EM_SEGMENT |
Node or Segment Variable |
Segment identifier derived from Decision Tree Leaf, Cluster number, or SOM cell ID |
If there is a situation where a different value in a fixed name is required you can use the Rules Builder node to assign just about anything to the EM_Outcome variable.
The Cutoff node will provide the calculated decision point as EM_CUTOFF.
In TwoStage models you also get a variable prefixed with EV_ and labeled “Expected Value:vnm” where vnm = interval target name. It is derived from the predicted value and any specified bias.
Neural Network models can also produce variables named H<number> for the value of the hidden units.
The Link Analysis node provides the item-cluster detection information as the _segment_ variable.
Decision Tree Score code creates variables named _NODE_ and _LEAF_. They are labeled 'Node' and 'Leaf'. They identify each final node or leaf by both leaf number and node number.
The Variable Clustering node can replace a large set of variables with a smaller set of cluster components with little loss of information. The cluster components are named Clusn where "n" is a number.
Some of the Modify nodes in EM can produce output variables with identifying prefixes on the variable name.
Prefix |
Node |
Description |
IMP_ |
Impute |
Original variable’s value or if missing an imputed value |
GRP_ |
Interactive Binning |
Group number based on the original variable's value |
REP_ |
Replacement |
Replacement values for the variable’s class and interval levels |
The Principal Component node by default produces variables named PC_n where "n" is a number but the "PC" can be change in the node properties. Its value is an uncorrelated linear combination of the original input variables.
The Transform Variables node's formula builder defaults to a name like TRANS_n where "n" is a number. However that name can be modified. It does generate names for several pre-defined transformations available from the Train properties, Variables dialog.
The new variables created are named with the selected variable's name and a prefix to identify the specific transformation.
Prefix |
Method |
Description |
BIN_ |
Bucket |
the bin based on the difference between the maximum and minimum values |
CNTR_ |
Centering |
the grand mean centered value |
EXP_ |
Exponential |
the exponential logarithm of the variable |
INV_ |
Inverse |
the inverse of the variable |
LOG_ |
Log |
the natural log of the original variable |
LG10_ |
Log 10 |
the base 10 logarithm of the original variable |
OPT_ |
Optimal Binning |
Binned in order to maximize the relationship to the target. |
PWR_ |
Optimal Max. Equalize |
Best power transformation to equalize target level spread. |
PCTL_ |
Quantile |
values grouped so groups have same frequency in each group |
RANGE_ |
Range Standardization |
scaled value of the variable |
SQR_ |
Square |
square of the variable |
SQRT_ |
Square Root |
square root of the variable. |
STD_ |
Standardize |
Produced by subtracting the mean and dividing by the standard deviation. |
TI_ |
Dummy Indicator |
creates dummy variable for categorical variables from highest class value to lowest class value |
If you are into unsupervised learning you have probably already experimented with the SOM/Kohonen node.
Variable |
Label |
Description |
SOM_Segment |
SOM Segment ID |
integer identifying the cluster |
SOM_ID |
SOM ID |
contains the row and column in the SOM |
Distance |
Distance |
the distance from each case to the cluster seed |
SOM_DIMENSION1 |
SOM Dimension1 |
identifies rows or columns in the SOM |
SOM_DIMENSION2 |
SOM Dimension2 |
identifies rows or columns in the SOM |
Incremental Response node provides several variables that could be used to optimize customer targeting.
Variable |
Description |
EM_P_CONTROL_RESPONSE |
Predicted response probability from the control group |
EM_P_CONTROL_NONRESPONSE |
1- EM_P_CONTROL_RESPONSE |
EM_P_ADJ_INCREMENT_RESPONSE |
Adjusted to be positive incremental predicted response rate |
EM_P_ADJ_INCREMENT_NONRESPONSE |
1 - EM_P_ADJ_INCREMENT_RESPONSE |
EM_P_ABS_INCREMENT_RESPONSE |
Absolute value of the incremental predicted response rate (available when an outcome model used) |
EM_P_ABS_INCREMENT_NONRESPONSE |
1 - EM_P_ABS_INCREMENT_RESPONSE |
EM_P_TREATMENT_RESPONSE |
Predicted response probability from the treatment group |
EM_P_TREATMENT_NONRESPONSE |
1 - EM_P_TREATMENT_RESPONSE |
EM_P_INCREMENT_RESPONSE |
EM_P_TREATMENT_RESPONSE - EM_P_CONTROL_RESPONSE |
EM_P_INCREMENT_NONRESPONSE |
EM_P_TREATMENT_NONRESPONSE - EM_P_CONTROL_NONRESPONSE |
EM_P_CONTROL_OUTCOME |
Predicted value of the outcome variable from the control group |
EM_P_TREATMENT_OUTCOME |
Predicted value of the outcome variable from the treatment group |
EM_P_INCREMENT_OUTCOME |
EM_P_TREATMENT_OUTCOME - EM_P_CONTROL_OUTCOME |
EM_REV_TREATMENT |
Estimated revenue for the treatment group EM_P_TREATMENT_RESPONSE * EM_P_TREATMENT_OUTCOME – Cost or if Constant Revenue is set EM_P_CONTROL_RESPONSE * Revenue_Per_Response – Cost |
EM_REV_CONTROL |
Estimated revenue for the control group EM_P_CONTROL_RESPONSE * EM_P_CONTROL_OUTCOME |
EM_REV_INCREMENT |
Estimated incremental revenue EM_REV_TREATMENT - EM_REV_CONTROL |
If you can think of other variables generated by EM score code or a better description, please add them to the comments. I left one out just to prove to someone that nobody ever used it.
"Risk comes from not knowing what you're doing."
~ Warren Edward Buffett (born August 30, 1930) an American business magnate, investor, and philanthropist
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.