05-29-2014 01:08 PM
I would like someone to please share the interpretation of the 3 columns which get generated as a part of exported dataset of any classification modeling node: from column, into column and decision column.
Eg: if my target variable is binary with label as tgt, then what does from_tgt, into_tgt and decision_tgt imply.
Also, what is the correct way of constructing the confusion matrix from the below 2 options:
a. Cross tab between from_tgt and into_tgt columns.
b. Cross tab between actual tgt column and decision_tgt column.
These two approaches yield me different matrices. The first one has low precision and high recall and the other one has exact opposite. Also it seems that the minimum predicted probability for tgt = Y using from and into = Y is above 0.5.
Any help is highly appreciated.
05-29-2014 01:34 PM
You can find more information in the SAS Enterprise Miner Reference Help (Link to doc: SAS Enterprise Miner). In the Predictive Modeling chapter look for the "Scored Data Sets" section.
Your interpretation is correct. F_ is the original target in your data, I_ is the predicted target by de model, D_ is the predicted target accounting for your decision matrix (maximize profit or minimize loss). Also you are correct, the cutoff for a predicted probability to be considered a predicted event is 0.5.
A brief excerpt from the reference help below. Although it is addressing categorical variables, you can interpret in terms of event or nonevent:
"The I_ variable is the category that the case is classified into--also a formatted value. The I_ value is the category with the highest posterior probability. If a decision matrix is used, the D_ value is the decision with the largest estimated profit or smallest estimated loss. The D_ value might differ from the I_ value for two reasons:
•The decision alternatives do not necessarily correspond to the target categories, and
•The I_ depends directly on the posterior probabilities, not on estimated profit or loss.
However, the I_ value can depend indirectly on the decision matrix when the decision matrix is used in model estimation or selection."
The Model Comparision node calculates the true positive, true negative, false positive, false negative counts for your models. If you prefer to build a confusion matrix, you can also try using the below code in a SAS Code node:
title1 "Training data TP FP TN FN";
proc tabulate data=&EM_IMPORT_DATA;
class f_%EM_TARGET i_%EM_TARGET;
/*Add the below if you have validation data*/
title1 "Validation data TP FP TN FN";
proc tabulate data=&EM_IMPORT_VALIDATE;
class f_%EM_TARGET i_%EM_TARGET;
I hope it helps,
05-29-2014 10:33 PM
Thanks for your reply. I would like to state that I am using cost matrix for my imbalanced classification problem. And I see that the decision_target is populated with Y or N. Same is the case with from and into nodes as well as the actual target flag. The query you have sent in the last sections of your reply, seem to construct the misclassification matrix using from and into nodes but not decision and actual target flag. Do let me know if this is true.
05-30-2014 10:14 AM
Correct, I sent you a code for the from/into classification matrix. If you want the decision variable use d_%EM_TARGET.
The macro em_target resolves to the name of your target.
The actual target is built by the proc your modeling EM node uses. Sometimes it is a dummy variable. Use the variable name in a code similar to the proc tabulate on this thread and you should be good to go.