Solved: Re: EM: Setting cutoff for predictive model

fann_man · Posted 03-22-2015 11:18 AM

I am building a logistic regression model in EM 13.2 to predict a binary ownership variable. I would like to predict the 20% of cases with the highest regression scores as being positive for ownership.

Following building the model I attached a "Cutoff" node in which allowed me to set the regression score above which cases will be predicted to be positive. However I want to know if there is a more flexible way which identifies the scores of the 80th percentile rather than manually setting the value myself. The model needs to be applied to other datasets, which might not necessarily have the exact same regression score for the 80th percentile.

Is there an option within "Cutoff" which can achieve this? Or possibly an alternative method which could achieve the same results?

Thanks

DougWielenga · Posted 12-05-2017 11:49 AM

Depending on the scenario, you might want to identify a cutoff probability of interest knowing that this will (likely) cutoff differently sized proportions of each data set it scores. If your goal is to predict the top 20% of the scored observations as responders regardless of their actual predicted probabilities, the easiest way is to use the RANK procedure with the GROUPS= option to create 5 bins based on the predicted probability. The first or last group (depending on sort order) corresponds to the top 20%. If you scored a data set with ROLE=SCORE using a Score node in SAS Enterprise Miner, you could connect a subsequent SAS Code node and use the following code assuming a categorical target:

/*** BEGIN SAS CODE ***/

libname mylib 'C:\data'; * define path to where you will write out your data;

proc rank data=&EM_IMPORT_SCORE out=myranks groups=5 descending; * identify 5 groups based on predicted probability;
var EM_EVENTPROBABILITY;
ranks MyRankVar;
run;

proc freq data=myranks; * crosstabs of rank variable by actual target level;
tables MyRankVar * %EM_TARGET / nocol nopercent;
run;

data mylib.MyScores; * flag those in the top 20%;
set myranks;
Top20=.;
if MyRankVar gt 0 then Top20=0;
else Top20=1;
run;

proc freq data=mylib.MyScores; * verify you have flagged the right group;
tables MyRankVar*Top20 / norow nocol nopercent;
run;

proc means data=mylib.MyScores; * calculate statistics on predicted probability grouped by Top20;
var EM_EVENTPROBABILITY;
class Top20;
run;

/*** END SAS CODE ***/

It is possible you will not need the DESCENDING option in the RANK procedure. Also, the EM_EVENTPROBABILITY variable is added by the Score node so you will need to modify the code to identify the variable containing the prediction probability for the target event if you do not score using the Score node.

Hope this helps!

Doug

View solution in original post

M_Maldonado · Posted 03-24-2015 09:32 AM

A popular automatic cutoff selection method is to use the Event Precision Equal Recall. This method selects a cutoff where the event precision rate and the true positive rate intersect. I posted a brief example here: https://communities.sas.com/docs/DOC-6050

I hope it helps,

Miguel

DougWielenga · Posted 12-05-2017 11:49 AM

Depending on the scenario, you might want to identify a cutoff probability of interest knowing that this will (likely) cutoff differently sized proportions of each data set it scores. If your goal is to predict the top 20% of the scored observations as responders regardless of their actual predicted probabilities, the easiest way is to use the RANK procedure with the GROUPS= option to create 5 bins based on the predicted probability. The first or last group (depending on sort order) corresponds to the top 20%. If you scored a data set with ROLE=SCORE using a Score node in SAS Enterprise Miner, you could connect a subsequent SAS Code node and use the following code assuming a categorical target: