BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fann_man
Calcite | Level 5

I am building a logistic regression model in EM 13.2 to predict a binary ownership variable. I would like to predict the 20% of cases with the highest regression scores as being positive for ownership.

Following building the model I attached a "Cutoff" node in which allowed me to set the regression score above which cases will be predicted to be positive. However I want to know if there is a more flexible way which identifies the scores of the 80th percentile rather than manually setting the value myself. The model needs to be applied to other datasets, which might not necessarily have the exact same regression score for the 80th percentile.

Is there an option within "Cutoff" which can achieve this? Or possibly an alternative method which could achieve the same results?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

Depending on the scenario, you might want to identify a cutoff probability of interest knowing that this will (likely) cutoff differently sized proportions of each data set it scores.  If your goal is to predict the top 20% of the scored observations as responders regardless of their actual predicted probabilities, the easiest way is to use the RANK procedure with the GROUPS= option to create 5 bins based on the predicted probability.  The first or last group (depending on sort order) corresponds to the top 20%.  If you scored a data set with ROLE=SCORE using a Score node in SAS Enterprise Miner, you could connect a subsequent SAS Code node and use the following code assuming a categorical target: 

 

/*** BEGIN SAS CODE ***/

 

libname mylib 'C:\data';  * define path to where you will write out your data; 

 

proc rank data=&EM_IMPORT_SCORE out=myranks groups=5 descending;  * identify 5 groups based on predicted probability;
   var EM_EVENTPROBABILITY;
   ranks MyRankVar;
run;

 

proc freq data=myranks;   * crosstabs of rank variable by actual target level; 
   tables MyRankVar * %EM_TARGET / nocol nopercent;
run;

 

data mylib.MyScores;  * flag those in the top 20%;
   set myranks;
   Top20=.;
   if MyRankVar gt 0 then Top20=0;
   else Top20=1;
run;

 

proc freq data=mylib.MyScores;   * verify you have flagged the right group;
   tables MyRankVar*Top20 / norow nocol nopercent;
run;

 

proc means data=mylib.MyScores;  * calculate statistics on predicted probability grouped by Top20; 
   var EM_EVENTPROBABILITY;
   class Top20;
run;

 

/*** END SAS CODE ***/

 

It is possible you will not need the DESCENDING option in the RANK procedure.  Also, the EM_EVENTPROBABILITY variable is added by the Score node so you will need to modify the code to identify the variable containing the prediction probability for the target event if you do not score using the Score node.

 

Hope this helps!

Doug

View solution in original post

2 REPLIES 2
M_Maldonado
Barite | Level 11

A popular automatic cutoff selection method is to use the Event Precision Equal Recall. This method selects a cutoff where the event precision rate and the true positive rate intersect. I posted a brief example here: https://communities.sas.com/docs/DOC-6050

I hope it helps,

Miguel

DougWielenga
SAS Employee

Depending on the scenario, you might want to identify a cutoff probability of interest knowing that this will (likely) cutoff differently sized proportions of each data set it scores.  If your goal is to predict the top 20% of the scored observations as responders regardless of their actual predicted probabilities, the easiest way is to use the RANK procedure with the GROUPS= option to create 5 bins based on the predicted probability.  The first or last group (depending on sort order) corresponds to the top 20%.  If you scored a data set with ROLE=SCORE using a Score node in SAS Enterprise Miner, you could connect a subsequent SAS Code node and use the following code assuming a categorical target: 

 

/*** BEGIN SAS CODE ***/

 

libname mylib 'C:\data';  * define path to where you will write out your data; 

 

proc rank data=&EM_IMPORT_SCORE out=myranks groups=5 descending;  * identify 5 groups based on predicted probability;
   var EM_EVENTPROBABILITY;
   ranks MyRankVar;
run;

 

proc freq data=myranks;   * crosstabs of rank variable by actual target level; 
   tables MyRankVar * %EM_TARGET / nocol nopercent;
run;

 

data mylib.MyScores;  * flag those in the top 20%;
   set myranks;
   Top20=.;
   if MyRankVar gt 0 then Top20=0;
   else Top20=1;
run;

 

proc freq data=mylib.MyScores;   * verify you have flagged the right group;
   tables MyRankVar*Top20 / norow nocol nopercent;
run;

 

proc means data=mylib.MyScores;  * calculate statistics on predicted probability grouped by Top20; 
   var EM_EVENTPROBABILITY;
   class Top20;
run;

 

/*** END SAS CODE ***/

 

It is possible you will not need the DESCENDING option in the RANK procedure.  Also, the EM_EVENTPROBABILITY variable is added by the Score node so you will need to modify the code to identify the variable containing the prediction probability for the target event if you do not score using the Score node.

 

Hope this helps!

Doug

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2232 views
  • 0 likes
  • 3 in conversation