Hi,
I am training a binary classification model using Proc Logistic. The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0. I balanced the training set to about 50:50 using sampling before training. The code used is
proc logistic Data = work.train_stdize outmodel= mydata.Model_1. namelen=32; class &class_var. / param=ref; model responder(event='1') = &class_var. &num_var. / stb lackfit ctable pprob=(0.0 to 1.0 by 0.1) /* pevent=0.1 */; weight weight; score data=work.train_stdize fitstat out=mydata.train_scr outroc=mydata.troc; run;
I included the ctable option to generate the classification table for each decile.
ctable pprob=(0.0 to 1.0 by 0.1)
Do I need to include the pevent option? Yes or No and Why?
pevent=0.1
Thanks and much appreciated,
Lobbie
Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.
Yes, PEVENT= produces the results as if the data set were still 10% in one category and 90% in the other category, even though you create a model based of 50% in each category.
Hmmm
@Ksharp wrote:
Yes. PEVENT= would affect the prob of being Good or Bad .
By default, P above 50% is good , PEVENT= would adjust 50% according to its value .
I think you are right, @Ksharp, and my response above was not correct. My response above was about the (similarly named) PRIOREVENT= option, not the PEVENT= option (I think).
So, where @Lobbie says "The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0. I balanced the training set to about 50:50 using sampling before training", I think he really wants to use the PRIOREVENT= option and not the PEVENT= option, to get proper predictions from a dataset that is 50:50, when the original data is 10:90. But again, I add I THINK, because I really need to read the documentation a few more times.
Or maybe someone else can jump in and straighten all this out, saving me some reading and thinking 😉
Ok, thanks @Ksharp , but I'm still going to take some time and read the documentation carefully.
Hi @PaigeMiller and @Ksharp ,
According to the documentation, PEVENT is for specifying prior event probabilities and is only applicable under Model statement and where CTABLE option is specified. My query was should I add PEVENT when generating the Classification Table because I trained the model on balanced classes when the proportion of my classes is 10:90. I think the answer is "Yes, I should add PEVENT=0.1 along in the CTABLE option"?
PRIOREVENT on the other hand is used in the Score statement according to the documentation. I found that if I fitted the model using Offset method, I will need to add PRIOREVENT=0.1 in the Score statement when scoring, so that the predicted probabilities will be adjusted with prior.
If I fitted the model using the Weight method, I do not need to use PRIOREVENT= when scoring. Reason was the adjustments are already reflected in the intercept/coefficients (@Rick_SAS mentioned this one of his replies, sorry I can't seem to find the thread now).
Well, that's good to know, and as I said, I still have some thinking to do! Thanks!
No worries @PaigeMiller, and please do let me know what your thoughts are later. Really appreciate it.
My hunch is both you and @Ksharp are right in your previous answers, else why would SAS have a PEVENT option to work with the CTABLE option? Because CTABLE does not do prior adjustments by default.
Hello @Lobbie and @Ksharp. I don't see how PRIOREVENT= and PEVENT= produce the same results.
/* Make up some data, with 10% value of 0 and 90% value of 1 */
data a;
do i=1 to 1000;
if i<=100 then y=0;
else y=1;
x1=rand('normal');
x2=rand('normal');
output;
end;
run;
/* Perform Logistic Regression */
proc logistic data=a;
model y(event='1')=x1 x2;
output out=preds predicted=pred;
run;
/* Oversample to 50-50, PEVENT=0.9 */
proc logistic data=a(where=(i<=200));
model y(event='1')=x1 x2/pevent=0.9 ctable;
output out=preds2 predicted=pred2;
run;
/* Oversample to 50-50, PRIOREVENT=0.9 */
proc logistic data=a(where=(i<=200));
model y(event='1')=x1 x2;
score out=preds3 priorevent=0.9;
run;
PRIOREVENT does what I think should be done given my understanding of the original problem. It's not clear to me how PEVENT applies here.
Paige,
Very interesting . I also want know .
It seem that PEVENT= has nothing to do with predicted probability .It is just for Class Table(CTable).
According to documentation:
PEVENT=value| (list)
specifies one prior probability or a list of prior probabilities for the event of interest. The false positive
and false negative rates are then computed as posterior probabilities by Bayes’ theorem.
That means OP do not need PEVENT= . Only priorevent=0.9 could adjust predicted probability .
I mislead OP, I was wrong.
@StatDave @Rick_SAS @SteveDenham could take a look ?
Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement.
@StatDave wrote:
Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement.
I think the red highlighted text clears things up for me. Thanks!
@StatDave , yes I am following the 22601 note and am using Weight when fitting my model. The "Details: Classification Table" section of the LOGISTIC documentation stated that PEVENT should be used because I have fitted the model using a balanced training set when it was 10:90 in the beginning.
However when I did not specify the PEVENT option during the creation of CTABLE during model training, and I manually calculate PPV & NPV using the scored data, the results matched.
The only explanation I can think of i.e. as to why I do not need to specify PEVENT option contrary to the recommendation in the documentation is because I fitted the model using Weight statement. All parameters are adjusted accordingly and are used to compute the CTABLE and P_1 probabilities in the scored dataset. This is also the reason why I do not need to specify PRIOREVENT= in the score statement when scoring.
Am I correct? Thanks.
Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.