BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lobbie
Obsidian | Level 7

Hi,

I am training a binary classification model using Proc Logistic.  The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0.  I balanced the training set to about 50:50 using sampling before training.  The code used is

proc logistic 
	Data = work.train_stdize 
	outmodel= mydata.Model_1.
	namelen=32;
	class &class_var. / param=ref;
	model responder(event='1') = &class_var. &num_var. / stb lackfit ctable pprob=(0.0 to 1.0 by 0.1) /* pevent=0.1 */;
	weight weight;
	score data=work.train_stdize 	fitstat out=mydata.train_scr	outroc=mydata.troc;

run;

I included the ctable option to generate the classification table for each decile.

ctable pprob=(0.0 to 1.0 by 0.1)

Do I need to include the pevent option?  Yes or No and Why?

pevent=0.1

Thanks and much appreciated,

Lobbie

 

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.

View solution in original post

17 REPLIES 17
PaigeMiller
Diamond | Level 26

Yes, PEVENT= produces the results as if the data set were still 10% in one category and 90% in the other category, even though you create a model based of 50% in each category.

--
Paige Miller
Ksharp
Super User
Yes. PEVENT= would affect the prob of being Good or Bad .
By default, P above 50% is good , PEVENT= would adjust 50% according to its value .
PaigeMiller
Diamond | Level 26

Hmmm

 


@Ksharp wrote:
Yes. PEVENT= would affect the prob of being Good or Bad .
By default, P above 50% is good , PEVENT= would adjust 50% according to its value .

I think you are right, @Ksharp, and my response above was not correct. My response above was about the (similarly named) PRIOREVENT= option, not the PEVENT= option (I think).

 

So, where @Lobbie says "The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0. I balanced the training set to about 50:50 using sampling before training", I think he really wants to use the PRIOREVENT= option and not the PEVENT= option, to get proper predictions from a dataset that is 50:50, when the original data is 10:90. But again, I add I THINK, because I really need to read the documentation a few more times.

 

Or maybe someone else can jump in and straighten all this out, saving me some reading and thinking 😉

--
Paige Miller
Ksharp
Super User
Paige,
I think both are the same thing. You don't make a mistake .
PEVENT= Specifies prior event probabilities

PaigeMiller
Diamond | Level 26

Ok, thanks @Ksharp , but I'm still going to take some time and read the documentation carefully.

--
Paige Miller
Lobbie
Obsidian | Level 7

Hi @PaigeMiller and @Ksharp ,

 

According to the documentation, PEVENT is for specifying prior event probabilities and is only applicable under Model statement and where CTABLE option is specified.  My query was should I add PEVENT when generating the Classification Table because I trained the model on balanced classes when the proportion of my classes is 10:90.  I think the answer is "Yes, I should add PEVENT=0.1 along in the CTABLE option"?

 

PRIOREVENT on the other hand is used in the Score statement according to the documentation.  I found that if I fitted the model using Offset method, I will need to add PRIOREVENT=0.1 in the Score statement when scoring, so that the predicted probabilities will be adjusted with prior.

 

If I fitted the model using the Weight method, I do not need to use PRIOREVENT= when scoring.  Reason was the adjustments are already reflected in the intercept/coefficients (@Rick_SAS mentioned this one of his replies, sorry I can't seem to find the thread now).

PaigeMiller
Diamond | Level 26

Well, that's good to know, and as I said, I still have some thinking to do! Thanks!

--
Paige Miller
Lobbie
Obsidian | Level 7

No worries @PaigeMiller, and please do let me know what your thoughts are later. Really appreciate it.

 

My hunch is both you and @Ksharp are right in your previous answers, else why would SAS have a PEVENT option to work with the CTABLE option?  Because CTABLE does not do prior adjustments by default.

PaigeMiller
Diamond | Level 26

Hello @Lobbie and @Ksharp. I don't see how PRIOREVENT= and PEVENT= produce the same results.

 

/* Make up some data, with 10% value of 0 and 90% value of 1 */
data a;
    do i=1 to 1000;
        if i<=100 then y=0;
        else y=1;
        x1=rand('normal');
        x2=rand('normal');
        output;
    end;
run;
/* Perform Logistic Regression */
proc logistic data=a;
	model y(event='1')=x1 x2;
	output out=preds predicted=pred;
run;

/* Oversample to 50-50, PEVENT=0.9 */
proc logistic data=a(where=(i<=200));
    model y(event='1')=x1 x2/pevent=0.9 ctable;
    output out=preds2 predicted=pred2;
run;

/* Oversample to 50-50, PRIOREVENT=0.9 */
proc logistic data=a(where=(i<=200));
    model y(event='1')=x1 x2;
    score out=preds3 priorevent=0.9;
run;

PRIOREVENT does what I think should be done given my understanding of the original problem. It's not clear to me how PEVENT applies here.

--
Paige Miller
Ksharp
Super User

Paige,
Very interesting . I also want know .
It seem that PEVENT= has nothing to do with predicted probability .It is just for Class Table(CTable).
According to documentation:

PEVENT=value| (list)
specifies one prior probability or a list of prior probabilities for the event of interest. The false positive
and false negative rates are then computed as posterior probabilities by Bayes’ theorem.

That means OP do not need PEVENT= . Only priorevent=0.9 could adjust predicted probability .
I mislead OP, I was wrong.

 

@StatDave  @Rick_SAS  @SteveDenham  could take a look ?

StatDave
SAS Super FREQ

Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement. 

PaigeMiller
Diamond | Level 26

@StatDave wrote:

Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement. 


I think the red highlighted text clears things up for me. Thanks!

--
Paige Miller
Lobbie
Obsidian | Level 7

@StatDave , yes I am following the 22601 note and am using Weight when fitting my model.  The "Details: Classification Table" section of the LOGISTIC documentation stated that PEVENT should be used because I have fitted the model using a balanced training set when it was 10:90 in the beginning.

 

However when I did not specify the PEVENT option during the creation of CTABLE during model training, and I manually calculate PPV & NPV using the scored data, the results matched.

ctable.jpg

 

The only explanation I can think of i.e. as to why I do not need to specify PEVENT option contrary to the recommendation in the documentation is because I fitted the model using Weight statement.  All parameters are adjusted accordingly and are used to compute the CTABLE and P_1 probabilities in the scored dataset.  This is also the reason why I do not need to specify PRIOREVENT= in the score statement when scoring.

 

Am I correct?  Thanks.

 

StatDave
SAS Super FREQ

Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 17 replies
  • 1870 views
  • 6 likes
  • 4 in conversation