Solved: Re: The use of PEVENT= in Proc Logistic

Lobbie · Posted 09-12-2020 10:51 PM

Hi,

I am training a binary classification model using Proc Logistic. The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0. I balanced the training set to about 50:50 using sampling before training. The code used is

proc logistic 
	Data = work.train_stdize 
	outmodel= mydata.Model_1.
	namelen=32;
	class &class_var. / param=ref;
	model responder(event='1') = &class_var. &num_var. / stb lackfit ctable pprob=(0.0 to 1.0 by 0.1) /* pevent=0.1 */;
	weight weight;
	score data=work.train_stdize 	fitstat out=mydata.train_scr	outroc=mydata.troc;

run;

I included the ctable option to generate the classification table for each decile.

ctable pprob=(0.0 to 1.0 by 0.1)

Do I need to include the pevent option? Yes or No and Why?

pevent=0.1

Thanks and much appreciated,

Lobbie

StatDave · Posted 09-15-2020 02:08 PM

Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.

View solution in original post

PaigeMiller · Posted 09-13-2020 04:37 AM

Yes, PEVENT= produces the results as if the data set were still 10% in one category and 90% in the other category, even though you create a model based of 50% in each category.

--
Paige Miller

Ksharp · Posted 09-13-2020 07:02 AM

Yes. PEVENT= would affect the prob of being Good or Bad .
By default, P above 50% is good , PEVENT= would adjust 50% according to its value .

PaigeMiller · Posted 09-13-2020 07:46 AM

Hmmm

@Ksharp wrote:
Yes. PEVENT= would affect the prob of being Good or Bad .
By default, P above 50% is good , PEVENT= would adjust 50% according to its value .

I think you are right, @Ksharp, and my response above was not correct. My response above was about the (similarly named) PRIOREVENT= option, not the PEVENT= option (I think).

So, where @Lobbie says "The classes are imbalanced at about 10% for the event 1 and 90% for the non-event 0. I balanced the training set to about 50:50 using sampling before training", I think he really wants to use the PRIOREVENT= option and not the PEVENT= option, to get proper predictions from a dataset that is 50:50, when the original data is 10:90. But again, I add I THINK, because I really need to read the documentation a few more times.

Or maybe someone else can jump in and straighten all this out, saving me some reading and thinking 😉

--
Paige Miller

Ksharp · Posted 09-13-2020 07:59 AM

Paige,
I think both are the same thing. You don't make a mistake .
PEVENT= Specifies prior event probabilities

PaigeMiller · Posted 09-13-2020 08:09 AM

Ok, thanks @Ksharp , but I'm still going to take some time and read the documentation carefully.

--
Paige Miller

Lobbie · Posted 09-13-2020 08:43 AM

Hi @PaigeMiller and @Ksharp ,

According to the documentation, PEVENT is for specifying prior event probabilities and is only applicable under Model statement and where CTABLE option is specified. My query was should I add PEVENT when generating the Classification Table because I trained the model on balanced classes when the proportion of my classes is 10:90. I think the answer is "Yes, I should add PEVENT=0.1 along in the CTABLE option"?

PRIOREVENT on the other hand is used in the Score statement according to the documentation. I found that if I fitted the model using Offset method, I will need to add PRIOREVENT=0.1 in the Score statement when scoring, so that the predicted probabilities will be adjusted with prior.

If I fitted the model using the Weight method, I do not need to use PRIOREVENT= when scoring. Reason was the adjustments are already reflected in the intercept/coefficients (@Rick_SAS mentioned this one of his replies, sorry I can't seem to find the thread now).

PaigeMiller · Posted 09-13-2020 08:50 AM

Well, that's good to know, and as I said, I still have some thinking to do! Thanks!

--
Paige Miller

Lobbie · Posted 09-13-2020 09:17 AM

No worries @PaigeMiller, and please do let me know what your thoughts are later. Really appreciate it.

My hunch is both you and @Ksharp are right in your previous answers, else why would SAS have a PEVENT option to work with the CTABLE option? Because CTABLE does not do prior adjustments by default.

PaigeMiller · Posted 09-13-2020 07:42 PM

Hello @Lobbie and @Ksharp. I don't see how PRIOREVENT= and PEVENT= produce the same results.

/* Make up some data, with 10% value of 0 and 90% value of 1 */
data a;
    do i=1 to 1000;
        if i<=100 then y=0;
        else y=1;
        x1=rand('normal');
        x2=rand('normal');
        output;
    end;
run;
/* Perform Logistic Regression */
proc logistic data=a;
	model y(event='1')=x1 x2;
	output out=preds predicted=pred;
run;

/* Oversample to 50-50, PEVENT=0.9 */
proc logistic data=a(where=(i<=200));
    model y(event='1')=x1 x2/pevent=0.9 ctable;
    output out=preds2 predicted=pred2;
run;

/* Oversample to 50-50, PRIOREVENT=0.9 */
proc logistic data=a(where=(i<=200));
    model y(event='1')=x1 x2;
    score out=preds3 priorevent=0.9;
run;

PRIOREVENT does what I think should be done given my understanding of the original problem. It's not clear to me how PEVENT applies here.

--
Paige Miller

Ksharp · Posted 09-14-2020 07:49 AM

Paige,
Very interesting . I also want know .
It seem that PEVENT= has nothing to do with predicted probability .It is just for Class Table(CTable).
According to documentation:

PEVENT=value| (list)
specifies one prior probability or a list of prior probabilities for the event of interest. The false positive
and false negative rates are then computed as posterior probabilities by Bayes’ theorem.

That means OP do not need PEVENT= . Only priorevent=0.9 could adjust predicted probability .
I mislead OP, I was wrong.

@StatDave @Rick_SAS @SteveDenham could take a look ?

StatDave · Posted 09-14-2020 10:20 AM

Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement.

PaigeMiller · Posted 09-14-2020 11:02 AM

@StatDave wrote:

Note that the OP's reference to the offset and weight methods suggests that they are using this note on oversampling. The PEVENT= option in the MODEL statement just applies the formulas shown in the "Details: Classification Table" section of the LOGISTIC documentation to compute the PPV, NPV, and correct classification rate. It has no effect on the predicted probabilities provided by the OUTPUT or SCORE statements. To do that and obtain posterior probabilities you need to use the PRIOR= or PRIOREVENT= option in the SCORE statement.

I think the red highlighted text clears things up for me. Thanks!

--
Paige Miller

Lobbie · Posted 09-14-2020 09:49 PM

@StatDave , yes I am following the 22601 note and am using Weight when fitting my model. The "Details: Classification Table" section of the LOGISTIC documentation stated that PEVENT should be used because I have fitted the model using a balanced training set when it was 10:90 in the beginning.

However when I did not specify the PEVENT option during the creation of CTABLE during model training, and I manually calculate PPV & NPV using the scored data, the results matched.

The only explanation I can think of i.e. as to why I do not need to specify PEVENT option contrary to the recommendation in the documentation is because I fitted the model using Weight statement. All parameters are adjusted accordingly and are used to compute the CTABLE and P_1 probabilities in the scored dataset. This is also the reason why I do not need to specify PRIOREVENT= in the score statement when scoring.

Am I correct? Thanks.

StatDave · Posted 09-15-2020 02:08 PM

Yes, I believe that is correct. The PEVENT= option is suggested in the documentation to enable getting these statistics when you aren't using a separate method to adjust for oversampling. So, when you use weights to adjust for the oversampling, PEVENT= wouldn't be needed in addition.

Ready to join fellow brilliant minds for the SAS Hackathon?