## How to predict PD with logistic regression?

Hi guys,

I have tried to find another topic that could help me out, but still no succes to do that.

Let me start by telling about my dataset:
I've an application dataset based from the real world. It is a collection based on people who have tried to lend some money, I have information about them as income, age, children, married, LTV, ... etc. almost 200 variables, my response variable is their default status. Whether they have defaulted in the first year or not.

My dataset includes 40.000 observations and 220 defaults (default value=1).

I have tried to do clear the dataset by missing>5% => removing the variable, missing<5% => removing the rows.

Now I am down to approx. 50 variables, furthermore I divided the original dataset to a training- and test-dataset (70% training, 30% test).

To investigate which variables I should work further with I do the following:

``````Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST');
model default(event='1')= VAR1--VAR50/selection=stepwise;
run; ``````

This gives 6 significant variables, an c-value of: 0.701, Somers' 😧 0.42, AIC: 2340,40.

I'm not very happy of the c-value, but I can live with it.

My next point is to try and calculate the probability of default given these 6 variables. By using the following:

``````PROC LOGISTIC DATA = TRAINING_DATA descending;
class CATEGORY_VAR1(PARAM=REF REF='FIRST');
MODEL default(event='1') = VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 / link = probit ctable
pprob=(0.05 to 1 by 0.05);
output out=PREDICTED_PROB predicted=PD_probit;
RUN;``````

When I then test these predictions to see how many of them actually are correct hits, by the following:

``````data CHECK;
set PREDICTED_PROB;
where PD_probit > 0.5 and default=1;
run;``````

I got 0 hits! These indicates that my model cannot predict anything...

What am I doing wrong? How should I approach it?
My wish would be: Check how many the model gave me correct on, in percent (hopefully a lot), and then use the model to try it of on the test-dataset.

Sorry if the post is to long, let me know if there is something I should add/remove. 🙂

Best regards.

14 REPLIES 14

## Re: How to predict PD with logistic regression?

You have about 0.5% of your data gets a Y=1. It's not surprising that this particular logistic regression doesn't predict any observations will default. The huge mass of data that is driving the regression did not default. Try something called oversampling. Go to your favorite internet search engine and type in

logistic regression oversampling in sas

--
Paige Miller

## Re: How to predict PD with logistic regression?

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

## Re: How to predict PD with logistic regression?

@Norbit wrote:

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

You oversample, then randomly split the resulting data into training and validation data sets.

Regarding your 2nd paragraph, you are not wrong.

--
Paige Miller

## Re: How to predict PD with logistic regression?

If you look at the number of defaults in your data, the chances of randomly selecting an account that will default is 0.55%. I suggest you compare that with the average PD of your defaulting accounts. If the average PD is significantly greater than 0.55% then I'd suggest your model has some predictability as it is doing better than a random selection.  Ksharp
Super User

## Re: How to predict PD with logistic regression?

As Paige said your predict probability is too small 220/40000=0.004 .

proc logistic have no effect for such small probability event.

I advice to oversample to enhance this probability . like  :  good:bad =  1000:220 . use this 1220 to build a model.

Or try option PEVENT= to adjust population 's probability of bad .

``````Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST');
model default(event='1')= VAR1--VAR50/selection=stepwise  pevent=0.004 ;
run; ``````

## Re: How to predict PD with logistic regression?

Thank you guys for the suggestions. I will try them out!

I will let you know if it went well/bad. 🙂

@Ksharp : Could you please elaborate a bit more, in a theoretically way if you can, about the "pevent="-statement?

Again, thank you for your time! 🙂  Ksharp
Super User

## Re: How to predict PD with logistic regression?

Sorry. I bad. since your data is all data, NOT sample from population. therefore, you don't need PEVENT= .

Still I suggest you to over-sample your data to enhance ratio of bad:good . and also use PEVENT= to adjust probability of model event .

## Re: How to predict PD with logistic regression?

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

"Oversampling the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly under-sampling the majority class may under-fit your algorithm if the minority class is very small."

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data.

Best regards

## Re: How to predict PD with logistic regression?

@Norbit wrote:

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

"Oversampling the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly under-sampling the majority class may under-fit your algorithm if the minority class is very small."

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data.

Did you even try oversampling to compare the training and validation data set on your data?

--
Paige Miller

## Re: How to predict PD with logistic regression?

Hi again,

I tried to the over oversample (or call it under-sample) where I used the following code:

``````data OVERSAMPLING;
set TRAINING_DATASET;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;

proc freq data=OVERSAMPLING;
tables y;
run;``````

actually followed this LINK and the default rate went up to 10.5% where the non-default fall to 89.5%. Now the frequencies are (approx.) 0's: 1904, and 1's: 200. I also tried to put the offset calculation, it gave me almost the same intercept as before oversampling.

When counting defaults for  PD >0.025 (2.5%) I had hits on 9,42%. Now I hit on 96,85% it is only 3.15% which has a PD under 2.5%.

.. but!

With oversample I do have a high percentages of sensitivity (97.4), low specificity (10.7), False positive (88.9), False negative (2.7), correct (19.6). How should I interpret these now?  Ksharp
Super User

## Re: How to predict PD with logistic regression?

Yes. That is why you need option PEVENT=0.005 to adjust predict probability(0.005 is the event probability in population) .

"Is there any way where I can use some Penelized Logistic regression? "

PROC LOGISTIC has such method (option : firth):

model good_bad(event='good')= &varlist /outroc=roc lackfit scale=none aggregate rsquare firth corrb ;

Or if your data size is small ,you can also try exact logistic regresssion via EXACT statement .

## Re: How to predict PD with logistic regression?

I am still confused on how to use it correcly?

As my reply obove, I tried to oversample my data.

But how do I use this on the validation data and predict the correct defaults..
I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".  Ksharp
Super User

## Re: How to predict PD with logistic regression?

I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".

Use options  PEVENT= "event prob in population"  and FIRTH in MODEL statement, and score test data, not validate data(which avoid overfit problem).

There are many score ways,like CODE statement,SCORE statement, PROC PLM. And calling @StatDave

https://blogs.sas.com/content/iml/2019/02/11/proc-plm-regression-models-sas.html