Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Forecasting
- /
- How to predict PD with logistic regression?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 11-14-2019 07:02 PM
(3233 views)

Hi guys,

I have tried to find another topic that could help me out, but still no succes to do that.

Let me start by telling about my dataset:

I've an application dataset based from the real world. It is a collection based on people who have tried to lend some money, I have information about them as income, age, children, married, LTV, ... etc. almost 200 variables, my response variable is their default status. Whether they have defaulted in the first year or not.

My dataset includes 40.000 observations and 220 defaults (default value=1).

I have tried to do clear the dataset by missing>5% => removing the variable, missing<5% => removing the rows.

Now I am down to approx. 50 variables, furthermore I divided the original dataset to a training- and test-dataset (70% training, 30% test).

To investigate which variables I should work further with I do the following:

```
Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST');
model default(event='1')= VAR1--VAR50/selection=stepwise;
run;
```

This gives 6 significant variables, an c-value of: 0.701, Somers' 😧 0.42, AIC: 2340,40.

I'm not very happy of the c-value, but I can live with it.

My next point is to try and calculate the probability of default given these 6 variables. By using the following:

```
PROC LOGISTIC DATA = TRAINING_DATA descending;
class CATEGORY_VAR1(PARAM=REF REF='FIRST');
MODEL default(event='1') = VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 / link = probit ctable
pprob=(0.05 to 1 by 0.05);
output out=PREDICTED_PROB predicted=PD_probit;
RUN;
```

(also tried with link=logit).

When I then test these predictions to see how many of them actually are correct hits, by the following:

```
data CHECK;
set PREDICTED_PROB;
where PD_probit > 0.5 and default=1;
run;
```

I got 0 hits! These indicates that my model cannot predict anything...

What am I doing wrong? How should I approach it?

My wish would be: Check how many the model gave me correct on, in percent (hopefully a lot), and then use the model to try it of on the test-dataset.

Sorry if the post is to long, let me know if there is something I should add/remove. 🙂

Best regards.

14 REPLIES 14

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You have about 0.5% of your data gets a Y=1. It's not surprising that this particular logistic regression doesn't predict any observations will default. The huge mass of data that is driving the regression did not default. Try something called oversampling. Go to your favorite internet search engine and type in

logistic regression oversampling in sas

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Norbit wrote:

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

You oversample, then randomly split the resulting data into training and validation data sets.

Regarding your 2nd paragraph, you are not wrong.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

As Paige said your predict probability is too small 220/40000=0.004 .

proc logistic have no effect for such small probability event.

I advice to oversample to enhance this probability . like : good:bad = 1000:220 . use this 1220 to build a model.

Or try option PEVENT= to adjust population 's probability of bad .

```
Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST');
model default(event='1')= VAR1--VAR50/selection=stepwise
```** pevent=0.004** ;
run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you guys for the suggestions. I will try them out!

I will let you know if it went well/bad. 🙂

@Ksharp : Could you please elaborate a bit more, in a theoretically way if you can, about the "pevent="-statement?

Again, thank you for your time! 🙂

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sorry. I bad. since your data is all data, NOT sample from population. therefore, you don't need PEVENT= .

Still I suggest you to over-sample your data to enhance ratio of bad:good . and also use PEVENT= to adjust probability of model event .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

"**Oversampling** the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly **under-sampling** the majority class may under-fit your algorithm if the minority class is very small."

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data.

Best regards

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Norbit wrote:

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

"

Oversamplingthe minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarlyunder-samplingthe majority class may under-fit your algorithm if the minority class is very small."

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data.

Did you even try oversampling to compare the training and validation data set on your data?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi again,

I tried to the over oversample (or call it under-sample) where I used the following code:

```
data OVERSAMPLING;
set TRAINING_DATASET;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=OVERSAMPLING;
tables y;
run;
```

actually followed this LINK and the default rate went up to 10.5% where the non-default fall to 89.5%. Now the frequencies are (approx.) 0's: 1904, and 1's: 200. I also tried to put the offset calculation, it gave me almost the same intercept as before oversampling.

When counting defaults for PD >0.025 (2.5%) I had hits on 9,42%. Now I hit on 96,85% it is only 3.15% which has a PD under 2.5%.

.. but!

With oversample I do have a high percentages of sensitivity (97.4), low specificity (10.7), False positive (88.9), False negative (2.7), correct (19.6). How should I interpret these now?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes. That is why you need option PEVENT=0.005 to adjust predict probability(0.005 is the event probability in population) .

"Is there any way where I can use some Penelized Logistic regression? "

PROC LOGISTIC has such method (option : firth):

model good_bad(event='good')= &varlist /outroc=roc lackfit scale=none aggregate rsquare** firth** corrb ;

Or if your data size is small ,you can also try exact logistic regresssion via EXACT statement .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I am still confused on how to use it correcly?

As my reply obove, I tried to oversample my data.

But how do I use this on the validation data and predict the correct defaults..

I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".

Use options PEVENT= "event prob in population" and FIRTH in MODEL statement, and score test data, not validate data(which avoid overfit problem).

There are many score ways,like CODE statement,SCORE statement, PROC PLM. And calling @StatDave

https://blogs.sas.com/content/iml/2019/02/11/proc-plm-regression-models-sas.html

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.