BookmarkSubscribeRSS Feed
Norbit
Fluorite | Level 6

Hi guys,

I have tried to find another topic that could help me out, but still no succes to do that. 

 

Let me start by telling about my dataset:
I've an application dataset based from the real world. It is a collection based on people who have tried to lend some money, I have information about them as income, age, children, married, LTV, ... etc. almost 200 variables, my response variable is their default status. Whether they have defaulted in the first year or not. 

My dataset includes 40.000 observations and 220 defaults (default value=1).

 

I have tried to do clear the dataset by missing>5% => removing the variable, missing<5% => removing the rows.

Now I am down to approx. 50 variables, furthermore I divided the original dataset to a training- and test-dataset (70% training, 30% test).

 

To investigate which variables I should work further with I do the following:

Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST'); 
model default(event='1')= VAR1--VAR50/selection=stepwise;
run; 

 

This gives 6 significant variables, an c-value of: 0.701, Somers' 😧 0.42, AIC: 2340,40.

I'm not very happy of the c-value, but I can live with it. 


My next point is to try and calculate the probability of default given these 6 variables. By using the following:

 

PROC LOGISTIC DATA = TRAINING_DATA descending;
class CATEGORY_VAR1(PARAM=REF REF='FIRST');
MODEL default(event='1') = VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 / link = probit ctable
pprob=(0.05 to 1 by 0.05);
output out=PREDICTED_PROB predicted=PD_probit;
RUN;

(also tried with link=logit).

 

When I then test these predictions to see how many of them actually are correct hits, by the following:

 

data CHECK;
set PREDICTED_PROB;
where PD_probit > 0.5 and default=1;
run;

 

I got 0 hits! These indicates that my model cannot predict anything... 

 

What am I doing wrong? How should I approach it? 
My wish would be: Check how many the model gave me correct on, in percent (hopefully a lot), and then use the model to try it of on the test-dataset. 

 

Sorry if the post is to long, let me know if there is something I should add/remove. 🙂

Best regards.

14 REPLIES 14
PaigeMiller
Diamond | Level 26

You have about 0.5% of your data gets a Y=1. It's not surprising that this particular logistic regression doesn't predict any observations will default. The huge mass of data that is driving the regression did not default. Try something called oversampling. Go to your favorite internet search engine and type in

 

logistic regression oversampling in sas

--
Paige Miller
Norbit
Fluorite | Level 6

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

 

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

 

PaigeMiller
Diamond | Level 26

@Norbit wrote:

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick.

 

Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

 


You oversample, then randomly split the resulting data into training and validation data sets.

 

Regarding your 2nd paragraph, you are not wrong.

--
Paige Miller
SASKiwi
PROC Star

If you look at the number of defaults in your data, the chances of randomly selecting an account that will default is 0.55%. I suggest you compare that with the average PD of your defaulting accounts. If the average PD is significantly greater than 0.55% then I'd suggest your model has some predictability as it is doing better than a random selection.

Ksharp
Super User

As Paige said your predict probability is too small 220/40000=0.004 .

proc logistic have no effect for such small probability event.

I advice to oversample to enhance this probability . like  :  good:bad =  1000:220 . use this 1220 to build a model.

 

Or try option PEVENT= to adjust population 's probability of bad .

Proc logistic data=TRAINING_DATA;
class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST'); 
model default(event='1')= VAR1--VAR50/selection=stepwise  pevent=0.004 ;
run; 

 

Norbit
Fluorite | Level 6

@Ksharp @SASKiwi @PaigeMiller 

Thank you guys for the suggestions. I will try them out!

I will let you know if it went well/bad. 🙂 

@Ksharp : Could you please elaborate a bit more, in a theoretically way if you can, about the "pevent="-statement?  

Again, thank you for your time! 🙂

Ksharp
Super User

Sorry. I bad. since your data is all data, NOT sample from population. therefore, you don't need PEVENT= .

Still I suggest you to over-sample your data to enhance ratio of bad:good . and also use PEVENT= to adjust probability of model event .

Norbit
Fluorite | Level 6

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

 

"Oversampling the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly under-sampling the majority class may under-fit your algorithm if the minority class is very small."

 

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data. 

Best regards

 

PaigeMiller
Diamond | Level 26

@Norbit wrote:

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote:

 

"Oversampling the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly under-sampling the majority class may under-fit your algorithm if the minority class is very small."

 

Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data. 

 


Did you even try oversampling to compare the training and validation data set on your data? 

--
Paige Miller
Norbit
Fluorite | Level 6

Hi again, 

 

I tried to the over oversample (or call it under-sample) where I used the following code: 

 

data OVERSAMPLING; 
set TRAINING_DATASET;
   if y=1 then output;
   if y=0 then do;
     if ranuni(10000)<1/20 then output;
   end;
run;

proc freq data=OVERSAMPLING;
tables y;
run;

actually followed this LINK and the default rate went up to 10.5% where the non-default fall to 89.5%. Now the frequencies are (approx.) 0's: 1904, and 1's: 200. I also tried to put the offset calculation, it gave me almost the same intercept as before oversampling.

 

When counting defaults for  PD >0.025 (2.5%) I had hits on 9,42%. Now I hit on 96,85% it is only 3.15% which has a PD under 2.5%.

 

.. but! 

With oversample I do have a high percentages of sensitivity (97.4), low specificity (10.7), False positive (88.9), False negative (2.7), correct (19.6). How should I interpret these now?

 

Ksharp
Super User

Yes. That is why you need option PEVENT=0.005 to adjust predict probability(0.005 is the event probability in population) .

 

"Is there any way where I can use some Penelized Logistic regression? "

PROC LOGISTIC has such method (option : firth):

 

model good_bad(event='good')= &varlist /outroc=roc lackfit scale=none aggregate rsquare firth corrb ;

 

Or if your data size is small ,you can also try exact logistic regresssion via EXACT statement .

Norbit
Fluorite | Level 6

I am still confused on how to use it correcly? 

As my reply obove, I tried to oversample my data. 

 

But how do I use this on the validation data and predict the correct defaults.. 
I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".  

Ksharp
Super User

I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".  

 

Use options  PEVENT= "event prob in population"  and FIRTH in MODEL statement, and score test data, not validate data(which avoid overfit problem).

There are many score ways,like CODE statement,SCORE statement, PROC PLM. And calling @StatDave 

 

https://blogs.sas.com/content/iml/2019/02/11/proc-plm-regression-models-sas.html

https://blogs.sas.com/content/iml/2019/11/20/predicted-values-generalized-linear-models-ilink-sas.ht...

StatDave
SAS Super FREQ

Going back to your original post, use the PREDPROBS=INDIVIDUAL option in the OUTPUT statement rather than the PREDICTED= option. The resulting data set will contain a variable containing the predicted response categorical, F_default. These predicted response categories are determined using a maximum predicted probability rule, meaning that whichever predicted probability is larger - event or nonevent - determines the predicted response category. With such a rare event, it is unlikely that any predicted event probability will exceed 0.5, but some will likely exceed the observed event rate that you say is 220/40000=.0055. Oversampling is probably not necessary. See the description of the PREDPROBS= option and the description of the resulting data set in the "Input and output data sets" Details section of the PROC LOGISTIC documentation.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2960 views
  • 2 likes
  • 5 in conversation