About Norbit

Norbit · ‎11-20-2019

I am still confused on how to use it correcly? As my reply obove, I tried to oversample my data. But how do I use this on the validation data and predict the correct defaults.. I still don't have a model where I can say: "If PD >0.5 then this applicant would be counted as going default in one year".

Norbit · ‎11-20-2019

Hi again, I tried to the over oversample (or call it under-sample) where I used the following code: data OVERSAMPLING; set TRAINING_DATASET; if y=1 then output; if y=0 then do; if ranuni(10000)<1/20 then output; end; run; proc freq data=OVERSAMPLING; tables y; run; actually followed this LINK and the default rate went up to 10.5% where the non-default fall to 89.5%. Now the frequencies are (approx.) 0's: 1904, and 1's: 200. I also tried to put the offset calculation, it gave me almost the same intercept as before oversampling. When counting defaults for PD >0.025 (2.5%) I had hits on 9,42%. Now I hit on 96,85% it is only 3.15% which has a PD under 2.5%. .. but! With oversample I do have a high percentages of sensitivity (97.4), low specificity (10.7), False positive (88.9), False negative (2.7), correct (19.6). How should I interpret these now?

Norbit · ‎11-19-2019

And again... As you guys said I should try with oversampling, I started reading more and more about it. It seems to be the "correct" way to handle the data, BUT there is another question in my head after reading this quote: "Oversampling the minority class using SMOTE or other algorithms has the disadvantage that it suffers from over-fitting. That is, you may perform well on the training set but on the test set your performance may suffer badly. Similarly under-sampling the majority class may under-fit your algorithm if the minority class is very small." Is there any way where I can use some Penelized Logistic regression? Where the regression knows that I have oversampled the minority class of the data. Best regards

Norbit · ‎11-17-2019

When I oversample, should it be done before or after splitting the data into training and validation? I am still concerned by doing this kind of trick. Another question; am I totally wrong by saying that I want a model which can predict a PD where I can conclude: "If pd >0.5 (50%) then default=1." Which leads me back to the topic, I want to be able to find out which applicants can be separated as possible defaulters from the time of application.

Norbit · ‎11-16-2019

@Ksharp @SASKiwi @PaigeMiller Thank you guys for the suggestions. I will try them out! I will let you know if it went well/bad. 🙂 @Ksharp : Could you please elaborate a bit more, in a theoretically way if you can, about the "pevent="-statement? Again, thank you for your time! 🙂

Norbit · ‎11-14-2019

Hi guys, I have tried to find another topic that could help me out, but still no succes to do that. Let me start by telling about my dataset: I've an application dataset based from the real world. It is a collection based on people who have tried to lend some money, I have information about them as income, age, children, married, LTV, ... etc. almost 200 variables, my response variable is their default status. Whether they have defaulted in the first year or not. My dataset includes 40.000 observations and 220 defaults (default value=1). I have tried to do clear the dataset by missing>5% => removing the variable, missing<5% => removing the rows. Now I am down to approx. 50 variables, furthermore I divided the original dataset to a training- and test-dataset (70% training, 30% test). To investigate which variables I should work further with I do the following: Proc logistic data=TRAINING_DATA; class CATEGORY_VAR1(PARAM=REF REF='FIRST') CATEGORY_CAR2(PARAM=ref ref='FIRST'); model default(event='1')= VAR1--VAR50/selection=stepwise; run; This gives 6 significant variables, an c-value of: 0.701, Somers' 😧 0.42, AIC: 2340,40. I'm not very happy of the c-value, but I can live with it. My next point is to try and calculate the probability of default given these 6 variables. By using the following: PROC LOGISTIC DATA = TRAINING_DATA descending; class CATEGORY_VAR1(PARAM=REF REF='FIRST'); MODEL default(event='1') = VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 / link = probit ctable pprob=(0.05 to 1 by 0.05); output out=PREDICTED_PROB predicted=PD_probit; RUN; (also tried with link=logit). When I then test these predictions to see how many of them actually are correct hits, by the following: data CHECK; set PREDICTED_PROB; where PD_probit > 0.5 and default=1; run; I got 0 hits! These indicates that my model cannot predict anything... What am I doing wrong? How should I approach it? My wish would be: Check how many the model gave me correct on, in percent (hopefully a lot), and then use the model to try it of on the test-dataset. Sorry if the post is to long, let me know if there is something I should add/remove. 🙂 Best regards.

Online Status	Offline
Date Last Visited	‎11-27-2019 10:40 AM

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

Re: How to predict PD with logistic regression?

How to predict PD with logistic regression?