Hello
I want to ask a question about credit score models.
Let's say that there are 100,000 customers of bank in December 2019.(December 2019 is called "base month").
For these 100,000 customers I have data from Janaury 2020 till december 2020 (This period is called "Following period").
During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.
For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.
let's say that the Gini coefficent for these 100,000 customer is 92%.
We want to build a new regression model in order to improve the model ability.
For this task we devided the population (100,000 customer) into in-smaple (training set) and out-sample (test set) .
The in-sample is 70% of the population ( 70,000 customers).
The out-sample is 70% of population (30,000 customers).
Then from the in-sample we build a new regression model.
My questions:
1-The task is to compare Gini coefficient between old model and new model.
The Gini coefficient that is calculated based on the new model should be calculated on in-smaple or out-sample or all population?
The Gini coefficient based on old model (current model) should be calculated on in-smaple or out-sample or all population?
Note: I have mentioned before that we calculated Gini coefficient based on old model (current model) on 100,000 customers
2-In order to run the new regression model I saw 2 approaches:
a- Technically run it on 100,000 customers but for 30,000 customer put null value in fail/no fail response varaible
b-Technically run it on 70,000 customers (in-sample) only
May you show the SAS code of these 2 aprocahes? Which approcah is better?
3-May you show the SAS code to calculate the Gini coefficent?(base on the answer to question1 if it should be calculated on in-sample/outsample/all sample
4-I want to keep the regression coefffients of the new regression model in order to predict probablity of failure on a customer list from another period.
What is the way to it please?
thank you
Erik
@Ronein wrote:
Hello
I want to ask a question about credit score models.
Let's say that there are 100,000 customers of bank in December 2019.(December 2019 is called "base month").
For these 100,000 customers I have data from Janaury 2020 till december 2020 (This period is called "Following period").
During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.
For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.
let's say that the Gini coefficent for these 100,000 customer is 92%.
The Gini coefficient of what? What are you using to rank the population? And what is the measure whose concentration over that population you are using?
Hello,
I can answer your questions, but I do not have the time right now. In 5 minutes I need to shut-down my PC.
Let me start with a piece of code to calculate the Gini coefficient for your model.
It can be done with PROC LOGISTIC (SAS/Stat), using your target as dependent variable and using the predicted probabilities as an independent variable. Make absolutely sure you are modelling the same target event as in your existing model!!
ods select Association;
ods output Association=work.Association;
proc logistic data=libname.datasetname;
id customer_number;
model binary_target(event='1') = predicted_value_probability;
run;
/* end of program */
The association table (Association of Predicted Probabilities and Observed Responses) contains the:
Then use these formula's in a data step:
ROC index = Area Under ROC curve = AUC_ROC
ROC index = (Percent_Concordant + 0.5 * Percent_Tied)
Accuracy Ratio (AR) = Gini coefficient
Gini = 2 * ROC index - 1 = ( ( ROC index - 0.5 ) / 0.5 )
Geometrically, this means :
Area Under ROC curve : you divide area under ROC curve by the square
Gini : Do not consider lower triangular part of ROC chart, then divide remaining area under ROC curve by the upper triangle
Good luck,
Koen
Thank you.
Should Gini be calculated on in-sample or out-sample or all data?
May anyone explain please?
ods select Association;ods output Association=work.Association;
proc logistic data=libname.datasetname;
id customer_number;
model binary_target(event='1') = predicted_value_probability;
run;
Hello,
Is the old (regression) model built on / with the same 100000 customers?
In other words, the old model's Gini coefficient of 92% that you are reporting, is it the result of scoring the 100000 customers with a model built on still other (earlier) observations or is it the result of scoring the 100000 customers with the model built on that same 100000 observations? In the first case, the 92% is a test-Gini and in the latter case it's a training Gini.
So answering this question delivers you an answer to the question whether you should use out-of-sample (test) or use in-sample (training) observations to compare your old and new regression model.
But if the same 100000 observations were also used to build the old model, were they used all 100000 for training (learning)? I can imagine there was also a training / (validation) / test data split?? In that case, why don't you consider the same split?
Finally I think you should also create a validation set. Without validation set (used while model building) the model will be vastly overfit and will lose a lot of its performance when applied to unseen test data.
Consider the use of a validation data set. It's good practice in this type of modelling.
Cheers,
Koen
Yes, that's clear! Thank you.
In that case, the 92% on the 100 000 observations is an "honest" Gini on out-of-sample data because these 100 000 observations were never "seen" by the model before.
Hence you should compare this Gini of 92% with the Gini on the out-of-sample data set (30 000 obs.) in your new modelling exercise. The Gini on in-sample + out-of-sample would be flattered (artificially high) as the in-sample data were used for the new model.
But again, consider the use of a validation set in your new modelling exercise to avoid severe overfitting! Validation data can also be considered in-sample.
Also, a 92% Gini is very high, especially for an application score card. Maybe you're dealing with a behavioral scorecard but even then 92% is still high. It will not be easy to outperform this.
Are you sure you are not mixing up between AUC_ROC and Gini??
Hope this helps,
Koen
Reading along, I think these are really great responses from @sbxkoenk. Well done! 👍 👍
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.