BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I want to ask a question about credit score models.

Let's say that there are  100,000 customers of bank in December 2019.(December 2019 is called "base month").

For these 100,000 customers I have data  from Janaury 2020 till  december 2020 (This period is called "Following period").

During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.

For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.

let's say that the Gini coefficent for these 100,000 customer is  92%.

We want to build a new regression model in order to improve the model ability.

For this task we devided the population (100,000 customer) into in-smaple (training set)  and out-sample (test set) .

The in-sample is 70% of the population ( 70,000 customers).

The out-sample is 70% of population (30,000 customers).

Then from the in-sample we build a new regression model.

My questions:

1-The task is to compare Gini coefficient between old model and new model.

The Gini coefficient that is calculated based on the new model should be calculated on  in-smaple or out-sample or all population?

The Gini coefficient based on old model (current model) should be calculated on   in-smaple or out-sample or all population?

Note: I have mentioned before that we calculated Gini coefficient based on old model (current model) on 100,000 customers

2-In order to run the new regression model I saw 2 approaches:

a- Technically run it on 100,000 customers but for 30,000 customer put null value in fail/no fail response varaible

b-Technically run it on 70,000 customers (in-sample) only

May you show the SAS code of these 2 aprocahes? Which approcah is better?

3-May you show the SAS code to calculate the Gini coefficent?(base on the answer to question1 if it should be calculated on in-sample/outsample/all sample

4-I want to keep the regression coefffients of the new regression model in order to predict probablity of failure on a customer list from another period.

What is the way to it please?

 

thank you

Erik

 

8 REPLIES 8
mkeintz
PROC Star

@Ronein wrote:

Hello

I want to ask a question about credit score models.

Let's say that there are  100,000 customers of bank in December 2019.(December 2019 is called "base month").

For these 100,000 customers I have data  from Janaury 2020 till  december 2020 (This period is called "Following period").

During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.

For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.

let's say that the Gini coefficent for these 100,000 customer is  92%.

The Gini coefficient of what?   What are you using to rank the population?  And what is the measure whose concentration over that population you are using?


--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Meteorite | Level 14
I didn't understand your comment.
For the new model Gini is calculated based on prediced probabilty of failure by the new model coefficients.
For the old model Gini is calculated based on predicted probability of failure by the old model coefficients
sbxkoenk
SAS Super FREQ

Hello,

 

I can answer your questions, but I do not have the time right now. In 5 minutes I need to shut-down my PC.

Let me start with a piece of code to calculate the Gini coefficient for your model.

It can be done with PROC LOGISTIC (SAS/Stat), using your target as dependent variable and using the predicted probabilities as an independent variable. Make absolutely sure you are modelling the same target event as in your existing model!!

 

ods select Association;
ods output Association=work.Association; proc logistic data=libname.datasetname; id customer_number; model binary_target(event='1') = predicted_value_probability; run; /* end of program */

The association table (Association of Predicted Probabilities and Observed Responses) contains the: 

  • Percent Concordant
  • Percent Tied

Then use these formula's in a data step:

ROC index = Area Under ROC curve = AUC_ROC

ROC index = (Percent_Concordant + 0.5 * Percent_Tied)

Accuracy Ratio (AR) = Gini coefficient

Gini = 2 * ROC index - 1 = ( ( ROC index - 0.5 ) / 0.5 )

 

Geometrically, this means :

Area Under ROC curve : you divide area under ROC curve by the square

Gini : Do not consider lower triangular part of ROC chart, then divide remaining area under ROC curve by the upper triangle

 

Good luck,

Koen

 

Ronein
Meteorite | Level 14

Thank you.
Should Gini be calculated on in-sample or out-sample or all data?

May anyone explain please?

 

 

ods select Association;ods output Association=work.Association;
proc logistic data=libname.datasetname;
 id    customer_number;
 model binary_target(event='1') = predicted_value_probability;
run;

 

sbxkoenk
SAS Super FREQ

Hello,

 

Is the old (regression) model built on / with the same 100000 customers?

In other words, the old model's Gini coefficient of 92% that you are reporting, is it the result of scoring the 100000 customers with a model built on still other (earlier) observations or is it the result of scoring the 100000 customers with the model built on that same 100000 observations? In the first case, the 92% is a test-Gini and in the latter case it's a training Gini.

So answering this question delivers you an answer to the question whether you should use out-of-sample (test) or use in-sample (training) observations to compare your old and new regression model.

But if the same 100000 observations were also used to build the old model, were they used all 100000 for training (learning)? I can imagine there was also a training / (validation) / test data split?? In that case, why don't you consider the same split?

 

Finally I think you should also create a validation set. Without validation set (used while model building) the model will be vastly overfit and will lose a lot of its performance when applied to unseen test data.

Consider the use of a validation data set. It's good practice in this type of modelling.

 

Cheers,

Koen

Ronein
Meteorite | Level 14
The old model was built a few years ago on another list of customers.
I am using the "old model " regression coefficients to calculate perficted probability of default for each customer in 100,000 custimers and then calculate Gini for these 100,000 customers and get 92%.
Is it clear??
sbxkoenk
SAS Super FREQ

Yes, that's clear! Thank you.

 

In that case, the 92% on the 100 000 observations is an "honest" Gini on out-of-sample data because these 100 000 observations were never "seen" by the model before.

Hence you should compare this Gini of 92% with the Gini on the out-of-sample data set (30 000 obs.) in your new modelling exercise. The Gini on in-sample + out-of-sample would be flattered (artificially high) as the in-sample data were used for the new model.

 

But again, consider the use of a validation set in your new modelling exercise to avoid severe overfitting! Validation data can also be considered in-sample.

 

Also, a 92% Gini is very high, especially for an application score card. Maybe you're dealing with a behavioral scorecard but even then 92% is still high. It will not be easy to outperform this.

Are you sure you are not mixing up between AUC_ROC and Gini??

 

Hope this helps,

Koen

PaigeMiller
Diamond | Level 26

Reading along, I think these are really great responses from @sbxkoenk. Well done! 👍 👍

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3110 views
  • 5 likes
  • 4 in conversation