SAS Programming

Ronein · Posted 05-15-2021 10:54 AM

Hello

I want to ask a question about credit score models.

Let's say that there are 100,000 customers of bank in December 2019.(December 2019 is called "base month").

For these 100,000 customers I have data from Janaury 2020 till december 2020 (This period is called "Following period").

During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.

For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.

let's say that the Gini coefficent for these 100,000 customer is 92%.

We want to build a new regression model in order to improve the model ability.

For this task we devided the population (100,000 customer) into in-smaple (training set) and out-sample (test set) .

The in-sample is 70% of the population ( 70,000 customers).

The out-sample is 70% of population (30,000 customers).

Then from the in-sample we build a new regression model.

My questions:

1-The task is to compare Gini coefficient between old model and new model.

The Gini coefficient that is calculated based on the new model should be calculated on in-smaple or out-sample or all population?

The Gini coefficient based on old model (current model) should be calculated on in-smaple or out-sample or all population?

Note: I have mentioned before that we calculated Gini coefficient based on old model (current model) on 100,000 customers

2-In order to run the new regression model I saw 2 approaches:

a- Technically run it on 100,000 customers but for 30,000 customer put null value in fail/no fail response varaible

b-Technically run it on 70,000 customers (in-sample) only

May you show the SAS code of these 2 aprocahes? Which approcah is better?

3-May you show the SAS code to calculate the Gini coefficent?(base on the answer to question1 if it should be calculated on in-sample/outsample/all sample

4-I want to keep the regression coefffients of the new regression model in order to predict probablity of failure on a customer list from another period.

What is the way to it please?

thank you

Erik

mkeintz · Posted 05-15-2021 12:32 PM

@Ronein wrote:

Hello

I want to ask a question about credit score models.

Let's say that there are 100,000 customers of bank in December 2019.(December 2019 is called "base month").

For these 100,000 customers I have data from Janaury 2020 till december 2020 (This period is called "Following period").

During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.

For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.

let's say that the Gini coefficent for these 100,000 customer is 92%.

The Gini coefficient of what? What are you using to rank the population? And what is the measure whose concentration over that population you are using?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 05-15-2021 12:56 PM

I didn't understand your comment.
For the new model Gini is calculated based on prediced probabilty of failure by the new model coefficients.
For the old model Gini is calculated based on predicted probability of failure by the old model coefficients

sbxkoenk · Posted 05-15-2021 03:07 PM

Hello,

I can answer your questions, but I do not have the time right now. In 5 minutes I need to shut-down my PC.

Let me start with a piece of code to calculate the Gini coefficient for your model.

It can be done with PROC LOGISTIC (SAS/Stat), using your target as dependent variable and using the predicted probabilities as an independent variable. Make absolutely sure you are modelling the same target event as in your existing model!!

ods select Association;ods output Association=work.Association;
proc logistic data=libname.datasetname;
 id    customer_number;
 model binary_target(event='1') = predicted_value_probability;
run;
/* end of program */

The association table (Association of Predicted Probabilities and Observed Responses) contains the:

Percent Concordant
Percent Tied

Then use these formula's in a data step:

ROC index = Area Under ROC curve = AUC_ROC

ROC index = (Percent_Concordant + 0.5 * Percent_Tied)

Accuracy Ratio (AR) = Gini coefficient

Gini = 2 * ROC index - 1 = ( ( ROC index - 0.5 ) / 0.5 )

Geometrically, this means :

Area Under ROC curve : you divide area under ROC curve by the square

Gini : Do not consider lower triangular part of ROC chart, then divide remaining area under ROC curve by the upper triangle

Good luck,

Koen

Ronein · Posted 05-15-2021 04:29 PM

Thank you.
Should Gini be calculated on in-sample or out-sample or all data?

May anyone explain please?

ods select Association;ods output Association=work.Association;
proc logistic data=libname.datasetname;
 id    customer_number;
 model binary_target(event='1') = predicted_value_probability;
run;

sbxkoenk · Posted 05-16-2021 06:51 AM

Hello,

Is the old (regression) model built on / with the same 100000 customers?

In other words, the old model's Gini coefficient of 92% that you are reporting, is it the result of scoring the 100000 customers with a model built on still other (earlier) observations or is it the result of scoring the 100000 customers with the model built on that same 100000 observations? In the first case, the 92% is a test-Gini and in the latter case it's a training Gini.

So answering this question delivers you an answer to the question whether you should use out-of-sample (test) or use in-sample (training) observations to compare your old and new regression model.

But if the same 100000 observations were also used to build the old model, were they used all 100000 for training (learning)? I can imagine there was also a training / (validation) / test data split?? In that case, why don't you consider the same split?

Finally I think you should also create a validation set. Without validation set (used while model building) the model will be vastly overfit and will lose a lot of its performance when applied to unseen test data.

Consider the use of a validation data set. It's good practice in this type of modelling.

Cheers,

Koen

Ronein · Posted 05-16-2021 08:21 AM

The old model was built a few years ago on another list of customers.
I am using the "old model " regression coefficients to calculate perficted probability of default for each customer in 100,000 custimers and then calculate Gini for these 100,000 customers and get 92%.
Is it clear??

sbxkoenk · Posted 05-16-2021 08:38 AM

Yes, that's clear! Thank you.

In that case, the 92% on the 100 000 observations is an "honest" Gini on out-of-sample data because these 100 000 observations were never "seen" by the model before.

Hence you should compare this Gini of 92% with the Gini on the out-of-sample data set (30 000 obs.) in your new modelling exercise. The Gini on in-sample + out-of-sample would be flattered (artificially high) as the in-sample data were used for the new model.

But again, consider the use of a validation set in your new modelling exercise to avoid severe overfitting! Validation data can also be considered in-sample.

Also, a 92% Gini is very high, especially for an application score card. Maybe you're dealing with a behavioral scorecard but even then 92% is still high. It will not be easy to outperform this.

Are you sure you are not mixing up between AUC_ROC and Gini??

Hope this helps,

Koen

PaigeMiller · Posted 05-16-2021 09:11 AM

Reading along, I think these are really great responses from @sbxkoenk. Well done! 👍 👍

--
Paige Miller

Ronein · Posted 12-31-2024 06:11 AM

population in development data : 100,000 customers

In-sample data:(70% of population): 70,000 customers

Out-sample data:(30% of population): 30,000 customers

Model was developed on in-sample data

PD was calculated on in_sample data and out-sample data and also out of time data and then calculate GINI:

Gini for in-sample data: 79.9%

Gini for Out-sample data: 79.3%

Gini for Out-Of-Time data: 81.8%

Ronein · Posted 12-31-2024 06:04 AM

The Gini coefficient of what? Credit score Model

What are you using to rank the population? PD values

And what is the measure whose concentration over that population you are using? what do you mean??

mkeintz · Posted 12-31-2024 11:21 AM

@Ronein wrote:

The Gini coefficient of what? Credit score Model

What are you using to rank the population? PD values

And what is the measure whose concentration over that population you are using? what do you mean??

My questions cited above were posted in 2021. Upon re-reading this thread, here is what I recall by my question "what is the measure whose concentration over that population you are using?"

The Lorenz curve that I presume would display the underlying distribution for the Gini coefficient needs a horizontal measure the (cumulative % of population) and a vertical (cum % of "measure whose concentration ..."). So I take the latter to be PD values. And I presume your horizontal is just based on a count of observations in your dataset.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Programming

in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Re: in sample,out sample,Gini

Gini OOT VS IN SAMPLE

out of sample range and holdout sample

Out-of-sample range vs holdout sample

Out of sample predictions with PROC GLM

(Iterable) Out of Sample Forecasting with Probit

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...