<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: in sample,out sample,Gini in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741711#M231885</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is the old (regression) model built on / with the same 100000 customers?&lt;/P&gt;
&lt;P&gt;In other words, the old model's Gini coefficient of 92% that you are reporting, is it the result of scoring the 100000 customers with a model built on still other (earlier) observations or is it the result of scoring the 100000 customers with the model built on that same 100000 observations? In the first case, the 92% is a test-Gini and in the latter case it's a training Gini.&lt;/P&gt;
&lt;P&gt;So answering this question delivers you an answer to the question whether you should use out-of-sample (test) or use in-sample (training) observations to compare your old and new regression model.&lt;/P&gt;
&lt;P&gt;But if the same 100000 observations were also used to build the old model, were they used all 100000 for training (learning)? I can imagine there was also a training / (validation) / test data split?? In that case, why don't you consider the same split?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Finally I think you should also create a validation set. Without validation set (used while model building) the model will be vastly overfit and will lose a lot of its performance when applied to unseen test data.&lt;/P&gt;
&lt;P&gt;Consider the use of a validation data set. It's good practice in this type of modelling.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
    <pubDate>Sun, 16 May 2021 10:51:18 GMT</pubDate>
    <dc:creator>sbxkoenk</dc:creator>
    <dc:date>2021-05-16T10:51:18Z</dc:date>
    <item>
      <title>in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741647#M231856</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;
&lt;P&gt;I want to ask a question about credit score models.&lt;/P&gt;
&lt;P&gt;Let's say that there are&amp;nbsp; 100,000 customers of bank in December 2019.(December 2019 is called "base month").&lt;/P&gt;
&lt;P&gt;For these 100,000 customers I have data&amp;nbsp; from Janaury 2020 till&amp;nbsp; december 2020 (This period is called "Following period").&lt;/P&gt;
&lt;P&gt;During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.&lt;/P&gt;
&lt;P&gt;For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.&lt;/P&gt;
&lt;P&gt;let's say that the Gini coefficent for these 100,000 customer is&amp;nbsp; 92%.&lt;/P&gt;
&lt;P&gt;We want to build a new regression model in order to improve the model ability.&lt;/P&gt;
&lt;P&gt;For this task we devided the population (100,000 customer) into in-smaple&amp;nbsp;(training set)&amp;nbsp; and out-sample&amp;nbsp;(test set) .&lt;/P&gt;
&lt;P&gt;The in-sample is 70% of the population ( 70,000 customers).&lt;/P&gt;
&lt;P&gt;The&amp;nbsp;out-sample is 70% of population (30,000 customers).&lt;/P&gt;
&lt;P&gt;Then from the in-sample we build a new regression model.&lt;/P&gt;
&lt;P&gt;My questions:&lt;/P&gt;
&lt;P&gt;1-The task is to compare&amp;nbsp;Gini coefficient between old model and new model.&lt;/P&gt;
&lt;P&gt;The Gini coefficient that is calculated based on the new model should be calculated on&amp;nbsp; in-smaple or&amp;nbsp;out-sample or all population?&lt;/P&gt;
&lt;P&gt;The Gini coefficient based on old model (current model) should be calculated on&amp;nbsp;&amp;nbsp; in-smaple or&amp;nbsp;out-sample or all population?&lt;/P&gt;
&lt;P&gt;Note: I have mentioned before that we calculated&amp;nbsp;Gini coefficient based on old model (current model) on 100,000 customers&lt;/P&gt;
&lt;P&gt;2-In order to run the new regression model I saw 2 approaches:&lt;/P&gt;
&lt;P&gt;a- Technically run it on 100,000 customers but for 30,000 customer put null value in fail/no fail response varaible&lt;/P&gt;
&lt;P&gt;b-Technically run it on 70,000 customers (in-sample) only&lt;/P&gt;
&lt;P&gt;May you show the SAS code of these 2 aprocahes? Which approcah is better?&lt;/P&gt;
&lt;P&gt;3-May you show the SAS code to calculate the Gini coefficent?(base on the answer to question1 if it should be calculated on&amp;nbsp;in-sample/outsample/all sample&lt;/P&gt;
&lt;P&gt;4-I want to keep the regression coefffients of the new regression model in order to predict probablity of failure on a customer list from another period.&lt;/P&gt;
&lt;P&gt;What is the way to it please?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;thank you&lt;/P&gt;
&lt;P&gt;Erik&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 15 May 2021 14:54:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741647#M231856</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2021-05-15T14:54:25Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741654#M231860</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/159549"&gt;@Ronein&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hello&lt;/P&gt;
&lt;P&gt;I want to ask a question about credit score models.&lt;/P&gt;
&lt;P&gt;Let's say that there are&amp;nbsp; 100,000 customers of bank in December 2019.(December 2019 is called "base month").&lt;/P&gt;
&lt;P&gt;For these 100,000 customers I have data&amp;nbsp; from Janaury 2020 till&amp;nbsp; december 2020 (This period is called "Following period").&lt;/P&gt;
&lt;P&gt;During the following period I have information of Fail/Not fail (response varaible)and also information of different explanatory variables.&lt;/P&gt;
&lt;P&gt;For each customer I have also forecast probablity of failure by using coeffients from current model that is used in the bank.&lt;/P&gt;
&lt;P&gt;let's say that the Gini coefficent for these 100,000 customer is&amp;nbsp; 92%.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The Gini coefficient of what?&amp;nbsp; &amp;nbsp;What are you using to rank the population?&amp;nbsp; And what is the measure whose concentration over that population you are using?&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Sat, 15 May 2021 16:32:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741654#M231860</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2021-05-15T16:32:04Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741655#M231861</link>
      <description>I didn't understand your comment.&lt;BR /&gt;For the new model Gini is calculated based on prediced probabilty of failure by the new model coefficients.&lt;BR /&gt;For the old model Gini is calculated based on predicted probability of failure by the old model coefficients</description>
      <pubDate>Sat, 15 May 2021 16:58:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741655#M231861</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2021-05-15T16:58:42Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741662#M231863</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I can answer your questions, but I do not have the time right now. In 5 minutes I need to shut-down my PC.&lt;/P&gt;
&lt;P&gt;Let me start with a piece of code to calculate the Gini coefficient for your model.&lt;/P&gt;
&lt;P&gt;It can be done with PROC LOGISTIC (SAS/Stat), using your target as dependent variable and using the predicted probabilities as an independent variable. Make absolutely sure you are modelling the same target event as in your existing model!!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;ods select Association;&lt;BR /&gt;ods output Association=work.Association;
proc logistic data=libname.datasetname;
 id    customer_number;
 model binary_target(event='1') = predicted_value_probability;
run;
/* end of program */&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The association table (Association of Predicted Probabilities and Observed Responses) contains the:&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Percent Concordant&lt;/LI&gt;
&lt;LI&gt;Percent Tied&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Then use these formula's in a data step:&lt;/P&gt;
&lt;P&gt;ROC index = Area Under ROC curve = AUC_ROC&lt;/P&gt;
&lt;P&gt;ROC index = (Percent_Concordant + 0.5 * Percent_Tied)&lt;/P&gt;
&lt;P&gt;Accuracy Ratio (AR) = Gini coefficient&lt;/P&gt;
&lt;P&gt;Gini&amp;nbsp;= 2 * ROC index - 1 = ( ( ROC index - 0.5 ) / 0.5 )&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Geometrically, this means :&lt;/P&gt;
&lt;P&gt;Area Under ROC curve&amp;nbsp;: you divide area under ROC&amp;nbsp;curve by the square&lt;/P&gt;
&lt;P&gt;Gini : Do not consider lower triangular part of ROC chart, then divide remaining area under ROC&amp;nbsp;curve by the upper triangle&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good luck,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 15 May 2021 19:07:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741662#M231863</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2021-05-15T19:07:06Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741667#M231864</link>
      <description>&lt;P&gt;Thank you.&lt;BR /&gt;Should Gini be calculated on in-sample or out-sample or all data?&lt;/P&gt;
&lt;P&gt;May anyone explain please?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class="language-sas"&gt;&lt;CODE&gt;ods select Association;ods output Association=work.Association;
proc logistic data=libname.datasetname;
 id    customer_number;
 model binary_target(event='1') = predicted_value_probability;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 15 May 2021 21:08:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741667#M231864</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2021-05-15T21:08:08Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741711#M231885</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is the old (regression) model built on / with the same 100000 customers?&lt;/P&gt;
&lt;P&gt;In other words, the old model's Gini coefficient of 92% that you are reporting, is it the result of scoring the 100000 customers with a model built on still other (earlier) observations or is it the result of scoring the 100000 customers with the model built on that same 100000 observations? In the first case, the 92% is a test-Gini and in the latter case it's a training Gini.&lt;/P&gt;
&lt;P&gt;So answering this question delivers you an answer to the question whether you should use out-of-sample (test) or use in-sample (training) observations to compare your old and new regression model.&lt;/P&gt;
&lt;P&gt;But if the same 100000 observations were also used to build the old model, were they used all 100000 for training (learning)? I can imagine there was also a training / (validation) / test data split?? In that case, why don't you consider the same split?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Finally I think you should also create a validation set. Without validation set (used while model building) the model will be vastly overfit and will lose a lot of its performance when applied to unseen test data.&lt;/P&gt;
&lt;P&gt;Consider the use of a validation data set. It's good practice in this type of modelling.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
      <pubDate>Sun, 16 May 2021 10:51:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741711#M231885</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2021-05-16T10:51:18Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741719#M231892</link>
      <description>The old model  was built a few years ago on another list of  customers.&lt;BR /&gt;I am using  the "old model " regression  coefficients  to calculate perficted probability  of  default  for each customer  in 100,000 custimers and then calculate Gini for these 100,000 customers  and get 92%.&lt;BR /&gt;Is it clear??</description>
      <pubDate>Sun, 16 May 2021 12:21:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741719#M231892</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2021-05-16T12:21:31Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741723#M231896</link>
      <description>&lt;P&gt;Yes, that's clear! Thank you.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In that case, the 92% on the 100 000 observations is an "honest" Gini on out-of-sample data because these 100 000 observations were never "seen" by the model before.&lt;/P&gt;
&lt;P&gt;Hence you should compare this Gini of 92% with the Gini on the out-of-sample data set (30 000 obs.) in your new modelling exercise. The Gini on in-sample + out-of-sample would be flattered (artificially high) as the in-sample data were used for the new model.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But again, consider the use of a validation set in your new modelling exercise to avoid severe overfitting! Validation data can also be considered in-sample.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, a 92% Gini is very high, especially for an application score card. Maybe you're dealing with a behavioral scorecard but even then 92% is still high. It will not be easy to outperform this.&lt;/P&gt;
&lt;P&gt;Are you sure you are not mixing up between AUC_ROC and Gini??&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
      <pubDate>Sun, 16 May 2021 12:40:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741723#M231896</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2021-05-16T12:40:15Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741726#M231899</link>
      <description>&lt;P&gt;Reading along, I think these are really great responses from &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/60547"&gt;@sbxkoenk&lt;/a&gt;. Well done!&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt; &lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 16 May 2021 13:12:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/741726#M231899</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2021-05-16T13:12:04Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954810#M372894</link>
      <description>&lt;P&gt;&lt;SPAN&gt;The Gini coefficient of what? Credit score Model&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;What are you using to rank the population? PD values&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt; And what is the measure whose concentration over that population you are using? what do you mean??&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 11:04:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954810#M372894</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2024-12-31T11:04:44Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954811#M372895</link>
      <description>&lt;P&gt;population in development data : 100,000 customers&lt;/P&gt;
&lt;P&gt;In-sample data:(70% of population): 70,000 customers&lt;/P&gt;
&lt;P&gt;Out-sample data:(30% of population): 30,000 customers&lt;/P&gt;
&lt;P&gt;Model was developed on in-sample data&amp;nbsp;&lt;/P&gt;
&lt;P&gt;PD was calculated on in_sample data and out-sample data and&amp;nbsp; also out of time data and then calculate GINI:&lt;/P&gt;
&lt;P&gt;Gini for in-sample data: 79.9%&lt;/P&gt;
&lt;P&gt;Gini for Out-sample data: 79.3%&lt;/P&gt;
&lt;P&gt;Gini for Out-Of-Time data: 81.8%&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 11:11:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954811#M372895</guid>
      <dc:creator>Ronein</dc:creator>
      <dc:date>2024-12-31T11:11:20Z</dc:date>
    </item>
    <item>
      <title>Re: in sample,out sample,Gini</title>
      <link>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954821#M372905</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/159549"&gt;@Ronein&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;&lt;SPAN&gt;The Gini coefficient of what? Credit score Model&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;What are you using to rank the population? PD values&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt; And what is the measure whose concentration over that population you are using? what do you mean??&lt;/SPAN&gt;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;My questions cited above were posted in 2021.&amp;nbsp; Upon re-reading this thread, here is what I recall by my question "&lt;SPAN&gt;what is the measure whose concentration over that population you are using?"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The Lorenz curve that I presume would display the underlying distribution for the Gini coefficient needs a horizontal measure the (cumulative % of population) and a vertical (cum % of "measure whose concentration ...").&amp;nbsp; So I take the latter to be PD values.&amp;nbsp; And I presume your horizontal is just based on a count of observations in your dataset.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 16:21:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/in-sample-out-sample-Gini/m-p/954821#M372905</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2024-12-31T16:21:47Z</dc:date>
    </item>
  </channel>
</rss>

