BookmarkSubscribeRSS Feed
turcay
Lapis Lazuli | Level 10

Hello everyone,

 

Currently, I try to analyze whether the two data sets(one of them is Model data set) are consistent with each another or not. In accordance with this purpose, firstly, I try to perform PSI(Population Stability Index), SSI(Stability Statistic Index) and Default Rate analysis. As is known, to understand this efficiently, we should examine the GINI value, however, this two datasets’s model variables similar to each other in the ratio of %69.

 

Lets call these data sets being populations and give more detail,

I have two populations, one of them is “A” population(model data set) and the other one is “B” population. I have a scoring code for “A” population and “B” population has only 69 percent of model variables of A population’s model variables. I tried to perform “A” population scoring code over the “B” population then I perform Logistic Regression on results in Enterprise Guide. Even though, whole analyses give inconsistent results such as PSI, SSI and Default Rate, the result of the GINI(Sommers ‘D) comes 0.800 and c(ROC) comes 0.900.

 

Here are my some questions about this case,

  • How the other analyses results come inconsistent for these population,even the GINI and ROC come so high? How is it possible?
  • Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?
  • What can be the other methods to reach my aim, how can I make my decision to understand whether the datasets are consistent with each other or not?
  • Are there any other ways to find GINI values or are there any values to check whether the data sets are consistent or not?

Things I have

A population                                            B population

Model Data Set                                      New Data Set

Scoring Code                                         No Scoring Code

Model Variables                                %69 of A population Model Variables

 

Thank you,

7 REPLIES 7
Reeza
Super User

Have you compared the variables individually to each other? 

Ie check continuous variables using t-tests or KS tests. 

check categorical variables via a chi square tests

turcay
Lapis Lazuli | Level 10

Hello,

 

I compared the variables by customized PSI and SSI analysis.

 

On the other hand, I think you mean Kolmogorov - Simirnov tests. How can I use this test? And also can you show me short samples for T-Test and Chi Square analysis, please?

 

Further, may I learn your opinions about the following questions, especially the second question?

 

Here are my some questions about this case,

  • How the other analyses results come inconsistent for these population,even the GINI and ROC come so high? How is it possible?
  • Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?
  • What can be the other methods to reach my aim, how can I make my decision to understand whether the datasets are consistent with each other or not?
  • Are there any other ways to find GINI values or are there any values to check whether the data sets are consistent or not?

Thank you

Reeza
Super User

Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?

 

Scoring a model won't tell you anything about the consistency between the datasets. Scoring doesn't even require that you have the actual values, it derives predicted values based on model developed. Gini coefficient tells you how good a model is, not how similar data sets are. If one was a subset for example, then it would still score well but would not be representative of the actual population at all. 

 

I dont know what PSI and SSI are...perhaps they're subject specific terms. 

 

Ksharp
Super User
I don't understand your question, what do you mean " New Data Set Fit with Model Data Set".
ROC should not be so hight. I suspect you have some problem with data.

If you want check Goodness-Of-Fit statistic

score data=datasetB fitstat ;

Or use H-L test:
model .........../lackfit .


turcay
Lapis Lazuli | Level 10

Hello again,

 

Let me tell my question over example;

 

Let's pretend I have a Model Set(Population "A") as below, this Model Set includes 10 Model Variables and Target&Predicted Variables, also this model set has a scoring code. Model Set, approximately, 3000 observations has.

 

The following code just an example, I tried to create sample view of Model Data Set

 

Data ModelDataSet;
Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 ModelVariable8 8 ModelVariable9 8 ModelVariable10 8 Predicted 8 Target 8;
Infile Datalines Missover;
Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 ModelVariable8 ModelVariable9 ModelVariable10 Predicted Target;
Format ;
Datalines;
;
Run;

 

And the following one is New Data Set(Population "B"), this Data Set has not a scoring code and has 7 Model Variables which Population "A" has already have. New Data Set, approximately, 90000 observations has.

 

The following code just an example, I tried to create sample view of New Data Set

 

Data NewDataSet;
Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 Predicted 8 Target 8;
Infile Datalines Missover;
Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 Predicted Target;
Format ;
Datalines;
;
Run;

 

 

Then I performed Population "A"s scoring code over population "B" then I used Proc Logistic as below;

 

Ods Graphics On;
PROC LOGISTIC DATA=ScoredNewDataSet /*PLOTS(ONLY)=ROC*/ PLOTS(MAXPOINTS=NONE);
/*Ods Select ROCCurve ;*/
MODEL Target (Event = "1")=Predicted/SELECTION=NONE LINK=LOGIT;
RUN;
QUIT;
Ods Graphics Off;

 

At the end of the results, "Sommers 'D" which is equal to Gini and "c" which is equal to "ROC" come to high. Approximately, 0.800 and 0.900. It seems it is a perfect results. When I check the customized analysis variables are not consistent between two data sets. 

 

Did I make myself clear? 0.900 is to high for ROC and GINI values. Either this method wrong or I'm doing something wrong. What do you think?

 

Thank you

Reeza
Super User

Scoring a model creates a predicted value that can be compared to observed results in your case.  and AUC/Gini test accuracy of model not similarity between datasets. 

 

AFAIK this seems to be an incorrect analysis method for your question of interest - how similar are your datasets. 

 

If you have a reference otherwise, please post it. 

 

You can ask statistical methodology questions at Cross Validated. 

 

Edit: Your 'scoring' code appears to be a basic regression - no scoring is going on. Not sure how you're modelling the predicted vs target variables either, it see,s like a weird model. You don't indicate if you've combined the data in any way, so how is each dataset factored in. 

Ksharp
Super User
Where is your ScoredNewDataSet coming from ? 
and How do you "performed Population "A"s scoring code over population "B" "?


your code should like :

proc logistic data=ModelDataSet
model ........
score data=NewDataSet out=want fitstat;
run;


SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1925 views
  • 0 likes
  • 3 in conversation