12-24-2016 12:21 PM
Currently, I try to analyze whether the two data sets(one of them is Model data set) are consistent with each another or not. In accordance with this purpose, firstly, I try to perform PSI(Population Stability Index), SSI(Stability Statistic Index) and Default Rate analysis. As is known, to understand this efficiently, we should examine the GINI value, however, this two datasets’s model variables similar to each other in the ratio of %69.
Lets call these data sets being populations and give more detail,
I have two populations, one of them is “A” population(model data set) and the other one is “B” population. I have a scoring code for “A” population and “B” population has only 69 percent of model variables of A population’s model variables. I tried to perform “A” population scoring code over the “B” population then I perform Logistic Regression on results in Enterprise Guide. Even though, whole analyses give inconsistent results such as PSI, SSI and Default Rate, the result of the GINI(Sommers ‘D) comes 0.800 and c(ROC) comes 0.900.
Here are my some questions about this case,
Things I have
A population B population
Model Data Set New Data Set
Scoring Code No Scoring Code
Model Variables %69 of A population Model Variables
12-24-2016 02:54 PM
Have you compared the variables individually to each other?
Ie check continuous variables using t-tests or KS tests.
check categorical variables via a chi square tests
12-24-2016 06:00 PM
I compared the variables by customized PSI and SSI analysis.
On the other hand, I think you mean Kolmogorov - Simirnov tests. How can I use this test? And also can you show me short samples for T-Test and Chi Square analysis, please?
Further, may I learn your opinions about the following questions, especially the second question?
Here are my some questions about this case,
12-25-2016 01:28 AM
Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?
Scoring a model won't tell you anything about the consistency between the datasets. Scoring doesn't even require that you have the actual values, it derives predicted values based on model developed. Gini coefficient tells you how good a model is, not how similar data sets are. If one was a subset for example, then it would still score well but would not be representative of the actual population at all.
I dont know what PSI and SSI are...perhaps they're subject specific terms.
12-24-2016 10:45 PM
I don't understand your question, what do you mean " New Data Set Fit with Model Data Set". ROC should not be so hight. I suspect you have some problem with data. If you want check Goodness-Of-Fit statistic score data=datasetB fitstat ; Or use H-L test: model .........../lackfit .
12-25-2016 09:46 AM
Let me tell my question over example;
Let's pretend I have a Model Set(Population "A") as below, this Model Set includes 10 Model Variables and Target&Predicted Variables, also this model set has a scoring code. Model Set, approximately, 3000 observations has.
The following code just an example, I tried to create sample view of Model Data Set
Data ModelDataSet; Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 ModelVariable8 8 ModelVariable9 8 ModelVariable10 8 Predicted 8 Target 8; Infile Datalines Missover; Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 ModelVariable8 ModelVariable9 ModelVariable10 Predicted Target; Format ; Datalines; ; Run;
And the following one is New Data Set(Population "B"), this Data Set has not a scoring code and has 7 Model Variables which Population "A" has already have. New Data Set, approximately, 90000 observations has.
The following code just an example, I tried to create sample view of New Data Set
Data NewDataSet; Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 Predicted 8 Target 8; Infile Datalines Missover; Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 Predicted Target; Format ; Datalines; ; Run;
Then I performed Population "A"s scoring code over population "B" then I used Proc Logistic as below;
Ods Graphics On; PROC LOGISTIC DATA=ScoredNewDataSet /*PLOTS(ONLY)=ROC*/ PLOTS(MAXPOINTS=NONE); /*Ods Select ROCCurve ;*/ MODEL Target (Event = "1")=Predicted/SELECTION=NONE LINK=LOGIT; RUN; QUIT; Ods Graphics Off;
At the end of the results, "Sommers 'D" which is equal to Gini and "c" which is equal to "ROC" come to high. Approximately, 0.800 and 0.900. It seems it is a perfect results. When I check the customized analysis variables are not consistent between two data sets.
Did I make myself clear? 0.900 is to high for ROC and GINI values. Either this method wrong or I'm doing something wrong. What do you think?
12-25-2016 09:58 AM
Scoring a model creates a predicted value that can be compared to observed results in your case. and AUC/Gini test accuracy of model not similarity between datasets.
AFAIK this seems to be an incorrect analysis method for your question of interest - how similar are your datasets.
If you have a reference otherwise, please post it.
You can ask statistical methodology questions at Cross Validated.
Edit: Your 'scoring' code appears to be a basic regression - no scoring is going on. Not sure how you're modelling the predicted vs target variables either, it see,s like a weird model. You don't indicate if you've combined the data in any way, so how is each dataset factored in.
12-25-2016 10:55 PM
Where is your ScoredNewDataSet coming from ? and How do you "performed Population "A"s scoring code over population "B" "? your code should like : proc logistic data=ModelDataSet model ........ score data=NewDataSet out=want fitstat; run;