turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How to - Understand Whether The New Data Set Match...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-24-2016 12:21 PM

Hello everyone,

Currently, I try to analyze whether the two data sets(one of them is Model data set) are consistent with each another or not. In accordance with this purpose, firstly, I try to perform PSI(Population Stability Index), SSI(Stability Statistic Index) and Default Rate analysis. As is known, to understand this efficiently, we should examine the GINI value, however, this two datasets’s model variables similar to each other in the ratio of %69.

Lets call these data sets being populations and give more detail,

I have two populations, one of them is “A” population(model data set) and the other one is “B” population. I have a scoring code for “A” population and “B” population has only 69 percent of model variables of A population’s model variables. I tried to perform “A” population scoring code over the “B” population then I perform Logistic Regression on results in Enterprise Guide. Even though, whole analyses give inconsistent results such as PSI, SSI and Default Rate, the result of the GINI(Sommers ‘D) comes 0.800 and c(ROC) comes 0.900.

Here are my some questions about this case,

- How the other analyses results come inconsistent for these population,even the GINI and ROC come so high? How is it possible?
- Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?
- What can be the other methods to reach my aim, how can I make my decision to understand whether the datasets are consistent with each other or not?
- Are there any other ways to find GINI values or are there any values to check whether the data sets are consistent or not?

__Things I have__

__A population__ __B population__

Model Data Set New Data Set

Scoring Code No Scoring Code

Model Variables %69 of A population Model Variables

Thank you,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-24-2016 02:54 PM

Have you compared the variables individually to each other?

Ie check continuous variables using t-tests or KS tests.

check categorical variables via a chi square tests

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-24-2016 06:00 PM

Hello,

I compared the variables by customized PSI and SSI analysis.

On the other hand, I think you mean Kolmogorov - Simirnov tests. How can I use this test? And also can you show me short samples for T-Test and Chi Square analysis, please?

Further, may I learn your opinions about the following questions, especially the second question?

Here are my some questions about this case,

- How the other analyses results come inconsistent for these population,even the GINI and ROC come so high? How is it possible?
**Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?**- What can be the other methods to reach my aim, how can I make my decision to understand whether the datasets are consistent with each other or not?
- Are there any other ways to find GINI values or are there any values to check whether the data sets are consistent or not?

Thank you

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-25-2016 01:28 AM

**Is it right to perform Model data set's(Population A) scoring code over the new data set(Population B) to uderstand the consistency between these data sets and learn the GINI value?**

Scoring a model won't tell you anything about the consistency between the datasets. Scoring doesn't even require that you have the actual values, it derives predicted values based on model developed. Gini coefficient tells you how good a model is, not how similar data sets are. If one was a subset for example, then it would still score well but would not be representative of the actual population at all.

I dont know what PSI and SSI are...perhaps they're subject specific terms.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-24-2016 10:45 PM

I don't understand your question, what do you mean " New Data Set Fit with Model Data Set". ROC should not be so hight. I suspect you have some problem with data. If you want check Goodness-Of-Fit statistic score data=datasetB fitstat ; Or use H-L test: model .........../lackfit .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-25-2016 09:46 AM

Hello again,

Let me tell my question over example;

Let's pretend I have a Model Set(Population "A") as below, this Model Set includes 10 Model Variables and Target&Predicted Variables, also this model set has a scoring code. Model Set, approximately, 3000 observations has.

The following code just an example, I tried to create sample view of Model Data Set

```
Data ModelDataSet;
Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 ModelVariable8 8 ModelVariable9 8 ModelVariable10 8 Predicted 8 Target 8;
Infile Datalines Missover;
Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 ModelVariable8 ModelVariable9 ModelVariable10 Predicted Target;
Format ;
Datalines;
;
Run;
```

And the following one is New Data Set(Population "B"), this Data Set has not a scoring code and has 7 Model Variables which Population "A" has already have. New Data Set, approximately, 90000 observations has.

The following code just an example, I tried to create sample view of New Data Set

```
Data NewDataSet;
Length CustomerID 8 YearMonth $ 10 ModelVariable1 8 ModelVariable2 8 ModelVariable3 8 ModelVariable4 8 ModelVariable5 8 ModelVariable6 8 ModelVariable7 8 Predicted 8 Target 8;
Infile Datalines Missover;
Input CustomerID YearMonth ModelVariable1 ModelVariable2 ModelVariable3 ModelVariable4 ModelVariable5 ModelVariable6 ModelVariable7 Predicted Target;
Format ;
Datalines;
;
Run;
```

Then I performed Population "A"s scoring code over population "B" then I used Proc Logistic as below;

```
Ods Graphics On;
PROC LOGISTIC DATA=ScoredNewDataSet /*PLOTS(ONLY)=ROC*/ PLOTS(MAXPOINTS=NONE);
/*Ods Select ROCCurve ;*/
MODEL Target (Event = "1")=Predicted/SELECTION=NONE LINK=LOGIT;
RUN;
QUIT;
Ods Graphics Off;
```

At the end of the results, "Sommers 'D" which is equal to Gini and "c" which is equal to "ROC" come to high. Approximately, 0.800 and 0.900. It seems it is a perfect results. When I check the customized analysis variables are not consistent between two data sets.

Did I make myself clear? 0.900 is to high for ROC and GINI values. Either this method wrong or I'm doing something wrong. What do you think?

Thank you

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-25-2016 09:58 AM

Scoring a model creates a predicted value that can be compared to observed results in your case. and AUC/Gini test accuracy of model not similarity between datasets.

AFAIK this seems to be an incorrect analysis method for your question of interest - how similar are your datasets.

If you have a reference otherwise, please post it.

You can ask statistical methodology questions at Cross Validated.

Edit: Your 'scoring' code appears to be a basic regression - no scoring is going on. Not sure how you're modelling the predicted vs target variables either, it see,s like a weird model. You don't indicate if you've combined the data in any way, so how is each dataset factored in.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-25-2016 10:55 PM

Where is your ScoredNewDataSet coming from ? and How do you "performed Population "A"s scoring code over population "B" "? your code should like : proc logistic data=ModelDataSet model ........ score data=NewDataSet out=want fitstat; run;