Hello,
We have a doubt about the value of the kolmogorof smirnov for one of our models. The model is a scorecard developed using SAS Enterprise Miner and recently the KS statistic is taking very high values, up to a 0.65. We would like to know why is this happening and, if it is possible, how to solve it.
Thank you very much for your attention.
Your AUC statistic of logistic must be near 0.9 ?
That display your model is overfit .
Do you have a lot of independent variables ? or a big data ?
OR
some one or two variables are very very significant for your model
you could use proc freq check:
proc freq data=have;
table good_bad*X1 ;
run;
OR
Check WOE of each X variables, see if there is very big like 500 or very small like -500 .
OR
Calling @Rick_SAS
You need to explain how the KS statistic is being used. What are you modeling and what hypothesis test is being run? The KS statistic is used for many things, including the modeling of a distribution or the normality of residuals.
There is a picture in the article "What is Kolmogorov's D statistic?", which shows the geometric meaning of the KS statistic. The value represents the maximum deviation between an empirical CDF and the CDF of a reference distribution (often the normal distribution). The situation you describe indicates that the empirical distribution of the data is very different from the reference distribution, such as the artificial example I've created below. Reasons might include that you are specifying some parameters for the reference distribution (for example, a threshold parameter) that is very different from the best choice for that parameter.
If I was right, your missing value(level) of each X variables must have a very big or small WOE like : 400,-400 .
And If I was also right , your every X variables must have a very big IV value like : 0.8 or 0.6 .
That is to say your data sampling method (or data quality ) does not look right .
You should keep missing value away from all your X variables .
Rick,
OP 's code should look like
data final_total_score;
input good_bad $ total_score;
cards;
good 600
good 620
bad 520
bad 440
..........
;
title "KS检验";
proc npar1way data=final_total_score plots=edfplot edf ;
class good_bad;
var total_score;
run;
But OP get this KS under SAS/EM ,an GUI windows component in SAS ,like SAS/EG .
The problem is that this model has been working well during years, this problem has been going on for two months, we had never had this problem. When the model was developed, the WOE and the IV values for each variable seemed correct.
Is there some variable's IV is greater than 0.5 or 0.6 , If it does ,then your model is NOT trusted ,should drop these high IV variables .
@Adrián_cyc wrote:
The problem is that this model has been working well during years, this problem has been going on for two months, we had never had this problem. When the model was developed, the WOE and the IV values for each variable seemed correct.
Quite often when something has been working reasonably well and then stops you might want to investigate something other than the model code.
Did the data collection methods change?
Did any variables change meaning but use the same values?
Did precision of an instrument change?
Did the number of records involved for any sort of grouping variable(s) change? Or change for just some grouping variable values?
Does anyone examine the logs of the step that brings the data into SAS? Are there warnings that weren't there before? Data conversion notes?
It may help to actually post an example of the model code you are using. Someone familiar with the proc may be able to point out places
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.