- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I build a credit risk model on 100,000 customers. I split the data into train 70% and test 30% and built the model on train data. Then the results are Gini 79.5% on train and 78.5% on test . My question- is this difference of 1% is okay or mention a problem ?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.
data train test ;
set sashelp.heart(keep=status ageatstart);
if status='Alive' then output train;
else output test;
rename ageatstart=score;
run;
data all;
set train test indsname=indsname;
dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;
Here D is KS value which is > 0.3 and PValue=<.0001
that means it is significant(a.k.a the score is different from TRAIN and TEST, Gini 79.5% on train and 78.5% on test is different with each other).
2)You also can do ANOVA if your score from TRAIN and TEST both are conform to NORMAL distribution.
proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;
3)You also could compare two ROC curve by Chisquare Test.
https://support.sas.com/kb/45/339.html
4) Calling @StatDave
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.
data train test ;
set sashelp.heart(keep=status ageatstart);
if status='Alive' then output train;
else output test;
rename ageatstart=score;
run;
data all;
set train test indsname=indsname;
dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;
Here D is KS value which is > 0.3 and PValue=<.0001
that means it is significant(a.k.a the score is different from TRAIN and TEST, Gini 79.5% on train and 78.5% on test is different with each other).
2)You also can do ANOVA if your score from TRAIN and TEST both are conform to NORMAL distribution.
proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;
3)You also could compare two ROC curve by Chisquare Test.
https://support.sas.com/kb/45/339.html
4) Calling @StatDave