BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Onyx | Level 15
Hello
I build a credit risk model on 100,000 customers. I split the data into train 70% and test 30% and built the model on train data. Then the results are Gini 79.5% on train and 78.5% on test . My question- is this difference of 1% is okay or mention a problem ?
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

View solution in original post

1 REPLY 1
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1921 views
  • 1 like
  • 2 in conversation