BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Onyx | Level 15
Hello
I build a credit risk model on 100,000 customers. I split the data into train 70% and test 30% and built the model on train data. Then the results are Gini 79.5% on train and 78.5% on test . My question- is this difference of 1% is okay or mention a problem ?
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

View solution in original post

1 REPLY 1
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1800 views
  • 1 like
  • 2 in conversation