SAS Data Science

Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Viya (Machine Learning), SAS Visual Text Analytics, with point-and-click interfaces or programming
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Meteorite | Level 14
Hello
I build a credit risk model on 100,000 customers. I split the data into train 70% and test 30% and built the model on train data. Then the results are Gini 79.5% on train and 78.5% on test . My question- is this difference of 1% is okay or mention a problem ?
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

View solution in original post

1 REPLY 1
Ksharp
Super User

1) You could do Wilcoxon Test(non-parameter method) to check whether the score from TRAIN and TEST are conform to the same distribution.

data train test ;
 set sashelp.heart(keep=status ageatstart);
 if status='Alive' then output train;
  else output test;
 rename ageatstart=score;
run;

data all;
 set train test indsname=indsname;
 dsn=indsname;
run;
proc npar1way data=all edf;
class dsn;
var score;
run;

Ksharp_0-1734833628057.png

Here D is KS value which is > 0.3 and PValue=<.0001

that means it is significant(a.k.a  the score is different from TRAIN and TEST,  Gini 79.5% on train and 78.5% on test is different with each other).

 

2)You also can do ANOVA if your score from TRAIN and TEST both are conform to  NORMAL distribution.


proc glm data=all ;
class dsn;
model score=dsn/solution;
quit;

Ksharp_1-1734833946024.png

 

3)You also could compare two ROC curve by Chisquare Test.

https://support.sas.com/kb/45/339.html

 

4) Calling @StatDave 

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1313 views
  • 1 like
  • 2 in conversation