BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
whs278
Quartz | Level 8

 

I'm trying to calculate the AUC on a holdout test (or validation) data set.  My model has a pretty good AUC on the training data (.87), but I would like to see if it performs well out of sample.

 

 Let's say original datset contains three variables Y, X1, and X2.  I split this dataset into two smaller datasets: XTRAIN and XTEST.    

 

These are the steps I have done.

 

First I trained my model on the training dataset XTRAIN

proc logistic data = XTRAIN outmodel=  MODEL1   ;
	model Y (EVENT = '1')=  X1 X2 ;
run;

 

Next I use my model to make predictions on the test dataset.

 

proc logistic inmodel = MODEL1 ;
    score data = XTEST out = YPRED_test (rename = (P_1 = YPRED));
run;

Next I use these predictions to plot ROC and calculate my test AUC

 

proc logistic data= YPRED;
        model Y(event="1")=;
        roc pred =YPRED;
       ods select ROCOVERLAY;
 run;

I just wanted to check if these steps were correct.  In general, these are the steps for out-of-sample model validation I have used when programming in R and Python. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
whs278
Quartz | Level 8

Okay, re-read and now realized where I was getting confused.  Just had a hard time understanding that you could fit the model and calculate prediction scores for different datasets in the same PROC LOGISTIC step. 

 

This is ultimately  the fastest way to compare training/test AUC.

 

    proc logistic data=train;
        model y(event="1") = x1 x2;
        score data=train fitstat;
        score data=valid fitstat;
        run;

View solution in original post

5 REPLIES 5
whs278
Quartz | Level 8
Thanks, I actually read that but was confused on two points.

First it seems that having INMODEL as a separate step is unnecessary because I can just add a score statement to the first PROC LOGISTIC step. I just wanted to confirm that that is correct.

proc logistic data = XTRAIN outmodel= MODEL1 ;
model Y (EVENT = '1')= X1 X2 ;
score data = XTEST out =YTEST (rename = (P_1 = YPRED));
run;

Second I don't understand why you need the model statement in the second PROC LOGISTIC step since you already fit the model in the first step.

proc logistic data= YTEST;
model Y(event="1")=;
roc pred =YPRED;
ods select ROCOVERLAY;
run;
PaigeMiller
Diamond | Level 26

That link is pretty old, and maybe there are newer versions of code that do the same (or maybe there were always two ways to do this).

 

You can try it both ways and see if the results are the same. You can also try the second piece of code without the MODEL statement and see what happens.

--
Paige Miller
whs278
Quartz | Level 8

Okay, re-read and now realized where I was getting confused.  Just had a hard time understanding that you could fit the model and calculate prediction scores for different datasets in the same PROC LOGISTIC step. 

 

This is ultimately  the fastest way to compare training/test AUC.

 

    proc logistic data=train;
        model y(event="1") = x1 x2;
        score data=train fitstat;
        score data=valid fitstat;
        run;

Ksharp
Super User
proc logistic data=sashelp.heart;
model status(event='Dead')=weight height/nofit;
roc 'weight' pred=weight;
roc 'height' pred=height;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2588 views
  • 2 likes
  • 3 in conversation