- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to calculate the AUC on a holdout test (or validation) data set. My model has a pretty good AUC on the training data (.87), but I would like to see if it performs well out of sample.
Let's say original datset contains three variables Y, X1, and X2. I split this dataset into two smaller datasets: XTRAIN and XTEST.
These are the steps I have done.
First I trained my model on the training dataset XTRAIN
proc logistic data = XTRAIN outmodel= MODEL1 ;
model Y (EVENT = '1')= X1 X2 ;
run;
Next I use my model to make predictions on the test dataset.
proc logistic inmodel = MODEL1 ;
score data = XTEST out = YPRED_test (rename = (P_1 = YPRED));
run;
Next I use these predictions to plot ROC and calculate my test AUC
proc logistic data= YPRED;
model Y(event="1")=;
roc pred =YPRED;
ods select ROCOVERLAY;
run;
I just wanted to check if these steps were correct. In general, these are the steps for out-of-sample model validation I have used when programming in R and Python.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay, re-read and now realized where I was getting confused. Just had a hard time understanding that you could fit the model and calculate prediction scores for different datasets in the same PROC LOGISTIC step.
This is ultimately the fastest way to compare training/test AUC.
proc logistic data=train; model y(event="1") = x1 x2; score data=train fitstat; score data=valid fitstat; run;
.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
https://support.sas.com/kb/39/724.html
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First it seems that having INMODEL as a separate step is unnecessary because I can just add a score statement to the first PROC LOGISTIC step. I just wanted to confirm that that is correct.
proc logistic data = XTRAIN outmodel= MODEL1 ;
model Y (EVENT = '1')= X1 X2 ;
score data = XTEST out =YTEST (rename = (P_1 = YPRED));
run;
Second I don't understand why you need the model statement in the second PROC LOGISTIC step since you already fit the model in the first step.
proc logistic data= YTEST;
model Y(event="1")=;
roc pred =YPRED;
ods select ROCOVERLAY;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That link is pretty old, and maybe there are newer versions of code that do the same (or maybe there were always two ways to do this).
You can try it both ways and see if the results are the same. You can also try the second piece of code without the MODEL statement and see what happens.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay, re-read and now realized where I was getting confused. Just had a hard time understanding that you could fit the model and calculate prediction scores for different datasets in the same PROC LOGISTIC step.
This is ultimately the fastest way to compare training/test AUC.
proc logistic data=train; model y(event="1") = x1 x2; score data=train fitstat; score data=valid fitstat; run;
.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc logistic data=sashelp.heart; model status(event='Dead')=weight height/nofit; roc 'weight' pred=weight; roc 'height' pred=height; run;