BookmarkSubscribeRSS Feed
DvdM
Calcite | Level 5

How can I partition my dataset in a training and test set, where the training set can be used for k-fold cross validation for hyperparameter tuning in Model Studio? 

I want to use cross validation to find the optimal hyperparameters for my gradient boosting model, but I also want a separate test set to evaluate the model's performance. 

In the documentation for the autotuning validation method it says that "if your data is partitioned, then that partition is used and Validation methodValidation data proportion, and Cross validation number of folds are all ignored". However, if I do not create a partition variable in the project settings, it seems that the model will also be scored on the training data, thereby resulting in a AUC of 1. How can I holdout a separate test set, but still apply cross validation?

 

Thank you in advance!  

4 REPLIES 4
WendyCzika
SAS Employee

Actually that doc isn't completely correct.  If you only create training and test partitions (set the validation percentage to 0), then if you select K-fold cross validation for the Validation method, it will actually use that. So give that a try, and hope that helps!

DvdM
Calcite | Level 5

Thank you for your reply! I tried this on data with only a train and test partition, but noticed that the application of cross-validation does not show up in the autotuning part of the training code. See below:

 

partition rolevar='_PartInd_'n (TRAIN='1' TEST='2');
autotune useparameters=CUSTOM tuningparameters=(
lasso(LB=0 UB=10 INIT=0)
learningrate(LB=0.01 UB=1 INIT=0.1)
ntrees(LB=20 UB=150 INIT=100)
ridge(LB=0 UB=10 INIT=1)
samplingrate(LB=0.1 UB=1 INIT=0.5)
vars_to_try(LB=1 UB=100 INIT=100)
)
searchmethod=GA objective=AUC maxtime=3600
maxevals=50 maxiters=5 popsize=10
targetevent='1'
;

 

With data that is not partitioned, the cross validation does appear in the training code with the number of folds: 

 

 

  autotune useparameters=CUSTOM tuningparameters=(
     lasso(LB=0 UB=10 INIT=0)
     learningrate(LB=0.01 UB=1 INIT=0.1)
     ntrees(LB=20 UB=150 INIT=100)
     ridge(LB=0 UB=10 INIT=1)
     samplingrate(LB=0.1 UB=1 INIT=0.5)
     vars_to_try(LB=1 UB=100 INIT=100)
     )
     kfold=5
     searchmethod=GA objective=AUC maxtime=3600
     maxevals=50 maxiters=5 popsize=10
     targetevent='1'
  ;

Does it still work, even though it isn't mentioned in the training code? Or does it mean that cross validation is still not applied if I only have a train and test partition?

WendyCzika
SAS Employee

Do you know what version of SAS Viya you are on?  It should be working in Viya 3.5 and on.

 

DvdM
Calcite | Level 5
I am on V.03.04, so I guess I have to wait until my environment is updated to version 3.5 then. Or is there another workaround for version 3.4? 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1075 views
  • 0 likes
  • 2 in conversation