BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mh2t
Obsidian | Level 7

I did split my dataset into 3 separate sas datasets train, validate and test.

I wanted to build a GBM model on the train set, check on the validate set and predict on the test set. How can I use the validate and test sets in my code explicitly for model checking and prediction? 

 

proc gradboost data=mylib.train outmodel=mylib.savedModel seed=12345;
   input &myVars / level = nominal;
   target Y/ level = nominal;
   ods output FitStatistics=fitstats;
run;

Your help would be greatly appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
BrianGaines
SAS Employee

Hi @mh2t,

 

Are you using SAS Studio to develop your code? If so, then I suggest that you take a look at the tasks (specifically, the Partitioning, Gradient Boosting, and Assess tasks) because they can expedite your code development.  

 

As @StatDave mentioned, a convenient way to organize your data is to have one data table with an indicator variable that denotes which partition an observation belongs to. One benefit to this approach is that, when you estimate your model and use the PARTITION statement, some performance metrics for the validation and test partitions are automatically calculated so you don't need to calculate them separately as an additional step.   

 

For example, the following code creates a CAS session, loads SASHELP.CARS as an in-memory table, and partitions that table into three sets (the PROC PARTITION code is from the Partitioning task):

/* Connect to CAS */
cas;
libname mylib cas caslib="casuser";

/* Load data into memory */
data mylib.cars; 
   set sashelp.cars; 
run;

/* Partition data set */
proc partition data=mylib.cars partind samppct=30 samppct2=10;
	output out=mylib.cars;
run;

Now the data table MYLIB.CARS has a new _PartInd_ column where 0 corresponds to the training set, 1 for validation, and 2 for test. 

 

You can then use this data table with the PARTITION statement in PROC GRADBOOST, as is done with the following code (generated by the Gradient Boosting task):

proc gradboost data=MYLIB.CARS outmodel=mylib.savedModel;
	partition role=_PartInd_ (validate='1' test='2' train='0');
	target Origin / level=nominal;
	input MSRP EngineSize / level=interval;
	input DriveTrain / level=nominal;
	ods output FitStatistics=work.Gradboost_fit;
	score out=mylib.scored copyvars=(Origin MSRP EngineSize DriveTrain _PartInd_);
run;

You can see in the results that the procedure automatically calculates fit statistics for all three partitions:

gradboostResults.PNG

 

You could also use the saved model (mylib.savedModel) and PROC GRADBOOST to score the validation set, like in the following code:

proc gradboost data=MYLIB.CARS(where=(_partind_=1)) inmodel=mylib.savedModel;
	output out=mylib.valscored copyvars=(_all_);
run;

And you can see that the fit statistics match those produced by PROC GRADBOOST for the validation set when you estimated the model (compare with the previous results):

gradboostValResultsInmodel.PNG

 

But again, by organizing your data partitions into the same table and by using the PARTITION statement, SAS automatically calculates these fit statistics when you estimate your model.  You can also use the scored data table (mylib.scored) with the Assess task for additional model assessment.

 

Does this help?

 

-Brian

View solution in original post

2 REPLIES 2
StatDave
SAS Super FREQ

Concatenate your separate data sets into one data set with an added variable that has a distinct value for the training, validation, and testing sets of observations. Then add a PARTITION statement in your PROC GRADBOOST step. For example, if the added variable is named ObsType with values "trn", "val", and "tst":

 

partition role=ObsType(train='trt' validate='val' test='tst');

 

See the documentation for details on this statement.

BrianGaines
SAS Employee

Hi @mh2t,

 

Are you using SAS Studio to develop your code? If so, then I suggest that you take a look at the tasks (specifically, the Partitioning, Gradient Boosting, and Assess tasks) because they can expedite your code development.  

 

As @StatDave mentioned, a convenient way to organize your data is to have one data table with an indicator variable that denotes which partition an observation belongs to. One benefit to this approach is that, when you estimate your model and use the PARTITION statement, some performance metrics for the validation and test partitions are automatically calculated so you don't need to calculate them separately as an additional step.   

 

For example, the following code creates a CAS session, loads SASHELP.CARS as an in-memory table, and partitions that table into three sets (the PROC PARTITION code is from the Partitioning task):

/* Connect to CAS */
cas;
libname mylib cas caslib="casuser";

/* Load data into memory */
data mylib.cars; 
   set sashelp.cars; 
run;

/* Partition data set */
proc partition data=mylib.cars partind samppct=30 samppct2=10;
	output out=mylib.cars;
run;

Now the data table MYLIB.CARS has a new _PartInd_ column where 0 corresponds to the training set, 1 for validation, and 2 for test. 

 

You can then use this data table with the PARTITION statement in PROC GRADBOOST, as is done with the following code (generated by the Gradient Boosting task):

proc gradboost data=MYLIB.CARS outmodel=mylib.savedModel;
	partition role=_PartInd_ (validate='1' test='2' train='0');
	target Origin / level=nominal;
	input MSRP EngineSize / level=interval;
	input DriveTrain / level=nominal;
	ods output FitStatistics=work.Gradboost_fit;
	score out=mylib.scored copyvars=(Origin MSRP EngineSize DriveTrain _PartInd_);
run;

You can see in the results that the procedure automatically calculates fit statistics for all three partitions:

gradboostResults.PNG

 

You could also use the saved model (mylib.savedModel) and PROC GRADBOOST to score the validation set, like in the following code:

proc gradboost data=MYLIB.CARS(where=(_partind_=1)) inmodel=mylib.savedModel;
	output out=mylib.valscored copyvars=(_all_);
run;

And you can see that the fit statistics match those produced by PROC GRADBOOST for the validation set when you estimated the model (compare with the previous results):

gradboostValResultsInmodel.PNG

 

But again, by organizing your data partitions into the same table and by using the PARTITION statement, SAS automatically calculates these fit statistics when you estimate your model.  You can also use the scored data table (mylib.scored) with the Assess task for additional model assessment.

 

Does this help?

 

-Brian

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1192 views
  • 0 likes
  • 3 in conversation