Solved: How to use validate and test datasets manually in PROC GRADBOOST?

mh2t · Posted 07-09-2020 11:43 AM

I did split my dataset into 3 separate sas datasets train, validate and test.

I wanted to build a GBM model on the train set, check on the validate set and predict on the test set. How can I use the validate and test sets in my code explicitly for model checking and prediction?

proc gradboost data=mylib.train outmodel=mylib.savedModel seed=12345;
   input &myVars / level = nominal;
   target Y/ level = nominal;
   ods output FitStatistics=fitstats;
run;

Your help would be greatly appreciated!

BrianGaines · Posted 07-09-2020 04:28 PM

Hi @mh2t,

Are you using SAS Studio to develop your code? If so, then I suggest that you take a look at the tasks (specifically, the Partitioning, Gradient Boosting, and Assess tasks) because they can expedite your code development.

As @StatDave mentioned, a convenient way to organize your data is to have one data table with an indicator variable that denotes which partition an observation belongs to. One benefit to this approach is that, when you estimate your model and use the PARTITION statement, some performance metrics for the validation and test partitions are automatically calculated so you don't need to calculate them separately as an additional step.

For example, the following code creates a CAS session, loads SASHELP.CARS as an in-memory table, and partitions that table into three sets (the PROC PARTITION code is from the Partitioning task):

/* Connect to CAS */
cas;
libname mylib cas caslib="casuser";

/* Load data into memory */
data mylib.cars; 
   set sashelp.cars; 
run;

/* Partition data set */
proc partition data=mylib.cars partind samppct=30 samppct2=10;
	output out=mylib.cars;
run;

Now the data table MYLIB.CARS has a new _PartInd_ column where 0 corresponds to the training set, 1 for validation, and 2 for test.

You can then use this data table with the PARTITION statement in PROC GRADBOOST, as is done with the following code (generated by the Gradient Boosting task):

proc gradboost data=MYLIB.CARS outmodel=mylib.savedModel;
	partition role=_PartInd_ (validate='1' test='2' train='0');
	target Origin / level=nominal;
	input MSRP EngineSize / level=interval;
	input DriveTrain / level=nominal;
	ods output FitStatistics=work.Gradboost_fit;
	score out=mylib.scored copyvars=(Origin MSRP EngineSize DriveTrain _PartInd_);
run;

You can see in the results that the procedure automatically calculates fit statistics for all three partitions:

You could also use the saved model (mylib.savedModel) and PROC GRADBOOST to score the validation set, like in the following code:

proc gradboost data=MYLIB.CARS(where=(_partind_=1)) inmodel=mylib.savedModel;
	output out=mylib.valscored copyvars=(_all_);
run;

And you can see that the fit statistics match those produced by PROC GRADBOOST for the validation set when you estimated the model (compare with the previous results):

But again, by organizing your data partitions into the same table and by using the PARTITION statement, SAS automatically calculates these fit statistics when you estimate your model. You can also use the scored data table (mylib.scored) with the Assess task for additional model assessment.

Does this help?

-Brian

View solution in original post

StatDave · Posted 07-09-2020 11:51 AM

Concatenate your separate data sets into one data set with an added variable that has a distinct value for the training, validation, and testing sets of observations. Then add a PARTITION statement in your PROC GRADBOOST step. For example, if the added variable is named ObsType with values "trn", "val", and "tst":

partition role=ObsType(train='trt' validate='val' test='tst');

See the documentation for details on this statement.

BrianGaines · Posted 07-09-2020 04:28 PM

Hi @mh2t,

Are you using SAS Studio to develop your code? If so, then I suggest that you take a look at the tasks (specifically, the Partitioning, Gradient Boosting, and Assess tasks) because they can expedite your code development.

As @StatDave mentioned, a convenient way to organize your data is to have one data table with an indicator variable that denotes which partition an observation belongs to. One benefit to this approach is that, when you estimate your model and use the PARTITION statement, some performance metrics for the validation and test partitions are automatically calculated so you don't need to calculate them separately as an additional step.

For example, the following code creates a CAS session, loads SASHELP.CARS as an in-memory table, and partitions that table into three sets (the PROC PARTITION code is from the Partitioning task):

/* Connect to CAS */
cas;
libname mylib cas caslib="casuser";

/* Load data into memory */
data mylib.cars; 
   set sashelp.cars; 
run;

/* Partition data set */
proc partition data=mylib.cars partind samppct=30 samppct2=10;
	output out=mylib.cars;
run;

Now the data table MYLIB.CARS has a new _PartInd_ column where 0 corresponds to the training set, 1 for validation, and 2 for test.

You can then use this data table with the PARTITION statement in PROC GRADBOOST, as is done with the following code (generated by the Gradient Boosting task):

proc gradboost data=MYLIB.CARS outmodel=mylib.savedModel;
	partition role=_PartInd_ (validate='1' test='2' train='0');
	target Origin / level=nominal;
	input MSRP EngineSize / level=interval;
	input DriveTrain / level=nominal;
	ods output FitStatistics=work.Gradboost_fit;
	score out=mylib.scored copyvars=(Origin MSRP EngineSize DriveTrain _PartInd_);
run;

You can see in the results that the procedure automatically calculates fit statistics for all three partitions:

You could also use the saved model (mylib.savedModel) and PROC GRADBOOST to score the validation set, like in the following code:

proc gradboost data=MYLIB.CARS(where=(_partind_=1)) inmodel=mylib.savedModel;
	output out=mylib.valscored copyvars=(_all_);
run;

And you can see that the fit statistics match those produced by PROC GRADBOOST for the validation set when you estimated the model (compare with the previous results):

But again, by organizing your data partitions into the same table and by using the PARTITION statement, SAS automatically calculates these fit statistics when you estimate your model. You can also use the scored data table (mylib.scored) with the Assess task for additional model assessment.

Does this help?

-Brian

How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

Re: How to use validate and test datasets manually in PROC GRADBOOST?

SAS Innovate 2025: Call for Content