Programming the statistical procedures from SAS

Seeming Problem with HPSPLIT

Without going into too much detail, I want to say that I've encountered what seems to be a problem with HPSPLIT.  I first ran this procedure using a dataset that was divided (using variable "divide") into a training subsample (divide = 1) and a validation subsample (divide = 0).  I included the statement:

     partition rolevar=divide(TRAIN='1' VALIDATE='0');

which is supposed to tell SAS to using the training data to estimate a classification tree and the validation data to validate it.


To check the results I got, I created a new dataset.  I made a new data set containing only the training data.  I did this by using

     if divide = 1;

to subsample the original large data.


When I ran HPSPLIT on just the training data alone (and without the "partition" statement), I got a different tree.


Why should the absence of the validation data in my second run of HPSPLIT affect the results?  It does not seem right.  I expected to get the same tree both ways.




Re: Seeming Problem with HPSPLIT

I thought training data was used to train/validate the model but TEST data was used to determine predictive ability. Training data can allow for over fitting which is why it's a three ways split for data, Training, Validation and Test Data. The Validation data is used for model selection so if it changes, it may change the model selected. 


