2 weeks ago
Without going into too much detail, I want to say that I've encountered what seems to be a problem with HPSPLIT. I first ran this procedure using a dataset that was divided (using variable "divide") into a training subsample (divide = 1) and a validation subsample (divide = 0). I included the statement:
partition rolevar=divide(TRAIN='1' VALIDATE='0');
which is supposed to tell SAS to using the training data to estimate a classification tree and the validation data to validate it.
To check the results I got, I created a new dataset. I made a new data set containing only the training data. I did this by using
if divide = 1;
to subsample the original large data.
When I ran HPSPLIT on just the training data alone (and without the "partition" statement), I got a different tree.
Why should the absence of the validation data in my second run of HPSPLIT affect the results? It does not seem right. I expected to get the same tree both ways.
2 weeks ago
I thought training data was used to train/validate the model but TEST data was used to determine predictive ability. Training data can allow for over fitting which is why it's a three ways split for data, Training, Validation and Test Data. The Validation data is used for model selection so if it changes, it may change the model selected.
But you'd probably wait for a SAS rep to answer your question, my experience with EM is limited