BookmarkSubscribeRSS Feed
oakHILLS68
Fluorite | Level 6

Without going into too much detail, I want to say that I've encountered what seems to be a problem with HPSPLIT.  I first ran this procedure using a dataset that was divided (using variable "divide") into a training subsample (divide = 1) and a validation subsample (divide = 0).  I included the statement:

     partition rolevar=divide(TRAIN='1' VALIDATE='0');

which is supposed to tell SAS to using the training data to estimate a classification tree and the validation data to validate it.

 

To check the results I got, I created a new dataset.  I made a new data set containing only the training data.  I did this by using

     if divide = 1;

to subsample the original large data.

 

When I ran HPSPLIT on just the training data alone (and without the "partition" statement), I got a different tree.

 

Why should the absence of the validation data in my second run of HPSPLIT affect the results?  It does not seem right.  I expected to get the same tree both ways.

 

Thanks.

 

Dennis H.

1 REPLY 1
Reeza
Super User

I thought training data was used to train/validate the model but TEST data was used to determine predictive ability. Training data can allow for over fitting which is why it's a three ways split for data, Training, Validation and Test Data. The Validation data is used for model selection so if it changes, it may change the model selected. 

 

But you'd probably wait for a SAS rep to answer your question, my experience with EM is limited 😉

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1214 views
  • 1 like
  • 2 in conversation