Let us assume I have n = 50,000 records in my training dataset. Then I also have 25,000 records in a unique & new dataset.
I would like to submit all of it to EM Decision Trees so that 100% of the data in my training dataset it used as my estimation – but the 25,000 serve precisely (100%) as my validation, or test, dataset.
In reviewing the literature it looks like I will have to do something via stratification – but then it needs the %s for each of the levels. So I am a little confused there.
Maybe this is not done with a Data Partition node? The following is another scenario that would be ideal:
Training = 80% of 50,000
Validation = 20% of 50,000
Test = 100% of 25,000
How to make this happen perfectly?
Thank you very much in advance,
Zach Feinstein, Statistical Data Modeler
P (952) 838-4289 C(612) 590-4813 F (952) 838-2010
SFM Mutual Insurance Company
3500 American Blvd. W,
Suite 700, Bloomington, MN 55431
Hi Zach,
You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.
In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.
After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.
In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

Is this what you needed?
Good luck,
Miguel
Hi Zach,
You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.
In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.
After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.
In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

Is this what you needed?
Good luck,
Miguel
That seems like a very reasonable way to do it. Thank you so much!
I think what I put together is the equivalent of what you did [picture below].
Where may I find the output that compares or contrasts the scored nodes between the training and the test?
As long as your data partition node has test set to 0%, yep, I'd have done it the exact same way.
The results of your model node (e.g. decision tree) have fit statistics for all your partitions. For more stats like ROC, lift, gain, response, add a Model comparison node and see the results.
I hope it helps,
M
That, again, is some great help.
I suppose beggars cannot be choosers, but is it possible to see the same kind or quality of tree output - instead of Training versus Validation it will display Training versus Test for the statistically significant nodes from before?
man, with EM you can always choose... or come up with a workaround.
what do you have in mind? just the tree plot with stats for train & test on the boxes? or something else?
Basic tree plot with the stats for train & test within the two columns of boxes would be ideal.
But I think the only difficult part would be to ensure that the nodes are precisely the same as what was generated by default, or interactively, within the initial Training runs.
I have a couple ideas, will be in touch later today.
what EM version do you have?
Thanks a bunch. EM 6.1.
Hi,
As you have two datasets and want to use one for model development and other for validation. How about using user defined method within partition node?
Naeem
Hi Zachary. In the words of Spiderman, my spider sense is tingling.
I hope there are no differences in data set A (n=50,000) and data set B (n=25,000). I've been in this situation before and was told data set B was collected the same as data set A, but I find out later there was a slight methodological change in collected for B. Yes, the variables in both data sets are the same, but the underlying values had different assumptions.
You wouldn't want to validate a model using nuanced data.
Maybe there is a way to do both approaches and compare them. If you had a third data set for scoring, you could compare the results of the models.
You raise an excellent point. Thank you.
Actually - both datasets come from precisely the same pool. So perhaps that will aid in the discussion, configuration, and methodology behind seeing how the Training lines up with the Test within a Decision Tree.
Do you have any suggestions in how to best compare the results after the scoring? I have a full breadth of experience with the Training data and the Validation - just not with the Test data.
I almost wish there was a way for me to use the node of Data Partition where 80% of the first dataset is used for Training, 20% of that dataset is used for Validation, then 100% of the "other data" becomes the test. But the trick would be to have the Training/Validation/Test al in one dataset.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
