## EM Decision Trees - Stratification or Not - Validation or Test?

Solved
Frequent Contributor
Posts: 115

# EM Decision Trees - Stratification or Not - Validation or Test?

Let us assume I have n = 50,000 records in my training dataset. Then I also have 25,000 records in a unique & new dataset.

I would like to submit all of it to EM Decision Trees so that 100% of the data in my training dataset it used as my estimation – but the 25,000 serve precisely (100%) as my validation, or test, dataset.

In reviewing the literature it looks like I will have to do something via stratification – but then it needs the %s for each of the levels. So I am a little confused there.

Maybe this is not done with a Data Partition node? The following is another scenario that would be ideal:

Training = 80% of 50,000

Validation = 20% of 50,000

Test = 100% of 25,000

How to make this happen perfectly?

Thank you very much in advance,

Zach Feinstein, Statistical Data Modeler

P (952) 838-4289 C(612) 590-4813  F (952) 838-2010

SFM Mutual Insurance Company

3500 American Blvd. W,
Suite 700, Bloomington, MN 55431

Accepted Solutions
Solution
‎11-19-2014 10:08 AM
Super Contributor
Posts: 337

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Hi Zach,

You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.

In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.

After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.

In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

Is this what you needed?

Good luck,

Miguel

All Replies
Solution
‎11-19-2014 10:08 AM
Super Contributor
Posts: 337

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Hi Zach,

You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.

In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.

After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.

In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

Is this what you needed?

Good luck,

Miguel

Frequent Contributor
Posts: 115

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

That seems like a very reasonable way to do it. Thank you so much!

I think what I put together is the equivalent of what you did [picture below].

Where may I find the output that compares or contrasts the scored nodes between the training and the test?

Super Contributor
Posts: 337

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

As long as your data partition node has test set to 0%, yep, I'd have done it the exact same way.

The results of your model node (e.g. decision tree) have fit statistics for all your partitions. For more stats like ROC, lift, gain, response, add a Model comparison node and see the results.

I hope it helps,

M

Frequent Contributor
Posts: 115

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

That, again, is some great help.

I suppose beggars cannot be choosers, but is it possible to see the same kind or quality of tree output - instead of Training versus Validation it will display Training versus Test for the statistically significant nodes from before?

Super Contributor
Posts: 337

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

man, with EM you can always choose... or come up with a workaround.

what do you have in mind? just the tree plot with stats for train & test on the boxes? or something else?

Frequent Contributor
Posts: 115

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Basic tree plot with the stats for train & test within the two columns of boxes would be ideal.

But I think the only difficult part would be to ensure that the nodes are precisely the same as what was generated by default, or interactively, within the initial Training runs.

Super Contributor
Posts: 337

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

I have a couple ideas, will be in touch later today.

what EM version do you have?

Frequent Contributor
Posts: 115

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Thanks a bunch. EM 6.1.

Posts: 1,231

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Hi,

As you have two datasets and want to use one for model development and other for validation. How about using user defined method within partition node?

Naeem

Contributor
Posts: 71

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

Hi Zachary.  In the words of Spiderman, my spider sense is tingling.

I hope there are no differences in data set A (n=50,000) and data set B (n=25,000).  I've been in this situation before and was told data set B was collected the same as data set A, but I find out later there was a slight methodological change in collected for B.  Yes, the variables in both data sets are the same, but the underlying values had different assumptions.

You wouldn't want to validate a model using nuanced data.

Maybe there is a way to do both approaches and compare them.  If you had a third data set for scoring, you could compare the results of the models.

Frequent Contributor
Posts: 115

## Re: EM Decision Trees - Stratification or Not - Validation or Test?

You raise an excellent point. Thank you.

Actually - both datasets come from precisely the same pool. So perhaps that will aid in the discussion, configuration, and methodology behind seeing how the Training lines up with the Test within a Decision Tree.

Do you have any suggestions in how to best compare the results after the scoring? I have a full breadth of experience with the Training data and the Validation - just not with the Test data.

I almost wish there was a way for me to use the node of Data Partition where 80% of the first dataset is used for Training, 20% of that dataset is used for Validation, then 100% of the "other data" becomes the test. But the trick would be to have the Training/Validation/Test al in one dataset.

🔒 This topic is solved and locked.

Discussion stats
• 11 replies
• 945 views
• 4 likes
• 4 in conversation