BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zachary
Obsidian | Level 7

Let us assume I have n = 50,000 records in my training dataset. Then I also have 25,000 records in a unique & new dataset.

  

I would like to submit all of it to EM Decision Trees so that 100% of the data in my training dataset it used as my estimation – but the 25,000 serve precisely (100%) as my validation, or test, dataset.

  

In reviewing the literature it looks like I will have to do something via stratification – but then it needs the %s for each of the levels. So I am a little confused there.

  

Maybe this is not done with a Data Partition node? The following is another scenario that would be ideal:

  

Training = 80% of 50,000

Validation = 20% of 50,000

Test = 100% of 25,000

How to make this happen perfectly?

Thank you very much in advance,

Zach Feinstein, Statistical Data Modeler

P (952) 838-4289 C(612) 590-4813  F (952) 838-2010

SFM Mutual Insurance Company

3500 American Blvd. W,
Suite 700, Bloomington, MN 55431

www.sfmic.com          

1 ACCEPTED SOLUTION

Accepted Solutions
M_Maldonado
Barite | Level 11

Hi Zach,

You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.

In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.

After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.

In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

roles.png

Is this what you needed?

Good luck,

Miguel

View solution in original post

11 REPLIES 11
M_Maldonado
Barite | Level 11

Hi Zach,

You would use the Data Partition node to get stratified samples (training, validation, or testing) from one data set.

In your example you want to use one data set twice, and another data set once. You can specify the role for your data set using the Role property.

After you create your data source, and add it to your diagram, set the Role property to Train, Validate, or Test.

In the screenshot below I have set the same data source three times. I specified the role as train/validate/test for each data source node as an example similar to your question.

roles.png

Is this what you needed?

Good luck,

Miguel

Zachary
Obsidian | Level 7

That seems like a very reasonable way to do it. Thank you so much!

I think what I put together is the equivalent of what you did [picture below].

Where may I find the output that compares or contrasts the scored nodes between the training and the test?

Capture.JPG

M_Maldonado
Barite | Level 11

As long as your data partition node has test set to 0%, yep, I'd have done it the exact same way.

The results of your model node (e.g. decision tree) have fit statistics for all your partitions. For more stats like ROC, lift, gain, response, add a Model comparison node and see the results.

I hope it helps,

M

Zachary
Obsidian | Level 7

That, again, is some great help.

I suppose beggars cannot be choosers, but is it possible to see the same kind or quality of tree output - instead of Training versus Validation it will display Training versus Test for the statistically significant nodes from before?

M_Maldonado
Barite | Level 11

man, with EM you can always choose... or come up with a workaround.

what do you have in mind? just the tree plot with stats for train & test on the boxes? or something else?

Zachary
Obsidian | Level 7

Basic tree plot with the stats for train & test within the two columns of boxes would be ideal.

But I think the only difficult part would be to ensure that the nodes are precisely the same as what was generated by default, or interactively, within the initial Training runs.

M_Maldonado
Barite | Level 11

I have a couple ideas, will be in touch later today.

what EM version do you have?

Zachary
Obsidian | Level 7

Thanks a bunch. EM 6.1.

stat_sas
Ammonite | Level 13

Hi,

As you have two datasets and want to use one for model development and other for validation. How about using user defined method within partition node?

Naeem

jaredp
Quartz | Level 8

Hi Zachary.  In the words of Spiderman, my spider sense is tingling. 

I hope there are no differences in data set A (n=50,000) and data set B (n=25,000).  I've been in this situation before and was told data set B was collected the same as data set A, but I find out later there was a slight methodological change in collected for B.  Yes, the variables in both data sets are the same, but the underlying values had different assumptions.

You wouldn't want to validate a model using nuanced data.

Maybe there is a way to do both approaches and compare them.  If you had a third data set for scoring, you could compare the results of the models.

Zachary
Obsidian | Level 7

You raise an excellent point. Thank you.

Actually - both datasets come from precisely the same pool. So perhaps that will aid in the discussion, configuration, and methodology behind seeing how the Training lines up with the Test within a Decision Tree.

Do you have any suggestions in how to best compare the results after the scoring? I have a full breadth of experience with the Training data and the Validation - just not with the Test data.

I almost wish there was a way for me to use the node of Data Partition where 80% of the first dataset is used for Training, 20% of that dataset is used for Validation, then 100% of the "other data" becomes the test. But the trick would be to have the Training/Validation/Test al in one dataset.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 2959 views
  • 4 likes
  • 4 in conversation