BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SlutskyFan
Obsidian | Level 7
I have 2 questions:

1) Cross validated Decision trees: Under the panel for cross validation, if you select 'yes' and for number of subsets ='10' and number of repeats ='10' , are these results equivalent to 10-fold cross validation?

Cross validated regression: When you choose 'cross validation misclassification' as your selection criteria for the logistic regression node, it seems that this is similar to an n-fold cross validation where n = the total # of observations in your data set? Is that correct?

2) With cross validation techniques, do you still partition your data into training and validation subsets? I'm thinking, based on sas help documentation, since it is primarily used when small data sets are not large enough for partitioning, you wouldn't generally use a cross validation technique with partitioned data.
1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee
In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?
Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning
That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.

View solution in original post

2 REPLIES 2
SatishG
Calcite | Level 5
I'm not sure of the Cross validation in regression. I agree with the Decision Tree method and your second point.
PadraicGNeville
SAS Employee
In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?
Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning
That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 3798 views
  • 1 like
  • 3 in conversation