Solved: Cross Validation in regression and decision trees

SlutskyFan · Posted 02-21-2011 12:01 PM

I have 2 questions:

1) Cross validated Decision trees: Under the panel for cross validation, if you select 'yes' and for number of subsets ='10' and number of repeats ='10' , are these results equivalent to 10-fold cross validation?

Cross validated regression: When you choose 'cross validation misclassification' as your selection criteria for the logistic regression node, it seems that this is similar to an n-fold cross validation where n = the total # of observations in your data set? Is that correct?

2) With cross validation techniques, do you still partition your data into training and validation subsets? I'm thinking, based on sas help documentation, since it is primarily used when small data sets are not large enough for partitioning, you wouldn't generally use a cross validation technique with partitioned data.

PadraicGNeville · Posted 05-31-2011 01:26 PM

In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?
Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning
That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.

View solution in original post

SatishG · Posted 03-21-2011 01:30 AM

I'm not sure of the Cross validation in regression. I agree with the Decision Tree method and your second point.

PadraicGNeville · Posted 05-31-2011 01:26 PM

In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?
Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning
That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.

Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

Re: Cross Validation in regression and decision trees

SAS Innovate 2025: Call for Content