topic Re: Cross Validation in regression and decision trees in SAS Data Science

Cross Validation in regression and decision trees

SlutskyFan — Mon, 21 Feb 2011 17:01:20 GMT

I have 2 questions:

1) Cross validated Decision trees: Under the panel for cross validation, if you select 'yes' and for number of subsets ='10' and number of repeats ='10' , are these results equivalent to 10-fold cross validation?

Cross validated regression: When you choose 'cross validation misclassification' as your selection criteria for the logistic regression node, it seems that this is similar to an n-fold cross validation where n = the total # of observations in your data set? Is that correct?

2) With cross validation techniques, do you still partition your data into training and validation subsets? I'm thinking, based on sas help documentation, since it is primarily used when small data sets are not large enough for partitioning, you wouldn't generally use a cross validation technique with partitioned data.

Re: Cross Validation in regression and decision trees

SatishG — Mon, 21 Mar 2011 05:30:43 GMT

I'm not sure of the Cross validation in regression. I agree with the Decision Tree method and your second point.

Re: Cross Validation in regression and decision trees

PadraicGNeville — Tue, 31 May 2011 17:26:58 GMT

In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?
Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning
That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.