turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Cross Validation in regression and decision trees

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-21-2011 12:01 PM

I have 2 questions:

1) Cross validated Decision trees: Under the panel for cross validation, if you select 'yes' and for number of subsets ='10' and number of repeats ='10' , are these results equivalent to 10-fold cross validation?

Cross validated regression: When you choose 'cross validation misclassification' as your selection criteria for the logistic regression node, it seems that this is similar to an n-fold cross validation where n = the total # of observations in your data set? Is that correct?

2) With cross validation techniques, do you still partition your data into training and validation subsets? I'm thinking, based on sas help documentation, since it is primarily used when small data sets are not large enough for partitioning, you wouldn't generally use a cross validation technique with partitioned data.

1) Cross validated Decision trees: Under the panel for cross validation, if you select 'yes' and for number of subsets ='10' and number of repeats ='10' , are these results equivalent to 10-fold cross validation?

Cross validated regression: When you choose 'cross validation misclassification' as your selection criteria for the logistic regression node, it seems that this is similar to an n-fold cross validation where n = the total # of observations in your data set? Is that correct?

2) With cross validation techniques, do you still partition your data into training and validation subsets? I'm thinking, based on sas help documentation, since it is primarily used when small data sets are not large enough for partitioning, you wouldn't generally use a cross validation technique with partitioned data.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-21-2011 01:30 AM

I'm not sure of the Cross validation in regression. I agree with the Decision Tree method and your second point.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-31-2011 01:26 PM

In SAS decision trees, ’10 repeats’ means 10-fold cross-validation 10 times, for a total of 101 trees, including the original tree.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?

Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning

That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.

'Leave-one-out' cross-validation has been available in the EM Regression Node. In leave-one-out CV, n = the total # of observations in your data set.

Re: Using CV, do you still partition your data into training and validation subsets?

Not for a single EM modelling node. However, partitioning into data-available-for-CV vs test-hold-out is still useful, and if comparing models from several EM modeling nodes, using a single validation data set for the comparison may be useful. It's up to the analyst.

Re: primarily used when small data sets are not large enough for partitioning

That is my belief. Partitioning applies hold-out data directly to the model being deployed, providing a transparently unbiased estimate of accuracy. CV validates the model construction process. People disagree as to whether leave-one-out cross-validation provides unbiased or overrly optimistic estimates of prediction.

However, many people prefer to CV anything, regardless of size.