BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
rbettinger
Lapis Lazuli | Level 10

I am writing a SAS/IML module that involves cross-validation of an algorithm. The algorithm requires the computation of the class mean for a variable number of classes present in the data used as input to the algorithm. My question is:

  • Do I compute the class means for all of the data and use this single set of class means in the cross-validation process for each cross-validation sample, or
  • Do I compute a set of class means for each cross-validation sample and use these sample means in the cross-validation process for each cross-validation sample

If there is 1-fold cross-validation, e.g., no cross-validation  sampling of the data, the question is moot since the class means of the entire data sample will be computed and used in comparison to all of the data, but if I am performing n-fold cross-validation for n > 1 with the means computed from the entire data sample, then I will be comparing each sample of the data to means computed from the entire sample. Alternatively, if I compute the class means of each cross-validation sample separately, and use each sample's class means for comparison with the data used in computing the sample's class means, then the cross-validation comparisons will be related more closely to the data used in computing the class means than if the class means of the entire set of data were used.

Hence, using the all-data class means will result in a uniform standard of comparison while using n sets of class means derived from cross-validation samples will produce "local" results with assumedly lower variance than the all-data class means but presumably less representational applicability to the data as a whole. So there is a trade-off between "global" and "local" results.

I do not know which approach is best, so I ask the Community's help.

Ross

P.S. Thank you for reading this wordy question. I set my verbose flag to 1 to ensure that I explained my question properly.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

This doesn't look like an IML question. It seems to be a question about cross-validation. If so, you might get better results from a community such as StackOverflow. The SAS documentation includes several examples of CV, including https://documentation.sas.com/doc/en/statug/latest/statug_glmselect_details27.htm 

 

Regarding your question, I would state the answer as 

  • Compute a set of class means by using n-1 of the cross-validation samples and use these sample means in the cross-validation process for the remaining sample

For definiteness, let's consider 5-fold cross validation. This means you randomly split the data into 5 subsets, called 'folds'. Using 4 of the folds, you train the model (eg, fit the parameters). You then score the model on the "hold-out sample," which is the 5th subset. You use this scoring to compute some goodness-of-fit statistic(s). You repeat this process 5 times and report an average GoF statistic over the folds.

 

 

View solution in original post

1 REPLY 1
Rick_SAS
SAS Super FREQ

This doesn't look like an IML question. It seems to be a question about cross-validation. If so, you might get better results from a community such as StackOverflow. The SAS documentation includes several examples of CV, including https://documentation.sas.com/doc/en/statug/latest/statug_glmselect_details27.htm 

 

Regarding your question, I would state the answer as 

  • Compute a set of class means by using n-1 of the cross-validation samples and use these sample means in the cross-validation process for the remaining sample

For definiteness, let's consider 5-fold cross validation. This means you randomly split the data into 5 subsets, called 'folds'. Using 4 of the folds, you train the model (eg, fit the parameters). You then score the model on the "hold-out sample," which is the 5th subset. You use this scoring to compute some goodness-of-fit statistic(s). You repeat this process 5 times and report an average GoF statistic over the folds.