I am writing a SAS/IML module that involves cross-validation of an algorithm. The algorithm requires the computation of the class mean for a variable number of classes present in the data used as input to the algorithm. My question is:
- Do I compute the class means for all of the data and use this single set of class means in the cross-validation process for each cross-validation sample, or
- Do I compute a set of class means for each cross-validation sample and use these sample means in the cross-validation process for each cross-validation sample
If there is 1-fold cross-validation, e.g., no cross-validation sampling of the data, the question is moot since the class means of the entire data sample will be computed and used in comparison to all of the data, but if I am performing n-fold cross-validation for n > 1 with the means computed from the entire data sample, then I will be comparing each sample of the data to means computed from the entire sample. Alternatively, if I compute the class means of each cross-validation sample separately, and use each sample's class means for comparison with the data used in computing the sample's class means, then the cross-validation comparisons will be related more closely to the data used in computing the class means than if the class means of the entire set of data were used.
Hence, using the all-data class means will result in a uniform standard of comparison while using n sets of class means derived from cross-validation samples will produce "local" results with assumedly lower variance than the all-data class means but presumably less representational applicability to the data as a whole. So there is a trade-off between "global" and "local" results.
I do not know which approach is best, so I ask the Community's help.
Ross
P.S. Thank you for reading this wordy question. I set my verbose flag to 1 to ensure that I explained my question properly.