David Wolpert

Paper #: 95-01-003

This paper uses off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as vice versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is “anti-cross-validation” (choose the generalizer with largest cross-validation error). On the other hand, for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms. However even for such loss functions, any algorithm is equivalent on average to its “randomized” version, and in this still has no first principles justification in terms of average error. On the other hand, it is shown that (for example) cross-validation may have better minimax properties than anti-cross-validation, even for zero-one loss. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors. Accordingly they prove, as a particular example, that cross-validation can not be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-cross-validation rather than cross-validation (!). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can “not” say: if empirical misclassification rate is low, the VC dimension of your generalizer is small; and the training set is large, then with high probability your OTS error is small. Other implications for “membership queries” algorithms and “punting” algorithms are also discussed.