Cross-Validation & Introduction to the Bootstrap

Course: Computational Statistics (401-3632-00L), ETH Zürich Week 7 Date: 2026-04-03

This week continues Chapter 5: Model Assessment and Selection, diving deeply into cross-validation — the most widely used method in machine learning for estimating prediction error and selecting tuning parameters. We cover the $K$-fold cross-validation algorithm, the bias–variance tradeoff in the choice of $K$, the one-standard-error rule, and computational shortcuts for linear smoothers. The lecture concludes by introducing the bootstrap (Chapter 6), one of the most influential ideas in modern statistics.

Recap from Week 6

Last week wrapped up Chapter 4 (non-parametric regression) and began Chapter 5. The key ideas carried forward are:

1. Training, Validation, and Test Sets

Before discussing cross-validation, it is useful to recall the simplest data-splitting strategy. We have two distinct goals when building models:

The hold-out method addresses both goals by splitting the dataset into three disjoint parts:

  1. A training set used to fit all candidate models.
  2. A validation set used to select the best model from the candidates.
  3. A test set used to estimate the prediction error of the final model.
Issues with the Hold-Out Approach

There is no theory for optimal split proportions (a common rule of thumb is 50%/25%/25%). The results depend on the random split, and the training set is smaller than the full dataset, hurting predictive performance. Cross-validation addresses some of these drawbacks by eliminating the need for a separate validation set.

2. Cross-Validation

Cross-validation is a resampling technique that estimates the expected extra-sample (generalization) prediction error of a model $\hat{f}$, without requiring a separate validation set.

2.1 The $K$-Fold Cross-Validation Algorithm

Definition — $K$-Fold Cross-Validation

Given data $\{(x_i, y_i)\}_{i=1}^n$ and a loss function $L$:

  1. Partition the index set $\{1, \ldots, n\}$ into $K$ approximately equal-sized folds $B_1, \ldots, B_K$.
  2. For each $k = 1, \ldots, K$, fit the model on all data except fold $B_k$. Call the resulting estimator $\hat{f}^{(-B_k)}$.
  3. Evaluate on the held-out fold $B_k$.

The CV estimate of prediction error is:

$$\text{CV}(\hat{f}) = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{|B_k|} \sum_{i \in B_k} L\!\left(y_i,\; \hat{f}^{(-B_k)}(x_i)\right).$$

The key intuition is that each $\hat{f}^{(-B_k)}$ is built without the data in $B_k$, so the loss evaluated on $B_k$ approximates out-of-sample error. The special case $K = n$ is called leave-one-out cross-validation (LOOCV), where each observation serves as its own test set.

2.2 Cross-Validation for Model Selection

Suppose we have a collection of models $\{\hat{f}_\lambda\}_{\lambda \in \Lambda}$ indexed by a tuning parameter $\lambda$ (e.g., number of features in subset selection, bandwidth in kernel regression, or penalty in smoothing splines). Cross-validation selects the best $\lambda$ as follows:

  1. For each $\lambda \in \Lambda$, compute $\text{CV}(\hat{f}_\lambda)$ using $K$-fold CV.
  2. Select $\hat{\lambda} = \arg\min_{\lambda \in \Lambda} \text{CV}(\hat{f}_\lambda)$.
  3. The final model is $\hat{f}_{\hat{\lambda}}$, refitted on all training data.
Important

After selecting $\hat{\lambda}$ via CV, you must refit the model on the entire dataset to get the final model. Cross-validation is only used for selection; the final model should exploit all available training data. A separate, independent test set is still needed to estimate the error of this final model.

3. Choosing $K$: The Bias–Variance Tradeoff

The number of folds $K$ involves yet another bias–variance tradeoff — this time in the quality of the CV error estimate itself:

Bias–Variance Tradeoff in $K$
  • Small $K$ (e.g., $K = 2$): Each model $\hat{f}^{(-B_k)}$ is trained on only $n(1 - 1/K)$ observations — substantially fewer than $n$. Because these models use less data, they tend to have higher prediction error. As a result, $\text{CV}(\hat{f})$ overestimates the true error (positive bias).
  • Large $K$ (e.g., $K = n$, i.e., LOOCV): The training sets overlap heavily — each pair of training sets shares $n - 2$ observations. This makes the fold-level error estimates $\text{CV}_k$ highly correlated, which increases the variance of $\text{CV}(\hat{f})$.

There is no universally optimal choice of $K$. In practice, $K = 5$ or $K = 10$ are the most commonly used values, balancing bias and variance. Even though CV will typically overestimate the error in absolute terms, the shape of the $\text{CV}(\hat{f}_\lambda)$ curve as a function of $\lambda$ tends to be similar to the shape of the true error curve, so CV still picks a good $\hat\lambda$.

Computational Cost

$K$-fold CV requires fitting $K$ models. For LOOCV ($K = n$), this means $n$ model fits — which can be very expensive unless computational shortcuts are available (see Section 5).

4. The One-Standard-Error Rule

Cross-validation curves are often relatively flat near their minimum. This means many models with different complexity levels have nearly the same estimated error. To favor simpler, more interpretable models, we use the one-standard-error rule.

One-Standard-Error Rule

Select the most parsimonious (simplest) model whose CV error is no more than one standard error above the minimum CV error.

Estimating the Standard Error of CV

Assuming equal-sized folds $|B_k| = n/K$, define the per-fold error estimate:

$$\text{CV}_k = \frac{K}{n} \sum_{i \in B_k} L\!\left(y_i,\;\hat{f}^{(-B_k)}(x_i)\right),$$

so that $\text{CV}(\hat{f}) = \frac{1}{K}(\text{CV}_1 + \cdots + \text{CV}_K)$. The standard error is estimated by:

$$\widehat{\text{sd}} = \frac{1}{\sqrt{K}} \sqrt{\frac{1}{K-1} \sum_{k=1}^{K} \left(\text{CV}_k - \text{CV}(\hat{f})\right)^2}.$$
Caveat

This is only a heuristic standard error. The fold-level estimates $\text{CV}_k$ are not independent (the training sets overlap), so this formula only gives an approximate standard deviation. No unbiased estimator of the variance of $K$-fold CV is known (Bengio & Grandvalet, 2004).

Example — Body Fat, Best Subset Selection

Using 10-fold CV for best subset selection on the body fat dataset (247 observations, 13 predictors), the CV error curve is plotted as a function of the number of features $k$. The minimum tends to occur around $k = 7$ or $k = 8$, but applying the one-standard-error rule selects $\hat{k} = 3$, giving a much simpler model using only weight, abdomen, and wrist.

To show the instability of the argmin approach: repeating 10-fold CV 100 times with different random folds shows the argmin selects values of $k$ ranging widely from 2 to 12. In contrast, the one-standard-error rule concentrates on $k = 2$ (83 times) and $k = 3$ (15 times) — a much more stable selection.

Number of times each $k$ is selected over 100 repetitions of 10-fold CV
$k$12345678910111213
argmin of CV0581101112011121830
1-SE rule083152000000000

After selecting $\hat{k} = 3$, the final model $\hat{f}_3$ is fitted on all 247 observations. Note that standard linear model inferences (standard errors, confidence intervals) are not valid after model selection — the selection step has been "data-snooping." To estimate the prediction error of the final model, an independent test set is needed.

5. Computational Shortcuts for Linear Smoothers

For estimators whose fitted values are linear in the response — i.e., $\hat{\mathbf{y}} = S\mathbf{Y}$ where $S$ is the smoother (hat) matrix — LOOCV admits a remarkable closed-form identity that avoids refitting $n$ models.

LOOCV Shortcut for Linear Smoothers

For squared loss, the leave-one-out CV score can be computed as:

$$\text{CV}(\hat{f}) = \frac{1}{n}\sum_{i=1}^{n} \left(\frac{Y_i - \hat{f}(X_i)}{1 - S_{ii}}\right)^2,$$

where $S_{ii}$ is the $i$-th diagonal element of the smoother matrix $S$. This requires fitting $\hat{f}$ only once on the full dataset.

The intuition is elegant: the residual $Y_i - \hat{f}(X_i)$ from the full fit is "inflated" by the factor $1/(1 - S_{ii})$ to account for the fact that observation $i$ influenced its own prediction. Observations with high leverage $S_{ii}$ get a larger correction.

5.1 Generalized Cross-Validation (GCV)

Computing all diagonal elements $S_{ii}$ can itself be expensive. Historically, this motivated Generalized Cross-Validation (GCV), which replaces individual leverages with their average:

$$\text{GCV} = \frac{\frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{f}(X_i))^2}{\left(1 - \frac{1}{n}\text{tr}(S)\right)^2} = \frac{\text{RSS}/n}{\left(1 - \text{df}/n\right)^2},$$

where $\text{df} = \text{tr}(S)$ is the effective degrees of freedom. GCV is closely related to the degrees-of-freedom concept from non-parametric regression and performs comparably to LOOCV in practice.

6. Practical Comments on CV, $C_p$, AIC, and BIC

7. Introduction to the Bootstrap

The lecture concludes by motivating Chapter 6. The bootstrap, introduced by Bradley Efron in 1979, is one of the most influential ideas in modern statistics. It provides a general computational approach to problems where analytical solutions are unavailable.

7.1 The Problem

Suppose we observe i.i.d. data $Z_1, \ldots, Z_n \sim P$, where $P$ is unknown, and we compute a statistic:

$$\hat{\theta}_n = g(Z_1, \ldots, Z_n).$$

Our goal is to approximate the distribution of $\hat{\theta}_n$ (or of some transformation like a studentized version). This distribution tells us about the uncertainty of our estimate and allows us to build confidence intervals and perform hypothesis tests.

7.2 Why Do We Need the Bootstrap?

In simple cases, exact distributions are available. For instance, if $\hat{\theta}_n = \bar{X}_n$, the Central Limit Theorem gives $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$. In the Gaussian linear model, we have exact pivots for regression coefficients via $t$-distributions.

But in many practical situations: exact distributions are unavailable, asymptotic variances are hard to derive, asymptotic approximations may be poor for moderate $n$, or no useful asymptotic theory exists at all. The bootstrap addresses all of these scenarios.

7.3 The Core Idea

Bootstrap Principle

If the true distribution $P$ were known, we could simulate from it and approximate the distribution of $\hat{\theta}_n$ directly by Monte Carlo. Since $P$ is unknown, the bootstrap replaces $P$ with an estimate $\hat{P}$ (typically the empirical distribution of the observed data) and simulates from $\hat{P}$ instead. This is the essence of the "plug-in" principle: substitute the unknown with an estimate and proceed as if you were in the ideal world.

The bootstrap is useful for constructing confidence intervals, hypothesis testing, and estimating prediction error — topics that will be developed in detail in the coming weeks.

Key Takeaways