This week continues Chapter 5: Model Assessment and Selection, diving deeply into cross-validation — the most widely used method in machine learning for estimating prediction error and selecting tuning parameters. We cover the $K$-fold cross-validation algorithm, the bias–variance tradeoff in the choice of $K$, the one-standard-error rule, and computational shortcuts for linear smoothers. The lecture concludes by introducing the bootstrap (Chapter 6), one of the most influential ideas in modern statistics.
Recap from Week 6
Last week wrapped up Chapter 4 (non-parametric regression) and began Chapter 5. The key ideas carried forward are:
- The four paradigms of non-parametric regression are now complete: local averaging, local modelling, global modelling, and penalized modelling.
- We defined different types of error for a fitted model $\hat{f}$: training error, conditional (extra-sample) error, generalization error, and in-sample error.
- The optimism of the training error was derived, leading to Mallow's $C_p$ and AIC as bias-corrected alternatives to raw training error for model comparison.
1. Training, Validation, and Test Sets
Before discussing cross-validation, it is useful to recall the simplest data-splitting strategy. We have two distinct goals when building models:
- Model selection: find the best model among a set of candidates.
- Model assessment: estimate the prediction error of the final chosen model.
The hold-out method addresses both goals by splitting the dataset into three disjoint parts:
- A training set used to fit all candidate models.
- A validation set used to select the best model from the candidates.
- A test set used to estimate the prediction error of the final model.
There is no theory for optimal split proportions (a common rule of thumb is 50%/25%/25%). The results depend on the random split, and the training set is smaller than the full dataset, hurting predictive performance. Cross-validation addresses some of these drawbacks by eliminating the need for a separate validation set.
2. Cross-Validation
Cross-validation is a resampling technique that estimates the expected extra-sample (generalization) prediction error of a model $\hat{f}$, without requiring a separate validation set.
2.1 The $K$-Fold Cross-Validation Algorithm
Given data $\{(x_i, y_i)\}_{i=1}^n$ and a loss function $L$:
- Partition the index set $\{1, \ldots, n\}$ into $K$ approximately equal-sized folds $B_1, \ldots, B_K$.
- For each $k = 1, \ldots, K$, fit the model on all data except fold $B_k$. Call the resulting estimator $\hat{f}^{(-B_k)}$.
- Evaluate on the held-out fold $B_k$.
The CV estimate of prediction error is:
The key intuition is that each $\hat{f}^{(-B_k)}$ is built without the data in $B_k$, so the loss evaluated on $B_k$ approximates out-of-sample error. The special case $K = n$ is called leave-one-out cross-validation (LOOCV), where each observation serves as its own test set.
2.2 Cross-Validation for Model Selection
Suppose we have a collection of models $\{\hat{f}_\lambda\}_{\lambda \in \Lambda}$ indexed by a tuning parameter $\lambda$ (e.g., number of features in subset selection, bandwidth in kernel regression, or penalty in smoothing splines). Cross-validation selects the best $\lambda$ as follows:
- For each $\lambda \in \Lambda$, compute $\text{CV}(\hat{f}_\lambda)$ using $K$-fold CV.
- Select $\hat{\lambda} = \arg\min_{\lambda \in \Lambda} \text{CV}(\hat{f}_\lambda)$.
- The final model is $\hat{f}_{\hat{\lambda}}$, refitted on all training data.
After selecting $\hat{\lambda}$ via CV, you must refit the model on the entire dataset to get the final model. Cross-validation is only used for selection; the final model should exploit all available training data. A separate, independent test set is still needed to estimate the error of this final model.
3. Choosing $K$: The Bias–Variance Tradeoff
The number of folds $K$ involves yet another bias–variance tradeoff — this time in the quality of the CV error estimate itself:
- Small $K$ (e.g., $K = 2$): Each model $\hat{f}^{(-B_k)}$ is trained on only $n(1 - 1/K)$ observations — substantially fewer than $n$. Because these models use less data, they tend to have higher prediction error. As a result, $\text{CV}(\hat{f})$ overestimates the true error (positive bias).
- Large $K$ (e.g., $K = n$, i.e., LOOCV): The training sets overlap heavily — each pair of training sets shares $n - 2$ observations. This makes the fold-level error estimates $\text{CV}_k$ highly correlated, which increases the variance of $\text{CV}(\hat{f})$.
There is no universally optimal choice of $K$. In practice, $K = 5$ or $K = 10$ are the most commonly used values, balancing bias and variance. Even though CV will typically overestimate the error in absolute terms, the shape of the $\text{CV}(\hat{f}_\lambda)$ curve as a function of $\lambda$ tends to be similar to the shape of the true error curve, so CV still picks a good $\hat\lambda$.
$K$-fold CV requires fitting $K$ models. For LOOCV ($K = n$), this means $n$ model fits — which can be very expensive unless computational shortcuts are available (see Section 5).
4. The One-Standard-Error Rule
Cross-validation curves are often relatively flat near their minimum. This means many models with different complexity levels have nearly the same estimated error. To favor simpler, more interpretable models, we use the one-standard-error rule.
Select the most parsimonious (simplest) model whose CV error is no more than one standard error above the minimum CV error.
Estimating the Standard Error of CV
Assuming equal-sized folds $|B_k| = n/K$, define the per-fold error estimate:
so that $\text{CV}(\hat{f}) = \frac{1}{K}(\text{CV}_1 + \cdots + \text{CV}_K)$. The standard error is estimated by:
This is only a heuristic standard error. The fold-level estimates $\text{CV}_k$ are not independent (the training sets overlap), so this formula only gives an approximate standard deviation. No unbiased estimator of the variance of $K$-fold CV is known (Bengio & Grandvalet, 2004).
Using 10-fold CV for best subset selection on the body fat dataset (247 observations, 13 predictors), the CV error curve is plotted as a function of the number of features $k$. The minimum tends to occur around $k = 7$ or $k = 8$, but applying the one-standard-error rule selects $\hat{k} = 3$, giving a much simpler model using only weight, abdomen, and wrist.
To show the instability of the argmin approach: repeating 10-fold CV 100 times with different random folds shows the argmin selects values of $k$ ranging widely from 2 to 12. In contrast, the one-standard-error rule concentrates on $k = 2$ (83 times) and $k = 3$ (15 times) — a much more stable selection.
| $k$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| argmin of CV | 0 | 5 | 8 | 11 | 0 | 1 | 11 | 20 | 11 | 12 | 18 | 3 | 0 |
| 1-SE rule | 0 | 83 | 15 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
After selecting $\hat{k} = 3$, the final model $\hat{f}_3$ is fitted on all 247 observations. Note that standard linear model inferences (standard errors, confidence intervals) are not valid after model selection — the selection step has been "data-snooping." To estimate the prediction error of the final model, an independent test set is needed.
5. Computational Shortcuts for Linear Smoothers
For estimators whose fitted values are linear in the response — i.e., $\hat{\mathbf{y}} = S\mathbf{Y}$ where $S$ is the smoother (hat) matrix — LOOCV admits a remarkable closed-form identity that avoids refitting $n$ models.
For squared loss, the leave-one-out CV score can be computed as:
where $S_{ii}$ is the $i$-th diagonal element of the smoother matrix $S$. This requires fitting $\hat{f}$ only once on the full dataset.
The intuition is elegant: the residual $Y_i - \hat{f}(X_i)$ from the full fit is "inflated" by the factor $1/(1 - S_{ii})$ to account for the fact that observation $i$ influenced its own prediction. Observations with high leverage $S_{ii}$ get a larger correction.
5.1 Generalized Cross-Validation (GCV)
Computing all diagonal elements $S_{ii}$ can itself be expensive. Historically, this motivated Generalized Cross-Validation (GCV), which replaces individual leverages with their average:
where $\text{df} = \text{tr}(S)$ is the effective degrees of freedom. GCV is closely related to the degrees-of-freedom concept from non-parametric regression and performs comparably to LOOCV in practice.
6. Practical Comments on CV, $C_p$, AIC, and BIC
- $C_p$, AIC, and BIC are easy to compute when the degrees of freedom $\text{df}$ is known or easy to estimate (e.g., for linear smoothers), but require specific model assumptions (e.g., AIC/BIC need a likelihood).
- Cross-validation is the most widely used method in machine learning because it makes no distributional assumptions and works for any model class.
- CV must mimic the full modelling process: never touch the test fold $B_k$ when constructing $\hat{f}^{(-B_k)}$. Any preprocessing, feature selection, or standardisation must be done inside each training fold.
- Random fold splits are not always appropriate — for example, in time-series forecasting you should use sequential or rolling splits.
- All of these methods ($C_p$, AIC, BIC, CV) avoid the need for a dedicated validation set, but a test set is still required to estimate the error of the final selected model.
- Many aspects of CV are based on rules of thumb rather than rigorous theory. The one-standard-error rule is probably the best practical heuristic for improving stability.
7. Introduction to the Bootstrap
The lecture concludes by motivating Chapter 6. The bootstrap, introduced by Bradley Efron in 1979, is one of the most influential ideas in modern statistics. It provides a general computational approach to problems where analytical solutions are unavailable.
7.1 The Problem
Suppose we observe i.i.d. data $Z_1, \ldots, Z_n \sim P$, where $P$ is unknown, and we compute a statistic:
Our goal is to approximate the distribution of $\hat{\theta}_n$ (or of some transformation like a studentized version). This distribution tells us about the uncertainty of our estimate and allows us to build confidence intervals and perform hypothesis tests.
7.2 Why Do We Need the Bootstrap?
In simple cases, exact distributions are available. For instance, if $\hat{\theta}_n = \bar{X}_n$, the Central Limit Theorem gives $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$. In the Gaussian linear model, we have exact pivots for regression coefficients via $t$-distributions.
But in many practical situations: exact distributions are unavailable, asymptotic variances are hard to derive, asymptotic approximations may be poor for moderate $n$, or no useful asymptotic theory exists at all. The bootstrap addresses all of these scenarios.
7.3 The Core Idea
If the true distribution $P$ were known, we could simulate from it and approximate the distribution of $\hat{\theta}_n$ directly by Monte Carlo. Since $P$ is unknown, the bootstrap replaces $P$ with an estimate $\hat{P}$ (typically the empirical distribution of the observed data) and simulates from $\hat{P}$ instead. This is the essence of the "plug-in" principle: substitute the unknown with an estimate and proceed as if you were in the ideal world.
The bootstrap is useful for constructing confidence intervals, hypothesis testing, and estimating prediction error — topics that will be developed in detail in the coming weeks.
Key Takeaways
- $K$-fold CV partitions data into $K$ folds, trains on $K-1$ folds, evaluates on the held-out fold, and averages the errors. It is the most broadly applicable method for model selection.
- Choosing $K$ involves a bias–variance tradeoff: small $K$ overestimates error (bias), large $K$ increases variance due to correlated folds. Use $K = 5$ or $K = 10$ in practice.
- The one-standard-error rule improves stability by selecting the simplest model within one SE of the minimum — it consistently picks simpler models than the argmin.
- For linear smoothers, LOOCV has a closed-form shortcut using the hat matrix diagonals $S_{ii}$, and GCV approximates this using only $\text{tr}(S)$.
- The bootstrap replaces the unknown distribution $P$ with an empirical estimate $\hat{P}$ and simulates from it — a powerful, general-purpose tool for uncertainty quantification when analytical methods are intractable.