IML – Lecture 6: Model Selection & Regularization

This lecture wraps up the bias-variance tradeoff from Lecture 5 and then tackles two central practical questions: how do we choose which model to use? and how do we control model complexity to avoid overfitting? The lecture is organized in three parts: finishing the bias-variance decomposition, data-driven model evaluation and selection (train/test split, cross-validation), and controlling complexity via regularization (ridge, LASSO, and their comparison).

1. Finishing the Bias-Variance Tradeoff

1.1 Recap: Bias and Variance of a Model

Remember that we are studying a learning method $\mathcal{M}$ (not a single trained model). If we were to draw many independent training sets $\mathcal{D}_1, \dots, \mathcal{D}_J$ from the same distribution and train a model on each, we would get different fitted functions $\hat{f}_1, \dots, \hat{f}_J$. The bias and variance capture two fundamentally different sources of error in these estimators.

Definition — Bias and Variance

Given a joint distribution $\mathbb{P}_{X,Y}$ and a learning method $\mathcal{M}$ that produces $\hat{f}_{\mathcal{D}} = \mathcal{M}(\mathcal{D})$:

Squared Bias:

\text{Bias}_{\mathcal{D}}^2(\hat{f}_{\mathcal{D}}) := \mathbb{E}_X\!\left[\left(\mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(X)] - f^{\star}(X)\right)^2\right]

This measures how far the average model (averaged over all possible training sets) is from the true function $f^\star$. It is the systematic error that remains no matter how much data you collect.

Variance:

\text{Var}_{\mathcal{D}}(\hat{f}_{\mathcal{D}}) := \mathbb{E}_X\!\left[\mathbb{E}_{\mathcal{D}}\!\left[\left(\hat{f}_{\mathcal{D}}(X) - \mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(X)]\right)^2\right]\right]

This measures how much individual models scatter around the average model. It captures sensitivity to the particular training set drawn.

Both quantities are properties of the method and the data distribution — not of any single model trained on a specific dataset $\mathcal{D}$.

In practice, we can approximate them by sampling $J$ datasets and computing the sample average model $\bar{f}(x) = \frac{1}{J}\sum_{j=1}^{J}\hat{f}_j(x)$, then estimating the bias as $\mathbb{E}_X[(f^\star(x) - \bar{f}(x))^2]$ and the variance as $\frac{1}{J-1}\sum_{j=1}^{J}(\hat{f}_j(x) - \bar{f}(x))^2$.

1.2 The Bias-Variance Decomposition

The key result is that the expected generalization error (under squared loss) splits neatly into three terms:

Theorem — Bias-Variance Decomposition

\mathbb{E}_{\mathcal{D}}\!\left[L(\hat{f}_{\mathcal{D}};\,\mathbb{P}_{X,Y})\right] = \text{Var}_{\mathcal{D}}(\hat{f}_{\mathcal{D}}) + \text{Bias}_{\mathcal{D}}^2(\hat{f}_{\mathcal{D}}) + \sigma^2

where $\sigma^2 = \text{Var}(\varepsilon)$ is the irreducible noise variance.

The intuition: the squared error between the model output $\hat{f}_{\mathcal{D}}(x)$ and the noisy label $y$ can be decomposed by inserting the "average model" $\mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(x)]$ and the true function $f^\star(x)$ as intermediate waypoints. The cross-term vanishes because the deviation of the actual model from the average model is independent of the deviation of the average model from the ground truth.

1.3 The Tradeoff

This decomposition makes the bias-variance tradeoff precise:

Simple models (e.g. constant functions, low-degree polynomials) have high bias because their average model cannot get close to $f^\star$, but low variance because every individual model is close to the average.
Complex models (e.g. high-degree polynomials) have low bias because their average model can capture $f^\star$ well, but high variance because each individual model is highly sensitive to the specific training data.

The optimal model complexity is the "sweet spot" that minimizes the sum of bias and variance (since $\sigma^2$ is constant and cannot be reduced).

Bonus: Double Descent

Classically, the generalization error follows a U-shape as model complexity increases. However, modern research has shown that in overparameterized models (where the number of parameters far exceeds the number of data points), the generalization error can decrease again after the interpolation threshold — a phenomenon called double descent. At the interpolation threshold (where training error first hits zero), the model is maximally "strained" to fit the data, and the error peaks. Beyond that, additional capacity allows the model to find smoother interpolating solutions. This has been observed empirically in neural networks (e.g. on CIFAR-10) and can be proven rigorously for linear regression.

2. Model Evaluation and Selection Using Data

2.1 Why Training Error Is Not Enough

Training error decreases monotonically as model complexity increases (or at least does not increase). A very complex model can achieve zero training error by interpolating the data, but its generalization error may be terrible. Therefore, training error is not a reliable estimate of generalization error for most models, and we need held-out data to assess performance honestly.

2.2 Simple Train–Test Split

The simplest approach: split your full dataset $\mathcal{D}_{\text{full}}$ into a training set $\mathcal{D}$ and a test set $\mathcal{D}_{\text{test}}$. Train on $\mathcal{D}$, evaluate on $\mathcal{D}_{\text{test}}$. This gives an unbiased estimate of generalization error because the test data was not involved in training.

The split ratio involves a tradeoff: a larger training set produces a better model, while a larger test set gives a more precise error estimate. Typical splits are 80/20 or 70/30.

2.3 The Need for a Validation Set

If you use the test set to choose among several methods (e.g. picking the one with lowest test error), then the test error of the selected model is no longer an unbiased estimate — you've implicitly fitted to the test set by selecting based on it. The solution is a three-way split:

Training set $\mathcal{D}$: used to fit the model.
Validation set $\mathcal{D}_{\text{val}}$: used to choose the best method/hyperparameters.
Test set $\mathcal{D}_{\text{test}}$: used only once at the end to estimate the generalization error of the chosen model.

2.4 K-Fold Cross-Validation

The three-way split wastes data — especially problematic when the dataset is small. K-fold cross-validation lets you use all of $\mathcal{D}_{\text{use}} = \mathcal{D}_{\text{full}} \setminus \mathcal{D}_{\text{test}}$ for both training and validation.

Procedure — K-Fold Cross-Validation

Given a method $\mathcal{M}$ (or a set of candidate methods $\mathcal{M}_1, \dots, \mathcal{M}_r$) and $K$ folds:

Partition $\mathcal{D}_{\text{use}}$ into $K$ disjoint subsets $D_1', D_2', \dots, D_K'$ of approximately equal size.
For each fold $k = 1, \dots, K$:
- Train: fit $\hat{f}_k = \mathcal{M}(\mathcal{D}_{\text{use}} \setminus D_k')$.
- Validate: compute $L_k = \frac{1}{|D_k'|}\sum_{(x,y) \in D_k'} \ell(\hat{f}_k(x), y)$.
The cross-validation error is $\text{CV}_K(\mathcal{M}) = \frac{1}{K}\sum_{k=1}^{K} L_k$.
Select the method $\mathcal{M}^* = \arg\min_i \text{CV}_K(\mathcal{M}_i)$.
Retrain the selected method on the full $\mathcal{D}_{\text{use}}$: $\hat{f} = \mathcal{M}^*(\mathcal{D}_{\text{use}})$.
Estimate generalization error on $\mathcal{D}_{\text{test}}$.

Choice of K

Increasing $K$ means each fold's training set is closer in size to $\mathcal{D}_{\text{use}}$, so each sub-model better approximates the full model (good). But each validation set becomes smaller, making the per-fold error estimate noisier (bad). In practice, $K = 5$ or $K = 10$ are popular choices. The extreme case $K = |\mathcal{D}_{\text{use}}|$ is called Leave-One-Out Cross-Validation (LOOCV), which has minimal bias but can be computationally expensive and high-variance.

3. Controlling Model Complexity via Regularization

So far we've seen two "knobs" for controlling complexity: changing the feature map (e.g. polynomial degree) and splitting data for evaluation. Regularization gives us a third, more fine-grained knob: adding a penalty term to the loss function that discourages overly complex solutions.

3.1 Ridge Regression ($\ell_2$ Penalty)

Definition — Ridge Regression

Ridge regression adds a squared $\ell_2$-norm penalty to the ordinary least squares objective:

\hat{w}_\lambda = \arg\min_{w \in \mathbb{R}^d} \left\{ \|y - \Phi w\|_2^2 + \lambda \|w\|_2^2 \right\}

This is equivalent to the constrained form $\min \|y - \Phi w\|_2^2$ subject to $\|w\|_2 \leq R$ for some radius $R$ that depends on $\lambda$.

A major advantage of ridge regression is that it has a closed-form solution:

\hat{w}_\lambda = \left(X^\top X + \lambda I_d\right)^{-1} X^\top y

Notice that even when $X^\top X$ is singular (e.g. when $d > n$), adding $\lambda I_d$ makes the matrix invertible. This is one of the practical benefits of ridge: it stabilizes the solution in ill-conditioned or overparameterized settings.

Effect on Bias and Variance

Using the SVD $X = U\Sigma V^\top$, one can derive exact expressions for the bias and variance of the ridge estimator. Assuming $\mathbb{E}[X] = 0$ and $\text{Cov}(X) = I_d$:

\text{Bias}^2 = \left\| V \,\text{diag}\!\left(\frac{\lambda}{\sigma_i^2 + \lambda}\right) V^\top w^* \right\|_2^2, \qquad \text{Var} = \sigma^2 \sum_{i=1}^{d} \frac{\sigma_i^2}{(\sigma_i^2 + \lambda)^2}

The key takeaway: as $\lambda$ increases, the bias grows (we're pulling $\hat{w}$ toward zero, away from the true $w^*$), while the variance shrinks (we reduce sensitivity to noise). At $\lambda = 0$ we recover ordinary least squares (zero bias from the model, but potentially huge variance). As $\lambda \to \infty$, the estimator collapses to the zero vector (maximum bias, zero variance).

3.2 LASSO Regression ($\ell_1$ Penalty)

Definition — LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator) uses an $\ell_1$-norm penalty:

\hat{w}_\lambda = \arg\min_{w \in \mathbb{R}^d} \left\{ \|y - \Phi w\|_2^2 + \lambda \|w\|_1 \right\}

where $\|w\|_1 = \sum_{j=1}^d |w_j|$.

Like ridge, LASSO also controls model complexity by penalizing large weights. However, LASSO has a crucial additional property: it induces sparsity — many coefficients are driven to exactly zero. This means LASSO performs automatic feature selection.

3.3 Why Does LASSO Produce Sparse Solutions?

There are two complementary intuitions for why the $\ell_1$ penalty, but not the $\ell_2$ penalty, drives coefficients to exactly zero.

Intuition 1: Via the Closed-Form (Orthonormal Case)

In the simplified case where $X^\top X = I$ (orthonormal features), the solutions have elegant closed forms. For ridge, each coefficient is simply scaled: $\hat{w}_j^{\text{ridge}} = \frac{1}{1+\lambda} w_j^*$. Every coefficient shrinks, but none reaches zero. For LASSO, the solution is the soft-thresholding operator:

\hat{w}_j^{\text{lasso}} = \text{sign}(w_j^*) \cdot \max(0,\; |w_j^*| - \lambda)

This explicitly sets coefficient $j$ to zero whenever its magnitude falls below $\lambda$. Ridge merely shrinks towards zero with a multiplicative factor; LASSO performs a "hard cut" with an additive threshold.

Intuition 2: Geometric View

Think of the constrained optimization: we want to minimize $\|y - Xw\|_2^2$ subject to $\|w\|_1 \leq R$ (LASSO) or $\|w\|_2 \leq R$ (ridge). The constraint sets have very different shapes:

The $\ell_1$-ball in $\mathbb{R}^d$ is a diamond (cross-polytope) with sharp corners ("spikes") on each coordinate axis.
The $\ell_2$-ball is a smooth sphere with no corners.

The solution is where the elliptical contour lines of the loss function first touch the constraint set. For the $\ell_1$-ball, this contact typically happens at one of the spikes — which correspond to sparse vectors (many coordinates equal to zero). For the smooth $\ell_2$-ball, the contact point generally has all coordinates nonzero.

Intuition 3: Via Norm Comparison

Consider two vectors in $\mathbb{R}^d$: a dense vector $w_{\text{dense}} = \frac{1}{\sqrt{d}}(1,1,\dots,1)$ and a sparse vector $w_{\text{sparse}} = (1,0,\dots,0)$. Both have the same $\ell_2$-norm ($= 1$), so ridge cannot distinguish between them. But their $\ell_1$-norms differ dramatically: $\|w_{\text{dense}}\|_1 = \sqrt{d}$ while $\|w_{\text{sparse}}\|_1 = 1$. So LASSO strongly prefers the sparse solution.

Example — Polynomial Regression with Regularization

Consider $f^\star(x) = -x + 3x^2 + x^5$ with $n = 35$ noisy samples, and we fit polynomials of degree $m = 12$ (i.e., 13 parameters). Without regularization, the model overfits. With LASSO ($\lambda = 0.005$), the fitted polynomial effectively becomes degree 5 — the coefficients for degrees 6–12 are driven to exactly zero, matching the true polynomial's structure. Ridge with the same $\lambda$ shrinks all 13 coefficients but keeps all of them nonzero, yielding a full degree-12 polynomial (that nonetheless fits well because all coefficients are small).

3.4 Comparing Ridge and LASSO

	Ridge ($\ell_2$)	LASSO ($\ell_1$)
Penalty	$\lambda \\|w\\|_2^2$	$\lambda \\|w\\|_1$
Sparsity	No — shrinks all coefficients toward zero	Yes — drives many coefficients to exactly zero
Feature selection	No	Yes (implicit)
Closed-form	Yes: $(X^\top X + \lambda I)^{-1}X^\top y$	No (requires iterative optimization)
Differentiable	Yes	No (at $w_j = 0$)
Effect of $\lambda \uparrow$	Bias $\uparrow$, Variance $\downarrow$	Bias $\uparrow$, Variance $\downarrow$

3.5 Choosing the Regularization Parameter $\lambda$

The regularization strength $\lambda$ is a hyperparameter — it is not learned from the training data directly. We choose it using cross-validation: for each candidate value of $\lambda$ (typically on a logarithmic grid, e.g. $\{10^{-3}, 10^{-2}, \dots, 10^2\}$), run $K$-fold CV and compute the cross-validation error. Then select the $\lambda$ with the lowest CV error.

Algorithm — CV for Hyperparameter Tuning

Given a grid $\Lambda$ of candidate $\lambda$ values and $K$ folds:

For each $\lambda \in \Lambda$:
- For each fold $k$: train $\hat{w}_k^\lambda$ on $\mathcal{D}_{\text{use}} \setminus D_k'$ (with regularization $\lambda$), compute validation loss $L_k(\lambda)$ on $D_k'$.
- Compute $\text{CV}_K(\lambda) = \frac{1}{K}\sum_{k=1}^K L_k(\lambda)$.
Select $\lambda^* = \arg\min_{\lambda \in \Lambda} \text{CV}_K(\lambda)$.
Retrain on all of $\mathcal{D}_{\text{use}}$ with $\lambda^*$.

In Python with scikit-learn:

for lambda_value in lambda_grid:
    ridge_reg = sklearn.linear_model.Ridge(alpha=lambda_value)
    scores = cross_val_score(ridge_reg, X, Y,
                             scoring="neg_mean_squared_error", cv=K)

Key Takeaways

Bias-variance decomposition: Expected generalization error = Variance + Bias² + irreducible noise. Simple models underfit (high bias); complex models overfit (high variance). The optimal complexity balances the two.
Training error is unreliable for assessing generalization. Always use held-out data (test/validation sets).
Cross-validation efficiently uses all available data for both training and model selection by rotating which fold serves as the validation set. Use it to choose among methods or to tune hyperparameters like $\lambda$.
Ridge regression ($\ell_2$ penalty) shrinks all coefficients toward zero, stabilizes ill-conditioned problems, and has a convenient closed-form solution.
LASSO regression ($\ell_1$ penalty) produces sparse solutions — many coefficients become exactly zero — effectively performing automatic feature selection. The sparsity arises from the geometry of the $\ell_1$-ball (diamond with corners on axes) and the soft-thresholding nature of the solution.
Both regularization methods increase bias and decrease variance as $\lambda$ grows, and the optimal $\lambda$ should be found via cross-validation on a logarithmic grid.