Lecture 2: Theory for the Lasso — Prediction

High-Dimensional Statistics Lecture 2 Based on Bühlmann & van de Geer (2011), Ch. 6

This lecture develops the theoretical foundations for understanding why the Lasso works well for prediction in high-dimensional linear models where $p \gg n$. We start from a simple but powerful algebraic trick — the Basic Inequality — and progressively build up to the main result: an oracle inequality showing that the Lasso's prediction error is nearly as good as if we had known the relevant variables in advance. Along the way, we encounter the key ideas of controlling the "noise term" via the set $\mathcal{T}$, and the compatibility condition, which captures what the design matrix $X$ needs to satisfy.

1. The Setup: High-Dimensional Linear Model & the Lasso

We work with the standard linear model in matrix form:

$$Y = X\beta^0 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)$$

Here $Y$ is the $n \times 1$ response vector, $X$ is the $n \times p$ design matrix (fixed), and $\beta^0 \in \mathbb{R}^p$ is the true (unknown) parameter vector. Crucially, we allow $p \gg n$: many more variables than observations.

When $p > n$, the ordinary least squares (OLS) estimator is not unique and will heavily overfit. The Lasso regularises by adding an $\ell_1$-penalty:

Definition — The Lasso Estimator

$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left\{ \frac{\|Y - X\beta\|_2^2}{n} + \lambda \|\beta\|_1 \right\}$$

where $\|\beta\|_1 = \sum_{j=1}^p |\beta_j|$ and $\lambda \geq 0$ is the regularisation (tuning) parameter.

The $\ell_1$-penalty does two things simultaneously: it shrinks coefficients (pulling them toward zero) and selects variables (setting some coefficients to exactly zero). This is the origin of the name: Least Absolute Shrinkage and Selection Operator.

1.1 Why $\ell_1$ gives sparsity — geometric intuition

The Lasso is equivalent to the constrained problem: minimise $\|Y - X\beta\|_2^2/n$ subject to $\|\beta\|_1 \leq R$ (with a data-dependent mapping between $\lambda$ and $R$). Geometrically, the $\ell_1$-ball in $\mathbb{R}^p$ has corners that lie on the coordinate axes. The contour lines of the squared loss will typically first touch the $\ell_1$-ball at one of these corners, which corresponds to some coefficients being exactly zero.

Compare this with Ridge regression, which uses the $\ell_2$-penalty $\|\beta\|_2^2$. The $\ell_2$-ball is smooth (a sphere) — its contours have no corners — so Ridge shrinks coefficients toward zero but never sets them exactly to zero.

Example — Orthonormal Design (Soft-Thresholding)

When the design is orthonormal ($X^TX/n = I$), the Lasso has an explicit closed-form solution: $$\hat{\beta}_j(\lambda) = \text{sign}(Z_j)(|Z_j| - \lambda/2)_+, \quad Z_j = (X^TY)_j/n = \hat{\beta}_{OLS,j}$$ This is the soft-thresholding operator $g_{\lambda/2}(z) = \text{sign}(z)(|z| - \lambda)_+$. It shrinks OLS estimates toward zero and sets small ones exactly to zero.

Intuition: if the OLS estimate for a variable is too small (below the threshold $\lambda/2$), the Lasso concludes the signal is indistinguishable from noise and kills it. Coefficients above the threshold survive but are shrunk — this introduces bias.

1.2 Key notation and the active set

We define the active set $S_0 = \{j : \beta^0_j \neq 0\}$ with sparsity index $s_0 = |S_0|$. For a parameter vector $\beta$, we write $\beta_{S_0}$ for the components in $S_0$ and $\beta_{S_0^c}$ for those outside it. The Gram matrix is $\hat{\Sigma} = X^TX/n$.

2. The Basic Inequality — Starting Point for Everything

The entire prediction theory for the Lasso rests on a single, remarkably simple observation. Because $\hat{\beta}$ minimises the Lasso objective, its objective value must be at most that of any competitor — in particular, the true $\beta^0$:

$$\frac{\|Y - X\hat{\beta}\|_2^2}{n} + \lambda\|\hat{\beta}\|_1 \;\leq\; \frac{\|Y - X\beta^0\|_2^2}{n} + \lambda\|\beta^0\|_1$$

Substituting $Y = X\beta^0 + \varepsilon$ and rearranging, we get:

Lemma 6.1 — Basic Inequality

$$\frac{\|X(\hat{\beta} - \beta^0)\|_2^2}{n} + \lambda\|\hat{\beta}\|_1 \;\leq\; \frac{2\varepsilon^T X(\hat{\beta} - \beta^0)}{n} + \lambda\|\beta^0\|_1$$

What does this say intuitively? The left side has two "good" terms: the prediction error $\|X(\hat{\beta} - \beta^0)\|_2^2/n$ and the sparsity of $\hat{\beta}$ (via $\|\hat{\beta}\|_1$). The right side has a "bad" random term $2\varepsilon^T X(\hat{\beta} - \beta^0)/n$ (the noise leaking into the estimate) and the sparsity of the truth $\|\beta^0\|_1$. The strategy is: use the penalty $\lambda$ to dominate the random term, so the prediction error can be controlled.

2.1 Bounding the empirical process (random) term

The random part can be bounded using Hölder's inequality: $$\left|\frac{2\varepsilon^T X(\hat{\beta} - \beta^0)}{n}\right| \leq \underbrace{\max_{1 \leq j \leq p} \frac{2|\varepsilon^T X^{(j)}|}{n}}_{\text{maximum correlation of noise with any predictor}} \cdot \|\hat{\beta} - \beta^0\|_1$$

This motivates defining the "good event" $\mathcal{T}$ where the noise is well-behaved:

Definition — The Event $\mathcal{T}$

$$\mathcal{T} = \left\{ \max_{1 \leq j \leq p} \frac{2|\varepsilon^T X^{(j)}|}{n} \leq \lambda_0 \right\}$$

On $\mathcal{T}$, the maximum correlation between the noise $\varepsilon$ and any column of $X$ is bounded by $\lambda_0$. If we pick $\lambda \geq 2\lambda_0$, the penalty "overrules" the noise.

How large is $\lambda_0$? With Gaussian errors and normalised columns ($\hat{\sigma}^2_j = 1$ for all $j$):

Lemma 6.2 — Probability of $\mathcal{T}$

For $\lambda_0 = 2\sigma\sqrt{(t^2 + 2\log p)/n}$, we have $\mathbb{P}(\mathcal{T}) \geq 1 - 2\exp(-t^2/2)$.

Intuition: Each $\varepsilon^T X^{(j)}/n$ is a sum of $n$ random terms, and by the maximum over $p$ variables, we pick up a $\sqrt{\log p}$ factor (a union bound / Gaussian maximum argument). So the natural scale for $\lambda$ is $\lambda \asymp \sqrt{\log(p)/n}$.

3. The Slow Rate — No Assumptions on $X$

Working on $\mathcal{T}$ and applying the basic inequality with the bound on the random term, one can derive a first result with no conditions whatsoever on the design matrix $X$:

Corollary 6.1 — Slow Rate (No Design Conditions)

With $\lambda = 4\hat{\sigma}\sqrt{(t^2 + 2\log p)/n}$ and probability at least $1 - 2\exp(-t^2/2) - \mathbb{P}(\hat{\sigma} < \sigma)$: $$\frac{\|X(\hat{\beta} - \beta^0)\|_2^2}{n} \leq \frac{3}{2}\lambda\|\beta^0\|_1$$

Taking $t^2 \asymp \log(p)$, this gives a prediction error of order:

$$\frac{\|X(\hat{\beta} - \beta^0)\|_2^2}{n} = O_P\!\left(\|\beta^0\|_1 \sqrt{\frac{\log p}{n}}\right)$$
Why "slow"?

Even if $\beta^0$ is very sparse (say $s_0 = 3$), the bound only uses $\|\beta^0\|_1$, not $s_0$. The convergence rate $\sqrt{\log(p)/n}$ is slow compared to the $s_0 \log(p)/n$ we would hope for. The remarkable point is that no conditions on $X$ are needed — the columns could even be perfectly correlated! This is our baseline; we will improve it next.

Example — Concrete Numbers

Suppose $n = 200$, $p = 5000$, $s_0 = 5$, and the non-zero coefficients are all 1, so $\|\beta^0\|_1 = 5$. The slow rate bound is of order $5 \cdot \sqrt{\log(5000)/200} \approx 5 \cdot 0.29 \approx 1.45$. The fast rate (to come) would give something of order $5 \cdot \log(5000)/200 \approx 0.21$ — much tighter.

4. The Compatibility Condition — Unlocking Fast Rates

To get a faster rate, we need to exploit sparsity more aggressively. The key step comes from a decomposition of the $\ell_1$-norm:

4.1 How the Lasso "concentrates" on the active set

Using the Basic Inequality on $\mathcal{T}$, one can show (Lemma 6.3 in the book) that: $$\|\hat{\beta}_{S_0^c}\|_1 \leq 3\|\hat{\beta}_{S_0} - \beta^0_{S_0}\|_1$$

This is a profound result: the error on the inactive variables is dominated by the error on the active variables. In other words, the Lasso doesn't "waste" too much of its estimation budget on irrelevant variables. The error vector $\hat{\beta} - \beta^0$ lives mostly in a cone aligned with the active set.

4.2 The compatibility condition

With the cone constraint above, we need the design matrix $X$ to behave reasonably on vectors inside this cone. This is exactly what the compatibility condition captures:

Definition — Compatibility Condition

The compatibility condition holds for $S_0$ with constant $\varphi_0^2 > 0$ if, for all $\beta$ satisfying $\|\beta_{S_0^c}\|_1 \leq 3\|\beta_{S_0}\|_1$: $$\|\beta_{S_0}\|_1^2 \;\leq\; \frac{s_0}{\varphi_0^2} \cdot \beta^T \hat{\Sigma} \beta$$

where $\hat{\Sigma} = X^TX/n$ is the (scaled) Gram matrix.

What does it mean intuitively? The compatibility condition says: "On the cone of vectors that are roughly aligned with $S_0$ (i.e., $\|\beta_{S_0^c}\|_1 \leq 3\|\beta_{S_0}\|_1$), the design matrix doesn't collapse things too much." Specifically, $\beta^T \hat{\Sigma}\beta = \|X\beta\|_2^2/n$, so this is a lower bound on the prediction error in terms of the $\ell_1$-norm of $\beta_{S_0}$.

Compatibility vs. Eigenvalue Conditions

If we replaced $\|\beta_{S_0}\|_1^2$ by its upper bound $s_0\|\beta_{S_0}\|_2^2$, this would become a condition on the smallest eigenvalue of a submatrix of $\hat{\Sigma}$. But the cone restriction $\|\beta_{S_0^c}\|_1 \leq 3\|\beta_{S_0}\|_1$ makes compatibility weaker than a minimal eigenvalue condition — it only needs to hold for "sparse-like" directions, not all directions. This is why it's sufficient for Lasso theory even in settings where the minimal eigenvalue of $\hat{\Sigma}$ is zero (which always happens when $p > n$).

Example — When Does Compatibility Hold?

Orthonormal design ($\hat{\Sigma} = I$): $\beta^T \hat{\Sigma}\beta = \|\beta\|_2^2 \geq \|\beta_{S_0}\|_2^2 \geq \|\beta_{S_0}\|_1^2/s_0$, so $\varphi_0^2 = 1$.

Toeplitz correlation ($\Sigma_{j,k} = \rho^{|j-k|}$): With $|\rho| < 1$, compatibility holds. The correlation structure is "mild" enough that the design remains well-conditioned on sparse cones.

Highly correlated design: If two variables are nearly identical ($X^{(1)} \approx X^{(2)}$) and both are active, $\varphi_0^2$ may be very small, making the bound weak. This reflects a genuine statistical difficulty: the Lasso cannot tell which of the two correlated variables is "responsible."

5. The Oracle Inequality — The Main Result

Combining the Basic Inequality, the event $\mathcal{T}$, and the compatibility condition gives the central theorem of this lecture:

Theorem 6.1 — Oracle Inequality for the Lasso

Suppose the compatibility condition holds for $S_0$ with constant $\varphi_0^2 > 0$. Then on $\mathcal{T}$, for $\lambda \geq 2\lambda_0$: $$\frac{\|X(\hat{\beta} - \beta^0)\|_2^2}{n} + \lambda\|\hat{\beta} - \beta^0\|_1 \;\leq\; \frac{4\lambda^2 s_0}{\varphi_0^2}$$

This theorem gives two bounds at once:

  1. Prediction error: $\|X(\hat{\beta} - \beta^0)\|_2^2/n \leq 4\lambda^2 s_0 / \varphi_0^2$
  2. $\ell_1$-estimation error: $\|\hat{\beta} - \beta^0\|_1 \leq 4\lambda s_0 / \varphi_0^2$

5.1 Proof sketch

The proof is elegant and short. From Lemma 6.3 we know that on $\mathcal{T}$:

$$\frac{2\|X(\hat{\beta} - \beta^0)\|_2^2}{n} + \lambda\|\hat{\beta} - \beta^0\|_1 \leq 4\lambda\|\hat{\beta}_{S_0} - \beta^0_{S_0}\|_1$$

Now apply the compatibility condition to the right side:

$$4\lambda\|\hat{\beta}_{S_0} - \beta^0_{S_0}\|_1 \leq 4\lambda \frac{\sqrt{s_0}}{\varphi_0}\|X(\hat{\beta} - \beta^0)\|_2/\sqrt{n}$$

Finally, use the AM-GM inequality $4uv \leq u^2 + 4v^2$ (where $u = \|X(\hat{\beta}-\beta^0)\|_2/\sqrt{n}$ and $v = 2\lambda\sqrt{s_0}/\varphi_0$) to absorb $u^2$ into the left side.

5.2 Asymptotic implications — the fast rate

Since $\lambda \asymp \sqrt{\log(p)/n}$, the oracle inequality implies:

$$\frac{\|X(\hat{\beta} - \beta^0)\|_2^2}{n} = O_P\!\left(\frac{s_0 \log p}{n \cdot \varphi_0^2}\right)$$

Compare this to the OLS oracle (where you know $S_0$ and run OLS on only those $s_0$ variables): the oracle achieves $O_P(s_0/n)$. The Lasso pays an extra factor of $\log(p)$ — this is the price for not knowing the active set — and a factor $1/\varphi_0^2$ from the design.

Key Insight — The $\log(p)$ Price

The factor $\log(p)$ is remarkably small. Even if $p$ grows exponentially in $n$ (say $p = e^{n^{0.9}}$), we only pay $n^{0.9}$ in the numerator, not $p$ itself. This is the mathematical justification for why the Lasso can handle very high-dimensional problems: you don't need to "pay" for all $p$ variables, only logarithmically.

6. Slow Rate vs. Fast Rate — A Comparison

Property Slow Rate Fast Rate (Oracle Inequality)
Prediction error $O_P\!\big(\|\beta^0\|_1\sqrt{\log(p)/n}\big)$ $O_P\!\big(s_0 \log(p)/(n\varphi_0^2)\big)$
Design condition None Compatibility condition ($\varphi_0^2 > 0$)
Sparsity assumption $\|\beta^0\|_1 = o(\sqrt{n/\log p})$ $s_0 = o(n/\log p)$
Rate type $n^{-1/2}$ (parametric-like, but slow) $n^{-1}$ (nearly optimal)

The jump from the slow rate to the fast rate is entirely due to the compatibility condition. Without it, we can only bound the random term using $\|\hat{\beta} - \beta^0\|_1$, which leads to a rate involving $\|\beta^0\|_1$. With compatibility, we can convert the $\ell_1$-norm bound into a tighter bound via the sparsity $s_0$.

7. Practical Aspects

7.1 Choosing $\lambda$ — Cross-Validation

In practice, $\lambda$ is typically chosen by $K$-fold cross-validation (e.g., $K = 10$). The data is split into $K$ folds; for each candidate $\lambda$, the model is trained on $K-1$ folds and the prediction error is evaluated on the held-out fold. The $\lambda$ minimising the average CV error is selected. This tends to produce good prediction but may select too many variables (more on this in later lectures on variable selection).

7.2 The regularisation path

In practice, one computes $\hat{\beta}(\lambda)$ over a grid of $\lambda$-values. The true solution path is piecewise linear in $\lambda$, which is exploited by efficient algorithms like LARS.

At $\lambda_{\max} = \max_j |2(X^TY)_j/n|$, the Lasso solution is $\hat{\beta}(\lambda_{\max}) = 0$ (because of the KKT conditions). As $\lambda$ decreases, variables enter the model one by one.

7.3 The Lasso as a prediction machine

Even when the true model is not exactly sparse, the Lasso can be a powerful prediction method. Empirically, it often outperforms forward selection methods and is competitive with more complex machine learning approaches, especially when $p \gg n$ and the true signal is approximately sparse.

8. Summary of Lasso Properties and Required Conditions

Property Design Condition Signal Size Condition
Slow prediction rate None None
Fast prediction rate Compatibility None
$\ell_1$-estimation error Compatibility None
Variable screening ($\hat{S} \supseteq S_0$) Compatibility / restricted eigenvalue Beta-min condition
Variable selection ($\hat{S} = S_0$) Irrepresentable condition Beta-min condition

This table (Table 2.2 in the book) shows that prediction requires the weakest conditions, while exact variable selection requires the strongest. Each step up in ambition demands more from the design matrix and/or the signal strength.

Key Takeaways