Lecture 3: Lasso Theory — From Slow to Fast Rates & Beyond

High-Dimensional Statistics Lecture 3 Based on Bühlmann & van de Geer (2011), Ch. 6

This lecture deepens the theoretical analysis of the Lasso estimator for the high-dimensional linear model $Y = X\beta^0 + \varepsilon$ where $p \gg n$. Starting from the recap of the Lasso's behavior under orthonormal design (soft-thresholding) and the "slow rate" prediction bound from Corollary 6.1, the lecture develops the main theoretical machinery: the compatibility condition on the design matrix $X$, the resulting oracle inequality (Theorem 6.1), the much faster prediction and estimation rates that follow, and connections to the restricted eigenvalue condition. The lecture also covers practical aspects such as the regularization path, the Adaptive Lasso, and even a connection to deep neural networks.

1. Recap: Lasso Under Orthonormal Design

To build intuition, recall what happens in the simplest possible setting: when the columns of the design matrix $X$ are orthonormal, meaning $X^T X / n = I_{p \times p}$ (which requires $p \le n$). In this case, the Lasso has a beautiful closed-form solution — it simply applies soft-thresholding to the ordinary least squares (OLS) estimates.

Proposition 1 — Lasso for Orthonormal Design

Assume $X^T X / n = I$. Then the $j$-th component of the Lasso estimator is:

$$\hat{\beta}_j(\lambda) = g_{\lambda/2}(Z_j), \quad Z_j = (X^T Y)_j / n = \hat{\beta}_{\text{OLS},j}$$

where the soft-thresholding function is:

$$g_\lambda(z) = \text{sign}(z)(|z| - \lambda)_+$$

What does this mean intuitively? Each OLS coefficient $Z_j$ gets "shrunk" toward zero. If $|Z_j|$ is smaller than the threshold $\lambda/2$, the Lasso sets that coefficient exactly to zero. If $|Z_j|$ exceeds the threshold, the estimate is pulled toward zero by the fixed amount $\lambda/2$. This is what creates sparsity: small, likely-noisy coefficients get killed off entirely.

Example — Soft-Thresholding in Action

Suppose $\lambda/2 = 1$. If $Z_j = 0.8$ (small, probably noise), then $\hat\beta_j = 0$. If $Z_j = 2.5$ (large, likely signal), then $\hat\beta_j = \text{sign}(2.5)(2.5 - 1)_+ = 1.5$. The Lasso keeps the signal but shrinks it by $1$.

1.1 The Bias Problem of Soft-Thresholding

A key observation from the threshold functions plot: the Lasso (soft-thresholding, dotted line) introduces bias even for large coefficients. When $|Z_j|$ is big (so there's strong evidence the coefficient is non-zero), the Lasso still shrinks the estimate by $\lambda/2$. In contrast, hard-thresholding (dashed line) keeps the OLS value for large $|Z_j|$ untouched — it either keeps or kills, with no shrinkage in between. The Adaptive Lasso (solid line) achieves a nice compromise, which we discuss later.

1.2 Hard-Thresholding and the $\ell_0$-Penalty

Proposition 2 — Hard-Thresholding for Orthonormal Design

Assume $X^T X / n = I$. The $\ell_0$-penalized regression estimator:

$$\hat{\beta}_{\ell_0}(\lambda) = \arg\min_\beta \left( \|Y - X\beta\|_2^2 / n + \lambda \|\beta\|_0 \right)$$

equals hard-thresholding with threshold value $\sqrt{\lambda}$:

$$\hat{\beta}_{\ell_0;j}(\lambda) = Z_j \cdot \mathbb{1}(|Z_j| > \sqrt{\lambda}), \quad Z = X^T Y / n$$

Hard-thresholding has less bias than soft-thresholding (still biased because $\mathbb{E}[\hat\beta_{\ell_0}] \neq \beta^0$ due to the truncation, but not shrinkage bias). However, the $\ell_0$ penalty makes the optimization problem non-convex and computationally hard (NP-hard in general). Classical criteria like AIC and BIC are actually $\ell_0$-penalized regression problems. Recent progress on mixed-integer programming can handle $p \le 500$, but for truly high-dimensional problems we need the convex Lasso.

2. The Slow Rate: No Conditions on $X$ Needed

The first major theoretical result for the Lasso requires essentially no assumptions on the design matrix $X$ at all.

Corollary 6.1 (Bühlmann & van de Geer, 2011) — Slow Rate Bound

Assume $\varepsilon \sim \mathcal{N}_n(0, \sigma^2 I)$ and that the columns of $X$ are scaled so that $\hat\sigma_j^2 \equiv 1$ for all $j$. For the choice:

$$\lambda = 4\hat\sigma \sqrt{\frac{t^2 + 2\log(p)}{n}}$$

where $\hat\sigma$ is an estimator of $\sigma$, we have with probability at least $1 - \alpha$ (where $\alpha = 2\exp(-t^2/2) + P[\hat\sigma < \sigma]$):

$$\|X(\hat\beta - \beta^0)\|_2^2 / n \le \frac{3}{2}\lambda \|\beta^0\|_1$$

2.1 What Does This Tell Us?

Taking $\lambda \asymp \sqrt{\log(p)/n}$ (e.g., by choosing $t^2 \asymp \log(p)$), the bound becomes:

$$\|X(\hat\beta - \beta^0)\|_2^2 / n = O_P\!\left(\sqrt{\frac{\log(p)}{n}} \cdot \|\beta^0\|_1\right)$$

This is a "slow rate" because the prediction error decreases like $n^{-1/2}$ rather than $n^{-1}$. Even in the very sparse case $\|\beta^0\|_1 = O(1)$, the rate is $O_P(\sqrt{\log(p)/n})$, which is much worse than the oracle benchmark.

Key Insight: No Conditions on $X$ Needed for Slow Rates

The slow rate holds with no assumptions on the design $X$ — the columns could even be perfectly correlated. This is remarkable: the Lasso "does something reasonable" no matter what the design looks like. The price is a suboptimal convergence rate.

2.2 Benchmark: The OLS Oracle

If an oracle told us exactly which variables are truly non-zero (the support set $S_0 = \{j : \beta_j^0 \ne 0\}$ with $s_0 = |S_0|$), we could run OLS on just those variables and achieve:

$$\|X(\hat\beta_{\text{OLS-oracle}} - \beta^0)\|_2^2 / n = O_P(s_0 / n)$$

This is the "gold standard" — it depends on $s_0$ (number of relevant variables) rather than $\|\beta^0\|_1$, and it decays at rate $1/n$. To close this gap, we need the compatibility condition.

3. The Compatibility Condition

To go beyond the slow rate, we need some assumption on the design matrix $X$. The issue is that if columns of $X$ are too correlated, it becomes impossible to distinguish which variables are truly active. The compatibility condition is the key assumption that makes fast rates possible.

3.1 Why Do We Need a Condition on $X$?

Consider an extreme example: if $X^{(2)} = -X^{(1)}$ (two columns perfectly anti-correlated), then the models $\beta_1 = 1, \beta_2 = 0$ and $\beta_1 = 0, \beta_2 = -1$ produce identical predictions $X\beta$. We cannot tell them apart — the parameter $\beta^0$ is not identifiable. The compatibility condition rules out such pathological situations, but it does so in a subtle and minimal way.

Definition — Compatibility Condition

We say the compatibility condition holds for the set $S_0$ if there exists $\varphi_0 > 0$ such that for all $\beta$ satisfying $\|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1$:

$$\|\beta_{S_0}\|_1^2 \le \frac{\beta^T \hat\Sigma \beta \cdot s_0}{\varphi_0^2}$$

where $\hat\Sigma = X^T X / n$ is the empirical Gram matrix and $\varphi_0^2$ is called the compatibility constant.

3.2 Intuition Behind the Compatibility Condition

The condition says: for vectors $\beta$ that are "mostly supported on $S_0$" (the cone constraint $\|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1$), the Gram matrix $\hat\Sigma$ doesn't annihilate them. In other words, the prediction $X\beta$ stays large enough relative to the coefficient on $S_0$. Think of it as a restricted version of saying "the smallest eigenvalue of $\hat\Sigma$ is positive" — but only tested on a special cone of vectors rather than all of $\mathbb{R}^p$.

Why the Cone Constraint?

The constraint $\|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1$ arises naturally from the analysis of the Lasso. It can be shown that on the "good event" $\mathcal{T}$, the Lasso estimation error $\hat\beta - \beta^0$ automatically satisfies this cone constraint. So the compatibility condition only needs to hold on vectors the Lasso error can actually take — not on all of $\mathbb{R}^p$. This makes it much weaker than requiring all eigenvalues of $\hat\Sigma$ to be positive (which would be impossible when $p > n$).

3.3 Relation to the Restricted Eigenvalue Condition

A related but stronger condition is the restricted eigenvalue condition. Define:

$$\kappa^2(s_0, 3) = \min_{\substack{\beta \ne 0 \\ \|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1}} \frac{\beta^T \hat\Sigma \beta}{\|\beta_{S_0}\|_2^2}$$

The restricted eigenvalue looks at the ratio involving $\|\beta_{S_0}\|_2^2$, whereas compatibility uses $\|\beta_{S_0}\|_1^2 / s_0$. By Cauchy-Schwarz, $\|\beta_{S_0}\|_1^2 \le s_0 \|\beta_{S_0}\|_2^2$, so $\kappa^2(s_0, 3) > 0$ implies $\varphi_0^2 > 0$ but not vice versa. The compatibility condition is weaker and sufficient for the oracle inequality.

4. The Oracle Inequality — The Fast Rate

With the compatibility condition in hand, we arrive at the central result of this lecture.

Theorem 6.1 (Bühlmann & van de Geer, 2011) — Oracle Inequality for the Lasso

Assume the compatibility condition holds with constant $\varphi_0^2 > 0$. Define the event $\mathcal{T} = \{2\max_{j=1,\ldots,p} |\varepsilon^T X^{(j)} / n| \le \lambda_0\}$. Then on $\mathcal{T}$, for $\lambda \ge 2\lambda_0$:

$$\|X(\hat\beta - \beta^0)\|_2^2 / n + \lambda\|\hat\beta - \beta^0\|_1 \le 4\lambda^2 s_0 / \varphi_0^2$$

4.1 Unpacking the Oracle Inequality

This single inequality gives us two results in one:

  1. Prediction bound: $\|X(\hat\beta - \beta^0)\|_2^2 / n \le 4\lambda^2 s_0 / \varphi_0^2$
  2. $\ell_1$-estimation bound: $\|\hat\beta - \beta^0\|_1 \le 4\lambda s_0 / \varphi_0^2$

4.2 The "Good Event" $\mathcal{T}$

The event $\mathcal{T}$ controls the maximum absolute correlation between the noise $\varepsilon$ and any column of $X$. Think of $\varepsilon^T X^{(j)} / n$ as how much column $j$ "aligns with the noise." We need all of these to be small. For Gaussian or sub-Gaussian errors:

$$\lambda_0 \asymp \sqrt{\log(p)/n} \implies P[\mathcal{T}] \text{ is close to } 1$$

The factor $\log(p)$ appears because we're taking a maximum over $p$ terms (union bound flavor). This is why $\log(p)$ shows up in the rate — it's the price for not knowing which of the $p$ variables are relevant.

4.3 Asymptotic Implications — The Fast Rate

Setting $\lambda = 2\lambda_0 \asymp \sqrt{\log(p)/n}$ and assuming $\varphi_0^2 \ge L > 0$:

$$\|X(\hat\beta - \beta^0)\|_2^2 / n = O_P\!\left(s_0 \frac{\log(p)}{n}\right)$$
$$\|\hat\beta - \beta^0\|_1 = O_P\!\left(s_0 \sqrt{\frac{\log(p)}{n}}\right)$$
Example — Comparing Slow vs. Fast Rates

Suppose $n = 1000$, $p = 10{,}000$, $s_0 = 5$ (5 truly relevant variables), $\|\beta^0\|_1 = 5$.

Slow rate (no condition on $X$): prediction error $\sim \sqrt{\log(10{,}000)/1000} \cdot 5 \approx 0.48$

Fast rate (with compatibility): prediction error $\sim 5 \cdot \log(10{,}000) / 1000 \approx 0.046$

Oracle OLS: prediction error $\sim 5 / 1000 = 0.005$

The fast rate is dramatically better than the slow rate. The only price vs. the oracle is the $\log(p)$ factor — remarkably small given that we didn't know which 5 out of 10,000 variables matter!

5. Summary of Lasso Properties & Required Conditions

The following table (version of Table 2.2 from the book) gives a bird's-eye view of what the Lasso can achieve and what assumptions are needed:

Property Design Condition Size of Non-Zero Coefficients
Slow prediction convergence rate No requirement No requirement
Fast prediction convergence rate Compatibility No requirement
Estimation error bound $\|\hat\beta - \beta^0\|_1$ Compatibility No requirement
Variable screening ($\hat{S} \supseteq S_0$) Compatibility / RE Beta-min condition
Variable selection ($\hat{S} = S_0$) Neighborhood stability (irrepresentable cond.) Beta-min condition

The table reveals a hierarchy of difficulty: prediction is easiest (weakest assumptions), estimation requires more, variable screening is harder, and exact variable selection is the hardest task requiring the strongest conditions on both the design and the signal strength.

6. The Lasso Regularization Path

In practice, we don't pick a single $\lambda$ — we compute the Lasso solution for a whole range of $\lambda$ values. This gives the regularization path $\{\hat\beta(\lambda) : \lambda \ge 0\}$.

6.1 Key Properties

The Lasso regularization path has a remarkable property: it is piecewise linear in $\lambda$. This means we can compute the entire path efficiently. In practice, we compute it over a grid of $\lambda$-values and interpolate.

There is a maximal value $\lambda_{\max} = \max_j |2(X^T Y)_j / n|$ above which the Lasso solution is exactly zero: $\hat\beta(\lambda_{\max}^+) = 0$. This follows from the KKT conditions. As $\lambda$ decreases from $\lambda_{\max}$, variables enter the model one by one.

6.2 Choosing $\lambda$ in Practice

The most common approach is cross-validation (e.g., 10-fold CV). We evaluate the Lasso over a grid of $\lambda$ values (typically equi-spaced on a log-scale from $\lambda_{\max}$ down to a small value) and choose the $\lambda$ that minimizes the cross-validated prediction error. The theory tells us $\lambda$ should scale like $\sqrt{\log(p)/n}$, which is a useful sanity check for the cross-validated choice.

7. Addressing Lasso's Bias: Adaptive Lasso & Hard-Thresholding

The Lasso's soft-thresholding introduces systematic bias: even large coefficients are shrunk toward zero. Two approaches to address this:

7.1 The Adaptive Lasso

The Adaptive Lasso (Zou, 2006) is a two-stage procedure that re-weights the $\ell_1$-penalty using an initial Lasso fit:

$$\hat\beta_{\text{adapt}}(\lambda) = \arg\min_\beta \left( \|Y - X\beta\|_2^2 / n + \lambda \sum_{j=1}^p \frac{|\beta_j|}{|\hat\beta_{\text{init},j}|} \right)$$

The key idea: if the initial estimator $|\hat\beta_{\text{init},j}|$ is large (strong signal), the penalty weight $1/|\hat\beta_{\text{init},j}|$ is small, so that coefficient gets penalized less — meaning less bias. If the initial estimate is zero, then the Adaptive Lasso also sets it to zero (the penalty is $\infty$).

For orthonormal design with initial OLS estimates, the Adaptive Lasso becomes:

$$\hat\beta_{\text{adapt},j} = \text{sign}(Z_j)\left(|Z_j| - \frac{\lambda}{2|Z_j|}\right)_+$$

The threshold $\lambda/(2|Z_j|)$ decreases for large $|Z_j|$, so large signals are barely touched — much less bias than the constant shift of soft-thresholding.

Practical Recipe for the Adaptive Lasso

1. Run Lasso with $\lambda$ selected by 10-fold CV to get $\hat\beta_{\text{init}}$.

2. Run the Adaptive Lasso (using the formula above) with $\lambda$ again selected by 10-fold CV.

This sequential tuning is computationally cheap (two 1D optimizations) and typically yields a much sparser model with less bias than the plain Lasso.

7.2 Comparing the Three Threshold Functions

Under orthonormal design, the three estimators correspond to three thresholding functions applied to the OLS estimates $Z_j$:

8. Connection to Deep Neural Networks

An important modern connection: in deep neural networks, the prediction from the last layer to the response $Y$ is often based on a regularized linear model. Specifically, if $\varphi(X_i) \in \mathbb{R}^d$ are learned features from the last hidden layer, then the final prediction uses:

$$\hat\beta(\lambda) = \arg\min_\beta \|Y - (\varphi(X_1), \ldots, \varphi(X_n))^T \beta\|_2^2 / n + \lambda\|\beta\|_1$$

and the prediction is $\hat{f}(x) = \hat\beta(\lambda)^T \varphi(x)$. So even in the deep learning world, the Lasso theory developed in this lecture applies to the final linear layer. The features $\varphi(\cdot)$ are learned by the network, but the last step is still penalized linear regression.

9. Key Takeaways