Lecture 4: Lasso — Prediction, Computation, Screening & the Adaptive Lasso

Statistics for High-Dimensional Data Lecture 4 Based on Bühlmann & van de Geer (2011), Ch. 2 & 6

This lecture builds on the oracle inequality introduced in Lecture 3 and explores its practical consequences: what does it tell us about prediction and estimation? We then turn to how the Lasso is actually computed via coordinate descent, how the regularization path is constructed, and how $\lambda$ is chosen in practice. Finally, we address the Lasso's bias problem and discuss variable screening, variable selection, and the Adaptive Lasso as a powerful two-stage refinement.

1. Recap: The Lasso and Soft-Thresholding

Recall the Lasso estimator for the high-dimensional linear model $Y = X\beta^0 + \varepsilon$ with $p \gg n$:

$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left( \|Y - X\beta\|_2^2 / n + \lambda \|\beta\|_1 \right)$$

The Lasso is a sparse estimator: some components $\hat{\beta}_j(\lambda)$ are set to exactly zero. This is different from the sparsity assumption on the true model — the Lasso produces sparsity in its estimate regardless of whether the truth is sparse.

1.1 Orthonormal Design: Explicit Solution

When the design is orthonormal ($X^TX/n = I_{p \times p}$, which requires $p \le n$), the Lasso has a beautiful closed-form solution: it is simply soft-thresholding of the ordinary least squares (OLS) estimates.

Proposition 1 — Lasso with Orthonormal Design

If $X^TX/n = I$, then for each coordinate $j$:

$$\hat{\beta}_j(\lambda) = g_{\lambda/2}(Z_j), \quad Z_j = (X^TY)_j / n = \hat{\beta}_{\text{OLS},j}$$

where the soft-thresholding function is:

$$g_\lambda(z) = \text{sign}(z)(|z| - \lambda)_+$$

Intuition: Soft-thresholding works like a "dead zone" filter. If the OLS estimate $Z_j$ is small in magnitude (below the threshold $\lambda/2$), it gets set to zero — the Lasso decides that variable is not worth keeping. If $|Z_j|$ is large, the estimate is kept but shrunk toward zero by the constant amount $\lambda/2$. This shrinkage is the source of the Lasso's bias: even large, truly non-zero coefficients get pulled toward zero.

Example — Soft-Thresholding in Action

Suppose $\lambda = 2$ (so threshold = $\lambda/2 = 1$) and the OLS estimates are $Z_1 = 3$, $Z_2 = 0.5$, $Z_3 = -2.5$. Then:

  • $\hat{\beta}_1 = \text{sign}(3)(|3| - 1)_+ = +2$ — kept, but shrunk from 3 to 2.
  • $\hat{\beta}_2 = \text{sign}(0.5)(|0.5| - 1)_+ = 0$ — thresholded to zero (too small).
  • $\hat{\beta}_3 = \text{sign}(-2.5)(|2.5| - 1)_+ = -1.5$ — kept, but shrunk from $-2.5$ to $-1.5$.

1.2 Why Lasso Is Sparse but Ridge Is Not

The key geometric insight is that the $\ell_1$-ball (the Lasso constraint set) has corners that sit on the coordinate axes. When the elliptical contours of the squared loss first touch the $\ell_1$-ball, they are very likely to hit a corner — which means one or more coordinates are exactly zero. The $\ell_2$-ball (used by Ridge regression) is round, so contour lines almost never touch it at a point where a coordinate is zero.

2. Oracle Inequality and Its Implications

2.1 The Oracle Inequality (Theorem 6.1)

This is the central theoretical result for the Lasso. Under the compatibility condition, the Lasso achieves near-oracle performance for both prediction and $\ell_1$-estimation.

Theorem 6.1 — Oracle Inequality for the Lasso (Bühlmann & van de Geer, 2011)

Assume the compatibility condition holds with constant $\varphi_0^2 \ge L > 0$. Then, on the event $\mathcal{T}$ and for $\lambda \ge 2\lambda_0$:

$$\|X(\hat{\beta} - \beta^0)\|_2^2 / n \;+\; \lambda\|\hat{\beta} - \beta^0\|_1 \;\le\; 4\lambda^2 s_0 / \varphi_0^2$$

where $\mathcal{T} = \{2\max_{j=1,\ldots,p} |\varepsilon^T X^{(j)}/n| \le \lambda_0\}$ is a "good event" with $\mathbb{P}[\mathcal{T}]$ large when $\lambda_0 \asymp \sqrt{\log(p)/n}$ and errors are (sub-)Gaussian.

What does this bound contain? The left-hand side has two terms: the prediction error $\|X(\hat{\beta} - \beta^0)\|_2^2/n$ and the $\ell_1$-estimation error $\lambda\|\hat{\beta} - \beta^0\|_1$. Both are bounded simultaneously by the right-hand side, which depends on the sparsity $s_0 = |S_0|$ (number of truly non-zero coefficients), the regularization parameter $\lambda$, and the compatibility constant $\varphi_0$.

2.2 What Is the Compatibility Condition?

Definition — Compatibility Condition

The compatibility condition holds for the set $S_0$ if for some $\varphi_0 > 0$ and for all $\beta$ satisfying $\|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1$:

$$\|\beta_{S_0}\|_1^2 \;\le\; \left(\beta^T \hat{\Sigma} \beta\right) \cdot s_0 / \varphi_0^2$$

where $\hat{\Sigma} = X^TX/n$ is the empirical Gram matrix.

Intuition: This condition says that even though $p \gg n$ and the Gram matrix $\hat{\Sigma}$ is singular, it still behaves "nicely enough" in the directions that matter. Specifically, vectors that are concentrated on the active set $S_0$ (those satisfying the cone constraint $\|\beta_{S_0^c}\|_1 \le 3\|\beta_{S_0}\|_1$) cannot be in the null space of $\hat{\Sigma}$. It is a much weaker condition than requiring all eigenvalues of $\hat{\Sigma}$ to be positive — it only needs the matrix to be well-behaved in a restricted cone of directions.

Note that the condition cannot be checked in practice since $S_0$ is unknown, but it is a reasonable structural assumption that holds for many common design matrices (e.g. when rows of $X$ are drawn i.i.d. from a sub-Gaussian distribution).

2.3 Implications of the Oracle Inequality

Setting $\lambda \asymp \sqrt{\log(p)/n}$ (which ensures $\mathbb{P}[\mathcal{T}]$ is large) and assuming $\varphi_0^2 \ge L > 0$:

Prediction and Estimation Rates

1. Prediction error (fast rate):

$$\|X(\hat{\beta} - \beta^0)\|_2^2 / n = O_P\!\left(s_0 \log(p) / n\right)$$

2. $\ell_1$-estimation error:

$$\|\hat{\beta} - \beta^0\|_1 = O_P\!\left(s_0 \sqrt{\log(p)/n}\right)$$

Why "oracle"? If we magically knew the active set $S_0$ and ran OLS only on those $s_0$ variables, the prediction error would be $O_P(s_0/n)$. The Lasso achieves $O_P(s_0 \log(p)/n)$ — the extra $\log(p)$ factor is the price for not knowing which variables are active. This is remarkably small: searching through $p$ variables (which could be millions) only costs a logarithmic factor.

Important: Conditions on the Design Matrix

If we want to analyse $\hat{\beta} - \beta^0$ in any norm, we need conditions on $X$. Consider the extreme case $X^{(2)} = -X^{(1)}$: then $\beta_1 = 1, \beta_2 = 0$ and $\beta_1 = 0, \beta_2 = -1$ produce identical predictions, making $\beta^0$ non-identifiable. The compatibility condition rules out such pathologies (in a soft, averaged sense).

3. Computational Algorithm for the Lasso

The Lasso optimization problem is convex but not differentiable (because of the $|\beta_j|$ terms in the $\ell_1$-penalty). The standard approach is coordinate descent: optimize over one coordinate $\beta_j$ at a time while holding all others fixed, cycling through all coordinates repeatedly.

3.1 KKT Conditions

The algorithm is motivated by the Karush-Kuhn-Tucker (KKT) conditions. Taking the sub-differential of the Lasso objective with respect to $\beta_j$:

$$\frac{\partial}{\partial \beta_j}\left(\|Y - X\beta\|_2^2/n + \lambda\|\beta\|_1\right) = G_j(\beta) + \lambda e_j$$

where $G(\beta) = -2X^T(Y - X\beta)/n$ is the gradient of the squared loss, and $e_j = \text{sign}(\beta_j)$ if $\beta_j \ne 0$, or $e_j \in [-1, 1]$ if $\beta_j = 0$ (because $|\cdot|$ is not differentiable at zero).

KKT Conditions for the Lasso (Lemma 2.1)

A vector $\hat{\beta}$ is a solution of the Lasso if and only if:

$$G_j(\hat{\beta}) = -\text{sign}(\hat{\beta}_j)\lambda \quad \text{if } \hat{\beta}_j \ne 0$$ $$|G_j(\hat{\beta})| \le \lambda \quad \text{if } \hat{\beta}_j = 0$$

Intuition: For active variables ($\hat{\beta}_j \ne 0$), the gradient of the loss must exactly balance the penalty derivative. For inactive variables ($\hat{\beta}_j = 0$), the gradient must be small enough (in absolute value $\le \lambda$) that it's not worth "turning on" that variable. This is where sparsity comes from: the $\ell_1$-penalty has a kink at zero, so there's a range of gradient values for which $\beta_j = 0$ is optimal.

3.2 Coordinate Descent Algorithm

The update rule for the $j$-th coordinate is an explicit soft-thresholding step:

$$\beta_j^{[m]} = \frac{\text{sign}(Z_j)(|Z_j| - \lambda/2)_+}{\hat{\Sigma}_{j,j}}$$

where $Z_j = X_j^T(Y - X\beta_{-j}^{[m-1]})/n$ is the partial residual projected onto the $j$-th variable, and $\hat{\Sigma}_{j,j}$ normalises by the variance of $X^{(j)}$. We cycle through $j = 1, 2, \ldots, p, 1, 2, \ldots$ until convergence. This algorithm is guaranteed to converge to a global minimum because the Lasso objective is convex.

Example — One Coordinate Descent Step

Suppose $\lambda = 0.4$, $\hat{\Sigma}_{j,j} = 1$, and the partial residual gives $Z_j = 0.15$. Since $|0.15| < \lambda/2 = 0.2$, we set $\beta_j = 0$. The gradient is too small to overcome the penalty — this variable stays inactive.

If instead $Z_j = 0.8$, then $\beta_j = \text{sign}(0.8)(0.8 - 0.2)_+ / 1 = 0.6$. The variable enters the model but is shrunk from 0.8 to 0.6.

4. The Regularization Path

In practice, we want to compute the Lasso solution $\hat{\beta}(\lambda)$ for many values of $\lambda$ to understand how the model evolves with different regularization strengths. The collection $\{\hat{\beta}(\lambda); \lambda \in \mathbb{R}_+\}$ is called the regularization path.

4.1 Key Properties

The Lasso regularization path is piecewise linear in $\lambda$. In practice, we compute it on a grid of $\lambda$-values and interpolate linearly. The grid is typically chosen equi-spaced on the log scale.

An important boundary: there is a maximal value $\lambda_{\max}$ above which all coefficients are zero:

$$\lambda_{\max} = \max_j |2(X^TY)_j / n| \quad \Longrightarrow \quad \hat{\beta}(\lambda_{\max}) = 0$$

This follows directly from the KKT conditions: if $\lambda \ge \lambda_{\max}$, then $|G_j(0)| \le \lambda$ for all $j$, so the all-zeros vector satisfies the optimality conditions. The regularization path starts at $\lambda_{\max}$ (null model) and as $\lambda$ decreases, variables enter the model one by one.

4.2 Non-Monotonicity of the Path

Important Caveat

The regularization path is not necessarily monotone in the non-zeros. It can happen that $\hat{\beta}_j(\lambda) \ne 0$ but $\hat{\beta}_j(\lambda') = 0$ for some $\lambda' < \lambda$. That is, a variable can enter the model, leave again, and potentially re-enter as $\lambda$ decreases. This is an important practical observation: simply looking at which variables appear "early" in the path does not guarantee they remain important.

5. Prediction with the Lasso

The goal is to estimate the regression function $f(x) = \mathbb{E}[Y|X=x] = (\beta^0)^T x$ using $\hat{f}(x) = \hat{\beta}(\lambda)^T x$.

5.1 Choosing $\lambda$ via Cross-Validation

In practice, we choose $\lambda$ via $K$-fold cross-validation (typically $K = 10$). The data is split into $K$ folds; for each candidate $\lambda$, we fit the Lasso on $K-1$ folds and measure prediction error on the held-out fold. We select $\hat{\lambda}_{\text{CV}}$ that minimises the average prediction error across folds.

The Lasso with cross-validated $\lambda$ is a strong prediction machine. The oracle inequality guarantees that the prediction error scales as $s_0 \log(p)/n$, which is nearly as good as if an oracle had told us which variables to use. In practice, the Lasso is competitive with many more complex methods, including random forests, boosting, and neural networks — particularly in high-dimensional, relatively low-noise settings.

Prediction-Optimal $\lambda$ vs. Selection-Optimal $\lambda$

Choosing $\lambda$ via cross-validation is great for prediction but often leads to selecting too many variables. The prediction-optimal $\lambda$ tends to be too small for reliable variable selection, because including a few extra noise variables doesn't hurt prediction much but does hurt sparsity recovery. This is a fundamental tension between prediction and variable selection.

6. Variable Screening and Variable Selection

6.1 Active Set and the Beta-Min Condition

The active set is $S_0 = \{j : \beta_j^0 \ne 0\}$ with $s_0 = |S_0|$. The estimated active set is $\hat{S} = \{j : \hat{\beta}_j \ne 0\}$. A natural question: can the Lasso recover $S_0$?

From the oracle inequality, we know $\|\hat{\beta} - \beta^0\|_1 \le 4\lambda s_0/\varphi_0^2$ on $\mathcal{T}$. If the smallest true coefficient is larger than this bound, then every active variable must have $\hat{\beta}_j \ne 0$.

Variable Screening via the Beta-Min Condition

If the beta-min condition holds:

$$\min_{j \in S_0} |\beta_j^0| > 4\lambda s_0 / \varphi_0^2$$

then $\mathbb{P}[\hat{S} \supseteq S_0] \ge \mathbb{P}[\mathcal{T}]$ which is large. In words: the Lasso selects a superset of the true active set — it does not miss any important variable!

Intuition: The beta-min condition says that all true signals are large enough to "survive" the shrinkage and thresholding. If a true coefficient is tiny (close to zero), the Lasso might zero it out, and we can't distinguish it from noise. The condition quantifies the minimum detectable signal strength.

6.2 Variable Selection: A Harder Problem

Screening ($\hat{S} \supseteq S_0$) is the easier task. Exact variable selection ($\hat{S} = S_0$, no false positives) requires much stronger conditions.

Conditions for Consistent Variable Selection

Under the irrepresentable condition (or equivalently, the neighbourhood stability condition) on the design $X$, and assuming the beta-min condition $\min_{j \in S_0}|\beta_j^0| \gg \sqrt{s_0 \log(p)/n}$:

$$\mathbb{P}[\hat{S} = S_0] \to 1 \quad (n \to \infty)$$

The irrepresentable condition is sufficient and essentially necessary for consistent variable selection with the Lasso. However, it is quite restrictive and often not fulfilled in practice. This, combined with the difficulty of choosing the correct $\lambda$ for selection (rather than prediction), leads to a pragmatic conclusion:

Practical Takeaway

Variable screening is realistic; variable selection (in one step) is not very realistic with the plain Lasso. A humorous but insightful re-interpretation:

LASSO = Least Absolute Shrinkage and Screening Operator

(originally: Least Absolute Shrinkage and Selection Operator, Tibshirani, 1996)

Property Design Condition Signal Condition
Prediction (fast rate) Compatibility condition None required
Variable screening ($\hat{S} \supseteq S_0$) Restricted eigenvalue Beta-min condition
Variable selection ($\hat{S} = S_0$) Irrepresentable condition Beta-min condition

7. The Adaptive Lasso

The standard Lasso has a well-known bias problem: because soft-thresholding shrinks all coefficients by the same amount $\lambda/2$, even truly large coefficients get biased toward zero. The Adaptive Lasso (Zou, 2006) addresses this by using a data-driven, variable-specific penalty.

7.1 Definition

Definition — Adaptive Lasso

The Adaptive Lasso is a two-stage procedure:

  1. Stage 1: Compute an initial estimator $\hat{\beta}_{\text{init}}$ (e.g., the Lasso with $\lambda$ chosen by CV).
  2. Stage 2: Solve the re-weighted Lasso:
    $$\hat{\beta}_{\text{adapt}}(\lambda) = \arg\min_{\beta}\left(\|Y - X\beta\|_2^2/n + \lambda \sum_{j=1}^{p} \frac{|\beta_j|}{|\hat{\beta}_{\text{init},j}|}\right)$$

Intuition: The penalty weight for variable $j$ is $1/|\hat{\beta}_{\text{init},j}|$. If the initial estimate is large (strong signal), the penalty is small — so we barely shrink that variable. If the initial estimate is small (likely noise), the penalty is huge — strongly encouraging that variable to be zero. This way the Adaptive Lasso "adapts" the penalty to the signal strength of each variable.

7.2 Key Properties

The Adaptive Lasso has two important properties:

7.3 Threshold Functions Compared

For orthonormal design with $\hat{\beta}_{\text{init}} = \hat{\beta}_{\text{OLS}}$, the Adaptive Lasso has the explicit solution:

$$\hat{\beta}_{\text{adapt},j} = \text{sign}(Z_j)\left(|Z_j| - \frac{\lambda}{2|Z_j|}\right)_+$$

This defines a different thresholding function compared to the standard Lasso. While the Lasso (soft-thresholding) always shrinks by the constant $\lambda/2$, the Adaptive Lasso shrinks by $\lambda/(2|Z_j|)$ — a decreasing amount as $|Z_j|$ grows. For large OLS estimates, the Adaptive Lasso closely approximates hard-thresholding (keeping or discarding, but barely shrinking), which is less biased.

7.4 Practical Usage

In practice, the Adaptive Lasso is computed as follows:

  1. Fit a standard Lasso with $\lambda_{\text{init}}$ chosen by 10-fold CV.
  2. Construct a modified design matrix: multiply the $j$-th column of $X$ by $|\hat{\beta}_{\text{init},j}|$ (and drop columns where $\hat{\beta}_{\text{init},j} = 0$).
  3. Fit a standard Lasso on this modified design, choosing $\lambda$ by a second round of CV.

This sequential tuning (two separate CV steps instead of a joint optimization over two parameters) is computationally much cheaper.

Example — Lasso vs. Adaptive Lasso (Motif Regression)

In a motif regression problem with $p = 195$ and $n = 143$, the Lasso with CV selected $|\hat{S}| = 26$ variables. The Adaptive Lasso (with the Lasso as initial estimator) selected only $|\hat{S}_{\text{adapt}}| = 16$ variables — a substantially sparser model while retaining the key predictive variables.

Should We Always Use the Adaptive Lasso?

The Adaptive Lasso is computationally only slightly more expensive (two Lasso fits instead of one). It often produces a somewhat better, sparser model, and has stronger theoretical properties for variable screening and selection. However, in large-scale data the differences may not always be dramatic. The general recommendation from the lectures: "Yes, often the Adaptive Lasso is perhaps a bit better."

Key Takeaways