Lecture 1: Introduction to High-Dimensional Statistics & The Lasso

High-Dimensional Statistics Lecture 1 Based on Bühlmann & van de Geer (2011)

This opening lecture sets the stage for the entire course: what happens when you have far more variables (features) than observations? In classical statistics you might have $n = 1000$ patients and $p = 5$ measurements each. In modern applications the situation is reversed — you may have $n = 100$ samples but $p = 4000$ gene expressions. This is the high-dimensional regime where $p \gg n$, and standard tools like ordinary least squares (OLS) simply break down. The course introduces the Lasso as the central tool for tackling this challenge, along with its extensions and the theoretical machinery needed to understand when and why it works.

1. What Is High-Dimensional Data?

1.1 General Framework

We observe $n$ independent and identically distributed (i.i.d.) data points $Z_1, \ldots, Z_n$, where the dimension of each $Z_i$ is much larger than $n$. The most common settings are:

Example — Riboflavin Production with Bacillus Subtilis

A collaboration with DSM (Switzerland) aimed to improve the riboflavin (vitamin B2) production rate of the bacterium Bacillus subtilis via genetic engineering. The data had a response variable $Y$ representing the log-production rate, and covariates $X \in \mathbb{R}^{4088}$ — the expression levels of 4088 genes. With only $n = 115$ samples, we have $p \gg n$. The question: which genes are most relevant for riboflavin production?

Such data arises everywhere today: in biology (genomics, proteomics), medical imaging, economics, and environmental sciences. The key challenge is always the same — far more unknowns than equations.

1.2 Why Classical Methods Fail

In a standard linear model $Y = X\beta^0 + \varepsilon$, the ordinary least squares (OLS) estimator minimises $\|Y - X\beta\|_2^2$. When $p > n$, the design matrix $X_{n \times p}$ has more columns than rows, so the system is underdetermined — there are infinitely many solutions that fit the training data perfectly. OLS is not unique, and any OLS solution will massively overfit: it memorises noise and performs terribly on new data.

Key Insight

When $p > n$, we must impose additional structure on $\beta^0$ to make estimation possible. The most popular assumption is sparsity — only a few of the $p$ coefficients are non-zero. Regularisation methods exploit this structure.

2. The Sparsity Assumption

Sparsity means that the true parameter vector $\beta^0 = (\beta_1^0, \ldots, \beta_p^0)^T$ has most of its entries equal to zero. We define the active set as $S_0 = \{j : \beta_j^0 \neq 0\}$ and the sparsity as $s_0 = |S_0|$, the number of truly relevant variables. We need $s_0 \ll p$ (and ideally $s_0 \ll n$) for meaningful estimation.

Definition — Sparsity Measures

Sparsity can be quantified by $\ell_q$-"norms":

  • $\|\beta\|_0 = |\{j : \beta_j \neq 0\}|$ — counts non-zero entries (not a true norm).
  • $\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$ — the $\ell_1$-norm, convex and computationally friendly.
  • $\|\beta\|_q^q = \sum_{j=1}^{p} |\beta_j|^q$ for $0 < q < \infty$.

Roughly, high-dimensional inference is possible when $\log(p) \cdot \text{sparsity}(\beta) \ll n$.

Intuition — Why Sparsity Helps

Think of predicting a person's health from 10,000 biomarkers. In reality, only perhaps 20 of those biomarkers are truly related to the outcome. If we could somehow identify those 20, we would have a small, manageable estimation problem ($s_0 = 20$ unknowns, $n$ observations). Sparsity-based methods try to find this "needle in the haystack" automatically.

3. The Linear Model Setup

We work with the standard linear model:

$$Y = X_{n \times p}\, \beta^0 + \varepsilon$$

where $Y \in \mathbb{R}^n$ is the response vector, $X$ is the $n \times p$ design matrix, $\beta^0 \in \mathbb{R}^p$ is the unknown parameter, and $\varepsilon \in \mathbb{R}^n$ contains i.i.d. errors with $\mathbb{E}[\varepsilon_i] = 0$. For simplicity, we usually assume zero intercept and that all covariates are centred and standardised (each column of $X$ has mean zero and empirical variance one).

In the high-dimensional setting, the crucial feature is $p \gg n$. Combined with the sparsity assumption on $\beta^0$, this defines the problem landscape of the course.

4. The Lasso Estimator

4.1 Definition

Definition — The Lasso (Tibshirani, 1996)

The Lasso (Least Absolute Shrinkage and Selection Operator) is defined as:

$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left( \|Y - X\beta\|_2^2 / n + \lambda \|\beta\|_1 \right)$$

where $\lambda \geq 0$ is the regularisation parameter (tuning parameter) controlling the trade-off between fitting the data (first term) and keeping the model simple (second term).

The first term, $\|Y - X\beta\|_2^2/n$, is the familiar residual sum of squares (divided by $n$) — it measures how well $\beta$ fits the training data. The second term, $\lambda\|\beta\|_1 = \lambda \sum_j |\beta_j|$, is the $\ell_1$-penalty, which pushes coefficients toward zero. The larger $\lambda$ is, the more coefficients get set exactly to zero — giving a sparser model.

4.2 Why the $\ell_1$-Penalty Produces Sparsity

There is an equivalent "constrained" formulation of the Lasso (the so-called primal form):

$$\hat{\beta}_{\text{primal}}(R) = \arg\min_{\beta:\, \|\beta\|_1 \leq R} \|Y - X\beta\|_2^2 / n$$

There is a one-to-one correspondence between $\lambda$ and $R$ (depending on the data). The geometric picture makes the sparsity property clear:

β₁ β₂ β̂_OLS β̂_Lasso β₁ β₂ β̂_OLS β̂_Ridge Lasso (ℓ₁ constraint) Ridge (ℓ₂ constraint)
Figure 1: The ℓ₁-ball (diamond) has corners on the axes, so the expanding contour ellipses of the residual sum of squares tend to first touch the ball at a corner — setting one coefficient to exactly zero. The ℓ₂-ball (circle) has no corners, so the solution rarely lands exactly on an axis.

The key geometric intuition: the $\ell_1$-ball is a diamond shape with sharp corners that lie on the coordinate axes. When you expand the elliptical contours of the least-squares loss outward from the OLS solution, they will typically first touch the diamond at one of these corners — which corresponds to one or more $\hat{\beta}_j$ being exactly zero. By contrast, the $\ell_2$-ball (used in Ridge regression) is round, with no corners, so the contours will almost never hit the ball exactly on an axis. This is why Lasso does variable selection (produces exact zeros) while Ridge regression does not.

4.3 The Lasso as Shrinkage + Selection

The name "LASSO" stands for Least Absolute Shrinkage and Selection Operator. It simultaneously does two things: it shrinks all coefficient estimates toward zero (reducing variance), and it selects variables by setting some coefficients exactly to zero. This dual property makes it extremely useful in high-dimensional settings.

5. Lasso for Orthonormal Design: A Special Case

To build intuition, consider the simplest possible case: the design is orthonormal, meaning $X^T X / n = I_{p \times p}$ (which requires $p \leq n$). In this special case, the Lasso solution has a beautiful closed form.

Proposition 1 — Soft-Thresholding

Assume orthogonal design $X^T X/n = I$. Then the Lasso estimator is:

$$\hat{\beta}_j(\lambda) = g_{\lambda/2}(Z_j), \qquad Z_j = (X^T Y)_j / n = \hat{\beta}_{\text{OLS},j}$$

where $g_\lambda(z) = \text{sign}(z)(|z| - \lambda)_+$ is the soft-thresholding function.

In words: start with the OLS estimate $Z_j$ for each coefficient, then shrink it toward zero by subtracting $\lambda/2$, and if the result would be negative, set it to zero. This is the simplest possible "keep the big signals, kill the small ones" rule.

z g(z) −λ λ Soft-thresholding (Lasso) Hard-thresholding
Figure 2: Soft-thresholding shrinks all values toward zero and kills those below λ. Hard-thresholding keeps values above the threshold unchanged (used later in other methods).
Why Is This Important?

The orthonormal case gives us clean intuition: the Lasso acts as a "noise filter" that removes small estimated coefficients (likely noise) while keeping large ones (likely signal) — but it also introduces bias by shrinking the large coefficients toward zero. This bias-variance trade-off is a recurring theme throughout the course.

6. The Lasso as a Prediction Machine

6.1 Using Cross-Validation for $\lambda$

In practice, the key question is: how do we choose $\lambda$? The most common approach is $k$-fold cross-validation (typically $k = 10$). The idea is:

  1. Split the data randomly into $k$ roughly equal parts (folds).
  2. For each candidate $\lambda$, hold out one fold, train the Lasso on the remaining $k-1$ folds, and measure the prediction error on the held-out fold.
  3. Repeat for each fold, average the errors, and pick the $\lambda$ that minimises the average prediction error.

This is denoted $\hat{\lambda}_{\text{CV}}$. To honestly assess overall performance, one should use a double cross-validation: an outer loop evaluates the whole procedure (including inner CV for $\lambda$-selection) on held-out test data.

6.2 Prediction Performance

For prediction, the Lasso is remarkably effective and competitive with far more complex methods. A practical demonstration from a financial application (MSc thesis, Jiawen Le, 2019) showed the Lasso performing among the top methods when constructing machine-learning-based long-short investment portfolios, competing with random forests, gradient boosting, and neural networks — while being far simpler and more interpretable.

Practical Tip

The Lasso tuned via cross-validation for prediction often selects too many variables (more than the true $s_0$). This is because it tends to overestimate the active set: $\hat{S}(\hat{\lambda}_{\text{CV}}) \supseteq S_0$ with high probability. Good for prediction (doesn't miss anything), but it means some selected variables may be noise. This screening property is addressed by methods like the adaptive Lasso later in the course.

7. Variable Selection and Screening

7.1 Active Set and Screening

Beyond prediction, we often want to know which variables are truly important. The set of selected variables is $\hat{S}(\lambda) = \{j : \hat{\beta}_j(\lambda) \neq 0\}$. The Lasso has a fundamental screening property: with the right choice of $\lambda$, the selected set contains the true active set with high probability: $\hat{S}(\lambda) \supseteq S_0$.

Moreover, the cardinality of the Lasso's selected set is bounded: $|\hat{S}| \leq \min(n, p)$. When $p \gg n$, this means the Lasso reduces dimensionality from $p$ down to at most $n$ — a potentially enormous reduction.

7.2 Exact Variable Selection

Going beyond screening to exact variable selection — recovering $\hat{S} = S_0$ — is much harder and requires stronger conditions on the design matrix $X$ (the "irrepresentable condition" or "neighbourhood stability condition"). These are discussed in later lectures. The take-away for now: the Lasso is an excellent screening tool, but for exact selection one typically needs two-stage procedures.

8. The Adaptive Lasso: A Two-Stage Refinement

Definition — Adaptive Lasso (Zou, 2006)

The adaptive Lasso uses re-weighted $\ell_1$-penalties:

$$\hat{\beta}_{\text{adapt}}(\lambda) = \arg\min_{\beta} \left( \|Y - X\beta\|_2^2/n + \lambda \sum_{j=1}^{p} \frac{|\beta_j|}{|\hat{\beta}_{\text{init},j}|} \right)$$

where $\hat{\beta}_{\text{init}}$ is an initial estimator (typically the Lasso from a first stage).

The idea is elegant: in the first stage, use the Lasso to get initial estimates. In the second stage, penalise each coefficient inversely proportional to its first-stage estimate. This means:

Example — Motif Regression (DNA Binding Sites)

In the HIF1α transcription factor binding site problem ($p = 195$ candidate motifs, $n = 143$), the Lasso with cross-validation selected 26 variables. The adaptive Lasso, starting from this Lasso fit and using a second round of cross-validation, reduced the selection to 16 variables — a meaningful reduction in false positives while retaining the important signals.

9. Ridge Regression vs. Lasso — Comparison

Property Ridge ($\ell_2$-penalty) Lasso ($\ell_1$-penalty)
Penalty $\lambda \|\beta\|_2^2 = \lambda \sum_j \beta_j^2$ $\lambda \|\beta\|_1 = \lambda \sum_j |\beta_j|$
Sparsity? No — all coefficients are non-zero Yes — many coefficients set to exactly zero
Variable selection? No Yes
Constraint geometry Sphere (smooth, no corners) Diamond (corners on axes)
Bias on large coefficients Moderate (proportional shrinkage) Higher (constant shrinkage by $\lambda/2$)
When $p \gg n$? Unique solution, but dense Unique solution (under mild conditions), sparse

10. Course Roadmap — Topics Ahead

Lecture 1 introduced the landscape. The course will cover the following major topics:

  1. Lasso and modifications for high-dimensional linear and generalised linear models — including oracle inequalities, convergence rates, and theoretical guarantees.
  2. Group Lasso for settings where variables come in natural groups (e.g., all dummy variables encoding a categorical factor).
  3. Additive models with many smooth univariate functions — combining sparsity with smoothness.
  4. De-biasing / de-sparsified Lasso — constructing confidence intervals and $p$-values in high-dimensional settings.
  5. Stability selection — a method to control false positives by subsampling.
  6. Multiple sample splitting — another approach to valid inference after selection.
  7. Hidden confounding and deconfounding — dealing with unobserved common causes.
  8. Undirected graphical modeling — estimating conditional independence structures when $p \gg n$.

Key Takeaways