Lecture 7: Uncertainty Quantification — P-values & Confidence Intervals in High Dimensions

High-Dimensional Statistics Lecture 7 Based on Ch. 10–11 of Bühlmann & van de Geer (2011) and Lecture Slides

Overview

Up until now in the course, we have focused on estimation and prediction in high-dimensional settings — building sparse models via the Lasso, Group Lasso, and related methods. But in scientific applications, simply obtaining a point estimate is often not enough. We want to know: how confident are we in these estimates? Which variables are truly relevant, and can we attach a measure of statistical significance (a p-value) or a confidence interval to each coefficient?

This lecture tackles exactly this question. It introduces two main approaches to frequentist uncertainty quantification for high-dimensional linear models: the de-sparsified (de-biased) Lasso and multi sample splitting. These are classical statistical concepts — p-values and confidence intervals — adapted to the very challenging setting where $p \gg n$.

1. Why Is Uncertainty Quantification Hard in High Dimensions?

1.1 The Problem with the Lasso

Consider the standard high-dimensional linear model $Y = X\beta^0 + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$. The Lasso gives us a sparse estimate $\hat\beta(\lambda)$, which is great for prediction. But the Lasso is biased — it systematically shrinks coefficients towards zero — and this bias makes it unsuitable for direct inference (hypothesis tests, confidence intervals).

In the classical low-dimensional world ($p < n$), OLS gives us unbiased estimates with known Gaussian distributions, and we can read off standard errors and p-values from any regression output. When $p > n$, OLS doesn't even exist (the system is underdetermined), so we cannot simply fall back on classical theory.

Motivating Example: Motif Regression

In a biological application with $p = 195$ candidate motifs and $n = 143$ observations, the Lasso selects 26 variables. But which of these are truly relevant? The Lasso gives us coefficient estimates, but no p-values or confidence intervals. We need a way to quantify how certain we are about each selection.

1.2 What We Want

For each coefficient $\beta_j^0$ ($j = 1, \dots, p$), we want to test:

$$H_{0,j}: \beta_j^0 = 0 \quad \text{vs.} \quad H_{A,j}: \beta_j^0 \neq 0$$

and construct confidence intervals of the form $\hat b_j \pm c_\alpha \cdot \widehat{\text{se}}_j$. We want these to have valid frequentist coverage properties, even when $p \gg n$.

2. The De-sparsified (De-biased) Lasso

2.1 Starting Point: OLS in Low Dimensions

To understand the de-sparsified Lasso, let's first recall how things work when $p < n$ and $\text{rank}(X) = p$. The OLS estimator for a single coefficient $\beta_j^0$ can be written as a partial regression:

$$\hat\beta_{\text{OLS},j} = \frac{Y^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}}$$

where $Z^{(j)}$ are the residuals from regressing the $j$-th covariate $X^{(j)}$ on all other covariates $X^{(-j)} = \{X^{(k)}; k \neq j\}$:

$$Z^{(j)} = X^{(j)} - X^{(-j)}\hat\gamma^{(j)}, \qquad \hat\gamma^{(j)} = \arg\min_\gamma \|X^{(j)} - X^{(-j)}\gamma\|_2^2$$
Intuition: What Are These Residuals $Z^{(j)}$?

Think of $Z^{(j)}$ as the "part of $X^{(j)}$ that cannot be explained by the other variables". By projecting $Y$ onto this residual, we isolate the unique contribution of variable $j$. This is the classical Frisch–Waugh–Lovell idea: to understand the effect of one variable, first "partial out" all the others.

2.2 The Key Idea: Use Lasso for the Residuals

In high dimensions ($p > n$), we cannot compute the OLS residuals $Z^{(j)}$ because the regression of $X^{(j)}$ on $X^{(-j)}$ is itself a high-dimensional problem. The solution: use the Lasso to compute approximate residuals.

Definition — De-sparsified Lasso (Nodewise Regression)

For each $j = 1, \dots, p$, compute Lasso residuals:

$$Z^{(j)} = X^{(j)} - X^{(-j)}\hat\gamma^{(j)}, \qquad \hat\gamma^{(j)} = \arg\min_\gamma \left(\|X^{(j)} - X^{(-j)}\gamma\|_2^2 + \lambda_j \|\gamma\|_1\right)$$

Then build the projection of $Y$ onto $Z^{(j)}$:

$$\frac{Y^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} = \beta_j^0 + \underbrace{\sum_{k \neq j} \frac{(X^{(k)})^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} \beta_k^0}_{\text{bias term}} + \underbrace{\frac{\varepsilon^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}}}_{\text{noise term}}$$

Notice there is a bias term: because the Lasso residuals $Z^{(j)}$ are only approximately orthogonal to the other columns of $X$, the projection doesn't perfectly isolate $\beta_j^0$. In classical OLS, these cross-terms vanish exactly; here they don't.

2.3 Estimating and Subtracting the Bias

The key step is to estimate the bias and subtract it. We estimate the bias as:

$$\widehat{\text{bias}} = \sum_{k \neq j} \frac{(X^{(k)})^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} \hat\beta_k$$

where $\hat\beta_k$ comes from the standard Lasso of $Y$ on $X$. This gives us the de-sparsified Lasso estimator:

Definition — De-sparsified Lasso Estimator
$$\hat b_j = \frac{Y^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} - \sum_{k \neq j} \frac{(X^{(k)})^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} \hat\beta_k \qquad (j = 1, \dots, p)$$

Equivalently, this can be written as:

$$\hat b_j = \hat\beta_j + \frac{(Y - X\hat\beta)^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}}$$

This second form reveals the name "de-biased Lasso": we take the standard Lasso estimate $\hat\beta_j$ and add a correction term based on the Lasso residuals projected onto $Z^{(j)}$.

Important: Not Sparse!

Unlike the Lasso, the de-sparsified estimator $\hat b_j$ is never exactly zero for all $j$. It is not designed for variable selection or sparse estimation — it is designed for inference (p-values and confidence intervals). Think of it as trading sparsity for unbiasedness.

2.4 Asymptotic Distribution and Confidence Intervals

Assuming fixed design $X$ and Gaussian errors $\varepsilon \sim \mathcal{N}_n(0, \sigma^2 I)$, the fluctuation (noise) term has a known distribution:

$$\sqrt{n} \cdot \frac{\varepsilon^T Z^{(j)}}{(X^{(j)})^T Z^{(j)}} \sim \mathcal{N}\!\left(0,\; \frac{\sigma^2 \|Z^{(j)}\|_2^2 / n}{|(X^{(j)})^T Z^{(j)}/n|^2}\right)$$

More precisely, after accounting for the bias correction, one can show:

$$\sigma^{-1}\sqrt{n}\;\frac{(X^{(j)})^T Z^{(j)}/n}{\|Z^{(j)}\|_2/\sqrt{n}} \;(\hat b_j - \beta_j^0) = W_j + \Delta_j$$

where $(W_1, \dots, W_p)^T \sim \mathcal{N}_p(0, \sigma^2 \Omega)$ and $\max_j |\Delta_j| = o_P(1)$ (the remaining bias vanishes asymptotically).

Confidence Intervals for $\beta_j^0$

A $(1-\alpha)$-confidence interval for $\beta_j^0$ is:

$$\hat b_j \;\pm\; \hat\sigma \cdot n^{-1/2} \cdot \frac{\|Z^{(j)}\|_2/\sqrt{n}}{|(X^{(j)})^T Z^{(j)}/n|} \cdot \Phi^{-1}(1 - \alpha/2)$$

where $\hat\sigma^2$ can be estimated as $\|Y - X\hat\beta\|_2^2 / n$ or $\|Y - X\hat\beta\|_2^2 / (n - \|\hat\beta\|_0)$.

2.5 When Does It Work? Conditions and Efficiency

For the de-biased Lasso to give valid inference, we need:

For the estimator to achieve the best possible asymptotic variance (the Cramér–Rao lower bound), we additionally need:

Under these conditions, the de-biased Lasso achieves (Theorem 10.2 in the book):

$$\sqrt{n}(\hat b_j - \beta_j^0) \Longrightarrow \mathcal{N}(0, \sigma^2 \Theta_{jj})$$

where $\Theta = \Sigma_X^{-1} = \text{Cov}(X)^{-1}$. This is exactly the same asymptotic variance as OLS in low dimensions — a remarkable result showing that, under sparsity, high-dimensional inference can be as efficient as if we knew which variables mattered.

2.6 Choice of Tuning Parameters

The de-sparsified Lasso involves two types of tuning parameters: $\lambda$ for the main Lasso $\hat\beta = \hat\beta(\hat\lambda_{CV})$ (chosen by cross-validation as usual), and $\lambda_j$ for each nodewise regression computing $Z^{(j)}$.

There is a bias–variance trade-off in the choice of $\lambda_j$:

The practical recommendation is to inflate the variance a bit (use a slightly smaller $\lambda_j$) to ensure the bias correction works well. This controls the Type I error at the cost of slightly decreased power.

Empirical Performance (R package hdi)

The de-sparsified Lasso is implemented in the R package hdi (by Meier et al.). Simulation studies show that the 95% confidence intervals achieve close to nominal coverage in a wide range of settings — typically between 86% and 96% coverage across individual coefficients. Robust or bootstrap variants can improve the coverage slightly.

3. Sample Splitting for P-values

3.1 The Idea

An alternative, conceptually simpler approach to inference in high dimensions is sample splitting, proposed by Wasserman and Roeder (2009). The idea is beautifully simple:

  1. Split the data randomly into two halves: $I_1$ and $I_2$ (each of size $\approx n/2$).
  2. Screen: Use the first half $I_1$ to select important variables using the Lasso (or any other method), obtaining a set $\hat S(I_1)$.
  3. Test: Use the second half $I_2$ to perform classical OLS-based t-tests on the selected variables only.

Because the two halves are independent, the p-values from step 3 are valid (conditional on step 2 correctly including all truly active variables). We adjust for multiplicity using a Bonferroni correction:

$$\tilde P_{\text{corr},j} = \min(\tilde P_j \cdot |\hat S(I_1)|, \; 1) \qquad (j = 1, \dots, p)$$

where $\tilde P_j$ is the raw two-sided t-test p-value for variable $j$ based on the second half $I_2$, and variables not in $\hat S(I_1)$ get $\tilde P_j = 1$.

The "P-value Lottery" Problem

Single sample splitting has a serious drawback: the results depend heavily on the random split. Different splits can yield dramatically different p-values for the same variable. This makes results non-reproducible — a phenomenon called the "p-value lottery".

3.2 Multi Sample Splitting

The solution, proposed by Meinshausen, Meier, and Bühlmann (2009), is to repeat the splitting many times and aggregate the resulting p-values.

Algorithm — Multi Sample Splitting

For $b = 1, \dots, B$ (e.g., $B = 50$ or $100$):

  1. Randomly split the data into two disjoint groups $I_1^{[b]}$ and $I_2^{[b]}$ of approximately equal size.
  2. Using only $I_1^{[b]}$, run the Lasso to estimate the active set $\hat S^{[b]}$.
  3. Using only $I_2^{[b]}$, compute adjusted p-values $\tilde P_{\text{corr},j}^{[b]} = \min(\tilde P_j^{[b]} \cdot |\hat S^{[b]}|, \; 1)$ for each $j = 1, \dots, p$.

Then aggregate over the $B$ p-values using the quantile method described below.

3.3 Aggregating P-values

For each variable $j$, we now have $B$ p-values $\tilde P_{\text{corr},j}^{[1]}, \dots, \tilde P_{\text{corr},j}^{[B]}$. We aggregate them using quantiles. For a fixed $\gamma \in (0,1)$:

$$Q_j(\gamma) = \min\!\Big(\text{quantile}_\gamma\!\big(\tilde P_{\text{corr},j}^{[b]} / \gamma;\; b = 1, \dots, B\big),\; 1\Big)$$

In practice, we use an adaptive version that searches over $\gamma$:

$$P_j = \min\!\Big((1 - \log \gamma_{\min}) \inf_{\gamma \in (\gamma_{\min}, 1)} Q_j(\gamma),\; 1\Big)$$

with $\gamma_{\min} = 0.05$ as a default. The resulting $P_j$ is a valid p-value for $H_{0,j}$.

3.4 Error Control

Familywise Error Rate (FWER)

For FWER control at level $\alpha$, simply select all variables with $P_j \leq \alpha$:

$$\hat S_{\text{FWER}}(\alpha) = \{j : P_j \leq \alpha\}$$

Under standard conditions (screening property and Gaussian errors), this controls the FWER asymptotically: $\limsup_{n \to \infty} \mathbb{P}[V > 0] \leq \alpha$, where $V$ is the number of false positives.

False Discovery Rate (FDR)

For FDR control, since the multi-split method already produces multiplicity-corrected p-values, the procedure simplifies: order $P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(p)}$ and select:

$$h = \max\{i : P_{(i)} \leq i \cdot q\}, \qquad q(\alpha) = \frac{\alpha}{\sum_{i=1}^p i^{-1}}$$

This achieves $\limsup_{n\to\infty} \mathbb{E}[V/\max(1,R)] \leq \alpha$.

3.5 Required Conditions

The multi sample splitting approach needs two main conditions:

  1. Screening property: $\mathbb{P}[\hat S^{\lfloor n/2 \rfloor} \supseteq S_0] \to 1$ — the Lasso (or whatever selection method) on half the data must include all truly active variables with high probability.
  2. Sparsity: $\mathbb{P}[|\hat S^{\lfloor n/2 \rfloor}| < n/2] \to 1$ — the selected set must be small enough to allow valid OLS-based tests on the second half.
Advantage Over Classical Methods

Multi sample splitting is competitive even in low-dimensional settings ($p < n$). When correlations among variables are high, it can actually have better detection power than the standard FDR procedure of Benjamini–Hochberg. This is because, by trimming variables in step 2, it reduces the variance inflation that correlated covariates cause in OLS.

4. Comparison of the Two Approaches

Aspect De-sparsified Lasso Multi Sample Splitting
Core idea Correct Lasso bias via nodewise regression Split data; screen then test on independent half
Output Confidence intervals + p-values P-values (FWER or FDR)
Key conditions Sparsity of $\beta^0$ and of precision matrix rows Screening property of selection method
Sparsity requirement $s_0 = o(\sqrt{n}/\log p)$ Weaker (just need screening to work)
Design conditions Compatibility + sparse nodewise regressions Much weaker (no irrepresentable condition)
Reproducibility Deterministic (given data) Approximately reproducible (for large $B$)
Achieves Cramér–Rao? Yes (under conditions) Not claimed
Implementation R: hdi package R: hdi package

5. Recap: Group Lasso

The lecture recap also briefly revisits the Group Lasso and introduces the Sparse Group Lasso, extending ideas from previous lectures.

Recall — Group Lasso (Yuan and Lin, 2006)

Given groups $G_1, \dots, G_q$ forming a partition of $\{1, \dots, p\}$, the parameter vector is written as $\beta = (\beta_{G_1}, \beta_{G_2}, \dots, \beta_{G_q})^T$. The goal is group-sparse estimation: for each group $j$, either the entire block $\hat\beta_{G_j} \equiv 0$, or all components $(\hat\beta_{G_j})_r \neq 0$ for every $r \in G_j$.

Sparse Group Lasso (Simon, Friedman, Hastie & Tibshirani, 2013)
$$\hat\beta(\lambda, \alpha) = \arg\min_\beta \left(\|Y - X\beta\|_2^2/n + (1-\alpha)\lambda \sum_{j=1}^q m_j \|\beta_{G_j}\|_2 + \alpha\lambda\|\beta\|_1\right)$$

This is a convex combination of the Group Lasso penalty and the standard Lasso ($\ell_1$) penalty. For $\alpha > 0$, it can induce sparsity within groups as well — some individual coefficients within an active group can be set to zero. This is useful when we expect that not all variables in a relevant group are important.

6. Key Takeaways