Overview
Up until now in the course, we have focused on estimation and prediction in high-dimensional settings — building sparse models via the Lasso, Group Lasso, and related methods. But in scientific applications, simply obtaining a point estimate is often not enough. We want to know: how confident are we in these estimates? Which variables are truly relevant, and can we attach a measure of statistical significance (a p-value) or a confidence interval to each coefficient?
This lecture tackles exactly this question. It introduces two main approaches to frequentist uncertainty quantification for high-dimensional linear models: the de-sparsified (de-biased) Lasso and multi sample splitting. These are classical statistical concepts — p-values and confidence intervals — adapted to the very challenging setting where $p \gg n$.
1. Why Is Uncertainty Quantification Hard in High Dimensions?
1.1 The Problem with the Lasso
Consider the standard high-dimensional linear model $Y = X\beta^0 + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$. The Lasso gives us a sparse estimate $\hat\beta(\lambda)$, which is great for prediction. But the Lasso is biased — it systematically shrinks coefficients towards zero — and this bias makes it unsuitable for direct inference (hypothesis tests, confidence intervals).
In the classical low-dimensional world ($p < n$), OLS gives us unbiased estimates with known Gaussian distributions, and we can read off standard errors and p-values from any regression output. When $p > n$, OLS doesn't even exist (the system is underdetermined), so we cannot simply fall back on classical theory.
In a biological application with $p = 195$ candidate motifs and $n = 143$ observations, the Lasso selects 26 variables. But which of these are truly relevant? The Lasso gives us coefficient estimates, but no p-values or confidence intervals. We need a way to quantify how certain we are about each selection.
1.2 What We Want
For each coefficient $\beta_j^0$ ($j = 1, \dots, p$), we want to test:
and construct confidence intervals of the form $\hat b_j \pm c_\alpha \cdot \widehat{\text{se}}_j$. We want these to have valid frequentist coverage properties, even when $p \gg n$.
2. The De-sparsified (De-biased) Lasso
2.1 Starting Point: OLS in Low Dimensions
To understand the de-sparsified Lasso, let's first recall how things work when $p < n$ and $\text{rank}(X) = p$. The OLS estimator for a single coefficient $\beta_j^0$ can be written as a partial regression:
where $Z^{(j)}$ are the residuals from regressing the $j$-th covariate $X^{(j)}$ on all other covariates $X^{(-j)} = \{X^{(k)}; k \neq j\}$:
Think of $Z^{(j)}$ as the "part of $X^{(j)}$ that cannot be explained by the other variables". By projecting $Y$ onto this residual, we isolate the unique contribution of variable $j$. This is the classical Frisch–Waugh–Lovell idea: to understand the effect of one variable, first "partial out" all the others.
2.2 The Key Idea: Use Lasso for the Residuals
In high dimensions ($p > n$), we cannot compute the OLS residuals $Z^{(j)}$ because the regression of $X^{(j)}$ on $X^{(-j)}$ is itself a high-dimensional problem. The solution: use the Lasso to compute approximate residuals.
For each $j = 1, \dots, p$, compute Lasso residuals:
Then build the projection of $Y$ onto $Z^{(j)}$:
Notice there is a bias term: because the Lasso residuals $Z^{(j)}$ are only approximately orthogonal to the other columns of $X$, the projection doesn't perfectly isolate $\beta_j^0$. In classical OLS, these cross-terms vanish exactly; here they don't.
2.3 Estimating and Subtracting the Bias
The key step is to estimate the bias and subtract it. We estimate the bias as:
where $\hat\beta_k$ comes from the standard Lasso of $Y$ on $X$. This gives us the de-sparsified Lasso estimator:
Equivalently, this can be written as:
This second form reveals the name "de-biased Lasso": we take the standard Lasso estimate $\hat\beta_j$ and add a correction term based on the Lasso residuals projected onto $Z^{(j)}$.
Unlike the Lasso, the de-sparsified estimator $\hat b_j$ is never exactly zero for all $j$. It is not designed for variable selection or sparse estimation — it is designed for inference (p-values and confidence intervals). Think of it as trading sparsity for unbiasedness.
2.4 Asymptotic Distribution and Confidence Intervals
Assuming fixed design $X$ and Gaussian errors $\varepsilon \sim \mathcal{N}_n(0, \sigma^2 I)$, the fluctuation (noise) term has a known distribution:
More precisely, after accounting for the bias correction, one can show:
where $(W_1, \dots, W_p)^T \sim \mathcal{N}_p(0, \sigma^2 \Omega)$ and $\max_j |\Delta_j| = o_P(1)$ (the remaining bias vanishes asymptotically).
A $(1-\alpha)$-confidence interval for $\beta_j^0$ is:
where $\hat\sigma^2$ can be estimated as $\|Y - X\hat\beta\|_2^2 / n$ or $\|Y - X\hat\beta\|_2^2 / (n - \|\hat\beta\|_0)$.
2.5 When Does It Work? Conditions and Efficiency
For the de-biased Lasso to give valid inference, we need:
- Sparsity of $\beta^0$: The true model must be sparse with $s_0 = o(\sqrt{n}/\log p)$. This is a minimax-optimal requirement — no method can do inference with weaker sparsity.
- Compatibility condition for $X$ (same as for Lasso prediction bounds).
For the estimator to achieve the best possible asymptotic variance (the Cramér–Rao lower bound), we additionally need:
- Sparsity of the nodewise regressions: $X^{(j)}$ versus $X^{(-j)}$ must be sparse, i.e., $s_j \ll n/\log p$.
Under these conditions, the de-biased Lasso achieves (Theorem 10.2 in the book):
where $\Theta = \Sigma_X^{-1} = \text{Cov}(X)^{-1}$. This is exactly the same asymptotic variance as OLS in low dimensions — a remarkable result showing that, under sparsity, high-dimensional inference can be as efficient as if we knew which variables mattered.
2.6 Choice of Tuning Parameters
The de-sparsified Lasso involves two types of tuning parameters: $\lambda$ for the main Lasso $\hat\beta = \hat\beta(\hat\lambda_{CV})$ (chosen by cross-validation as usual), and $\lambda_j$ for each nodewise regression computing $Z^{(j)}$.
There is a bias–variance trade-off in the choice of $\lambda_j$:
- If $\lambda_j$ is smaller, the residuals $Z^{(j)}$ have smaller norm $\|Z^{(j)}\|_2^2$, leading to higher variance but lower bias in the correction term.
- If $\lambda_j$ is larger, $\|Z^{(j)}\|_2^2$ is larger, giving lower variance but higher bias.
The practical recommendation is to inflate the variance a bit (use a slightly smaller $\lambda_j$) to ensure the bias correction works well. This controls the Type I error at the cost of slightly decreased power.
hdi)
The de-sparsified Lasso is implemented in the R package hdi (by Meier et al.). Simulation studies show that the 95% confidence intervals achieve close to nominal coverage in a wide range of settings — typically between 86% and 96% coverage across individual coefficients. Robust or bootstrap variants can improve the coverage slightly.
3. Sample Splitting for P-values
3.1 The Idea
An alternative, conceptually simpler approach to inference in high dimensions is sample splitting, proposed by Wasserman and Roeder (2009). The idea is beautifully simple:
- Split the data randomly into two halves: $I_1$ and $I_2$ (each of size $\approx n/2$).
- Screen: Use the first half $I_1$ to select important variables using the Lasso (or any other method), obtaining a set $\hat S(I_1)$.
- Test: Use the second half $I_2$ to perform classical OLS-based t-tests on the selected variables only.
Because the two halves are independent, the p-values from step 3 are valid (conditional on step 2 correctly including all truly active variables). We adjust for multiplicity using a Bonferroni correction:
where $\tilde P_j$ is the raw two-sided t-test p-value for variable $j$ based on the second half $I_2$, and variables not in $\hat S(I_1)$ get $\tilde P_j = 1$.
Single sample splitting has a serious drawback: the results depend heavily on the random split. Different splits can yield dramatically different p-values for the same variable. This makes results non-reproducible — a phenomenon called the "p-value lottery".
3.2 Multi Sample Splitting
The solution, proposed by Meinshausen, Meier, and Bühlmann (2009), is to repeat the splitting many times and aggregate the resulting p-values.
For $b = 1, \dots, B$ (e.g., $B = 50$ or $100$):
- Randomly split the data into two disjoint groups $I_1^{[b]}$ and $I_2^{[b]}$ of approximately equal size.
- Using only $I_1^{[b]}$, run the Lasso to estimate the active set $\hat S^{[b]}$.
- Using only $I_2^{[b]}$, compute adjusted p-values $\tilde P_{\text{corr},j}^{[b]} = \min(\tilde P_j^{[b]} \cdot |\hat S^{[b]}|, \; 1)$ for each $j = 1, \dots, p$.
Then aggregate over the $B$ p-values using the quantile method described below.
3.3 Aggregating P-values
For each variable $j$, we now have $B$ p-values $\tilde P_{\text{corr},j}^{[1]}, \dots, \tilde P_{\text{corr},j}^{[B]}$. We aggregate them using quantiles. For a fixed $\gamma \in (0,1)$:
In practice, we use an adaptive version that searches over $\gamma$:
with $\gamma_{\min} = 0.05$ as a default. The resulting $P_j$ is a valid p-value for $H_{0,j}$.
3.4 Error Control
Familywise Error Rate (FWER)
For FWER control at level $\alpha$, simply select all variables with $P_j \leq \alpha$:
Under standard conditions (screening property and Gaussian errors), this controls the FWER asymptotically: $\limsup_{n \to \infty} \mathbb{P}[V > 0] \leq \alpha$, where $V$ is the number of false positives.
False Discovery Rate (FDR)
For FDR control, since the multi-split method already produces multiplicity-corrected p-values, the procedure simplifies: order $P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(p)}$ and select:
This achieves $\limsup_{n\to\infty} \mathbb{E}[V/\max(1,R)] \leq \alpha$.
3.5 Required Conditions
The multi sample splitting approach needs two main conditions:
- Screening property: $\mathbb{P}[\hat S^{\lfloor n/2 \rfloor} \supseteq S_0] \to 1$ — the Lasso (or whatever selection method) on half the data must include all truly active variables with high probability.
- Sparsity: $\mathbb{P}[|\hat S^{\lfloor n/2 \rfloor}| < n/2] \to 1$ — the selected set must be small enough to allow valid OLS-based tests on the second half.
Multi sample splitting is competitive even in low-dimensional settings ($p < n$). When correlations among variables are high, it can actually have better detection power than the standard FDR procedure of Benjamini–Hochberg. This is because, by trimming variables in step 2, it reduces the variance inflation that correlated covariates cause in OLS.
4. Comparison of the Two Approaches
| Aspect | De-sparsified Lasso | Multi Sample Splitting |
|---|---|---|
| Core idea | Correct Lasso bias via nodewise regression | Split data; screen then test on independent half |
| Output | Confidence intervals + p-values | P-values (FWER or FDR) |
| Key conditions | Sparsity of $\beta^0$ and of precision matrix rows | Screening property of selection method |
| Sparsity requirement | $s_0 = o(\sqrt{n}/\log p)$ | Weaker (just need screening to work) |
| Design conditions | Compatibility + sparse nodewise regressions | Much weaker (no irrepresentable condition) |
| Reproducibility | Deterministic (given data) | Approximately reproducible (for large $B$) |
| Achieves Cramér–Rao? | Yes (under conditions) | Not claimed |
| Implementation | R: hdi package |
R: hdi package |
5. Recap: Group Lasso
The lecture recap also briefly revisits the Group Lasso and introduces the Sparse Group Lasso, extending ideas from previous lectures.
Given groups $G_1, \dots, G_q$ forming a partition of $\{1, \dots, p\}$, the parameter vector is written as $\beta = (\beta_{G_1}, \beta_{G_2}, \dots, \beta_{G_q})^T$. The goal is group-sparse estimation: for each group $j$, either the entire block $\hat\beta_{G_j} \equiv 0$, or all components $(\hat\beta_{G_j})_r \neq 0$ for every $r \in G_j$.
This is a convex combination of the Group Lasso penalty and the standard Lasso ($\ell_1$) penalty. For $\alpha > 0$, it can induce sparsity within groups as well — some individual coefficients within an active group can be set to zero. This is useful when we expect that not all variables in a relevant group are important.
6. Key Takeaways
- The Lasso alone cannot do inference — it is biased and does not provide valid p-values or confidence intervals. We need dedicated methods for uncertainty quantification.
- The de-sparsified (de-biased) Lasso corrects the Lasso's bias by projecting $Y$ onto Lasso-computed residuals $Z^{(j)}$, then subtracting an estimated bias term. Under sparsity conditions, it achieves asymptotic normality with the optimal (Cramér–Rao) variance $\sigma^2\Theta_{jj}$.
- Multi sample splitting provides a conceptually simple, widely applicable approach: screen on one half, test on the other, repeat and aggregate. It controls FWER and FDR under weaker conditions than the de-sparsified Lasso.
- Both methods are available in the R package
hdiand provide practical tools for high-dimensional inference. - The Sparse Group Lasso extends the Group Lasso by allowing sparsity within groups, combining $\ell_1$ and group-$\ell_2$ penalties.