This lecture extends the Lasso framework in two major directions. First, we introduce the Group Lasso, which handles predictors that naturally come in groups (e.g., factor variables, interaction terms, basis expansions). Instead of selecting individual coefficients, the Group Lasso selects or removes entire groups at once. Second, we move beyond linear models into high-dimensional additive models, where each covariate contributes a smooth nonlinear function. The key idea is a combined sparsity-smoothness penalty that keeps the model flexible yet tractable.
0. Recap: KKT Conditions for the Lasso
Before diving into new material, we recall the Karush-Kuhn-Tucker (KKT) conditions that characterize the Lasso solution. These conditions are both necessary and sufficient because the Lasso objective is convex.
For the Lasso estimator $\hat{\beta}(\lambda) = \arg\min_\beta \left( \|Y - X\beta\|_2^2/n + \lambda\|\beta\|_1 \right)$, define the negative gradient $G(\beta) = -2X^T(Y - X\beta)/n$. Then the KKT conditions state:
Intuition: Think of $G_j(\hat{\beta})$ as the "gradient pressure" pushing coefficient $j$ away from zero. The Lasso solution says: if the data's gradient signal for variable $j$ is weaker than the penalty $\lambda$, that coefficient is killed to zero. If it's strong enough, the coefficient is non-zero but shrunk so that the gradient exactly balances the penalty. Sparsity is induced at the points of non-differentiability of the $\ell_1$-penalty — exactly where $\beta_j = 0$.
1. The Group Lasso
1.1 Motivation: Why Groups?
In many applications, the parameter vector $\beta$ has a natural group structure. The indices $\{1, \ldots, p\}$ are partitioned into $q$ non-overlapping groups $G_1, \ldots, G_q$ so that:
The most common situations where groups arise naturally are:
- Categorical (factor) variables: A factor with $d$ levels requires $d - 1$ dummy parameters (e.g., using sum contrasts). All $d-1$ parameters form one group. It makes no sense to select only some of these dummy variables — either the whole factor matters, or it doesn't.
- Interaction terms: A first-order interaction between two factors with 4 levels each requires $3 \times 3 = 9$ parameters, forming another group.
- Basis expansions: In additive models, each smooth function $f_j(\cdot)$ is represented by $K$ basis coefficients — these form a group.
Consider $p = 2$ categorical variables, each with 4 levels $\{0, 1, 2, 3\}$. Using sum contrasts, each main effect needs 3 parameters. With main effects only, $\beta_{G_1} = (\beta_1, \beta_2, \beta_3)$ for the first factor and $\beta_{G_2} = (\beta_4, \beta_5, \beta_6)$ for the second. If we also include a first-order interaction, that requires $3 \times 3 = 9$ additional parameters in a third group $G_3$.
A standard Lasso might zero out individual dummy coefficients (e.g., set $\beta_2 = 0$ but keep $\beta_1, \beta_3 \neq 0$), which breaks the coherent interpretation of the factor. The Group Lasso avoids this by forcing $\beta_{G_j} \equiv 0$ (entire group out) or $\beta_{G_j} \not\equiv 0$ (entire group in).
1.2 The Group Lasso Penalty
The Group Lasso estimator for a linear model is:
where $\|\beta_{G_j}\|_2 = \sqrt{\sum_{r \in G_j} \beta_r^2}$ is the Euclidean ($\ell_2$) norm of the sub-vector for group $j$, and $m_j$ are weights (often $m_j = \sqrt{|G_j|}$ to account for different group sizes).
Key insight — the geometry of group sparsity: The standard Lasso uses the $\ell_1$-norm $\|\beta\|_1 = \sum_j |\beta_j|$, which has "corners" at the axes, producing individual sparsity. The Group Lasso uses a mixed $\ell_1/\ell_2$-norm: it applies an $\ell_2$-norm within each group and sums these with an $\ell_1$-style structure across groups. The $\ell_2$-norm within a group is smooth everywhere except at the origin $\beta_{G_j} = 0$. This single non-differentiable point is what drives entire groups to zero — just like the kink at zero in the absolute value drives individual Lasso coefficients to zero.
Think of it this way: within each group, the penalty is "round" (no corners, so no individual sparsity within the group), but across groups the penalty has $\ell_1$-type corners (driving whole groups to zero).
If $|G_j| = 1$ for all $j$, then $\|\beta_{G_j}\|_2 = |\beta_j|$ and the Group Lasso reduces exactly to the standard Lasso. So the Lasso is a special case of the Group Lasso.
1.3 The Generalized Group Lasso Penalty
In practice, we often want a more flexible penalty that accounts for the internal structure of each group. The generalized Group Lasso penalty replaces the standard Euclidean norm with a weighted quadratic form:
where each $A_j$ is a positive definite matrix. This is computationally convenient because we can transform to the standard Group Lasso via $\tilde{\beta}_{G_j} = A_j^{1/2} \beta_{G_j}$, which turns the penalty into $\lambda \sum_j m_j \|\tilde{\beta}_{G_j}\|_2$.
Groupwise Prediction Penalty
A particularly important special case sets $A_j = X_{G_j}^T X_{G_j}$:
This penalty is invariant under arbitrary reparametrizations within every group. This is very important for categorical variables: no matter which contrast coding you use (sum contrasts, treatment contrasts, Helmert, etc.), the penalty gives the same result. When using an orthogonal parameterization with $X_{G_j}^T X_{G_j} = I$, this simplifies back to the standard Group Lasso.
1.4 Algorithm: Block Coordinate Descent
The Group Lasso is solved by a block coordinate descent algorithm — a generalization of the coordinatewise descent used for the standard Lasso. Instead of optimizing over one coefficient at a time, we optimize over one group at a time while keeping all other groups fixed.
Initialize: $\beta^{[0]} \in \mathbb{R}^p$, set $m = 0$.
Repeat until convergence:
- Increment $m$, cycle through groups $j = 1, \ldots, q$.
- Check: If $\|(-\nabla\rho(\beta^{[m-1]}_{-G_j}))_{G_j}\|_2 \leq \lambda m_j$, set $\beta^{[m]}_{G_j} = 0$ (kill the group).
- Otherwise: $\beta^{[m]}_{G_j} = \arg\min_{\beta_{G_j}} Q_\lambda(\beta^{[m-1]}_{+G_j})$ (optimize within the group).
For squared error loss with orthogonalized design ($n^{-1}X_{G_j}^T X_{G_j} = I$), the block update has a beautiful closed form — a group-level soft-thresholding:
Here $(x)_+ = \max(x, 0)$. This is the multi-dimensional analog of the Lasso's scalar soft-thresholding: the entire group vector $U_{G_j}$ is shrunk uniformly toward zero, and set exactly to zero when the gradient norm $\|U_{G_j}\|_2$ is below the threshold $\lambda m_j$.
For sparse problems (many groups, few active), an active set strategy dramatically speeds up computation. The algorithm focuses on currently active groups and only revisits inactive groups occasionally (e.g., every 10th iteration). This easily scales to $p \approx 10^4$ – $10^6$.
1.5 Theoretical Guarantees (Sketch)
The theoretical analysis of the Group Lasso follows a similar spirit to the standard Lasso, but with more involved arguments. The key results are:
- Prediction consistency: Under appropriate conditions (compatibility condition for groups, mild distributional assumptions), the Group Lasso achieves $(\hat{\beta}(\lambda) - \beta^0)^T \Sigma_X (\hat{\beta}(\lambda) - \beta^0) = o_P(1)$ even when $p \gg n$.
- Variable screening at group level: With suitable $\lambda$, the estimated active group set contains all truly active groups with high probability: $\hat{S}_{\text{group}}(\lambda) \supseteq S^0_{\text{group}}$.
- Prediction gain from grouping: When the group structure is real (group sizes $T_j > 1$), and $T_j > \log(q)$, the Group Lasso achieves better rates than applying the standard Lasso that ignores the group structure.
1.6 Application: DNA Splice Site Prediction
A compelling application uses the Group Lasso with logistic regression for predicting splice sites in DNA. The model uses 7 sequence positions, each with 4 nucleotide levels, and considers main effects plus 2-way and 3-way interactions. The Group Lasso naturally selects which main effects and interaction terms matter, producing an interpretable model. Results show mainly main effects are selected, with a few interactions — a finding that has been debated in computational biology.
2. High-Dimensional Additive Models
2.1 From Linear to Additive: Why Go Nonlinear?
A linear model $Y_i = \sum_{j=1}^p \beta_j X_i^{(j)} + \varepsilon_i$ assumes each covariate has a strictly linear effect. In reality, effects may be nonlinear: a drug's effect might plateau, or a temperature effect might be U-shaped. Additive models are the most natural first step beyond linearity:
where $\mu$ is an intercept, each $f_j$ is a smooth univariate function, $\varepsilon_i$ are i.i.d. with $E[\varepsilon_i] = 0$, and we assume centering: $\sum_{i=1}^n f_j(X_i^{(j)}) = 0$ for identification.
The model is "additive" because the total effect is a sum of individual contributions — one function per covariate, with no interactions. Each $f_j$ is allowed to be an arbitrary smooth function, not just a linear one. This gives us a lot of flexibility while keeping the model interpretable: we can plot each $f_j$ to understand its individual contribution.
At first glance, fitting additive models with $p \gg n$ seems hopeless — we have infinitely many parameters (entire functions!) for each of $p$ covariates. The trick is to combine sparsity (most $f_j$ should be zero) with smoothness (the non-zero $f_j$ should be smooth).
2.2 Basis Expansion: Making the Problem Finite
Each function $f_j$ is expanded in a set of $K$ basis functions (e.g., B-splines):
This converts the infinite-dimensional problem into a finite one with $pK$ parameters. We construct an $n \times pK$ design matrix $H = [H_1 | H_2 | \ldots | H_p]$ where $(H_j)_{i,k} = h_{j,k}(X_i^{(j)})$. A typical choice is $K \approx \sqrt{n}$ basis functions per covariate.
Notice that the parameters for each covariate $j$ naturally form a group $\beta_j = (\beta_{j,1}, \ldots, \beta_{j,K})^T$. This is the bridge connecting additive models to the Group Lasso!
2.3 The Sparsity-Smoothness Penalty (SSP)
To estimate $f_1, \ldots, f_p$ well in high dimensions, we need a penalty that does two things simultaneously: (1) drives entire functions $f_j$ to zero (sparsity at the function level), and (2) keeps the surviving functions smooth (avoiding overfitting within each function). This leads to the sparsity-smoothness penalty:
where $\|f_j\|_n = \sqrt{n^{-1}\sum_{i=1}^n f_j(X_i^{(j)})^2}$ is the empirical $\ell_2$-norm of $f_j$, and $I(f_j) = \sqrt{\int |f_j''(x)|^2 dx}$ is the Sobolev semi-norm measuring the roughness of $f_j$.
Intuition: The two tuning parameters $\lambda_1$ and $\lambda_2$ control different aspects. The sparsity part ($\lambda_1$) acts like the Group Lasso — it penalizes the "size" of each function, pushing entire functions to zero. The smoothness part ($\lambda_2$) acts like a roughness penalty — it penalizes the curvature of each function, preventing wild oscillations. You need both: sparsity alone doesn't prevent overfitting within active functions, and smoothness alone doesn't reduce the number of active functions.
2.4 Natural Cubic Splines: An Elegant Reduction
A remarkable result shows that we don't have to search over all possible smooth functions. When we optimize over the Sobolev space (continuously differentiable functions with square-integrable second derivatives), the solution is guaranteed to be a natural cubic spline:
Let $\mathcal{F}$ be the Sobolev space of continuously differentiable functions on $[a,b]$ with square-integrable second derivatives. Then the minimizers $\hat{f}_j$ of the SSP problem are natural cubic splines with knots at the data points $X_i^{(j)}$, $i = 1, \ldots, n$.
Why this matters: This result reduces an infinite-dimensional optimization (searching over all smooth functions) to a finite-dimensional one. Natural cubic splines with $n$ knots have $n$ degrees of freedom, so the total parameter dimension is $\approx np$. With B-spline basis functions and a manageable number of knots ($K \approx \sqrt{n}$), the effective dimension becomes $pK \approx p\sqrt{n}$, which is large but tractable.
In parametric form, we can write the empirical norm and smoothness penalty as:
where $W_j$ is the matrix of inner products of second derivatives of the basis functions: $(W_j)_{k,\ell} = \int h_{j,k}''(x) h_{j,\ell}''(x)\,dx$.
2.5 SSP of Group Lasso Type
The original SSP penalty $\lambda_1 \sum_j \|f_j\|_n + \lambda_2 \sum_j I(f_j)$ has better theoretical properties, but it is harder to compute because the two norms appear as separate terms. For easier computation, an alternative SSP of Group Lasso type combines both norms under a single square root:
In parametric form:
The key observation: For every fixed $\lambda_2$, this is exactly a generalized Group Lasso penalty with positive definite matrices $A_j = H_j^T H_j/n + \lambda_2^2 W_j$. This means we can use the same block coordinate descent algorithm from the Group Lasso section! We just need to transform to $\tilde{\beta}_{G_j} = A_j^{1/2}\beta_j$, solve a standard Group Lasso, and transform back via $\hat{\beta}_j = A_j^{-1/2}\hat{\tilde{\beta}}_j$.
Simulations show that $\lambda_2$ (the smoothness parameter) is often not very sensitive — a few candidate values suffice. The sparsity parameter $\lambda_1$ is more critical and is typically selected via cross-validation. One can simply run the algorithm for a small grid of $\lambda_2$ values and cross-validate over the full $(\lambda_1, \lambda_2)$ grid.
2.6 Numerical Example: Recovering Nonlinear Functions
The lecture illustrates the method with a simulated example: $n = 150$ observations, $p = 200$ covariates, but only $s_0 = 4$ truly active functions:
Functions $f_5, \ldots, f_{200}$ are all zero. This is a challenging high-dimensional setting ($p > n$), yet the additive model estimator with the SSP penalty successfully recovers all four nonlinear functions, including the sinusoidal shape of $f_1$, the quadratic $f_2$, the linear $f_3$, and the exponential $f_4$. The inactive functions $f_5, f_6, \ldots$ are correctly estimated as (near-)zero.
In the simulated example, the SSP penalty successfully recovers the curved shapes that a linear model would miss entirely. The linear Lasso would approximate $f_1(x) = -\sin(2x)$ with just a straight line, losing the oscillatory structure. The additive model captures the curvature because it uses flexible basis expansions while controlling smoothness.
On real data (e.g., motif regression for DNA binding sites), the additive model can yield an improvement in cross-validated prediction error of about 20% compared to a Lasso-estimated linear model.
2.7 Prediction and Variable Selection Properties
The theoretical analysis (Chapter 8 of the book) establishes oracle-type results. Under sparsity and a compatibility condition for the design:
This achieves the optimal rate $O(s_0 n^{-4/5})$ up to a $\sqrt{\log(p)}$ factor — the "price" of not knowing in advance which functions are active. The rate $n^{-4/5}$ (rather than $n^{-1}$ as in the linear case) is the known minimax-optimal rate for estimating twice-differentiable functions, reflecting the inherent difficulty of nonparametric estimation.
For variable screening, under a "beta-min" type condition (the non-zero functions must be detectable), the method correctly identifies all active functions with high probability:
3. The Big Picture: How It All Connects
Here is a conceptual summary of the hierarchy of methods covered so far:
| Method | Penalty | Sparsity Level | When to Use |
|---|---|---|---|
| Lasso | $\lambda \sum_j |\beta_j|$ | Individual coefficients | Linear model, independent predictors |
| Group Lasso | $\lambda \sum_j m_j \|\beta_{G_j}\|_2$ | Entire groups | Factors, interactions, basis expansions |
| Additive Model + SSP | $\lambda_1 \sum_j \|f_j\|_n + \lambda_2 \sum_j I(f_j)$ | Entire functions + smoothness | Nonlinear effects, smooth covariates |
The unifying theme is that structure in the parameters should be reflected in the structure of the penalty. The Lasso treats each coefficient independently; the Group Lasso respects group structure; the additive model penalty further incorporates smoothness. Each level adds flexibility and better adaptation to the true data-generating process, at the cost of additional tuning parameters and complexity.
Key Takeaways
- The Group Lasso extends the Lasso to handle grouped parameters by using an $\ell_1/\ell_2$ mixed norm: $\ell_2$ within groups (no individual sparsity) and $\ell_1$-type across groups (driving entire groups to zero). It reduces to the standard Lasso when all group sizes are 1.
- The generalized Group Lasso and the groupwise prediction penalty ensure invariance under reparametrization within groups — critical for factor variables where the choice of contrast coding should not affect the result.
- High-dimensional additive models extend linear models to allow each covariate to have a nonlinear (smooth) effect. The sparsity-smoothness penalty simultaneously selects which covariates are active and controls the roughness of the estimated functions.
- A key theoretical result (Proposition 5.1) shows that the optimal functions are natural cubic splines, reducing the infinite-dimensional problem to a tractable parametric one.
- The SSP of Group Lasso type turns the additive model estimation into a standard Group Lasso problem (for each fixed $\lambda_2$), making computation efficient via block coordinate descent.