Lecture 5: GLMs, Group Lasso & Additive Models

High-Dimensional Statistics Lecture 5 Based on Chapters 3, 4, 5 of Bühlmann & van de Geer (2011)

This lecture extends the Lasso framework in three major directions. First, we move beyond squared error loss to generalized linear models (GLMs), covering logistic regression, Poisson regression, and contingency tables. Second, we introduce the Group Lasso, which handles covariates that naturally come in groups (e.g. factor variables with multiple dummy-coded columns). Third, we discuss additive models — a nonparametric extension of linear models — where sparsity and smoothness must be controlled simultaneously. Together, these tools make the Lasso framework applicable to a much wider range of real-world problems.

I. Quick Recap: The Lasso & Adaptive Lasso

Before diving into new material, recall the two workhorses from previous lectures.

The Lasso (Recap)

Given a linear model $Y_i = \sum_{j=1}^p \beta_j^0 X_i^{(j)} + \varepsilon_i$, the Lasso estimator is:

$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left( \|Y - X\beta\|_2^2 / n + \lambda \|\beta\|_1 \right)$$

The $\ell_1$-penalty $\|\beta\|_1 = \sum_j |\beta_j|$ encourages sparsity: many estimated coefficients are set exactly to zero. The tuning parameter $\lambda$ is typically chosen via cross-validation.

The Lasso is excellent for prediction, but it has a known bias problem: it shrinks all coefficients toward zero, including the truly large ones. This motivates the Adaptive Lasso, a two-stage procedure.

The Adaptive Lasso

First, compute an initial Lasso estimate $\hat{\beta}_{\mathrm{init}}$. Then solve:

$$\hat{\beta}_{\mathrm{adapt}}(\lambda) = \arg\min_{\beta} \left( \|Y - X\beta\|_2^2 / n + \lambda \sum_{j=1}^p \frac{|\beta_j|}{|\hat{\beta}_{\mathrm{init},j}|} \right)$$

The idea: variables with large initial estimates get small penalties (less shrinkage), while variables with small or zero initial estimates get large penalties (more shrinkage). This produces sparser and less biased estimates.

Example — Motif Regression

In a DNA motif regression dataset with $p = 195$ predictors and $n = 143$ samples, the Lasso (with $\lambda$ from CV) selects 26 variables. The Adaptive Lasso further refines this to 16 variables, producing a sparser model that often better identifies the truly important predictors while maintaining predictive accuracy.

II. Generalized Linear Models (GLMs) and the Lasso

So far the Lasso used squared error loss, appropriate when $Y$ is continuous. But what if $Y$ is binary (e.g. disease yes/no), a count (e.g. number of events), or categorical? Generalized Linear Models provide a unified framework for all these cases.

2.1 The GLM Framework

Definition — Generalized Linear Model

Given observations $Y_1, \ldots, Y_n$ (independent) and covariates $X_i \in \mathbb{R}^p$, a GLM specifies:

$$g\big(\mathbb{E}[Y_i | X_i = x]\big) = \underbrace{\mu + \sum_{j=1}^p \beta_j x^{(j)}}_{= f_{\mu,\beta}(x)}$$

where $g(\cdot)$ is a known link function and $f_{\mu,\beta}(x)$ is the linear predictor. The intercept $\mu$ is important and should not be penalized.

The intuition is simple: instead of directly modelling $\mathbb{E}[Y|X]$ as a linear function of $X$, we apply a transformation $g(\cdot)$ first. A standard linear model is just the special case where $g$ is the identity function.

2.2 Lasso for GLMs

To apply the Lasso in a GLM, we replace the squared error loss with the negative log-likelihood as the loss function:

Lasso for GLMs
$$\hat{\mu}, \hat{\beta}(\lambda) = \arg\min_{\mu, \beta} \left( \underbrace{-\ell(\mu, \beta)}_{\text{neg. log-likelihood}} + \lambda \|\beta\|_1 \right)$$

The intercept $\mu$ is not penalized. Software: glmnet in R handles this efficiently for many GLM families.

A key property: for exponential family distributions, the negative log-likelihood is convex in $(\mu, \beta)$. This means all the nice computational and theoretical properties of the Lasso carry over. Specifically, high-dimensional consistency, oracle inequalities, and variable screening properties all hold under analogous conditions.

2.3 Important GLM Examples

Logistic Regression (Binary Response)

When $Y_i \in \{0, 1\}$ follows a Bernoulli distribution with $Y_i | X_i \sim \text{Bernoulli}(\pi(X_i))$, we use the logit link:

$$\log\frac{\pi(x)}{1 - \pi(x)} = \mu + \sum_{j=1}^p \beta_j x^{(j)}$$

The loss function (negative log-likelihood per observation) is:

$$\rho_{\mu,\beta}(x, y) = -y \cdot f_{\mu,\beta}(x) + \log\big(1 + \exp(f_{\mu,\beta}(x))\big)$$

This is convex in $(\mu, \beta)$ because the first term is linear and the second has a positive second derivative. An equivalent "margin" formulation writes the loss as $\log(1 + e^{-\tilde{y}f})$ where $\tilde{y} = 2y - 1 \in \{-1, +1\}$, showing that it upper-bounds the misclassification error.

Poisson Regression (Count Data)

For count data $Y_i \in \{0, 1, 2, \ldots\}$ with $Y_i | X_i \sim \text{Poisson}(\lambda(X_i))$, the link function is $\log(\lambda(x)) = f_{\mu,\beta}(x)$. The loss is:

$$\rho_{\mu,\beta}(x, y) = -y \cdot f_{\mu,\beta}(x) + \exp(f_{\mu,\beta}(x))$$

Again convex, for the same reasons as logistic regression.

Multinomial / Contingency Tables

For categorical $Y \in \{0, 1, \ldots, k-1\}$, the multinomial distribution uses the softmax inverse link $\pi_r = \exp(f_r) / \sum_s \exp(f_s)$. This extends to contingency tables (log-linear models) with factor variables. A key practical challenge: the design matrix can become enormous ($k = |I|$ rows), making computation expensive for many factors.

Key Insight

The Lasso for GLMs works because the negative log-likelihood remains convex. This means the same algorithmic tools (coordinate descent, etc.) and theoretical guarantees (oracle inequalities, variable screening) apply — we just swap out the loss function.

III. The Group Lasso

3.1 Motivation: Why Groups?

In many real problems, predictors come with a natural group structure. The standard Lasso treats each coefficient individually: it might set $\beta_3 = 0$ but $\beta_4 \neq 0$ even if both belong to the same categorical variable. This doesn't make practical sense — if a factor variable is irrelevant, all its dummy-coded columns should be zero simultaneously.

Example — Factor Variables

Suppose you have a categorical predictor "City" with 4 levels: Zurich, Basel, Bern, Geneva. To include it in a regression, you create 3 dummy variables (with sum contrasts). These 3 parameters form a natural group: either "City" matters (all 3 parameters active) or it doesn't (all 3 parameters zero). The standard Lasso might zero out only some of them, giving nonsensical results like "the Basel-vs-average effect matters but the Zurich-vs-average effect doesn't."

Groups also arise naturally in basis expansions for nonparametric functions (Chapter 5), interaction terms, and multitask learning.

3.2 Group Lasso: Definition

Let the parameter vector be partitioned into $q$ disjoint groups: $\beta = (\beta_{G_1}, \ldots, \beta_{G_q})$ where $G_1, \ldots, G_q$ partition $\{1, \ldots, p\}$.

Definition — Group Lasso Estimator (Yuan & Lin, 2006)
$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left( \frac{1}{n}\|Y - X\beta\|_2^2 + \lambda \sum_{j=1}^{q} m_j \|\beta_{G_j}\|_2 \right)$$

where $\|\beta_{G_j}\|_2 = \sqrt{\sum_{r \in G_j} \beta_r^2}$ is the $\ell_2$-norm of the sub-vector, and $m_j = \sqrt{|G_j|}$ is a weighting factor that accounts for differing group sizes.

Intuition — Why $\ell_2$-norms per group? The standard Lasso uses $\|\beta\|_1 = \sum_j |\beta_j|$, which penalizes each coefficient individually and promotes element-wise sparsity. The Group Lasso replaces this with a sum of $\ell_2$-norms over groups: $\sum_j \|\beta_{G_j}\|_2$. The $\ell_2$-norm is not differentiable at the zero vector (it has a "kink" there, similar to how $|\cdot|$ has a kink at 0). This kink forces entire groups to be either all zero or all non-zero. Think of it as "$\ell_1$ across groups, $\ell_2$ within groups."

Lasso ($\ell_1$ ball) element-wise sparsity Ridge ($\ell_2$ ball) no sparsity
Figure 1: The Lasso's diamond-shaped $\ell_1$ ball has corners on the axes, which is why solutions hit exact zeros. The Group Lasso applies this "corner" logic at the group level while using smooth $\ell_2$ geometry within each group.

3.3 Group Sparsity in Action

The Group Lasso produces group-sparse estimates:

3.4 Parameterization and the Groupwise Prediction Penalty

An important subtlety: the Group Lasso depends on the parameterization within each group. If you represent a 4-level factor with treatment contrasts vs. sum contrasts, you get different groups and potentially different results. This is undesirable.

Groupwise Prediction Penalty

A special but important variant uses:

$$\text{pen}(\beta) = \lambda \sum_{j=1}^{q} m_j \|X_{G_j} \beta_{G_j}\|_2 = \lambda \sum_{j=1}^{q} m_j \sqrt{\beta_{G_j}^T X_{G_j}^T X_{G_j} \beta_{G_j}}$$

This penalty is invariant under reparameterization within each group $G_j$, because it penalizes the predicted values $X_{G_j}\beta_{G_j}$, not the raw coefficients. When using an orthogonal parameterization where $X_{G_j}^T X_{G_j} = I$, this reduces to the standard Group Lasso.

In practice, with categorical variables you should either use a groupwise orthogonalized design or the groupwise prediction penalty directly.

3.5 Algorithm: Block Coordinate Descent

The Group Lasso is solved via block coordinate descent. Instead of updating one coefficient at a time (as in coordinate descent for the standard Lasso), we update one group at a time. For squared error loss, this works as follows:

  1. Initialize $\beta^{[0]}$ (e.g. to zero). Set $m = 0$.
  2. Cycle through groups $j = 1, \ldots, q$:
    • Compute the negative gradient for group $j$, holding all other groups fixed.
    • If the gradient norm is below a threshold $\lambda m_j$: set $\hat{\beta}_{G_j} = 0$ (the entire group is zeroed out).
    • Otherwise: solve a smaller optimization for $\beta_{G_j}$ only.
  3. Repeat until convergence.

For non-squared losses (e.g. in logistic regression), a block coordinate gradient descent variant is used, which replaces the exact within-block minimization with a quadratic approximation (similar to IRLS/Newton steps).

Practical Tip

An active set strategy dramatically speeds up computation: maintain a list of groups that are currently non-zero, and only loop over those plus occasional checks on the zero groups. Most groups stay at zero throughout.

3.6 Application: DNA Splice Site Prediction

A real-world application from computational biology: predicting splice sites in DNA using 7 factor variables (each with 4 nucleotide levels: A, C, G, T). A logistic group Lasso model was fit with all 3-way and lower interactions, totaling 64 terms and $p = 1{,}156$ parameters. The Group Lasso achieved a maximal correlation $\rho_{\max} = 0.6593$ on the test set, competitive with specialized maximum entropy approaches, while also providing interpretable selection of main effects and interactions.

3.7 Theoretical Guarantees (Overview)

The theoretical properties of the Group Lasso parallel those of the standard Lasso, but with "more complicated arguments." With a suitable choice of $\lambda$ and under a group compatibility condition (the group-level analogue of the compatibility/restricted eigenvalue condition), we get:

When group sizes $T_j$ are large relative to $\log(q)$, there is a prediction gain over using the standard Lasso, by exploiting group structure.

IV. Additive Models & Sparsity-Smoothness Penalty

4.1 From Linear to Additive Models

A linear model assumes $\mathbb{E}[Y|X] = \sum_j \beta_j X^{(j)}$ — a weighted sum of the raw covariates. This is restrictive: the true relationship between $Y$ and $X^{(j)}$ might be nonlinear. An additive model replaces each linear term with a smooth function:

Definition — High-Dimensional Additive Model
$$Y_i = \mu + \sum_{j=1}^{p} f_j(X_i^{(j)}) + \varepsilon_i, \quad i = 1, \ldots, n$$

where each $f_j$ is a smooth univariate function and we require centering $\sum_i f_j(X_i^{(j)}) = 0$ for identifiability. The goal is to estimate all $p$ functions, where $p$ can be much larger than $n$.

Example — Linear vs. Additive

A linear model says the effect of "age" on blood pressure is $\beta \cdot \text{age}$ — a straight line. An additive model allows the effect to be any smooth curve $f(\text{age})$, capturing, say, a quadratic or exponential relationship. Crucially, in high dimensions, most of these functions should be zero (sparsity at the function level).

4.2 Basis Expansion: Turning Functions into Vectors

To make the problem computational, each function $f_j$ is expanded in a basis of $K$ functions (e.g. B-splines):

$$f_j(x) = \sum_{k=1}^{K} \beta_{j,k} \, h_{j,k}(x)$$

This converts the infinite-dimensional problem (estimating functions) into a finite-dimensional one (estimating parameter vectors $\beta_j \in \mathbb{R}^K$). The design matrix becomes $H = [H_1 | H_2 | \cdots | H_p]$ where each $H_j$ is $n \times K$. The total number of parameters is $pK$, which can be very large.

Now, each function $f_j$ corresponds to a group $\beta_j$ of $K$ parameters — and the Group Lasso structure appears naturally!

4.3 The Sparsity-Smoothness Penalty

For additive models, we need to control two things simultaneously: (1) sparsity — most functions should be identically zero, and (2) smoothness — the non-zero functions shouldn't wiggle wildly. This leads to a combined penalty.

Definition — Sparsity-Smoothness Penalty (SSP)
$$\text{pen}_{\lambda_1, \lambda_2}(\beta) = \lambda_1 \sum_{j=1}^{p} \|f_j\|_n + \lambda_2 \sum_{j=1}^{p} I(f_j)$$

where $\|f_j\|_n = \|H_j \beta_j\|_2 / \sqrt{n}$ is the empirical norm (controlling sparsity) and $I(f_j) = \sqrt{\int |f_j''(x)|^2 dx} = \sqrt{\beta_j^T W_j \beta_j}$ is a Sobolev smoothness semi-norm (controlling smoothness). Two tuning parameters $\lambda_1$ and $\lambda_2$ are needed because sparsity and smoothness operate on very different scales.

Intuition:

SSP of Group Lasso Type (Computational Variant)

For more efficient computation, an alternative puts both terms under a single square root:

$$\text{pen}_{\lambda_1, \lambda_2}(\beta) = \lambda_1 \sum_{j=1}^{p} \sqrt{\|f_j\|_n^2 + \lambda_2^2 \, I^2(f_j)}$$

In parameterized form this becomes $\lambda_1 \sum_j \sqrt{\beta_j^T (H_j^T H_j / n + \lambda_2^2 W_j) \beta_j}$, which is exactly a generalized Group Lasso penalty with matrix $A_j = H_j^T H_j/n + \lambda_2^2 W_j$. So we can reuse the Group Lasso algorithms!

4.4 Natural Cubic Splines: The Optimal Basis

A beautiful theoretical result: when optimizing over the Sobolev space of continuously differentiable functions, the solution is always a natural cubic spline with knots at the data points. This means we don't lose anything by restricting to finite-dimensional spline bases — the infinite-dimensional optimization automatically selects splines.

In practice, one uses B-spline bases with $K \approx \sqrt{n}$ interior knots placed at empirical quantiles.

4.5 Numerical Example

In a simulated setting with $n = 150$, $p = 200$, and only $s_0 = 4$ active functions (including $\sin$, quadratic, linear, and exponential shapes), the sparsity-smoothness estimator correctly identifies the active functions and recovers their shapes, while setting the remaining 196 functions to zero.

V. Extensions: GLMs, Varying Coefficients & Multitask Learning

5.1 Group Lasso for GLMs

Everything from the Group Lasso section extends to generalized linear models. Replace the squared error loss with the negative log-likelihood:

$$\hat{\beta}(\lambda) = \arg\min_{\beta} \left( -\ell(\beta; Y_1, \ldots, Y_n) + \lambda \sum_{j=1}^{q} \sqrt{T_j} \|\beta_{G_j}\|_2 \right)$$

The block coordinate gradient descent algorithm handles this. Applications include group-sparse logistic regression (e.g. the DNA splice site example) and group-sparse Poisson regression.

5.2 Generalized Additive Models

Additive models also extend to GLMs. The model becomes $g(\mathbb{E}[Y_i | X_i]) = \mu + \sum_j f_j(X_i^{(j)})$ and estimation uses penalized negative log-likelihood with the sparsity-smoothness penalty. This allows, for instance, modelling a binary response with smooth nonlinear effects of high-dimensional covariates.

5.3 Varying Coefficient Models

When data has a "time" component, coefficients may change smoothly over time:

$$Y_i(t_r) = \mu + \sum_{j=1}^{p} \beta_j(t_r) X_i^{(j)}(t_r) + \varepsilon_i(t_r)$$

Each $\beta_j(\cdot)$ is a smooth function of time. The estimation again uses the sparsity-smoothness penalty, driving some $\hat{\beta}_j(\cdot) \equiv 0$ across all time points. This is closely related to the additive model framework.

5.4 Multitask Learning

When we have $T$ related regression problems (e.g. predicting gene expression at $T$ different time points), we can either run $T$ separate Lassos or use the Group Lasso simultaneously. The group structure groups coefficients for the same covariate across tasks: $\beta_{G_j} = (\beta_j(1), \ldots, \beta_j(T))^T$.

The Group Lasso approach enforces shared sparsity: if covariate $j$ is irrelevant, it is set to zero across all $T$ tasks simultaneously. Theory shows a prediction gain by a factor of $\log(p)$ when $T$ is large, compared to running $T$ separate Lassos.

5.5 Further Extensions

The GLM + Lasso / Group Lasso framework also supports high-dimensional Poisson regression, high-dimensional contingency tables (multinomial distribution), and high-dimensional Cox regression (with pseudo-partial likelihood) for survival analysis.

VI. Summary: Comparing the Methods

Method Penalty Sparsity Type When to Use
Lasso $\lambda \sum_j |\beta_j|$ Individual coefficients Continuous covariates, no group structure
Group Lasso $\lambda \sum_j m_j \|\beta_{G_j}\|_2$ Entire groups Factor variables, interactions, basis expansions
Additive Model (SSP) $\lambda_1 \sum_j \|f_j\|_n + \lambda_2 \sum_j I(f_j)$ Entire functions + smoothness Nonlinear effects, high-dim nonparametric regression
Sparse Group Lasso $(1-\alpha)\lambda\sum_j m_j\|\beta_{G_j}\|_2 + \alpha\lambda\|\beta\|_1$ Groups and within groups When you want both group and element sparsity

Key Takeaways