Lecture 3: Model Diagnostics, Selection & Introduction to Density Estimation

Course: Computational Statistics (401-3632-00L) Week 3 Dr. Jonathan Koh · ETH Zürich FS2026 Last updated: March 7, 2026

This lecture wraps up the linear regression chapter and opens the door to nonparametric methods. We begin with a recap of inference and optimality results from previous weeks, then cover three main topics: graphical diagnostics for checking linear model assumptions, model selection (including penalized regression and stepwise procedures), and an introduction to nonparametric density estimation via histograms and kernel estimators. The unifying thread is the bias–variance tradeoff: how model complexity must be balanced to achieve good predictive performance.

1. Recap of Previous Weeks

Week 3 begins by reviewing key results established so far. From the group game in Week 2, important takeaways included: the estimator $\hat{\boldsymbol{\beta}}$ is random (it changes with each dataset), and that model selection, hypothesis testing, and the distinction between simple and multiple regression all matter in practice.

1.1 Properties of the Least Squares Estimator

Under the Gauss–Markov (GM) assumptions — the linear model $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\mathrm{E}(\boldsymbol{\varepsilon}) = \boldsymbol{0}$, $\mathrm{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \boldsymbol{I}$, and uncorrelated errors — the OLS estimator $\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\boldsymbol{X}^\top \boldsymbol{Y}$ is the Best Linear Unbiased Estimator (BLUE). If we additionally assume Gaussian errors, then $\hat{\boldsymbol{\beta}}$ is also the maximum likelihood estimator and is Uniformly Minimum Variance Unbiased (UMVU).

1.2 Inference in the Gaussian Linear Model

Under Gaussian errors, the distributional results that enable classical inference are:

$$\hat{\boldsymbol{\beta}} \sim N_p\!\left(\boldsymbol{\beta},\; (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\sigma^2\right), \qquad \hat{\sigma}^2 \sim \frac{\sigma^2}{n-p}\,\chi^2_{n-p}.$$

These are used for $t$-tests of individual coefficients ($H_{0,j}: \beta_j = 0$), for ANOVA (global null test), and for constructing confidence and prediction intervals. When Gaussianity doesn't hold, the Central Limit Theorem justifies using these results approximately for large $n$.

Confidence Interval for $\beta_j$

A two-sided $(1-\alpha)$ confidence interval for $\beta_j$ is:

$$\hat{\beta}_j \;\pm\; \sqrt{\hat{\sigma}^2 (\boldsymbol{X}^\top\boldsymbol{X})^{-1}_{jj}} \;\cdot\; t_{n-p;\,1-\alpha/2}$$

where $t_{n-p;\,1-\alpha/2}$ is the $(1-\alpha/2)$-quantile of the $t$-distribution with $n-p$ degrees of freedom.

Caveat: Individual vs. Joint Tests

An individual $t$-test for $\beta_j$ quantifies the effect of the $j$-th predictor after having subtracted the linear effect of all other predictors. It is possible for all individual tests to fail to reject while the global $F$-test does reject — this can happen when predictors are correlated.

2. Evaluating Model Assumptions Graphically

The residuals $r_i = y_i - \hat{y}_i$ approximate the unobservable errors $\varepsilon_i$ and are the primary tool for checking whether the linear model is appropriate. Three key diagnostic plots were covered:

2.1 The Tukey–Anscombe Plot

This plots residuals $r_i$ (y-axis) against fitted values $\hat{y}_i$ (x-axis). A useful property is that the sample correlation between $r_i$ and $\hat{y}_i$ is always zero by construction. In an ideal scenario, the points fluctuate randomly around the horizontal line at zero — no patterns, no "funnels."

Common violations you can spot include: a fan-shaped spread (heteroscedasticity — the variance of $\varepsilon_i$ depends on $\hat{y}_i$), a curved trend (suggesting a missing nonlinear term, e.g., a quadratic), or distinct clusters with different spreads (group-specific variances).

Practical Remedies

If the standard deviation of the residuals grows linearly with fitted values, a log-transform $y \mapsto \log(y)$ can stabilize the variance. If it grows as the square root, then $y \mapsto \sqrt{y}$ may help. Alternatively, one can use weighted least squares (see §1.6.4 in the lecture notes).

2.2 The Normal QQ Plot

The QQ (quantile–quantile) plot checks whether the residuals come from a normal distribution. You plot the empirical quantiles of the residuals (y-axis) against the theoretical quantiles of a $N(0,1)$ (x-axis). If the residuals are truly normal with mean $\mu$ and variance $\sigma^2$, the points should fall along a straight line with intercept $\mu$ and slope $\sigma$.

Deviations to watch for: "S-shaped" curves indicate heavy tails (more extreme values than a Gaussian), a bend to one side signals skewness, and isolated points far from the line are potential outliers.

2.3 Residual Correlation Plot

To check for independence of errors, plot the residuals $r_i$ against the observation index $i$ (or the time $t_i$ if data are collected sequentially). If neighbouring residuals look similar — i.e., you see smooth "waves" — the independence assumption for the errors is likely violated (serial correlation).

3. Model Selection in Linear Models

Suppose we have $p$ predictor variables, but only a subset may actually be relevant for prediction. Including unnecessary predictors increases model flexibility but also increases estimation variability. The goal is not to identify the "true" model, but to find a model with good predictive performance.

3.1 The Bias–Variance Tradeoff in Submodel Selection

Let $S = \{j_1, \ldots, j_q\} \subseteq \{1, \ldots, p\}$ be an index set of $q$ selected predictors, and $\hat{f}_S(x) = x_S^\top \hat{\boldsymbol{\beta}}_S$ the fitted values from the submodel. The average mean squared prediction error at the design points decomposes as:

Prediction Risk Decomposition
$$R(S) = \underbrace{\frac{1}{n}\sum_{i=1}^{n}\Bigl(\mathrm{E}[\hat{f}_S(X_i)] - f^*(X_i)\Bigr)^2}_{\text{squared bias}} + \underbrace{\frac{1}{n}\sum_{i=1}^{n}\mathrm{Var}\!\bigl(\hat{f}_S(X_i)\bigr)}_{\text{variance}}$$

Under the fixed-design linear model with homoscedastic errors, the variance term equals $\frac{q}{n}\sigma^2$ — it grows linearly with $q$. Meanwhile, the bias tends to decrease as $q$ increases (the model becomes more flexible). This is the fundamental bias–variance tradeoff.

3.2 Why Not Just Use RSS?

The training error (RSS) always decreases as you add more predictors — for nested models, $\mathrm{RSS}(M_k) \geq \mathrm{RSS}(M_{k+1})$. So RSS alone is a poor criterion for model selection because it will always favour the most complex model. We need criteria that penalize model complexity.

3.3 Penalized Regression

Instead of minimizing $\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|^2$ alone, we add a penalty for the number of nonzero coefficients:

$$\hat{\boldsymbol{\beta}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_0,$$

where $\|\boldsymbol{\beta}\|_0 = |\{k : \beta_k \neq 0\}|$ counts the number of active predictors. The regularization parameter $\lambda$ controls the tradeoff: larger $\lambda$ means fewer variables; $\lambda = 0$ gives ordinary least squares.

Two popular choices for $\lambda$ yield well-known information criteria:

CriterionChoice of $\lambda$Tendency
AIC (Akaike) / Mallows $C_p$$\lambda = 2\hat{\sigma}^2$May select slightly too-complex models
BIC (Bayesian)$\lambda = \log(n)\,\hat{\sigma}^2$Tends towards simpler models

In practice, $\lambda$ is often chosen via cross-validation (covered later in the course).

3.4 Best Subset Selection

For each model size $k = 1, \ldots, p$, fit all $\binom{p}{k}$ models with exactly $k$ predictors (plus the intercept), and identify the one with the smallest RSS — call it $M_k$. The RSS will always decrease over the sequence $M_1, M_2, \ldots, M_p$, so you then need to apply a criterion (AIC, BIC, cross-validation) to pick the best $k$.

Computational Limitation

Best subset selection requires fitting $2^p$ models, which is only feasible for small $p$ (roughly $p \leq 15$ or so, since $2^{16} = 65{,}536$). For larger $p$, we need greedy alternatives.

3.5 Stepwise Selection

Stepwise methods greedily build a sequence of models, avoiding the exhaustive search:

Forward Selection

Start with the null model $M_0$ (intercept only). At each step, add the predictor that reduces RSS the most. Continue until all $p$ predictors are included. This produces a nested sequence $M_0 \subseteq M_1 \subseteq \cdots \subseteq M_p$, from which you pick the best model using a penalized criterion.

Backward Selection

Start with the full model $M_p$ (all predictors). At each step, remove the predictor whose removal increases RSS the least. Continue until only the intercept remains. Backward selection is often slightly better than forward, but requires $p < n$ (can't fit the full model otherwise).

Important: Post-Selection Inference

After performing any model selection procedure (best subset, forward, backward), the usual inference tools ($t$-tests, confidence intervals) are no longer valid. The data has been heavily used for choosing the model, so p-values and standard errors from the selected model are overly optimistic. This is the problem of post-selection inference.

Example: Body Fat Data

Using the body fat dataset ($n = 247$, $p = 13$ quantitative predictors), best subset selection was applied. The centred $R^2$ increases from about 0.66 (1 predictor) to 0.74 (all 13), but most of the gain is captured by the first 3–4 predictors. Forward and backward selection produce slightly different orderings — for instance, forward selection adds abdomen first, while backward removes knee first — but the final selected models are often similar.

4. Introduction to Nonparametric Density Estimation

We now step away from the regression setting and consider a simpler data structure: univariate i.i.d. observations with no response variable. The goal is to estimate the distribution of the data without assuming a parametric form (like Gaussian). This is a prototype for estimation in infinite-dimensional function spaces — a principle that will recur throughout the course.

Setting

We observe $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} F$, where $F$ is an unknown CDF. Goal: estimate the density $f = F'$ (assuming it exists), without assuming a parametric model (e.g., Gaussian with unknown mean and variance). A density $f$ is an infinite-dimensional object.

4.1 The Histogram

The histogram is the oldest and most widely used density estimator. It requires two choices: an origin $x_0$ and a class width (bin width) $h > 0$. These define the bins:

$$I_j = (x_0 + jh,\; x_0 + (j+1)h], \qquad j \in \mathbb{Z}.$$

The histogram counts observations in each bin and scales the bar heights so that the total area equals 1 (making it a proper density estimate). Two important points about the tuning parameters: the origin $x_0$ is highly arbitrary and can change the appearance significantly, while the bin width $h$ controls the smoothness — small $h$ gives a wiggly, noisy estimate; large $h$ gives a smooth but potentially over-smoothed estimate.

4.2 The Kernel Density Estimator (KDE)

The kernel density estimator improves on the histogram by placing a smooth "bump" centered at each observation, then averaging. It avoids the need for an origin $x_0$ and produces a smooth estimate.

Definition — Kernel Density Estimator
$$\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^{n} K\!\left(\frac{x - X_i}{h}\right)$$

where $h > 0$ is the bandwidth (smoothing parameter) and $K$ is a kernel function satisfying:

$$K(x) \geq 0, \qquad \int K(x)\,dx = 1, \qquad K(x) = K(-x).$$

Intuition: Each observation $X_i$ contributes a "bump" of width $h$ shaped like $K$. Summing and normalizing these bumps produces a smooth density estimate. Common choices for $K$ include the Gaussian, Epanechnikov, and uniform (box) kernels.

The KDE can be understood starting from the naive density estimator. The density at a point $x$ can be written as $f(x) = \lim_{h \to 0} \frac{1}{2h} P(x - h < X \leq x + h)$. Replacing probabilities with relative frequencies and not taking the limit gives the naive estimator — which is piecewise constant. The KDE simply replaces the box indicator with a smooth kernel $K$, yielding a continuous estimate.

The Role of Bandwidth $h$

The bandwidth $h$ plays the same role as the bin width in a histogram — it controls the bias–variance tradeoff. A small $h$ gives low bias but high variance (wiggly, noisy curve). A large $h$ gives low variance but high bias (oversmoothed, missing features). Choosing $h$ optimally is a central problem in nonparametric statistics, covered in more detail in Week 4.

Key Takeaways