Splines, Smoothing & Model Assessment

Course: Computational Statistics (401-3632-00L) Lecture 6 Date: 2026-03-25

This lecture completes the nonparametric regression chapter by covering global modelling (regression splines) and penalized modelling (smoothing splines) — the last two of the four nonparametric paradigms. It then transitions into model assessment and selection, introducing the key ideas of training error, test error, optimism, Mallow's $C_p$, and AIC.

1. Recap: Local Polynomial Regression

Before moving to global methods, the lecture briefly revisits local polynomial regression. Recall that the Nadaraya–Watson estimator fits a local constant at each target point $x$. We can do better by fitting a local polynomial of degree $p-1$:

$$\hat{\boldsymbol\beta}(x) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right) \left(Y_i - \beta_1 - \beta_2(x_i - x) - \cdots - \beta_p(x_i - x)^{p-1}\right)^2$$

The function estimate is the fitted local intercept: $\hat{f}(x) = \hat{\beta}_1(x)$. Common choices are $p = 2$ (local linear) or $p = 4$. A major advantage of the local polynomial estimator is that it handles boundaries much better than the Nadaraya–Watson estimator, which tends to suffer from design bias near the edges of the data. The idea is that by fitting a line (or higher polynomial) rather than a constant, the estimator can "tilt" to follow the true function even when the kernel window extends beyond the data range.

2. From Local to Global: Regression Splines

All the methods seen so far (kNN, kernel regression, local polynomials) are local fitting methods. Regression splines take a fundamentally different approach: they build a global model from piecewise polynomial basis functions. The key idea is to partition the domain of $x$ using knots $\xi_1 < \xi_2 < \cdots < \xi_K$ and fit a polynomial on each interval — but crucially, these pieces are connected by smoothness constraints.

2.1 Piecewise Constant & Piecewise Linear Models

The simplest approach is a piecewise constant model (essentially a histogram for regression — usually too rough). A step up is the piecewise linear model, which fits a separate line on each interval defined by the knots. This model has $2(K+1)$ parameters and is typically discontinuous at the knots, which is undesirable.

2.2 Continuous Piecewise Linear Splines

We can enforce continuity at the knots by imposing matching constraints. This reduces the number of free parameters and yields a continuous piecewise linear model. A convenient basis representation is:

$$f(x) = \beta_0 + \beta_1 x + \sum_{k=1}^{K} \beta_{k+1}(x - \xi_k)_+$$

where $(a)_+ = \max(a, 0)$ is the "positive part" function. This is still a linear model in the parameters $\boldsymbol\beta$, so we can estimate it with ordinary least squares. The trick is that each basis function $(x - \xi_k)_+$ is zero to the left of knot $\xi_k$ and "turns on" linearly to the right — this automatically ensures the resulting function is continuous everywhere.

2.3 Cubic Splines

Piecewise linear fits are still visually rough (you can see the kinks). The natural idea is to increase the polynomial degree to gain flexibility between knots while also demanding more smoothness at the knots.

Definition — Cubic Spline

A cubic spline with knots $\xi_1 < \cdots < \xi_K$ is a function that is a cubic polynomial on each interval, and is twice continuously differentiable everywhere (i.e., $f \in C^2$). This means the function itself, its first derivative, and its second derivative all match at every knot.

A cubic spline can be written as a linear basis function model with $K + 4$ parameters:

$$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{l=1}^{K} \gamma_l (x - \xi_l)^3_+$$

The first four terms give a global cubic polynomial, and each additional basis function $(x - \xi_l)^3_+$ "activates" at knot $\xi_l$ and adds a correction. Because we use cubed positive parts, the resulting function automatically satisfies the $C^2$ condition. This is still a linear model in the parameters, so again OLS applies.

Why cubic and not quadratic or quartic?

Cubic splines are the lowest-degree splines for which the knot-discontinuity is not visible to the human eye. This makes them the standard choice in practice. Compared to fitting a single high-degree global polynomial, cubic splines use many low-degree pieces and are generally much more numerically stable.

Choosing knots

The number and placement of knots control the bias–variance trade-off: more knots means more flexibility (lower bias, higher variance), fewer knots means a smoother fit (higher bias, lower variance). In practice, knots are often placed at quantiles of the observed $x_i$ values, or the number of knots is chosen by cross-validation. B-splines provide a computationally better basis than the truncated power basis shown above, though they span the same function space.

2.4 Natural Cubic Splines

Ordinary cubic splines can behave erratically beyond the boundary knots $\xi_1$ and $\xi_K$: the cubic polynomials in those tails can swing wildly, causing high variance in extrapolation. Natural cubic splines fix this by adding the constraint that $f$ must be linear beyond the boundary knots.

This linearity constraint frees up 4 degrees of freedom (two constraints at each boundary), so a natural cubic spline with $K$ knots has $\text{df} = K$ rather than $K + 4$. The trade-off is slightly more bias near the boundaries but significantly less variance — which is usually a good deal.

3. Smoothing Splines

Regression splines require you to choose the number and location of knots — a modelling decision that can be tricky. Smoothing splines offer an elegant alternative: instead of choosing knots, you let the data decide the smoothness through a roughness penalty.

3.1 The Penalized Least Squares Criterion

Among all twice-differentiable functions $f$, find the one minimizing:

$$\min_f \;\sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \int f''(z)^2\, dz$$

The first term is the usual residual sum of squares — it measures how well $f$ fits the data. The second term is a roughness penalty: the integral of the squared second derivative measures the total curvature of $f$. The parameter $\lambda \geq 0$ controls the trade-off between these two objectives.

Intuition — Extreme Cases of $\lambda$

If $\lambda = 0$: no penalty on curvature, so $f$ can interpolate every data point (maximum flexibility, likely overfitting). If $\lambda \to \infty$: the roughness penalty dominates, and since a straight line has $f'' \equiv 0$, the solution approaches ordinary linear regression. So $\lambda$ acts as a smooth dial between interpolation and a straight line.

3.2 The Remarkable Solution

The optimization above is over an infinite-dimensional function space (all $C^2$ functions). Yet, remarkably, the solution is finite-dimensional: the minimizer is always a natural cubic spline with knots at every unique observed value $x_1, \ldots, x_n$. So an infinite-dimensional problem reduces to standard linear algebra!

3.3 Basis Representation & Ridge-Type Solution

We represent the smoothing spline using natural spline basis functions $B_1, \ldots, B_n$:

$$f_\lambda(x) = \sum_{j=1}^{n} \beta_j B_j(x)$$

Substituting into the penalized criterion, it becomes:

$$\|\mathbf{Y} - \mathbf{B}\boldsymbol\beta\|^2 + \lambda\, \boldsymbol\beta^\top \boldsymbol\Omega\, \boldsymbol\beta$$

where $\mathbf{B}$ is the $n \times n$ design matrix of basis evaluations and $\Omega_{jk} = \int B_j''(z)\, B_k''(z)\, dz$ is the penalty matrix. The solution is:

Smoothing Spline Solution

$$\hat{\boldsymbol\beta} = (\mathbf{B}^\top \mathbf{B} + \lambda\, \boldsymbol\Omega)^{-1} \mathbf{B}^\top \mathbf{Y}$$

This has exactly the same form as a Ridge regression estimator. The penalty matrix $\lambda\boldsymbol\Omega$ shrinks the coefficients toward zero, preventing overfitting despite having $n$ parameters.

3.4 Smoothing Splines as Linear Smoothers

The fitted values can be written as $\hat{\mathbf{Y}} = \mathbf{S}_\lambda \mathbf{Y}$, where $\mathbf{S}_\lambda = \mathbf{B}(\mathbf{B}^\top \mathbf{B} + \lambda\boldsymbol\Omega)^{-1}\mathbf{B}^\top$ is a symmetric smoother (hat) matrix. The effective degrees of freedom are $\text{df} = \text{tr}(\mathbf{S}_\lambda)$, providing an intuitive measure of model complexity. For example, a smoothing spline with $\text{df} = 5$ is roughly as complex as a degree-4 polynomial.

3.5 Equivalent Kernel Interpretation

A smoothing spline can be approximated by a kernel estimator with a local bandwidth:

$$h(x) \propto \lambda^{1/4}\, n^{-1/4}\, f_X(x)^{-1/4}$$

This means the smoothing spline automatically adapts: where the data is sparse (low $f_X$), the effective bandwidth is larger (more smoothing), and where data is dense, it uses less smoothing to capture fine detail. This adaptive behaviour is a key advantage over fixed-bandwidth kernel methods.

3.6 Comparison: Kernel Regression vs. Smoothing Splines

PropertyNadaraya–Watson (Kernel)Smoothing Spline
Fitting approachLocal constantGlobal smooth function
Tuning parameterBandwidth $h$Smoothing parameter $\lambda$ (or df)
AdaptivityFixed bandwidth everywhereAdapts via equivalent local bandwidth
KnotsNot applicableAutomatically at all $x_i$
Smoother matrixYes (linear smoother)Yes (symmetric linear smoother)

3.7 Extensions & Outlook

Smoothing splines generalise beyond squared-error loss. For any loss function $L$, the minimizer of $\frac{1}{n}\sum_{i} L\{y_i, f(x_i)\} + \lambda \int f''(t)^2\, dt$ is still a natural cubic spline — though the coefficients may no longer have a closed-form solution. In multiple dimensions, fully nonparametric methods become inefficient due to the curse of dimensionality. Generalized additive models (GAMs) address this by assuming an additive structure: $E(Y \mid X_1, \ldots, X_p) = \alpha + f_1(X_1) + \cdots + f_p(X_p)$, where each $f_j$ can be estimated with a univariate smoothing spline.

4. Model Assessment & Selection

With so many nonparametric methods available — each with tuning parameters like $k$, $h$, $\lambda$, or the number of knots — the natural question is: how do we choose? This leads to Chapter 5 of the course: model assessment and selection.

4.1 Training Error vs. Test Error

The training error measures how well a fitted model $\hat{f}$ predicts the data it was trained on:

$$\text{err}_{\hat{f}} = \frac{1}{n}\sum_{i=1}^{n} (Y_i - \hat{f}(X_i))^2$$

This is always an overly optimistic estimate of how well the model will perform on new data, because the same observations were used for both fitting and evaluation. A complex enough model can drive the training error to zero while performing terribly on unseen data.

If we had an independent test set $\{(X_{\text{new},i}, Y_{\text{new},i})\}_{i=1}^m$, we could estimate the conditional (extra-sample) error:

$$R(\hat{f} \mid \mathcal{D}_n) = E\left[(Y_{\text{new}} - \hat{f}(X_{\text{new}}))^2 \mid \mathcal{D}_n\right]$$

Another important quantity is the generalization error $E[(Y_{\text{new}} - \hat{f}(X_{\text{new}}))^2]$, which averages over both the test point and the training data. This does not decrease monotonically with complexity — it typically exhibits a U-shape, first decreasing (underfitting regime) then increasing (overfitting regime).

4.2 The Optimism of the Training Error

To understand why the training error is optimistic, the lecture introduces the in-sample error: the prediction error evaluated at the same input points $X_i$ but with fresh responses $\tilde{Y}_i$ drawn independently from $P(Y \mid X_i)$:

$$\text{Err}_{\text{in}} = \frac{1}{n}\sum_{i=1}^{n} E_{\tilde{Y}_i}\!\left[(\tilde{Y}_i - \hat{f}(X_i))^2 \mid \mathcal{D}_n\right]$$

The optimism is defined as $\text{op} = \text{Err}_{\text{in}} - \text{err}_{\hat{f}}$. Taking expectations yields the elegant identity:

Theorem — Expected Optimism

$$\omega = E(\text{op}) = \frac{2}{n}\sum_{i=1}^{n} \text{Cov}(\hat{Y}_i, Y_i)$$

where $\hat{Y}_i = \hat{f}(X_i)$. The optimism is governed by how much the fitted values depend on the training responses. The more the model adapts to the specific training outcomes (i.e., the more complex it is), the larger this covariance, and the more optimistic the training error.

Intuition for the Proof

Write out $E[(\tilde{Y}_i - \hat{Y}_i)^2]$ and $E[(Y_i - \hat{Y}_i)^2]$ using the bias-variance decomposition. Most terms cancel between the two expressions. The only surviving term is $-2\text{Cov}(Y_i, \hat{Y}_i)$, since $\tilde{Y}_i$ is independent of $\hat{Y}_i$ (so $\text{Cov}(\tilde{Y}_i, \hat{Y}_i) = 0$) but $Y_i$ is not (because $\hat{Y}_i$ was constructed using $Y_i$).

4.3 Effective Degrees of Freedom & Mallow's $C_p$

Under the additive error model $Y_i = \mu_i + \varepsilon_i$ with $\text{Var}(\varepsilon_i) = \sigma^2$ and uncorrelated errors, we can write $\sum_{i} \text{Cov}(Y_i, \hat{Y}_i) = \sigma^2\, \text{df}(\hat{\mathbf{Y}})$, connecting the optimism directly to the effective degrees of freedom $\text{df} = \text{tr}(\mathbf{S})$ that we defined for linear smoothers last week. This gives:

$$E(\text{Err}_{\text{in}}) = E(\text{err}_{\hat{f}}) + \frac{2\sigma^2}{n}\,\text{df}(\hat{\mathbf{Y}})$$
Definition — Mallow's $C_p$

$$C_p = \frac{1}{n}\left\{\text{RSS} + 2\hat\sigma^2\, \text{df}(\hat{\mathbf{Y}})\right\}$$

where $\hat\sigma^2$ is estimated from a low-bias model. $C_p$ is an unbiased estimator of $E(\text{Err}_{\text{in}})$: among a collection of models, select the one with the smallest $C_p$. It corrects the training error for its optimistic bias by adding a penalty proportional to the model's complexity.

4.4 AIC and BIC

Likelihood-based model selection follows the same principle of penalizing complexity. Information criteria take the form:

$$\text{*IC} = (\text{goodness of fit}) + (\text{model complexity penalty})$$

The two most popular choices are the Akaike Information Criterion (AIC) $= -2\ell(\hat\theta) + 2p$ and the Bayesian Information Criterion (BIC) $= -2\ell(\hat\theta) + \log(n)\, p$, where $\hat\theta$ is the MLE and $p$ is the effective number of parameters. BIC penalizes complexity more heavily for large $n$, and tends to favour simpler models. In Gaussian linear models with known $\sigma^2$, $C_p$ and AIC are equivalent (up to a multiplicative constant).

Key Takeaways