This lecture completes the nonparametric regression chapter by covering global modelling (regression splines) and penalized modelling (smoothing splines) — the last two of the four nonparametric paradigms. It then transitions into model assessment and selection, introducing the key ideas of training error, test error, optimism, Mallow's $C_p$, and AIC.
1. Recap: Local Polynomial Regression
Before moving to global methods, the lecture briefly revisits local polynomial regression. Recall that the Nadaraya–Watson estimator fits a local constant at each target point $x$. We can do better by fitting a local polynomial of degree $p-1$:
The function estimate is the fitted local intercept: $\hat{f}(x) = \hat{\beta}_1(x)$. Common choices are $p = 2$ (local linear) or $p = 4$. A major advantage of the local polynomial estimator is that it handles boundaries much better than the Nadaraya–Watson estimator, which tends to suffer from design bias near the edges of the data. The idea is that by fitting a line (or higher polynomial) rather than a constant, the estimator can "tilt" to follow the true function even when the kernel window extends beyond the data range.
2. From Local to Global: Regression Splines
All the methods seen so far (kNN, kernel regression, local polynomials) are local fitting methods. Regression splines take a fundamentally different approach: they build a global model from piecewise polynomial basis functions. The key idea is to partition the domain of $x$ using knots $\xi_1 < \xi_2 < \cdots < \xi_K$ and fit a polynomial on each interval — but crucially, these pieces are connected by smoothness constraints.
2.1 Piecewise Constant & Piecewise Linear Models
The simplest approach is a piecewise constant model (essentially a histogram for regression — usually too rough). A step up is the piecewise linear model, which fits a separate line on each interval defined by the knots. This model has $2(K+1)$ parameters and is typically discontinuous at the knots, which is undesirable.
2.2 Continuous Piecewise Linear Splines
We can enforce continuity at the knots by imposing matching constraints. This reduces the number of free parameters and yields a continuous piecewise linear model. A convenient basis representation is:
where $(a)_+ = \max(a, 0)$ is the "positive part" function. This is still a linear model in the parameters $\boldsymbol\beta$, so we can estimate it with ordinary least squares. The trick is that each basis function $(x - \xi_k)_+$ is zero to the left of knot $\xi_k$ and "turns on" linearly to the right — this automatically ensures the resulting function is continuous everywhere.
2.3 Cubic Splines
Piecewise linear fits are still visually rough (you can see the kinks). The natural idea is to increase the polynomial degree to gain flexibility between knots while also demanding more smoothness at the knots.
A cubic spline with knots $\xi_1 < \cdots < \xi_K$ is a function that is a cubic polynomial on each interval, and is twice continuously differentiable everywhere (i.e., $f \in C^2$). This means the function itself, its first derivative, and its second derivative all match at every knot.
A cubic spline can be written as a linear basis function model with $K + 4$ parameters:
The first four terms give a global cubic polynomial, and each additional basis function $(x - \xi_l)^3_+$ "activates" at knot $\xi_l$ and adds a correction. Because we use cubed positive parts, the resulting function automatically satisfies the $C^2$ condition. This is still a linear model in the parameters, so again OLS applies.
Cubic splines are the lowest-degree splines for which the knot-discontinuity is not visible to the human eye. This makes them the standard choice in practice. Compared to fitting a single high-degree global polynomial, cubic splines use many low-degree pieces and are generally much more numerically stable.
Choosing knots
The number and placement of knots control the bias–variance trade-off: more knots means more flexibility (lower bias, higher variance), fewer knots means a smoother fit (higher bias, lower variance). In practice, knots are often placed at quantiles of the observed $x_i$ values, or the number of knots is chosen by cross-validation. B-splines provide a computationally better basis than the truncated power basis shown above, though they span the same function space.
2.4 Natural Cubic Splines
Ordinary cubic splines can behave erratically beyond the boundary knots $\xi_1$ and $\xi_K$: the cubic polynomials in those tails can swing wildly, causing high variance in extrapolation. Natural cubic splines fix this by adding the constraint that $f$ must be linear beyond the boundary knots.
This linearity constraint frees up 4 degrees of freedom (two constraints at each boundary), so a natural cubic spline with $K$ knots has $\text{df} = K$ rather than $K + 4$. The trade-off is slightly more bias near the boundaries but significantly less variance — which is usually a good deal.
3. Smoothing Splines
Regression splines require you to choose the number and location of knots — a modelling decision that can be tricky. Smoothing splines offer an elegant alternative: instead of choosing knots, you let the data decide the smoothness through a roughness penalty.
3.1 The Penalized Least Squares Criterion
Among all twice-differentiable functions $f$, find the one minimizing:
The first term is the usual residual sum of squares — it measures how well $f$ fits the data. The second term is a roughness penalty: the integral of the squared second derivative measures the total curvature of $f$. The parameter $\lambda \geq 0$ controls the trade-off between these two objectives.
If $\lambda = 0$: no penalty on curvature, so $f$ can interpolate every data point (maximum flexibility, likely overfitting). If $\lambda \to \infty$: the roughness penalty dominates, and since a straight line has $f'' \equiv 0$, the solution approaches ordinary linear regression. So $\lambda$ acts as a smooth dial between interpolation and a straight line.
3.2 The Remarkable Solution
The optimization above is over an infinite-dimensional function space (all $C^2$ functions). Yet, remarkably, the solution is finite-dimensional: the minimizer is always a natural cubic spline with knots at every unique observed value $x_1, \ldots, x_n$. So an infinite-dimensional problem reduces to standard linear algebra!
3.3 Basis Representation & Ridge-Type Solution
We represent the smoothing spline using natural spline basis functions $B_1, \ldots, B_n$:
Substituting into the penalized criterion, it becomes:
where $\mathbf{B}$ is the $n \times n$ design matrix of basis evaluations and $\Omega_{jk} = \int B_j''(z)\, B_k''(z)\, dz$ is the penalty matrix. The solution is:
$$\hat{\boldsymbol\beta} = (\mathbf{B}^\top \mathbf{B} + \lambda\, \boldsymbol\Omega)^{-1} \mathbf{B}^\top \mathbf{Y}$$
This has exactly the same form as a Ridge regression estimator. The penalty matrix $\lambda\boldsymbol\Omega$ shrinks the coefficients toward zero, preventing overfitting despite having $n$ parameters.
3.4 Smoothing Splines as Linear Smoothers
The fitted values can be written as $\hat{\mathbf{Y}} = \mathbf{S}_\lambda \mathbf{Y}$, where $\mathbf{S}_\lambda = \mathbf{B}(\mathbf{B}^\top \mathbf{B} + \lambda\boldsymbol\Omega)^{-1}\mathbf{B}^\top$ is a symmetric smoother (hat) matrix. The effective degrees of freedom are $\text{df} = \text{tr}(\mathbf{S}_\lambda)$, providing an intuitive measure of model complexity. For example, a smoothing spline with $\text{df} = 5$ is roughly as complex as a degree-4 polynomial.
3.5 Equivalent Kernel Interpretation
A smoothing spline can be approximated by a kernel estimator with a local bandwidth:
This means the smoothing spline automatically adapts: where the data is sparse (low $f_X$), the effective bandwidth is larger (more smoothing), and where data is dense, it uses less smoothing to capture fine detail. This adaptive behaviour is a key advantage over fixed-bandwidth kernel methods.
3.6 Comparison: Kernel Regression vs. Smoothing Splines
| Property | Nadaraya–Watson (Kernel) | Smoothing Spline |
|---|---|---|
| Fitting approach | Local constant | Global smooth function |
| Tuning parameter | Bandwidth $h$ | Smoothing parameter $\lambda$ (or df) |
| Adaptivity | Fixed bandwidth everywhere | Adapts via equivalent local bandwidth |
| Knots | Not applicable | Automatically at all $x_i$ |
| Smoother matrix | Yes (linear smoother) | Yes (symmetric linear smoother) |
3.7 Extensions & Outlook
Smoothing splines generalise beyond squared-error loss. For any loss function $L$, the minimizer of $\frac{1}{n}\sum_{i} L\{y_i, f(x_i)\} + \lambda \int f''(t)^2\, dt$ is still a natural cubic spline — though the coefficients may no longer have a closed-form solution. In multiple dimensions, fully nonparametric methods become inefficient due to the curse of dimensionality. Generalized additive models (GAMs) address this by assuming an additive structure: $E(Y \mid X_1, \ldots, X_p) = \alpha + f_1(X_1) + \cdots + f_p(X_p)$, where each $f_j$ can be estimated with a univariate smoothing spline.
4. Model Assessment & Selection
With so many nonparametric methods available — each with tuning parameters like $k$, $h$, $\lambda$, or the number of knots — the natural question is: how do we choose? This leads to Chapter 5 of the course: model assessment and selection.
4.1 Training Error vs. Test Error
The training error measures how well a fitted model $\hat{f}$ predicts the data it was trained on:
This is always an overly optimistic estimate of how well the model will perform on new data, because the same observations were used for both fitting and evaluation. A complex enough model can drive the training error to zero while performing terribly on unseen data.
If we had an independent test set $\{(X_{\text{new},i}, Y_{\text{new},i})\}_{i=1}^m$, we could estimate the conditional (extra-sample) error:
Another important quantity is the generalization error $E[(Y_{\text{new}} - \hat{f}(X_{\text{new}}))^2]$, which averages over both the test point and the training data. This does not decrease monotonically with complexity — it typically exhibits a U-shape, first decreasing (underfitting regime) then increasing (overfitting regime).
4.2 The Optimism of the Training Error
To understand why the training error is optimistic, the lecture introduces the in-sample error: the prediction error evaluated at the same input points $X_i$ but with fresh responses $\tilde{Y}_i$ drawn independently from $P(Y \mid X_i)$:
The optimism is defined as $\text{op} = \text{Err}_{\text{in}} - \text{err}_{\hat{f}}$. Taking expectations yields the elegant identity:
$$\omega = E(\text{op}) = \frac{2}{n}\sum_{i=1}^{n} \text{Cov}(\hat{Y}_i, Y_i)$$
where $\hat{Y}_i = \hat{f}(X_i)$. The optimism is governed by how much the fitted values depend on the training responses. The more the model adapts to the specific training outcomes (i.e., the more complex it is), the larger this covariance, and the more optimistic the training error.
Write out $E[(\tilde{Y}_i - \hat{Y}_i)^2]$ and $E[(Y_i - \hat{Y}_i)^2]$ using the bias-variance decomposition. Most terms cancel between the two expressions. The only surviving term is $-2\text{Cov}(Y_i, \hat{Y}_i)$, since $\tilde{Y}_i$ is independent of $\hat{Y}_i$ (so $\text{Cov}(\tilde{Y}_i, \hat{Y}_i) = 0$) but $Y_i$ is not (because $\hat{Y}_i$ was constructed using $Y_i$).
4.3 Effective Degrees of Freedom & Mallow's $C_p$
Under the additive error model $Y_i = \mu_i + \varepsilon_i$ with $\text{Var}(\varepsilon_i) = \sigma^2$ and uncorrelated errors, we can write $\sum_{i} \text{Cov}(Y_i, \hat{Y}_i) = \sigma^2\, \text{df}(\hat{\mathbf{Y}})$, connecting the optimism directly to the effective degrees of freedom $\text{df} = \text{tr}(\mathbf{S})$ that we defined for linear smoothers last week. This gives:
$$C_p = \frac{1}{n}\left\{\text{RSS} + 2\hat\sigma^2\, \text{df}(\hat{\mathbf{Y}})\right\}$$
where $\hat\sigma^2$ is estimated from a low-bias model. $C_p$ is an unbiased estimator of $E(\text{Err}_{\text{in}})$: among a collection of models, select the one with the smallest $C_p$. It corrects the training error for its optimistic bias by adding a penalty proportional to the model's complexity.
4.4 AIC and BIC
Likelihood-based model selection follows the same principle of penalizing complexity. Information criteria take the form:
The two most popular choices are the Akaike Information Criterion (AIC) $= -2\ell(\hat\theta) + 2p$ and the Bayesian Information Criterion (BIC) $= -2\ell(\hat\theta) + \log(n)\, p$, where $\hat\theta$ is the MLE and $p$ is the effective number of parameters. BIC penalizes complexity more heavily for large $n$, and tends to favour simpler models. In Gaussian linear models with known $\sigma^2$, $C_p$ and AIC are equivalent (up to a multiplicative constant).
Key Takeaways
- Regression splines are linear basis function models: piecewise polynomials joined at knots with smoothness constraints. Cubic splines ($C^2$ continuity) are the standard, and natural cubic splines add linearity beyond the boundaries for better extrapolation.
- Smoothing splines solve a penalized least squares problem (data fit + roughness penalty). The solution is always a natural cubic spline — a remarkable reduction from infinite to finite dimensions — and takes the form of a Ridge-type estimator.
- Smoothing splines are adaptive: their equivalent kernel has a local bandwidth that increases in sparse regions and decreases in dense regions, unlike fixed-bandwidth kernel estimators.
- The training error is optimistic: its expected optimism equals $\frac{2}{n}\sum_i \text{Cov}(\hat{Y}_i, Y_i)$, which under standard assumptions equals $\frac{2\sigma^2}{n}\text{df}$. This motivates Mallow's $C_p$ as a bias-corrected model comparison criterion.
- AIC and BIC are likelihood-based counterparts, all sharing the principle of balancing goodness of fit against model complexity.