This week wraps up the theory behind non-parametric density estimation (Chapter 2 of the lecture notes) — focusing on the role of the bandwidth, the bias–variance tradeoff for the MSE, the curse of dimensionality, and smoothness-based minimax rates. The second half of the lecture introduces regression theory (Chapter 3): loss functions, the regression function as the risk minimizer, the error decomposition into reducible and irreducible parts, and the concept of consistency.
1. Non-Parametric Density Estimation (continued)
Recall from Week 3: given i.i.d. observations $X_1, \ldots, X_n$ from an unknown density $f$, the kernel density estimator (KDE) is
$$\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^{n} K\!\left(\frac{x - X_i}{h}\right),$$ where $h > 0$ is the bandwidth (smoothing parameter) and $K$ is a kernel satisfying $K(x) \geq 0$, $\int K(x)\,dx = 1$, and $K(x) = K(-x)$.
Intuition: each observation contributes a "bump" of width $h$ centred at $X_i$. Summing and renormalising these bumps yields a smooth density estimate.
1.1 Kernel Choice vs. Bandwidth Choice
Two popular kernels are the Gaussian kernel $K(x) = \varphi(x) = (2\pi)^{-1/2}e^{-x^2/2}$ and the Epanechnikov kernel $K(x) = \tfrac{3}{4}(1 - x^2)\,\mathbf{1}\{|x| \leq 1\}$, which is MSE-optimal among compactly supported kernels. However, the key practical message is:
In practice, the choice of bandwidth $h$ matters far more than the choice of kernel shape. A large $h$ oversmooths (high bias, low variance); a small $h$ undersmooths (low bias, high variance, wiggly estimate).
1.2 The Role of the Bandwidth
The bandwidth $h$ is the smoothing parameter. As $h \to 0$, the estimate $\hat{f}_h$ approaches a sum of delta-spikes at each observation. As $h$ grows, $\hat{f}_h$ becomes increasingly smooth. For consistency of the estimator (meaning $\hat{f}_h \to f$ in some sense), we need the asymptotic regime:
The first condition ensures the bias vanishes (the window shrinks), and the second ensures the variance also vanishes (each window still contains enough data points).
1.3 The Bias–Variance Tradeoff and MSE
The mean squared error at a point $x$ decomposes as:
Instead of optimising at a single point, we can consider the mean integrated squared error (MISE):
Under smoothness assumptions on $f$ and standard kernel conditions:
Reading these formulas: The bias grows like $h^2$ — a larger bandwidth means more smoothing and more bias (proportional to the curvature $f''(x)$). The variance shrinks like $1/(nh)$ — more data and wider windows both reduce variance. This is the fundamental tradeoff: you cannot reduce both simultaneously with $h$ alone.
Optimal Bandwidth
Balancing these two leading terms (setting the derivative of the asymptotic MISE to zero) gives the optimal global bandwidth order and the resulting optimal MISE rate:
The notation $\asymp$ means "of the same order as." Both the squared-bias and variance terms end up being $O(n^{-4/5})$ at the optimum — neither dominates. This $n^{-4/5}$ rate is the best achievable rate for density estimation in one dimension (for twice-differentiable densities).
Bandwidth Selection in Practice
The exact optimal $h$ depends on unknown quantities like $R(f'') = \int (f''(x))^2 dx$. In practice, plug-in methods estimate these unknown features using a pilot bandwidth. A widely-used implementation is the Sheather–Jones selector, available in R as density(x, bw = "SJ").
If $\int z^2 K(z)\,dz = 0$ (a "bias-reducing" or higher-order kernel), the leading bias term vanishes. So why not always use these? The catch is that we still need the underlying density to be sufficiently smooth, and we generally don't know this in advance. Also, higher-order kernels can take negative values, which means $\hat{f}_h$ may go negative — not a valid density.
1.4 Higher Dimensions and the Curse of Dimensionality
For $d$-dimensional data $X_i \in \mathbb{R}^d$, the multivariate KDE becomes:
where $K$ is now a $d$-dimensional kernel (often a product of univariate kernels: $K(\mathbf{u}) = \prod_{j=1}^d K_{\text{univ}}(u_j)$). The crucial issue is that in high dimensions, local neighbourhoods become empty unless $n$ is enormous.
For twice-differentiable densities in $d$ dimensions, the best possible MSE rate for any nonparametric density estimator is: $$n^{-4/(4+d)}.$$ As $d$ increases, this rate deteriorates dramatically — convergence becomes painfully slow.
The following table illustrates just how severe this effect is:
| $n^{-4/(4+d)}$ | $d=1$ | $d=2$ | $d=3$ | $d=5$ | $d=10$ |
|---|---|---|---|---|---|
| $n=100$ | 0.025 | 0.046 | 0.072 | 0.129 | 0.268 |
| $n=1{,}000$ | 0.004 | 0.010 | 0.019 | 0.046 | 0.139 |
| $n=100{,}000$ | $1.0 \times 10^{-4}$ | $4.6 \times 10^{-4}$ | $13.9 \times 10^{-4}$ | 0.006 | 0.037 |
To improve the MSE by a factor of 10 in $d = 1$, you need roughly 17× more data. In $d = 10$, you need about 3,160× more data. The practical takeaway: standard kernel density estimation is essentially limited to $d \leq 2$ or $3$ without additional structural assumptions.
1.5 Smoothness Assumptions and the General Minimax Rate
To derive convergence rates at all, we must impose smoothness conditions on $f$. Without them, uniform rates are impossible.
For $\alpha > 0$ and $L > 0$, the Hölder class $\Sigma(\alpha, L)$ consists of functions whose derivatives up to order $\lfloor \alpha \rfloor$ exist, and whose highest-order derivatives satisfy a Hölder condition of order $\alpha - \lfloor \alpha \rfloor$. Functions in $\Sigma(\alpha, L)$ have a Taylor remainder bounded by: $$|f(u) - f_{x,\alpha}(u)| \leq L \|u - x\|^\alpha.$$
In plain terms: the parameter $\alpha$ controls how "smooth" the function is. Higher $\alpha$ means smoother, which makes estimation easier.
Under these smoothness assumptions, the MISE decomposes as:
Balancing these gives $h \asymp n^{-1/(2\alpha + d)}$, and the minimax rate:
This is the best rate achievable by any estimator, uniformly over the class $\Sigma(\alpha, L)$. Setting $\alpha = 2$ (twice-differentiable densities) recovers the rate $n^{-4/(4+d)}$ from the curse of dimensionality discussion above.
The rate $n^{-2\alpha/(2\alpha+d)}$ reveals two factors that make estimation harder: (1) higher dimension $d$ — more dimensions mean sparser data and slower rates, and (2) lower smoothness $\alpha$ — rougher functions are harder to estimate. This generalises the curse of dimensionality beyond just the $\alpha = 2$ case.
2. Regression Theory (Introduction)
The second part of the lecture transitions back to the supervised learning setting where we have both predictors $X$ and responses $Y$. The goal: formalise what it means to make a "good" prediction and identify the optimal prediction function.
2.1 Setting and Loss Functions
We observe i.i.d. training data $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ drawn from some unknown joint distribution $\Pr(X, Y)$. From this data, we construct a predictive model $\hat{f}: \mathbb{R}^p \to \mathbb{R}$.
To judge the quality of any prediction function $f$, we use a loss function $L: (y, \hat{y}) \mapsto L(y, \hat{y}) \in [0, \infty)$.
- Squared error loss (regression): $L(y, \hat{y}) = (y - \hat{y})^2$, where $y, \hat{y} \in \mathbb{R}$.
- 0–1 loss (classification): $L(y, \hat{y}) = \mathbf{1}(y \neq \hat{y})$, where $y, \hat{y} \in \mathcal{G}$ (a set of class labels).
2.2 Expected Squared Prediction Error (Risk)
Focusing on squared error loss, the expected squared prediction error (also called the risk) of a function $f$ is:
where the expectation is over $(X, Y) \sim \Pr(X, Y)$. The key question is: what function $f^*$ minimises this risk?
Note that we are searching over all functions — this is a fundamentally infinite-dimensional optimisation problem (unlike parametric regression where we only optimise over a finite-dimensional $\beta$).
2.3 The Regression Function as Risk Minimizer
The function that minimises $\text{Err}_f = \mathbb{E}[(Y - f(X))^2]$ is the regression function: $$f^*(x) = \mathbb{E}(Y \mid X = x).$$
Proof Sketch
Using the tower property (law of iterated expectations):
Since the integrand is non-negative for every $x$, it suffices to minimise pointwise. For each fixed $x$, we want to find $c = c(x)$ minimising $\mathbb{E}[(Y - c)^2 \mid X = x]$. Expanding:
This is a quadratic in $c$, minimised by taking the derivative and setting it to zero: $c^* = \mathbb{E}(Y \mid X = x)$. This is exactly the conditional expectation — the regression function.
2.4 The Error Decomposition
A fundamental decomposition connects any prediction function $f$ to the optimal $f^*$:
The first term measures how far your chosen $f$ is from the ideal $f^*$. This is the part you can improve by choosing a better model. The second term is the irreducible prediction error — it only vanishes if $Y$ is a deterministic function of $X$ (no noise). No model, no matter how sophisticated, can eliminate this noise floor.
Let $(X, Y)$ be bivariate normal with zero means, unit variances, and correlation $\rho$. Then $Y \mid X = x \sim \mathcal{N}(\rho x, 1 - \rho^2)$, so $f^*(x) = \rho x$ (a linear function!) and the irreducible error is $1 - \rho^2$. This is a case where linear regression is actually the optimal nonparametric solution.
If we replace the squared loss with the absolute loss $L(y, \hat{y}) = |y - \hat{y}|$, the risk minimizer becomes the conditional median of $Y$ given $X$, rather than the conditional mean. This highlights that what $f^*$ is depends on your choice of loss function.
2.5 Consistency of Predictive Methods
Given a sequence of estimators $\hat{f}_n$ (one for each sample size $n$), we want $\hat{f}_n$ to converge to $f^*$ as $n \to \infty$.
A sequence $\{\hat{f}_n\}$ is weakly consistent for a distribution $\Pr(X, Y)$ if: $$\lim_{n \to \infty} \mathbb{E}\!\left[\int_{\mathbb{R}^p} \{\hat{f}_n(x) - f^*(x)\}^2\, P_X(dx)\right] = 0.$$
In words: the expected $L^2$ distance between $\hat{f}_n$ and $f^*$ goes to zero.
Without structural assumptions on $\Pr(X, Y)$, convergence rates can be arbitrarily slow. There is no single model that works best for every problem. To obtain meaningful rates, you must assume something about the target — for example, that $f^*$ is smooth. This idea connects directly to the Hölder smoothness classes we saw in density estimation.
3. Key Takeaways
- Bandwidth is king: For KDE, the bandwidth $h$ controls the bias–variance tradeoff. The optimal rate $h_{\text{opt}} \asymp n^{-1/5}$ yields $\text{MISE} \asymp n^{-4/5}$ in 1D. The kernel shape matters much less.
- Curse of dimensionality: In $d$ dimensions, the best MSE rate degrades to $n^{-4/(4+d)}$ (for $\alpha=2$), and more generally $n^{-2\alpha/(2\alpha+d)}$. Practical KDE is limited to very low dimensions without structural assumptions.
- Smoothness buys speed: Smoother functions (higher $\alpha$) are easier to estimate. Convergence rates improve with $\alpha$ in the minimax rate $n^{-2\alpha/(2\alpha+d)}$.
- The regression function $f^*(x) = \mathbb{E}(Y \mid X = x)$ is the unique minimizer of the expected squared prediction error. Any other function incurs additional reducible error on top of the irreducible noise floor.
- No Free Lunch: Without assumptions on the data-generating process (like smoothness of $f^*$), no estimator can guarantee fast convergence. Structural assumptions are the price we pay for provable rates.