Density Estimation Theory & Introduction to Regression Theory

Course: Computational Statistics (401-3632-00L) Week 4 Date: 2026-03-13 Dr. Jonathan Koh, ETH Zürich

This week wraps up the theory behind non-parametric density estimation (Chapter 2 of the lecture notes) — focusing on the role of the bandwidth, the bias–variance tradeoff for the MSE, the curse of dimensionality, and smoothness-based minimax rates. The second half of the lecture introduces regression theory (Chapter 3): loss functions, the regression function as the risk minimizer, the error decomposition into reducible and irreducible parts, and the concept of consistency.

1. Non-Parametric Density Estimation (continued)

Recall from Week 3: given i.i.d. observations $X_1, \ldots, X_n$ from an unknown density $f$, the kernel density estimator (KDE) is

Definition — Kernel Density Estimator

$$\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^{n} K\!\left(\frac{x - X_i}{h}\right),$$ where $h > 0$ is the bandwidth (smoothing parameter) and $K$ is a kernel satisfying $K(x) \geq 0$, $\int K(x)\,dx = 1$, and $K(x) = K(-x)$.

Intuition: each observation contributes a "bump" of width $h$ centred at $X_i$. Summing and renormalising these bumps yields a smooth density estimate.

1.1 Kernel Choice vs. Bandwidth Choice

Two popular kernels are the Gaussian kernel $K(x) = \varphi(x) = (2\pi)^{-1/2}e^{-x^2/2}$ and the Epanechnikov kernel $K(x) = \tfrac{3}{4}(1 - x^2)\,\mathbf{1}\{|x| \leq 1\}$, which is MSE-optimal among compactly supported kernels. However, the key practical message is:

Key Insight

In practice, the choice of bandwidth $h$ matters far more than the choice of kernel shape. A large $h$ oversmooths (high bias, low variance); a small $h$ undersmooths (low bias, high variance, wiggly estimate).

1.2 The Role of the Bandwidth

The bandwidth $h$ is the smoothing parameter. As $h \to 0$, the estimate $\hat{f}_h$ approaches a sum of delta-spikes at each observation. As $h$ grows, $\hat{f}_h$ becomes increasingly smooth. For consistency of the estimator (meaning $\hat{f}_h \to f$ in some sense), we need the asymptotic regime:

$$h = h_n \to 0 \quad \text{and} \quad n h_n \to \infty \quad \text{as } n \to \infty.$$

The first condition ensures the bias vanishes (the window shrinks), and the second ensures the variance also vanishes (each window still contains enough data points).

1.3 The Bias–Variance Tradeoff and MSE

The mean squared error at a point $x$ decomposes as:

$$\text{MSE}(x) = \mathbb{E}\!\left[\left(\hat{f}_h(x) - f(x)\right)^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}_h(x)] - f(x)\right)^2}_{\text{Bias}^2(x)} + \underbrace{\text{Var}\!\left(\hat{f}_h(x)\right)}_{\text{Variance}}.$$

Instead of optimising at a single point, we can consider the mean integrated squared error (MISE):

$$\text{MISE} = \mathbb{E}\!\left[\int \left(\hat{f}_h(x) - f(x)\right)^2 dx\right].$$
Theorem — Asymptotic Bias and Variance of the KDE

Under smoothness assumptions on $f$ and standard kernel conditions:

$$\text{Bias}(x) = \frac{1}{2}h^2 f''(x)\int z^2 K(z)\,dz + o(h^2),$$
$$\text{Var}\!\left(\hat{f}_h(x)\right) = \frac{1}{nh}\,f(x)\int K^2(z)\,dz + o\!\left((nh)^{-1}\right).$$

Reading these formulas: The bias grows like $h^2$ — a larger bandwidth means more smoothing and more bias (proportional to the curvature $f''(x)$). The variance shrinks like $1/(nh)$ — more data and wider windows both reduce variance. This is the fundamental tradeoff: you cannot reduce both simultaneously with $h$ alone.

Optimal Bandwidth

Balancing these two leading terms (setting the derivative of the asymptotic MISE to zero) gives the optimal global bandwidth order and the resulting optimal MISE rate:

$$h_{\text{opt}} \asymp n^{-1/5}, \qquad \text{MISE} \asymp n^{-4/5}.$$

The notation $\asymp$ means "of the same order as." Both the squared-bias and variance terms end up being $O(n^{-4/5})$ at the optimum — neither dominates. This $n^{-4/5}$ rate is the best achievable rate for density estimation in one dimension (for twice-differentiable densities).

Bandwidth Selection in Practice

The exact optimal $h$ depends on unknown quantities like $R(f'') = \int (f''(x))^2 dx$. In practice, plug-in methods estimate these unknown features using a pilot bandwidth. A widely-used implementation is the Sheather–Jones selector, available in R as density(x, bw = "SJ").

Remark — Higher-Order Kernels

If $\int z^2 K(z)\,dz = 0$ (a "bias-reducing" or higher-order kernel), the leading bias term vanishes. So why not always use these? The catch is that we still need the underlying density to be sufficiently smooth, and we generally don't know this in advance. Also, higher-order kernels can take negative values, which means $\hat{f}_h$ may go negative — not a valid density.

1.4 Higher Dimensions and the Curse of Dimensionality

For $d$-dimensional data $X_i \in \mathbb{R}^d$, the multivariate KDE becomes:

$$\hat{f}_h(\mathbf{x}) = \frac{1}{nh^d}\sum_{i=1}^{n} K\!\left(\frac{\mathbf{x} - X_i}{h}\right),$$

where $K$ is now a $d$-dimensional kernel (often a product of univariate kernels: $K(\mathbf{u}) = \prod_{j=1}^d K_{\text{univ}}(u_j)$). The crucial issue is that in high dimensions, local neighbourhoods become empty unless $n$ is enormous.

The Curse of Dimensionality

For twice-differentiable densities in $d$ dimensions, the best possible MSE rate for any nonparametric density estimator is: $$n^{-4/(4+d)}.$$ As $d$ increases, this rate deteriorates dramatically — convergence becomes painfully slow.

The following table illustrates just how severe this effect is:

$n^{-4/(4+d)}$$d=1$$d=2$$d=3$$d=5$$d=10$
$n=100$0.0250.0460.0720.1290.268
$n=1{,}000$0.0040.0100.0190.0460.139
$n=100{,}000$$1.0 \times 10^{-4}$$4.6 \times 10^{-4}$$13.9 \times 10^{-4}$0.0060.037

To improve the MSE by a factor of 10 in $d = 1$, you need roughly 17× more data. In $d = 10$, you need about 3,160× more data. The practical takeaway: standard kernel density estimation is essentially limited to $d \leq 2$ or $3$ without additional structural assumptions.

1.5 Smoothness Assumptions and the General Minimax Rate

To derive convergence rates at all, we must impose smoothness conditions on $f$. Without them, uniform rates are impossible.

Definition — Hölder Smoothness Class $\Sigma(\alpha, L)$

For $\alpha > 0$ and $L > 0$, the Hölder class $\Sigma(\alpha, L)$ consists of functions whose derivatives up to order $\lfloor \alpha \rfloor$ exist, and whose highest-order derivatives satisfy a Hölder condition of order $\alpha - \lfloor \alpha \rfloor$. Functions in $\Sigma(\alpha, L)$ have a Taylor remainder bounded by: $$|f(u) - f_{x,\alpha}(u)| \leq L \|u - x\|^\alpha.$$

In plain terms: the parameter $\alpha$ controls how "smooth" the function is. Higher $\alpha$ means smoother, which makes estimation easier.

Under these smoothness assumptions, the MISE decomposes as:

$$\text{MISE}(h) = \underbrace{O(h^{2\alpha})}_{\text{squared bias}} + \underbrace{O\!\left(\frac{1}{nh^d}\right)}_{\text{variance}}.$$

Balancing these gives $h \asymp n^{-1/(2\alpha + d)}$, and the minimax rate:

$$\sup_{f \in \Sigma(\alpha, L)} \mathbb{E}\!\left[\int \left(\hat{f}_h(x) - f(x)\right)^2 dx\right] \leq C\, n^{-2\alpha/(2\alpha + d)}.$$

This is the best rate achievable by any estimator, uniformly over the class $\Sigma(\alpha, L)$. Setting $\alpha = 2$ (twice-differentiable densities) recovers the rate $n^{-4/(4+d)}$ from the curse of dimensionality discussion above.

The Two Enemies of Convergence

The rate $n^{-2\alpha/(2\alpha+d)}$ reveals two factors that make estimation harder: (1) higher dimension $d$ — more dimensions mean sparser data and slower rates, and (2) lower smoothness $\alpha$ — rougher functions are harder to estimate. This generalises the curse of dimensionality beyond just the $\alpha = 2$ case.

2. Regression Theory (Introduction)

The second part of the lecture transitions back to the supervised learning setting where we have both predictors $X$ and responses $Y$. The goal: formalise what it means to make a "good" prediction and identify the optimal prediction function.

2.1 Setting and Loss Functions

We observe i.i.d. training data $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ drawn from some unknown joint distribution $\Pr(X, Y)$. From this data, we construct a predictive model $\hat{f}: \mathbb{R}^p \to \mathbb{R}$.

To judge the quality of any prediction function $f$, we use a loss function $L: (y, \hat{y}) \mapsto L(y, \hat{y}) \in [0, \infty)$.

Definition — Common Loss Functions
  • Squared error loss (regression): $L(y, \hat{y}) = (y - \hat{y})^2$, where $y, \hat{y} \in \mathbb{R}$.
  • 0–1 loss (classification): $L(y, \hat{y}) = \mathbf{1}(y \neq \hat{y})$, where $y, \hat{y} \in \mathcal{G}$ (a set of class labels).

2.2 Expected Squared Prediction Error (Risk)

Focusing on squared error loss, the expected squared prediction error (also called the risk) of a function $f$ is:

$$\text{Err}_f = \mathbb{E}\!\left[\{Y - f(X)\}^2\right],$$

where the expectation is over $(X, Y) \sim \Pr(X, Y)$. The key question is: what function $f^*$ minimises this risk?

$$f^* = \arg\min_f \text{Err}_f.$$

Note that we are searching over all functions — this is a fundamentally infinite-dimensional optimisation problem (unlike parametric regression where we only optimise over a finite-dimensional $\beta$).

2.3 The Regression Function as Risk Minimizer

Theorem — The Regression Function Minimises the Squared Error Risk

The function that minimises $\text{Err}_f = \mathbb{E}[(Y - f(X))^2]$ is the regression function: $$f^*(x) = \mathbb{E}(Y \mid X = x).$$

Proof Sketch

Using the tower property (law of iterated expectations):

$$\mathbb{E}[\{Y - f(X)\}^2] = \mathbb{E}\!\left(\mathbb{E}[\{Y - f(X)\}^2 \mid X]\right) = \int_{\mathbb{R}^p} \mathbb{E}[\{Y - f(x)\}^2 \mid X = x]\, P_X(dx).$$

Since the integrand is non-negative for every $x$, it suffices to minimise pointwise. For each fixed $x$, we want to find $c = c(x)$ minimising $\mathbb{E}[(Y - c)^2 \mid X = x]$. Expanding:

$$\mathbb{E}[(Y - c)^2 \mid X = x] = \mathbb{E}[Y^2 \mid X = x] - 2c\,\mathbb{E}[Y \mid X = x] + c^2.$$

This is a quadratic in $c$, minimised by taking the derivative and setting it to zero: $c^* = \mathbb{E}(Y \mid X = x)$. This is exactly the conditional expectation — the regression function.

2.4 The Error Decomposition

A fundamental decomposition connects any prediction function $f$ to the optimal $f^*$:

Theorem — Reducible + Irreducible Error Decomposition
$$\mathbb{E}[\{Y - f(X)\}^2] = \underbrace{\mathbb{E}[\{f(X) - f^*(X)\}^2]}_{\text{reducible error (}L^2\text{ error of }f\text{)}} + \underbrace{\mathbb{E}[\{Y - f^*(X)\}^2]}_{\text{irreducible error } = \text{Err}_{f^*}}.$$

The first term measures how far your chosen $f$ is from the ideal $f^*$. This is the part you can improve by choosing a better model. The second term is the irreducible prediction error — it only vanishes if $Y$ is a deterministic function of $X$ (no noise). No model, no matter how sophisticated, can eliminate this noise floor.

Example — Bivariate Normal

Let $(X, Y)$ be bivariate normal with zero means, unit variances, and correlation $\rho$. Then $Y \mid X = x \sim \mathcal{N}(\rho x, 1 - \rho^2)$, so $f^*(x) = \rho x$ (a linear function!) and the irreducible error is $1 - \rho^2$. This is a case where linear regression is actually the optimal nonparametric solution.

Remark — Absolute Loss

If we replace the squared loss with the absolute loss $L(y, \hat{y}) = |y - \hat{y}|$, the risk minimizer becomes the conditional median of $Y$ given $X$, rather than the conditional mean. This highlights that what $f^*$ is depends on your choice of loss function.

2.5 Consistency of Predictive Methods

Given a sequence of estimators $\hat{f}_n$ (one for each sample size $n$), we want $\hat{f}_n$ to converge to $f^*$ as $n \to \infty$.

Definition — Weak Consistency

A sequence $\{\hat{f}_n\}$ is weakly consistent for a distribution $\Pr(X, Y)$ if: $$\lim_{n \to \infty} \mathbb{E}\!\left[\int_{\mathbb{R}^p} \{\hat{f}_n(x) - f^*(x)\}^2\, P_X(dx)\right] = 0.$$

In words: the expected $L^2$ distance between $\hat{f}_n$ and $f^*$ goes to zero.

"No Free Lunch"

Without structural assumptions on $\Pr(X, Y)$, convergence rates can be arbitrarily slow. There is no single model that works best for every problem. To obtain meaningful rates, you must assume something about the target — for example, that $f^*$ is smooth. This idea connects directly to the Hölder smoothness classes we saw in density estimation.

3. Key Takeaways