Regression Theory & Introduction to Nonparametric Regression

Course: Computational Statistics FS2026 Lecture 5 Date: 2026-03-20 Dr. Jonathan Koh, ETH Zürich

This lecture wraps up the theoretical foundations of regression (§3) and introduces nonparametric regression (§4). We revisit the regression function as the optimal predictor under squared loss, discuss consistency and the "No Free Lunch" principle, and decompose prediction error into bias and variance. We then motivate nonparametric methods — starting with k-nearest neighbors, moving to kernel regression (Nadaraya–Watson), and ending with local polynomial regression — showing how each method handles the bias–variance tradeoff differently.

1. Regression Theory Recap (§3)

1.1 Prediction and the Squared Error Risk

Suppose we have i.i.d. training data $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ drawn from some unknown joint distribution $\text{Pr}(X,Y)$. From this data we build a predictive model $\hat{f}: \mathbb{R}^p \to \mathbb{R}$. To measure how good our predictions are, we use the expected squared prediction error (also called the risk):

Definition — Expected Squared Prediction Error
$$\text{Err}_f = \mathbb{E}\!\big[\big(Y - f(X)\big)^2\big],$$

where the expectation is over a new draw $(X,Y) \sim \text{Pr}(X,Y)$. The ideal predictor is $f^* = \arg\min_f \text{Err}_f$.

Notice that we are optimizing over functions, not just a finite-dimensional parameter vector like $\beta$. This is the key conceptual shift when moving beyond parametric models.

1.2 The Regression Function as Risk Minimizer

The function that minimizes $\text{Err}_f$ turns out to be the conditional mean, commonly called the regression function:

Theorem — Optimal Predictor Under Squared Loss
$$f^*(x) = \mathbb{E}(Y \mid X = x).$$

The proof idea is elegant and worth understanding. Using the tower property (law of iterated expectations), we can write:

$$\mathbb{E}\!\big[(Y - f(X))^2\big] = \mathbb{E}\!\Big[\mathbb{E}\!\big[(Y - f(X))^2 \mid X\big]\Big].$$

Since the inner expectation is non-negative for every $x$, it suffices to minimize it pointwise. For a fixed $x$, the problem reduces to finding the constant $c$ that minimizes $\mathbb{E}[(Y - c)^2 \mid X = x]$. Expanding and taking the derivative with respect to $c$ gives $c = \mathbb{E}(Y \mid X = x)$.

Furthermore, we get a useful decomposition of the total prediction error:

$$\mathbb{E}\!\big[(Y - f(X))^2\big] = \underbrace{\mathbb{E}\!\big[(f(X) - f^*(X))^2\big]}_{\text{reducible error}} + \underbrace{\mathbb{E}\!\big[(Y - f^*(X))^2\big]}_{\text{irreducible error}}.$$

The first term measures how far our model $f$ is from the true regression function $f^*$, and this is what we try to shrink. The second term is noise in the data — we can never eliminate it.

Example — Bivariate Normal

Let $(X,Y)$ be bivariate normal with zero means, unit variances, and correlation $\rho$. Then $Y \mid X = x \sim N(\rho x, 1-\rho^2)$, so $f^*(x) = \rho x$ and the irreducible error is $1 - \rho^2$. In this setting, the regression function happens to be linear — so linear regression is the correct model!

Example — Bernoulli (Exercise)

Let $X \in \{0, 1\}$ with $\Pr(X=1)=p$, and $Y \mid X=x \sim \text{Bernoulli}(\theta_x)$. The optimal predictor is simply $f^*(x) = \theta_x$. This works for discrete $X$ too — the theory is completely general.

2. Consistency, Optimality, and the No Free Lunch

2.1 When Is an Estimator "Good Enough"?

Given the decomposition above, $\hat{f}$ is nearly optimal if its $L^2(P_X)$ distance to $f^*$ is small. In practice we hope that as we collect more data, our estimator gets closer to the truth. This motivates the notion of weak consistency:

Definition — Weak Consistency

A sequence of estimators $\{\hat{f}_n\}$ is weakly consistent for a distribution $\text{Pr}(X,Y)$ if

$$\lim_{n \to \infty} \mathbb{E}\!\left[\int_{\mathbb{R}^p} \big(\hat{f}_n(x) - f^*(x)\big)^2 \, P_X(dx)\right] = 0.$$

In words: the expected integrated squared error goes to zero as the sample size grows. An important subtlety is that a method might be consistent for some distributions but not others.

2.2 No Free Lunch

Important — No Free Lunch Principle

For any estimation method, there exists a distribution for which the convergence rate in the consistency definition above is arbitrarily slow. There is no single model that works best for every problem.

To obtain meaningful convergence rates, we need structural assumptions — most commonly, smoothness of $f^*$. This is a recurring theme: more assumptions = faster rates, but wrong assumptions = large bias.

3. Bias–Variance Tradeoff (Revisited)

We already encountered bias–variance in the linear model context (§1.7). Here it appears in a more general form. For any random estimator $\hat{f}$ (depending on $\mathcal{D}_n$) and any fixed point $x_0$:

Bias–Variance Decomposition
$$\mathbb{E}\!\big[(\hat{f}(x_0) - f^*(x_0))^2\big] = \underbrace{\big[\mathbb{E}[\hat{f}(x_0)] - f^*(x_0)\big]^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\!\big[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2\big]}_{\text{Variance}}.$$

A similar decomposition holds for the integrated $L^2$ error. The practical upshot: for good predictions, we need to find a sweet spot between bias and variance. This tradeoff is controlled by tuning parameters in our models — $k$ in kNN, bandwidth $h$ in kernel regression, $\lambda$ in smoothing splines, and so on. The classic picture shows that as the tuning parameter increases flexibility (e.g., smaller $h$), bias decreases but variance increases, and there is an optimal point where their sum (the prediction error) is minimized.

3.1 Model Bias vs. Estimation Bias in Linear Regression

For linear regression $\hat{f}(x) = x^\top \hat{\beta}$, the total squared bias can be decomposed further. Let $\beta^* = \arg\min_\beta \mathbb{E}[(X^\top\beta - f^*(X))^2]$ be the "best linear approximation." Then:

$$\underbrace{\mathbb{E}_X\!\big[\big(X^\top\beta^* - f^*(X)\big)^2\big]}_{\text{model bias}} + \underbrace{\mathbb{E}_X\!\big[\big(X^\top(\beta^* - \mathbb{E}[\hat{\beta}])\big)^2\big]}_{\text{estimation bias}}.$$

Model bias is the error from restricting $f$ to be linear — it's large when the true relationship is nonlinear. Estimation bias arises when your estimator's expected value doesn't hit $\beta^*$. For OLS, estimation bias is zero (OLS is unbiased). Methods like ridge and lasso intentionally introduce some estimation bias in exchange for lower variance.

This motivates the move to nonparametric methods: if the true regression function is highly nonlinear, the model bias of linear regression can be overwhelming, and no amount of data will fix it.

4. Introduction to Nonparametric Regression (§4)

4.1 Motivation: Why Go Nonparametric?

Nonparametric regression aims to estimate $f^*(x) = \mathbb{E}(Y \mid X = x)$ without assuming a specific parametric form. Instead of saying "$f^*$ is linear" or "$f^*$ is quadratic," we only assume $f^*$ is "smooth" in some sense. The advantage is much lower model bias; the price is higher estimation variance and the need for more data.

The lecture organizes nonparametric regression methods around four paradigms:

In this lecture we focus on one predictor variable ($p = 1$) and cover the first three items from the list.

5. k-Nearest Neighbors Regression (§4.2)

5.1 The Estimator

Definition — kNN Regression Estimator

Given training data $\mathcal{D}_n$ and a new input $x_+$, let $N_k(x_+)$ be the set of indices of the $k$ closest training points to $x_+$ (typically by Euclidean distance). The kNN estimator is:

$$\hat{f}_k(x_+) = \frac{1}{k}\sum_{i \in N_k(x_+)} Y_i.$$

In plain terms: to predict at a new point, find the $k$ training points closest to it and average their responses. The core assumption is that nearby inputs have similar responses, so $f^*$ is well-approximated by a locally constant function.

5.2 The Role of $k$

The parameter $k$ controls the bias–variance tradeoff:

So $k$ acts as a smoothing parameter, just like bandwidth $h$ in kernel density estimation.

5.3 Parametric vs. Nonparametric

Linear regression is parametric: it assumes $f^*$ is globally linear, giving low variance but potentially large model bias. kNN is nonparametric: it assumes $f^*$ is locally constant, providing great flexibility but requiring enough data and a well-chosen $k$. kNN will outperform linear regression when the true relationship is very nonlinear — but it suffers badly from the curse of dimensionality when $p$ is large.

5.4 Consistency of kNN

Theorem — Weak Consistency of kNN

If $k_n \to \infty$ and $k_n / n \to 0$ as $n \to \infty$, then the kNN estimator is weakly consistent for all distributions with $\mathbb{E}(Y^2) < \infty$ and where ties occur with probability zero.

The two conditions have a clear intuition: $k_n \to \infty$ ensures we average over more and more observations (variance shrinks), while $k_n/n \to 0$ ensures the neighborhood radius shrinks to zero (bias shrinks). Under additional smoothness, kNN achieves a rate of $O(n^{-2/(2+p)})$, which degrades with dimension $p$ — the curse of dimensionality again.

6. Kernel Regression (§4.3)

6.1 Deriving the Nadaraya–Watson Estimator

The regression function can be written as $f^*(x) = \int y \, f_{Y|X}(y|x)\,dy$. Using the identity $f_{Y|X}(y|x) = f_{X,Y}(x,y)/f_X(x)$ and plugging in kernel density estimates for both the numerator and denominator, we arrive at:

Definition — Nadaraya–Watson Kernel Regression Estimator
$$\hat{f}(x) = \frac{\sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right) Y_i}{\sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)} = \sum_{i=1}^n w_i(x)\,Y_i,$$

where $w_i(x) = K((x-x_i)/h)\big/\sum_j K((x-x_j)/h)$ are normalized weights that sum to 1.

So $\hat{f}(x)$ is a weighted average of the $Y_i$'s, where observations with $x_i$ closer to $x$ get larger weights (as determined by the kernel function $K$). This is why it falls under "local averaging."

6.2 Local Constant Interpretation

An equivalent and very instructive way to view the Nadaraya–Watson estimator is as a locally weighted least squares problem:

$$\hat{f}(x) = \arg\min_{m_x \in \mathbb{R}} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)(Y_i - m_x)^2.$$

For each query point $x$, we find the best constant $m_x$ that minimizes the kernel-weighted sum of squares. The kernel determines what "local" means — observations near $x$ matter most. This is why the Nadaraya–Watson estimator is called a locally constant estimator.

6.3 The Role of the Bandwidth $h$

Just like $k$ in kNN, the bandwidth $h$ controls the bias–variance tradeoff: a large $h$ gives a smooth curve with high bias and low variance; a small $h$ gives a wiggly curve with low bias and high variance. An important difference from density estimation is that in regression, the bias and variance also depend on the design density $f_X(x)$, not just on derivatives of $f^*$. In particular, where $f_X(x)$ is small (data are sparse), the pointwise bias tends to be large.

6.4 kNN as a Special Case of Kernel Regression

kNN can be rewritten as kernel regression with a uniform (box) kernel and an adaptive bandwidth. Letting $r_k(x_0)$ denote the distance to the $k$-th nearest neighbor:

$$\hat{f}(x_0) = \frac{\sum_{i=1}^n K\!\left(\frac{|x_i - x_0|}{r_k(x_0)}\right) Y_i}{\sum_{i=1}^n K\!\left(\frac{|x_i - x_0|}{r_k(x_0)}\right)}, \quad K(u) = \begin{cases} 1 & u \le 1 \\ 0 & u > 1\end{cases}.$$

The key insight: kNN uses a box kernel where the bandwidth $h = r_k(x_0)$ varies with the location — wider where data is sparse, narrower where data is dense. This adaptive bandwidth is a conceptual advantage over fixed-bandwidth kernel regression.

7. Linear Smoothers, Hat Matrix & Degrees of Freedom

7.1 Kernel Regression as a Linear Smoother

When we evaluate the kernel regression estimator at all the observed design points, we get the vector of fitted values $\hat{\mathbf{Y}} = (\hat{f}(x_1), \ldots, \hat{f}(x_n))^\top$. Since each $\hat{f}(x_r) = \sum_s w_s(x_r) Y_s$, we can write this as a matrix multiplication:

$$\hat{\mathbf{Y}} = S\mathbf{Y},$$

where the smoother matrix (or hat matrix) $S$ has entries $S_{rs} = w_s(x_r)$. Any estimator that can be written this way — as a linear function of the response vector — is called a linear smoother. This is a powerful unifying framework: OLS, kernel regression, smoothing splines, and many other methods are all linear smoothers.

7.2 Inference Using the Hat Matrix

The hat matrix gives us the covariance structure of the fitted values directly: $\text{Cov}(\hat{\mathbf{Y}}) = \sigma_\varepsilon^2 SS^\top$. This allows us to compute pointwise standard errors and approximate confidence intervals. Under regularity conditions, $\hat{f}(x_i)$ is approximately normal:

$$\hat{f}(x_i) \pm 1.96 \cdot \hat{\sigma}_\varepsilon \sqrt{(SS^\top)_{ii}}$$

gives approximate 95% pointwise confidence intervals for $\mathbb{E}[\hat{f}(x_i)]$. Note: these intervals cover $\mathbb{E}[\hat{f}(x_i)]$, not $f^*(x_i)$ directly — the distinction matters because of bias.

7.3 Degrees of Freedom

Definition — Degrees of Freedom

For a model with fitted values $\hat{Y}_i$ and $Y_i = \mu_i + \varepsilon_i$ with $\text{Var}(\varepsilon_i) = \sigma^2$, the degrees of freedom are:

$$\text{df}(\hat{\mathbf{Y}}) = \frac{1}{\sigma^2}\sum_{i=1}^n \text{Cov}(\hat{Y}_i, Y_i).$$

Think of df as the effective number of parameters in your model — a single number that captures model complexity.

For linear smoothers where $\hat{\mathbf{Y}} = S\mathbf{Y}$, this simplifies beautifully to $\text{df} = \text{tr}(S)$. For OLS with $p$ parameters, $S = P = X(X^\top X)^{-1}X^\top$ is a projection matrix and $\text{tr}(P) = p$, recovering the familiar answer. Two extreme cases illustrate the concept: the identity estimator $\hat{Y}_i = Y_i$ has $\text{df} = n$ (maximally complex — just memorize the data), while the sample mean $\hat{Y}_i = \bar{Y}$ has $\text{df} = 1$ (simplest possible model).

8. Local Polynomial Regression (§4.4)

The Nadaraya–Watson estimator fits a local constant. A natural extension is to fit a local polynomial at each query point $x$. We solve:

Definition — Local Polynomial Regression
$$\hat{\beta}(x) = \arg\min_{\beta \in \mathbb{R}^p} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)\big(Y_i - \beta_1 - \beta_2(x_i - x) - \cdots - \beta_p(x_i - x)^{p-1}\big)^2.$$

The function estimate at $x$ is the local intercept: $\hat{f}(x) = \hat{\beta}_1(x)$.

Common choices are $p = 2$ (local linear) or $p = 4$. An important practical advantage of local polynomial regression is that it handles boundary effects better than the locally constant Nadaraya–Watson estimator. Because the polynomial can adapt its slope near the edges of the data, the bias at boundaries is reduced.

A bonus: the local polynomial fit also gives estimates of the derivatives of the regression function, via $\hat{f}^{(r)}(x) = r!\,\hat{\beta}_{r+1}(x)$.

9. Summary of Methods Covered

Method Paradigm Tuning Parameter Key Idea
kNN Local averaging $k$ (number of neighbors) Average the $k$ nearest responses; adaptive bandwidth
Nadaraya–Watson Local averaging / modeling $h$ (bandwidth) Kernel-weighted average; locally constant fit
Local polynomial Local modeling $h$ (bandwidth), $p$ (degree) Kernel-weighted local polynomial; better at boundaries

Key Takeaways