The Kernel Trick

Introduction to Machine Learning Lecture 9 Fanny Yang – ETH Zürich

This lecture introduces kernels, a powerful technique for learning non-linear functions without explicitly computing (potentially huge or infinite-dimensional) feature maps. The core idea — the kernel trick — lets us replace expensive inner products in feature space with a single, cheap function evaluation. We motivate the trick from the computational limitations of polynomial features, walk through the three-step kernelization recipe, and conclude with prominent kernel families (polynomial and RBF kernels).

1. Motivation — Why We Need Something Beyond Feature Maps

1.1 From Linear to Non-Linear Models via Featurization

Recall the standard supervised-learning objective: given data $\{(x_i, y_i)\}_{i=1}^{n}$ we fit a linear function $f(x) = \langle w, x \rangle$ by minimising a loss over all training points. To capture non-linear relationships we introduced featurization: a map $\varphi : \mathbb{R}^d \to \mathbb{R}^p$ that transforms the raw input into a richer feature space, and we then minimise

$$\min_{w \in \mathbb{R}^p} \; \frac{1}{n}\sum_{i=1}^{n} \ell\!\big(y_i,\; \langle w, \varphi(x_i) \rangle\big).$$

This is very flexible — but what happens when $p$ is enormous?

1.2 The Computational Explosion of Polynomial Features

Consider polynomial features of degree $m$ for a $d$-dimensional input. The number of distinct monomials (and hence the feature dimension) is $p = \binom{d+m}{m} = O(d^m)$. Storing the featurized training data requires $O(n \, d^m)$ memory.

Concrete Example — Biological Data

For genomic data with $d \sim 10^5$ features and $n \sim 10^2$ samples, even choosing a modest polynomial degree $m = 3$ gives $O(n \, d^m) \sim 10^{17}$ — far too large to store or compute with. We need a smarter approach.

1.3 Three Desiderata

We want a method that simultaneously gives us: (1) expressivity — the ability to represent rich non-linear functions; (2) computational efficiency — avoiding the $O(d^m)$ blowup; and (3) convexity — retaining a convex optimisation problem when the loss $\ell$ is convex. The kernel trick achieves all three.

2. The Kernel Trick — Kernelization in Three Steps

The key insight is that we never need to touch $\varphi(x)$ directly — all that matters are inner products between feature vectors. Below is the three-step recipe from the lecture notes.

Step I — Reparametrize & Identify Inner Products

Instead of optimising over $w \in \mathbb{R}^p$, we write $w = \Phi^\top \alpha$ with $\alpha \in \mathbb{R}^n$, where $\Phi \in \mathbb{R}^{n \times p}$ is the feature matrix with rows $\varphi(x_i)^\top$. This constrains $w$ to lie in $\operatorname{span}\{\varphi(x_1), \dots, \varphi(x_n)\}$, a subspace of dimension at most $n$ inside $\mathbb{R}^p$.

Plugging into the model we get:

$$f(x) = \langle w, \varphi(x) \rangle = \langle \Phi^\top \alpha, \varphi(x) \rangle = \sum_{i=1}^{n} \alpha_i \, \langle \varphi(x_i), \varphi(x) \rangle.$$

Notice that the data now appears only through inner products $\langle \varphi(x_i), \varphi(x_j) \rangle$.

Step II — Replace Inner Products by a Kernel Function

Define a bivariate function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ with $k(x, x') = \langle \varphi(x), \varphi(x') \rangle$ and the kernel matrix $K \in \mathbb{R}^{n \times n}$ with entries $K_{ij} = k(x_i, x_j)$. The objective becomes:

$$\min_{\alpha \in \mathbb{R}^n} \; \frac{1}{n}\sum_{i=1}^{n} \ell\!\Big(y_i,\; \sum_{j=1}^{n} \alpha_j\, k(x_i, x_j)\Big) \;=\; \min_{\alpha \in \mathbb{R}^n} \; \frac{1}{n}\sum_{i=1}^{n} \ell\!\big(y_i,\; (K\alpha)_i\big).$$

Memory drops from $O(np)$ to $O(n^2)$ — a huge saving when $p \gg n$.

Step III — Solve the Kernelized Training Loss & Predict

We solve for $\hat\alpha \in \mathbb{R}^n$ and predict on a new point $x$ via

$$\hat f(x) = \sum_{i=1}^{n} \hat\alpha_i \, k(x_i,\, x).$$
Summary of the Three Steps

I. Reparametrize $w = \Phi^\top \alpha$ in the training loss to obtain $\frac{1}{n}\sum_{i=1}^{n}\ell(y_i, \alpha^\top \Phi\,\varphi(x_i))$.
II. Replace all inner products $\langle \varphi(x_i), \varphi(x_j)\rangle$ with $k(x_i, x_j)$.
III. Solve $\hat\alpha = \arg\min_{\alpha \in \mathbb{R}^n} \frac{1}{n}\sum_{i=1}^{n}\ell(y_i, (K\alpha)_i)$ and predict $\hat f(x) = \sum_{i=1}^{n}\hat\alpha_i\, k(x_i, x)$.

Why Is the Reparametrization Valid?

It may seem like restricting $w$ to $\operatorname{span}\{\varphi(x_1),\dots,\varphi(x_n)\}$ loses something, since $\mathbb{R}^p$ is a much bigger space. But it turns out that at least one minimiser of the original objective already lives in this subspace.

Intuition via gradient descent: If we run gradient descent initialised at $w_0 = 0$, every iterate $w_t$ is a linear combination of $\varphi(x_1),\dots,\varphi(x_n)$ (because the gradient of $\ell(y_i, \langle w, \varphi(x_i)\rangle)$ is proportional to $\varphi(x_i)$). So GD never leaves the span.

More generally, the Representer Theorem guarantees this for any objective of the form $\frac{1}{n}\sum \ell(\cdot) + g(\|w\|)$ with $g$ non-decreasing — the component of $w$ orthogonal to $\operatorname{span}\{\varphi(x_i)\}$ can only increase $g$ without changing the data-fit term, so a minimiser exists without that orthogonal component.

3. What Makes a Valid Kernel?

The beauty of the kernel trick is that we can work with functions $k$ directly, without ever constructing the feature map $\varphi$. But not every function $k$ implicitly corresponds to some inner product in some feature space. The following definition characterises exactly which functions do.

Definition — Kernel (positive semi-definite function)

A function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is a kernel if:

  1. Symmetry: $k(x, x') = k(x', x)$ for all $x, x'$.
  2. Positive semi-definiteness: For every $n \in \mathbb{N}$ and every set $x_1, \dots, x_n \in \mathbb{R}^d$, the kernel matrix $K$ with $K_{ij} = k(x_i, x_j)$ is positive semi-definite.
Theorem (Mercer / Feature-map Correspondence)

A function $k$ is a kernel if and only if there exists a Hilbert space $\mathcal{H}$ and a map $\varphi : \mathbb{R}^d \to \mathcal{H}$ such that $k(x, x') = \langle \varphi(x), \varphi(x') \rangle_{\mathcal{H}}$ for all $x, x'$.

You can think of $\mathcal{H}$ as simply $\mathbb{R}^p$ for some (possibly very large) $p$, or in the infinite-dimensional case, the space $\ell^2$ of square-summable sequences.

Building New Kernels from Old Ones

Given valid kernels $k_1, k_2$, the following constructions also produce valid kernels:

These composition rules let you flexibly build complex kernels from simpler building blocks.

4. Examples of Kernels

4.1 Polynomial Kernels

The polynomial kernel of degree $m$ is defined as

$$k(x, x') = (1 + \langle x, x' \rangle)^m.$$

This implicitly represents all polynomial features up to degree $m$. The feature map $\varphi$ consists of all monomials (with appropriate scaling), so it is equivalent to explicit polynomial featurization — but much cheaper to compute.

Example — Computational Savings with Polynomial Kernels

For $d = 2$, the degree-2 polynomial kernel $k(x, x') = (1 + \langle x, x'\rangle)^2$ corresponds to the feature map $\varphi(x) = (1, \sqrt{2}\,x_1, \sqrt{2}\,x_2, x_1^2, x_2^2, \sqrt{2}\,x_1 x_2)^\top \in \mathbb{R}^6$.

Computing $k$ directly needs $O(d + \log m)$ operations (an inner product plus exponentiation), whereas explicitly computing and dotting the feature vectors takes $O(d^m)$. For $n = 5$, $d = 5$, $m = 4$: naïve approach does $\sim$31 500 operations; the kernel shortcut does only 225.

4.2 Radial Basis Function (RBF) Kernels

RBF kernels depend only on the distance between two points:

$$k(x, x') = \exp\!\left(-\frac{\|x - x'\|_p^{\alpha}}{\tau}\right)$$

where $\tau > 0$ is the bandwidth parameter. Two important special cases:

Bandwidth Intuition

The bandwidth $\tau$ controls how quickly the kernel decays with distance. A small $\tau$ makes the kernel very "peaky" — only very nearby points have high similarity. A large $\tau$ makes the kernel flatter — even distant points are still considered similar. Think of $\tau$ as controlling the "range of influence" of each training point.

RBF Kernels Have Infinite-Dimensional Feature Maps

Unlike polynomial kernels, the corresponding feature map of an RBF kernel is infinite-dimensional. For the Gaussian kernel with $d=1$, one can show (via Taylor-expanding the exponential) that the feature map sends $x$ to the sequence

$$\varphi(x) = \left(e^{-x^2/\tau}\,\sqrt{\frac{(2/\tau)^k}{k!}}\; x^k \right)_{k \in \mathbb{N}} \;\in\; \ell^2.$$

You would never be able to compute or store this explicitly — but with the kernel trick, you don't have to! You just evaluate the closed-form expression $e^{-(x - x')^2/\tau}$.

5. Does Kernelization Meet Our Desiderata?

Desideratum Satisfied? Explanation
Expressivity Yes Although we optimise over only $n$ parameters $\alpha$, the function class $\mathcal{F}_k = \{f_\alpha(x) = \sum_i \alpha_i k(x_i, x)\}$ grows richer as $n$ increases. "Universal" kernels (like Gaussian) are dense in continuous functions.
Efficiency Yes Memory $O(n^2)$ instead of $O(np)$; computation per kernel evaluation can be $O(d)$ instead of $O(d^m)$.
Convexity Yes If $\ell$ is convex in its second argument, then $\frac{1}{n}\sum \ell(y_i, (K\alpha)_i)$ is convex in $\alpha$ (composition of convex with affine).

6. Complexity Comparison at a Glance

Approach Parameters Memory Compute (kernel matrix)
Explicit poly features $p = O(d^m)$ $O(n \, d^m)$ $O(n^2 \, d^m)$
Polynomial kernel $n$ $O(n^2)$ $O(n^2(d + m))$
RBF kernel $n$ $O(n^2)$ $O(n^2 d)$

7. Key Takeaways