This lecture introduces kernels, a powerful technique for learning non-linear functions without explicitly computing (potentially huge or infinite-dimensional) feature maps. The core idea — the kernel trick — lets us replace expensive inner products in feature space with a single, cheap function evaluation. We motivate the trick from the computational limitations of polynomial features, walk through the three-step kernelization recipe, and conclude with prominent kernel families (polynomial and RBF kernels).
1. Motivation — Why We Need Something Beyond Feature Maps
1.1 From Linear to Non-Linear Models via Featurization
Recall the standard supervised-learning objective: given data $\{(x_i, y_i)\}_{i=1}^{n}$ we fit a linear function $f(x) = \langle w, x \rangle$ by minimising a loss over all training points. To capture non-linear relationships we introduced featurization: a map $\varphi : \mathbb{R}^d \to \mathbb{R}^p$ that transforms the raw input into a richer feature space, and we then minimise
This is very flexible — but what happens when $p$ is enormous?
1.2 The Computational Explosion of Polynomial Features
Consider polynomial features of degree $m$ for a $d$-dimensional input. The number of distinct monomials (and hence the feature dimension) is $p = \binom{d+m}{m} = O(d^m)$. Storing the featurized training data requires $O(n \, d^m)$ memory.
For genomic data with $d \sim 10^5$ features and $n \sim 10^2$ samples, even choosing a modest polynomial degree $m = 3$ gives $O(n \, d^m) \sim 10^{17}$ — far too large to store or compute with. We need a smarter approach.
1.3 Three Desiderata
We want a method that simultaneously gives us: (1) expressivity — the ability to represent rich non-linear functions; (2) computational efficiency — avoiding the $O(d^m)$ blowup; and (3) convexity — retaining a convex optimisation problem when the loss $\ell$ is convex. The kernel trick achieves all three.
2. The Kernel Trick — Kernelization in Three Steps
The key insight is that we never need to touch $\varphi(x)$ directly — all that matters are inner products between feature vectors. Below is the three-step recipe from the lecture notes.
Step I — Reparametrize & Identify Inner Products
Instead of optimising over $w \in \mathbb{R}^p$, we write $w = \Phi^\top \alpha$ with $\alpha \in \mathbb{R}^n$, where $\Phi \in \mathbb{R}^{n \times p}$ is the feature matrix with rows $\varphi(x_i)^\top$. This constrains $w$ to lie in $\operatorname{span}\{\varphi(x_1), \dots, \varphi(x_n)\}$, a subspace of dimension at most $n$ inside $\mathbb{R}^p$.
Plugging into the model we get:
Notice that the data now appears only through inner products $\langle \varphi(x_i), \varphi(x_j) \rangle$.
Step II — Replace Inner Products by a Kernel Function
Define a bivariate function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ with $k(x, x') = \langle \varphi(x), \varphi(x') \rangle$ and the kernel matrix $K \in \mathbb{R}^{n \times n}$ with entries $K_{ij} = k(x_i, x_j)$. The objective becomes:
Memory drops from $O(np)$ to $O(n^2)$ — a huge saving when $p \gg n$.
Step III — Solve the Kernelized Training Loss & Predict
We solve for $\hat\alpha \in \mathbb{R}^n$ and predict on a new point $x$ via
I. Reparametrize $w = \Phi^\top \alpha$ in the training loss to obtain $\frac{1}{n}\sum_{i=1}^{n}\ell(y_i, \alpha^\top \Phi\,\varphi(x_i))$.
II. Replace all inner products $\langle \varphi(x_i), \varphi(x_j)\rangle$ with $k(x_i, x_j)$.
III. Solve $\hat\alpha = \arg\min_{\alpha \in \mathbb{R}^n} \frac{1}{n}\sum_{i=1}^{n}\ell(y_i, (K\alpha)_i)$ and predict $\hat f(x) = \sum_{i=1}^{n}\hat\alpha_i\, k(x_i, x)$.
It may seem like restricting $w$ to $\operatorname{span}\{\varphi(x_1),\dots,\varphi(x_n)\}$ loses something, since $\mathbb{R}^p$ is a much bigger space. But it turns out that at least one minimiser of the original objective already lives in this subspace.
Intuition via gradient descent: If we run gradient descent initialised at $w_0 = 0$, every iterate $w_t$ is a linear combination of $\varphi(x_1),\dots,\varphi(x_n)$ (because the gradient of $\ell(y_i, \langle w, \varphi(x_i)\rangle)$ is proportional to $\varphi(x_i)$). So GD never leaves the span.
More generally, the Representer Theorem guarantees this for any objective of the form $\frac{1}{n}\sum \ell(\cdot) + g(\|w\|)$ with $g$ non-decreasing — the component of $w$ orthogonal to $\operatorname{span}\{\varphi(x_i)\}$ can only increase $g$ without changing the data-fit term, so a minimiser exists without that orthogonal component.
3. What Makes a Valid Kernel?
The beauty of the kernel trick is that we can work with functions $k$ directly, without ever constructing the feature map $\varphi$. But not every function $k$ implicitly corresponds to some inner product in some feature space. The following definition characterises exactly which functions do.
A function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is a kernel if:
- Symmetry: $k(x, x') = k(x', x)$ for all $x, x'$.
- Positive semi-definiteness: For every $n \in \mathbb{N}$ and every set $x_1, \dots, x_n \in \mathbb{R}^d$, the kernel matrix $K$ with $K_{ij} = k(x_i, x_j)$ is positive semi-definite.
A function $k$ is a kernel if and only if there exists a Hilbert space $\mathcal{H}$ and a map $\varphi : \mathbb{R}^d \to \mathcal{H}$ such that $k(x, x') = \langle \varphi(x), \varphi(x') \rangle_{\mathcal{H}}$ for all $x, x'$.
You can think of $\mathcal{H}$ as simply $\mathbb{R}^p$ for some (possibly very large) $p$, or in the infinite-dimensional case, the space $\ell^2$ of square-summable sequences.
Building New Kernels from Old Ones
Given valid kernels $k_1, k_2$, the following constructions also produce valid kernels:
- Addition: $k(x, x') = k_1(x, x') + k_2(x, x')$
- Multiplication: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$
- Feature composition: If $\psi$ is any map, then $k(x,x') = k_1(\psi(x), \psi(x'))$ is valid.
These composition rules let you flexibly build complex kernels from simpler building blocks.
4. Examples of Kernels
4.1 Polynomial Kernels
The polynomial kernel of degree $m$ is defined as
This implicitly represents all polynomial features up to degree $m$. The feature map $\varphi$ consists of all monomials (with appropriate scaling), so it is equivalent to explicit polynomial featurization — but much cheaper to compute.
For $d = 2$, the degree-2 polynomial kernel $k(x, x') = (1 + \langle x, x'\rangle)^2$ corresponds to the feature map $\varphi(x) = (1, \sqrt{2}\,x_1, \sqrt{2}\,x_2, x_1^2, x_2^2, \sqrt{2}\,x_1 x_2)^\top \in \mathbb{R}^6$.
Computing $k$ directly needs $O(d + \log m)$ operations (an inner product plus exponentiation), whereas explicitly computing and dotting the feature vectors takes $O(d^m)$. For $n = 5$, $d = 5$, $m = 4$: naïve approach does $\sim$31 500 operations; the kernel shortcut does only 225.
4.2 Radial Basis Function (RBF) Kernels
RBF kernels depend only on the distance between two points:
where $\tau > 0$ is the bandwidth parameter. Two important special cases:
- Gaussian kernel ($\alpha = 2$, $p = 2$): $k(x, x') = \exp\!\big(-\|x - x'\|_2^2 / \tau\big)$
- Laplacian kernel ($\alpha = 1$, $p = 2$): $k(x, x') = \exp\!\big(-\|x - x'\| / \tau\big)$
The bandwidth $\tau$ controls how quickly the kernel decays with distance. A small $\tau$ makes the kernel very "peaky" — only very nearby points have high similarity. A large $\tau$ makes the kernel flatter — even distant points are still considered similar. Think of $\tau$ as controlling the "range of influence" of each training point.
RBF Kernels Have Infinite-Dimensional Feature Maps
Unlike polynomial kernels, the corresponding feature map of an RBF kernel is infinite-dimensional. For the Gaussian kernel with $d=1$, one can show (via Taylor-expanding the exponential) that the feature map sends $x$ to the sequence
You would never be able to compute or store this explicitly — but with the kernel trick, you don't have to! You just evaluate the closed-form expression $e^{-(x - x')^2/\tau}$.
5. Does Kernelization Meet Our Desiderata?
| Desideratum | Satisfied? | Explanation |
|---|---|---|
| Expressivity | Yes | Although we optimise over only $n$ parameters $\alpha$, the function class $\mathcal{F}_k = \{f_\alpha(x) = \sum_i \alpha_i k(x_i, x)\}$ grows richer as $n$ increases. "Universal" kernels (like Gaussian) are dense in continuous functions. |
| Efficiency | Yes | Memory $O(n^2)$ instead of $O(np)$; computation per kernel evaluation can be $O(d)$ instead of $O(d^m)$. |
| Convexity | Yes | If $\ell$ is convex in its second argument, then $\frac{1}{n}\sum \ell(y_i, (K\alpha)_i)$ is convex in $\alpha$ (composition of convex with affine). |
6. Complexity Comparison at a Glance
| Approach | Parameters | Memory | Compute (kernel matrix) |
|---|---|---|---|
| Explicit poly features | $p = O(d^m)$ | $O(n \, d^m)$ | $O(n^2 \, d^m)$ |
| Polynomial kernel | $n$ | $O(n^2)$ | $O(n^2(d + m))$ |
| RBF kernel | $n$ | $O(n^2)$ | $O(n^2 d)$ |
7. Key Takeaways
- The kernel trick lets us work with non-linear feature maps implicitly by replacing all inner products $\langle \varphi(x), \varphi(x')\rangle$ with a kernel function $k(x, x')$, avoiding the explicit (and often prohibitively expensive) computation of $\varphi$.
- Kernelization follows three steps: reparametrize $w = \Phi^\top\alpha$, express everything via inner products, then replace those with $k$.
- This reparametrization is justified because at least one minimiser lies in $\operatorname{span}\{\varphi(x_1),\dots,\varphi(x_n)\}$ — intuitively shown via gradient descent, and formally by the Representer Theorem.
- A valid kernel must be symmetric and positive semi-definite; this is equivalent to the existence of a feature map into some Hilbert space.
- Polynomial kernels $k(x,x') = (1+\langle x,x'\rangle)^m$ give dramatic computational savings over explicit polynomial features; RBF kernels implicitly use infinite-dimensional feature maps and are "universal" approximators.