IML – Lecture 10: Kernels & Other Methods

This lecture continues and completes the treatment of kernel methods begun in Lecture 9. The first part shows how to kernelize training losses — that is, how to take a loss function written for linear models and rewrite it entirely in terms of a kernel matrix so that we never need to compute feature vectors explicitly. Kernel ridge regression is developed as the main worked example. The second part revisits properties of valid kernels and the composition rules that let you build new kernels from old ones. The final part introduces two popular non-parametric classifiers — k-Nearest Neighbors and Decision Trees / Random Forests — as a bridge toward neural networks.

1. Kernelized Losses

Recall the big picture of the ML pipeline: we choose a function class $\mathcal{F}$, a training loss, and an optimization method. Until now our function classes have been linear (possibly on hand-designed features). Kernel methods give us a systematic way to work with implicitly very high- (even infinite-) dimensional feature spaces while keeping the computation manageable.

1.1 The Two-Step Recipe

Turning any linear loss into a kernelized loss always follows the same two steps:

Step A — Reparameterize: Replace the weight vector $w \in \mathbb{R}^d$ with $w = X^\top \alpha$, where $\alpha \in \mathbb{R}^n$. This is justified by the Representer Theorem, which guarantees that a minimizer of the training loss lives in the span of the (featurized) data points.
Step B — Kernelize: In the reparameterized objective every appearance of the data takes the form of inner products $\langle x_i, x_j \rangle$ collected in the Gram matrix $XX^\top$. Replace this Gram matrix with a general kernel matrix $K$ whose entries are $K_{ij} = k(x_i, x_j)$.

Representer Theorem (Informal)

If the training loss depends on $w$ only through the inner products $\langle w, x_i \rangle$ (and possibly through $\|w\|$), then there exists a minimizer of the form $w^* = \sum_{i=1}^n \alpha_i x_i$. This means optimizing over $\alpha \in \mathbb{R}^n$ rather than $w \in \mathbb{R}^p$ is sufficient — and it enables the kernel trick.

1.2 Kernelized Regression (Squared Loss)

Starting from the standard least-squares objective and applying the two-step recipe gives us kernelized regression:

\min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \|y - K\alpha\|_2^2

where $K \in \mathbb{R}^{n \times n}$ is the kernel matrix. The trained model predicts via:

\hat{f}(x) = \hat{\alpha}^\top k_x = \sum_{i=1}^{n} \hat{\alpha}_i \, k(x_i, x)

where $k_x = (k(x_1, x), \ldots, k(x_n, x))^\top$ is the vector of kernel evaluations between the training points and the new test point.

Key difference from linear regression

In linear regression we optimize over $w \in \mathbb{R}^d$; in kernelized regression we optimize over $\alpha \in \mathbb{R}^n$. This is good when the implicit feature dimension $p \gg n$, but becomes expensive when the number of training points $n$ is very large, since $K$ is $n \times n$.

1.3 Kernel Ridge Regression

Just like in the linear setting, unregularized kernel regression can overfit. The natural fix is to add a ridge penalty. Following the same two-step recipe on the ridge objective $\frac{1}{n}\|y - Xw\|^2 + \lambda\|w\|^2$ yields:

\min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \|y - K\alpha\|_2^2 + \lambda \, \alpha^\top K \alpha

Let's walk through why the regularizer takes this form. After substituting $w = X^\top\alpha$, the penalty $\lambda\|w\|^2$ becomes $\lambda\|X^\top\alpha\|^2 = \lambda \, \alpha^\top XX^\top \alpha$. Replacing $XX^\top$ with $K$ gives $\lambda \, \alpha^\top K \alpha$.

Example — Polynomial kernel ridge regression

Using $k(x, z) = (1 + \langle x, z \rangle)^{14}$, vanilla kernelized regression ($\lambda = 0$) wildly overfits: the fitted curve oscillates through every data point. Adding a small ridge penalty ($\lambda = 0.0002$) produces a smooth fit that tracks the true function closely. The effect mirrors what we saw with feature-based ridge regression.

Example — Gaussian vs. Laplacian kernel ridge regression

With the Gaussian kernel ($\tau = 0.01$) and $\lambda = 0.4$, or the Laplacian kernel ($\tau = 1$) and $\lambda = 0.1$, the ridge penalty successfully recovers the true cosine function from noisy samples. Without regularization both kernels overfit, though the shape of the overfitting differs: the Gaussian produces smooth wiggles while the Laplacian produces sharp, peaked oscillations — reflecting the different smoothness of functions each kernel induces.

1.4 Classification with Kernels

The same kernelization idea applies to classification. For example, starting from the logistic loss $\ell(y_i, \hat{y}_i) = \log(1 + \exp(-y_i \langle w, x_i \rangle))$, we replace $\langle w, x_i \rangle$ with $\sum_j \alpha_j k(x_i, x_j)$ to get:

\min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^{n} \log\!\Bigl(1 + \exp\!\Bigl(-y_i \sum_{j=1}^{n} \alpha_j k(x_i, x_j)\Bigr)\Bigr)

After solving for $\hat{\alpha}$, new points are classified as $\hat{y} = \text{sign}\!\bigl(\sum_j \hat{\alpha}_j k(x, x_j)\bigr)$. This can model highly nonlinear decision boundaries while the optimization machinery (gradient descent, etc.) stays the same.

2. Properties of Valid Kernels & More Examples

2.1 What Makes a Kernel "Valid"?

Not every bivariate function $k(x, x')$ corresponds to some feature map. The function $k$ implicitly represents a feature mapping $\phi$ (i.e. $k(x, x') = \langle \phi(x), \phi(x') \rangle$) if and only if it satisfies two properties:

Definition — Kernel (positive semi-definite function)

A function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is a kernel if:

Symmetry: $k(x, x') = k(x', x)$ for all $x, x'$.
Positive semi-definiteness: For any $n$ and any points $x_1, \ldots, x_n$, the $n \times n$ kernel matrix $K$ with $K_{ij} = k(x_i, x_j)$ is positive semi-definite.

Intuitively, positive semi-definiteness means that the matrix of pairwise similarities "behaves like" a matrix of inner products. The deep result (Theorem 9.3 in the notes) is that every such function can be written as an inner product in some (possibly infinite-dimensional) Hilbert space.

2.2 Building New Kernels — Composition Rules

One of the most practical aspects of kernel theory is that you can build complex kernels from simple building blocks. Given valid kernels $k_1$ and $k_2$, the following are also valid kernels:

Kernel Composition Rules

Feature map composition: If $\phi : \mathbb{R}^d \to \mathbb{R}^p$ and $\psi : \mathbb{R}^p \to \mathbb{R}^q$ are feature maps, then $k(x, x') = \langle \psi(\phi(x)), \psi(\phi(x')) \rangle$ is a valid kernel.
Sum of kernels (same input): $k(x, x') = k_1(x, x') + k_2(x, x')$ is valid.
Sum of kernels (concatenated input): $k\bigl((x,y),(x',y')\bigr) = k_1(x, x') + k_2(y, y')$ is valid.
Product of kernels (same input): $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ is valid.
Product of kernels (concatenated input): $k\bigl((x,y),(x',y')\bigr) = k_1(x, x') \cdot k_2(y, y')$ is valid.

These rules let you, for example, combine a polynomial kernel for one group of features with an RBF kernel for another, and then add or multiply them — always knowing the result is a valid kernel.

2.3 Inner Product Kernels

If $h : \mathbb{R} \to \mathbb{R}$ has a Taylor series with only non-negative coefficients, then $k(x, x') = h(\langle x, x' \rangle)$ is a valid kernel. A prominent example is the polynomial kernel:

k(x, x') = (1 + \langle x, x' \rangle)^m, \quad m \in \mathbb{N}

This implicitly computes the inner product of all polynomial features up to degree $m$. For instance, with $d=2$ and $m=2$, the implicit feature map is $\phi(x) = (1, \sqrt{2}x_1, \sqrt{2}x_2, \sqrt{2}x_1 x_2, x_1^2, x_2^2)^\top \in \mathbb{R}^6$.

2.4 Radial Basis Function (RBF) Kernels

RBF kernels depend only on the distance between two points: $k(x, x') = h(\|x - x'\|)$. The general $\alpha$-exponential kernel is:

k(x, x') = \exp\!\left(-\frac{\|x - x'\|_p^{\alpha}}{\tau}\right)

where $\tau > 0$ is the bandwidth parameter.

Name	$\alpha$	$p$	Shape
Gaussian (often called "RBF")	2	2	Smooth bell curve
Laplacian	1 (or 2)	2 (or 1)	Peaked, non-smooth at center

The bandwidth $\tau$ controls how "local" the kernel is. A small $\tau$ makes the kernel sharply peaked — only very nearby points are similar. A large $\tau$ makes the kernel flat — even distant points are considered similar. Choosing $\tau$ is critical and is typically done via cross-validation.

Infinite-dimensional features

RBF kernels correspond to an infinite-dimensional feature map. For the 1D Gaussian kernel, the feature map sends each scalar $x$ to the infinite sequence $\phi(x) = \bigl(e^{-x^2/\tau}\sqrt{(2/\tau)^k / k!}\; x^k\bigr)_{k \in \mathbb{N}}$. You could never compute these features explicitly, but the kernel $k(x,x') = e^{-(x-x')^2/\tau}$ evaluates this inner product in constant time.

3. Practical Implementation

In scikit-learn the key classes are KernelRidge and SVC. The important hyperparameters are:

kernel: one of 'linear', 'polynomial', 'rbf', 'laplacian', 'cosine', etc.
alpha (our $\lambda$): regularization strength.
gamma: bandwidth of the kernel (inversely related to $\tau$).

Both hyperparameters (regularization and bandwidth) should be tuned via cross-validation. The lecture demo Kernel_prediction.ipynb illustrates these trade-offs interactively.

4. Other Non-Linear Methods

Before moving to neural networks, the lecture introduces two non-parametric methods. Unlike kernel methods, these are not obtained by minimizing a parametric training loss over a function class. They can be used for both classification and regression.

4.1 k-Nearest Neighbors (kNN)

Algorithm — k-Nearest Neighbors

kNN requires no training phase. At test time, for a new point $x$:

Choose $k$ and a distance metric (e.g. Euclidean).
Find the $k$ training points closest to $x$: $x_{i_1}, \ldots, x_{i_k}$.
Output the majority vote of their labels $y_{i_1}, \ldots, y_{i_k}$.

kNN is conceptually simple but comes with important caveats:

It needs a large training set to work well, and prediction cost is $O(nd)$ per query (though approximate methods can reduce this to sub-linear).
It is sensitive to $k$ — too small means noisy boundaries; too large means overly smooth. Choose $k$ via cross-validation.
It becomes unreliable in high dimensions because all points tend to be roughly equidistant (the "curse of dimensionality").

4.2 Decision Trees

A decision tree partitions the input space $\mathcal{X}$ into disjoint regions aligned with the coordinate axes. Each internal node applies a splitting rule of the form "is $x_{[j]} > t_j$?" and each leaf node predicts the majority class among the training points that fall into that region.

Growing a Decision Tree

Trees are built top-down using a greedy algorithm: at each step, pick the feature index $j$ and threshold $t$ that most reduce an impurity measure over the resulting child nodes. A common choice is the Gini impurity:

u(S) = \sum_{c=1}^{C} \frac{|\{i \in S : y_i = c\}|}{|S|} \left(1 - \frac{|\{i \in S : y_i = c\}|}{|S|}\right)

For binary classification this simplifies to $2p(1-p)$, which is small when one class dominates the set — exactly what we want in a leaf.

Caveats of Decision Trees

Overfitting: Leaf nodes contain few points and can easily memorize noise. The tree depth must be chosen carefully (another hyperparameter).
Greedy suboptimality: A bad split early on propagates errors to all descendants.

4.3 Random Forests

The remedy for the instability and overfitting of individual trees is to ensemble many trees. A Random Forest trains multiple decision trees, each on a random subset of features, and takes a majority vote across all trees. This variance-reduction technique makes random forests one of the most popular methods for tabular data with interpretable features and moderate dimensionality.

5. Comparing Classification Methods

The lecture concludes by comparing kernel SVMs, kNN, decision trees/random forests, and neural-network-based features on real datasets. A notable benchmark is CIFAR-10 (32×32 pixel images of 10 classes): linear models on compositional kernel features can reach roughly 90% test accuracy — surprisingly competitive, though the latest neural networks push this to about 98%.

A land-cover classification study on Sentinel-2 satellite imagery further demonstrates that as the training set grows, kernel SVMs, kNN, and random forests each have different scaling behaviors — with kernel SVMs (using a Gaussian kernel) often performing well even at moderate sample sizes.

Key Takeaways

Kernelizing a loss is a two-step process: reparameterize $w = X^\top\alpha$, then replace the Gram matrix $XX^\top$ with the kernel matrix $K$. This transforms an optimization in $\mathbb{R}^d$ (or $\mathbb{R}^p$) into one in $\mathbb{R}^n$.
Kernel ridge regression objective is $\frac{1}{n}\|y - K\alpha\|^2 + \lambda \alpha^\top K\alpha$. The ridge term prevents overfitting and is essential in practice.
Valid kernels can be composed (added, multiplied, chained through feature maps) to build expressive similarity measures while guaranteeing a feature-map interpretation.
RBF kernels (Gaussian, Laplacian) implicitly use infinite-dimensional features and are controlled by the bandwidth $\tau$.
kNN is training-free but expensive at test time and struggles in high dimensions. Decision trees are interpretable but prone to overfitting; random forests mitigate this through ensembling.