IML – Lecture 7: Classification

Overview

This lecture introduces the classification problem — the task of predicting a discrete label $y$ from an input $x$, as opposed to a continuous value in regression. The lecture walks through the full machine learning pipeline for binary classification: choosing a function class (linear classifiers), selecting an appropriate training loss (surrogate losses for the 0–1 loss), and optimizing it (gradient descent on the logistic loss). It concludes by connecting gradient descent on logistic loss to the maximum-margin classifier and support vector machines (SVMs).

1. What Is Classification?

In supervised learning, we have a training set of pairs $(x_i, y_i)$ and want to learn a mapping $\hat{f}: \mathcal{X} \to \mathcal{Y}$. The key difference from regression is the nature of the output space $\mathcal{Y}$:

Definition — Regression vs. Classification

Regression: $\mathcal{Y} = \mathbb{R}$ — the label is a continuous real value.
Classification: $\mathcal{Y}$ is a finite, discrete set, e.g. $\{-1, +1\}$ (binary) or $\{1, 2, \ldots, K\}$ (multiclass).

Examples of Classification Tasks

Email spam detection: input = email text, output $\in$ {spam, not spam}.
Medical diagnosis: input = patient symptoms & measurements, output $\in$ {at risk, safe}.
Image classification: input = image pixels, output $\in$ {cat, dog, ship, …}.
Sentiment analysis: input = text message, output = sentiment class.

Step 0 — Converting to Numbers

Before we can do any math, we need to convert the problem into numerical form. The input attributes are typically already numerical (e.g. blood measurements), but the class labels need to be assigned numbers. For binary classification, the standard convention is $y \in \{-1, +1\}$.

Prediction via the Sign Function

Just like in regression, we train a model $\hat{f}: \mathbb{R}^d \to \mathbb{R}$ that outputs a real-valued score. To turn this into a class prediction, we apply the sign function:

\hat{y} = \text{sign}(\hat{f}(x))

If $\hat{f}(x) > 0$ we predict $+1$; if $\hat{f}(x) < 0$ we predict $-1$.

The 0–1 Loss

The natural evaluation metric for classification is the zero-one loss, which simply checks whether the prediction is correct:

Definition — Zero-One Loss

$$ \ell_{0\text{-}1}(\hat{y},\, y) \;=\; \mathbb{1}_{\{\hat{y} \neq y\}} \;=\; \begin{cases} 1 & \text{if } \hat{y} \neq y \\ 0 & \text{otherwise} \end{cases} $$

Since $\hat{y}, y \in \{-1, +1\}$, this can also be written as $\mathbb{1}_{\{y \cdot f(x) < 0\}}$, which depends on $y$ and $f(x)$ only through the product $z = y \cdot f(x)$.

Accuracy & 0–1 Loss

When we say a classifier has "78% accuracy", that means $\ell_{0\text{-}1} = 0$ for 78% of data points. Accuracy $= 1 -$ (average 0–1 loss).

2. Linear Classifiers & Decision Boundaries

A linear classifier uses a function of the form $f(x) = w^\top x$ (or $w^\top x + w_0$ with a bias term). The prediction rule is:

\hat{y} = \text{sign}(w^\top x)

The decision boundary is the set $\{x : w^\top x = 0\}$, which is a hyperplane in $\mathbb{R}^d$. Points on one side are classified as $+1$, and points on the other side as $-1$. Geometrically, $w$ is the normal vector to this hyperplane: points that form an acute angle with $w$ get classified as $+1$ and those that form an obtuse angle as $-1$.

When Do Linear Classifiers Work?

Linear classifiers work well when the data is linearly separable — meaning there exists a hyperplane that perfectly separates the two classes. However, for more complex data distributions (e.g. classes arranged in concentric circles, or in alternating quadrants), a straight line simply cannot separate them.

Non-linear Classifiers via Feature Maps

The key insight is that we can make linear classifiers far more powerful by first applying a non-linear feature map $\phi: \mathbb{R}^d \to \mathbb{R}^p$ and then fitting a linear model in the transformed feature space:

f(x) = w^\top \phi(x), \qquad w \in \mathbb{R}^p

The effective function class becomes $\mathcal{F}_\phi = \{ x \mapsto w^\top \phi(x) : w \in \mathbb{R}^p \}$. Although $\phi$ is non-linear, the classifier is still linear in the features $\phi(x)$, so all the same training machinery applies.

Example — Circular Data

Suppose "+" points are inside a circle of radius $r$ and "−" points are outside. The data is not linearly separable in $(x_1, x_2)$. But with the feature map $\phi(x_1, x_2) = (1,\; x_1^2,\; x_2^2)$, the function $f(x) = r^2 - x_1^2 - x_2^2$ is linear in $\phi(x)$ and perfectly separates the classes. The decision boundary $\{x : x_1^2 + x_2^2 = r^2\}$ is a circle!

Example — XOR-like Data

If "+" points lie in the 2nd and 4th quadrants and "−" points in the 1st and 3rd, then $\phi(x_1, x_2) = x_1 x_2$ makes the data linearly separable with decision boundary along the coordinate axes.

Looking Ahead

Neural networks can be viewed as linear classifiers on learned features: instead of hand-picking $\phi$, the network learns the best $\phi$ from data. The last layer is essentially a linear classifier.

3. Surrogate Loss Functions for Training

A natural idea is to train the classifier by directly minimizing the 0–1 loss on the training set. However, the 0–1 loss is non-continuous, non-differentiable, and non-convex, which makes it computationally infeasible to optimize with gradient-based methods. We therefore introduce surrogate losses — smooth, convex substitutes that we minimize during training instead.

Key Principle

Surrogate losses are used only for training. For evaluation, we still report accuracy (i.e. the 0–1 loss).

What Makes a Good Surrogate?

Since the 0–1 loss depends on $y$ and $f(x)$ only through $z = y \cdot f(x)$, we look for surrogate functions $g(z)$ that satisfy:

Decreasing in $z$: wrong predictions ($z < 0$) should incur higher loss than correct ones ($z > 0$).
Convex & differentiable: so gradient-based optimization works reliably.
No extra reward for correct predictions: the loss should be bounded below (ideally by zero), so the optimizer doesn't waste effort pushing already-correct points further away from the boundary.
Stable gradients: no exploding or vanishing gradients for misclassified points.
Robust to noisy outliers: a single mislabeled point shouldn't dominate the loss.

Candidate Surrogate Losses

Loss	Formula $g(z)$	Verdict
Linear	$-z$	Decreasing & convex, but unbounded below — rewards large margins too aggressively, can fail on imbalanced data.
Exponential	$e^{-z}$	Decreasing, convex, bounded below by 0. But gradients explode as $z \to -\infty$, making it sensitive to outliers.
Logistic	$\log(1 + e^{-z})$	Best candidate. Decreasing, convex, differentiable, bounded gradients for misclassified points, robust to outliers.

Definition — Logistic Loss

$$ \ell_{\log}(f(x),\, y) \;=\; \log\!\big(1 + e^{-y \cdot f(x)}\big) $$

Its derivative with respect to $z = y \cdot f(x)$ is $\displaystyle\frac{d}{dz}\ell_{\log}(z) = -\frac{1}{1 + e^{z}}$, which is bounded between $-1$ and $-1/2$ for misclassified points ($z < 0$). This prevents exploding gradients and makes training stable.

Why Not the Exponential Loss?

The exponential loss $e^{-z}$ has derivative $-e^{-z}$, which blows up as $z \to -\infty$ (badly misclassified points). This makes the optimizer hypersensitive to outliers or noisy labels. The logistic loss avoids this problem by having bounded gradients.

Why Not the Linear Loss?

The linear loss $g(z) = -z$ decreases without bound, which means it keeps rewarding points that are already classified correctly with large margin. On imbalanced datasets, this can pull the decision boundary away from the majority class, hurting minority class accuracy.

4. Logistic Regression

Definition — Logistic Regression

Logistic regression is the method that fits a linear function $f_w(x) = w^\top x$ by minimizing the average logistic loss on the training data:

$$ L(w) \;=\; \frac{1}{n}\sum_{i=1}^{n} \log\!\Big(1 + e^{-y_i\, w^\top x_i}\Big) $$

Despite the name "regression", this is a classification method. The name comes from its connection to probabilistic modeling (the logistic/sigmoid function), which is covered later in the course.

Properties of the Logistic Loss Objective

The training loss $L(w)$ is always convex. However, whether it has a unique minimizer depends on the data:

Not linearly separable data: $L(w)$ is strictly convex and has a unique finite minimizer. Gradient descent converges to it.
Linearly separable data: the loss can be driven arbitrarily close to zero by scaling $\|w\|$ to infinity. There is no finite minimizer — the iterates of GD diverge in norm, but converge in direction.

5. Maximum-Margin Classifiers & SVMs

When data is linearly separable, many decision boundaries can perfectly classify all training points. Which one should we prefer?

The Margin

For a unit-norm weight vector $\|w\|_2 = 1$, the distance from a point $x_i$ to the decision boundary $\{x: w^\top x = 0\}$ is $|w^\top x_i|$. For correctly classified points, this equals $y_i \cdot w^\top x_i$. The margin of a classifier $w$ is the minimum such distance across all training points:

\text{margin}(w) = \min_{i=1,\ldots,n}\; y_i \langle w, x_i\rangle

Definition — Maximum-Margin Classifier

The max-margin classifier finds the unit-norm $w$ that maximizes the margin: $$ w_{\text{MM}} = \arg\max_{\|w\|_2 = 1} \;\min_{i=1,\ldots,n}\; y_i \langle w, x_i\rangle $$

Intuitively, this is the decision boundary that is as far away as possible from the nearest training points in both classes.

Support Vector Machines (Hard-Margin)

The max-margin problem has an equivalent (and more tractable) formulation: instead of fixing the norm and maximizing the margin, we fix the margin to be at least 1 and minimize the norm:

w_{\text{SVM}} = \arg\min_{w \in \mathbb{R}^d} \|w\|_2 \qquad \text{s.t.} \quad y_i \langle w, x_i\rangle \geq 1 \;\; \forall\, i

These two formulations yield the same decision boundary (same direction): $w_{\text{SVM}} / \|w_{\text{SVM}}\|_2 = w_{\text{MM}}$. The training points that lie exactly on the margin boundary ($y_i \langle w, x_i \rangle = 1$) are called support vectors.

Implicit Bias of Gradient Descent

Theorem — Implicit Bias of GD on Logistic Loss

For linearly separable data, gradient descent (initialized at 0) on the logistic loss converges in direction to the maximum-margin solution: $$ \lim_{t \to \infty} \frac{w_t}{\|w_t\|_2} = w_{\text{MM}} $$

Even though the logistic loss objective says nothing about margins or norms, gradient descent naturally finds the widest-margin classifier. This is called the implicit bias (or inductive bias) of gradient descent.

This is a remarkable result: you don't need to explicitly solve the SVM optimization problem. Simply running gradient descent on the logistic loss will, for separable data, give you the same decision boundary as the maximum-margin/SVM solution.

Soft-Margin SVM (Non-Separable Data)

When data is not linearly separable, the hard-margin constraint $y_i \langle w, x_i\rangle \geq 1$ cannot be satisfied for all points. We relax it by introducing slack variables $\xi_i \geq 0$:

Definition — Soft-Margin SVM

$$ \min_{w,\,\xi} \;\frac{1}{2}\|w\|^2 + \lambda \sum_{i=1}^{n} \xi_i \qquad \text{s.t.} \;\; y_i w^\top x_i \geq 1 - \xi_i,\;\; \xi_i \geq 0 \;\;\forall\, i $$

This is equivalent to the unconstrained formulation with the hinge loss:

$$ \min_{w} \;\frac{1}{2}\|w\|^2 + \lambda \sum_{i=1}^{n} \max(0,\; 1 - y_i w^\top x_i) $$

The parameter $\lambda > 0$ controls the trade-off between keeping a large margin (small $\|w\|$) and allowing violations (large $\xi_i$). For non-separable data, we can also use $\ell_2$-regularized logistic regression, which has a similar form. The two approaches give similar — sometimes identical — decision boundaries, though the SVM with hinge loss ignores correctly classified points beyond the margin, while logistic loss gives them a small but non-zero weight.

6. The Classification Pipeline at a Glance

Figure 1: The binary classification pipeline covered in this lecture.

The practitioner cares about 0–1 loss (accuracy), but the ML scientist uses a surrogate (logistic loss) during training because it is smooth and convex. Evaluation is always done on the actual 0–1 loss / accuracy.

Key Takeaways

Classification maps inputs to discrete labels. For binary classification, we use $y \in \{-1, +1\}$ and predict via $\hat{y} = \text{sign}(f(x))$.
Linear classifiers ($f(x) = w^\top x$) separate classes with a hyperplane. They can be made much more powerful by using non-linear feature maps $\phi(x)$.
We cannot directly optimize the 0–1 loss (it's non-convex and non-differentiable). Instead, we train with a surrogate loss — the logistic loss $\log(1 + e^{-yf(x)})$ is the best-behaved candidate (convex, smooth, bounded gradients, outlier-robust).
Logistic regression = minimizing average logistic loss with a linear model. For non-separable data, GD converges to a unique minimizer. For separable data, the iterates diverge in norm but converge in direction.
For separable data, GD on the logistic loss implicitly converges to the max-margin / SVM direction — no explicit margin maximization needed.
For non-separable data, the soft-margin SVM (hinge loss + $\ell_2$ penalty) and regularized logistic regression give similar decision boundaries.