Lecture 8: Classification II

Introduction to Machine Learning Fanny Yang ETH Zürich

This lecture extends the binary classification framework from Lecture 7 to the multi-class setting, introduces the cross-entropy loss as a natural generalization of the logistic loss, and then shifts focus to how we evaluate classifiers beyond simple average accuracy. The key themes are: how to handle more than two classes, how generalization works for classifiers (including under distribution shift), and how to deal with asymmetric errors using tools like ROC curves, precision, recall, and the F1-score.

1. Recap: Binary Classification Pipeline

Before moving to multi-class, let's quickly recall the binary classification setup. We have training data $\{(x_i, y_i)\}$ with labels $y_i \in \{-1, +1\}$. We train a model $\hat{f}: \mathbb{R}^d \to \mathbb{R}$ and predict using the labeling function:

$$\hat{y}(x) = \text{sign}(\hat{f}(x)) = \begin{cases} +1 & \text{if } \hat{f}(x) > 0 \\ -1 & \text{if } \hat{f}(x) < 0 \end{cases}$$

The set $\{x : \hat{f}(x) = 0\}$ is the decision boundary. For linear models this is a hyperplane $\{x : w^\top x = 0\}$. We train $\hat{f}$ by minimizing a surrogate loss (e.g., logistic loss) on the training data using gradient descent.

2. Multi-Class Classification

2.1 Prediction with Multiple Models

When we have $K > 2$ classes (e.g., classifying images into cats, dogs, ships, etc.), we need $K$ separate functions $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_K$, one per class. The predicted class for an input $x$ is whichever function gives the highest value:

Multi-Class Prediction Rule

Given $K$ models $\hat{f}_1, \ldots, \hat{f}_K$, predict:

$$\hat{y}(x) = \arg\max_{k=1,\ldots,K} \hat{f}_k(x)$$

Think of it this way: each $\hat{f}_k(x)$ measures how "confident" we are that $x$ belongs to class $k$. We pick the class with the highest confidence score.

For linear models (where $\hat{f}_k(x) = w_k^\top x$), this partitions the feature space into $K$ regions. The decision boundary between classes $i$ and $j$ is the set of points where $\hat{f}_i(x) = \hat{f}_j(x)$, which for linear models is again a hyperplane.

Visualizing 3-Class Linear Classification

With $K = 3$ and 2D input $x$, each linear model $\hat{f}_k(x) = w_k^\top x$ defines a plane in 3D (the third axis being the function value). The feature space gets split into three colored regions. The solid boundaries where two planes cross determine which class is predicted.

2.2 Training: One-vs-Rest Approach

One straightforward way to train $K$ models is the one-vs-rest (OvR) strategy. For each class $k \in \{1, \ldots, K\}$:

  1. Relabel all data: points with label $y = k$ get new label $\tilde{y} = +1$, and all others get $\tilde{y} = -1$.
  2. Train a standard binary classifier on this relabeled dataset to obtain $\hat{f}_k$.

This is conceptually simple but requires solving $K$ separate binary classification problems, which can be slow when $K$ is large.

method = sklearn.linear_model.LogisticRegression(multi_class='ovr')
method.fit(X, y)

2.3 Training: Cross-Entropy Loss (Simultaneous Approach)

A more elegant and widely-used approach trains all $K$ models simultaneously by minimizing a single loss function — the cross-entropy loss:

Definition — Cross-Entropy Loss

For a vector of model outputs $\hat{f}(x) = (\hat{f}_1(x), \ldots, \hat{f}_K(x))$ and true label $y \in \{1, \ldots, K\}$:

$$\ell_{\text{CE}}(\hat{f}(x),\; y) = -\log \frac{e^{\hat{f}_y(x)}}{\sum_{k=1}^{K} e^{\hat{f}_k(x)}}$$

Why does this make sense?

Consider a training point $(x, y)$ with true class $y = 1$. If $\hat{f}_1(x)$ is much larger than all other $\hat{f}_k(x)$, then the fraction $\frac{e^{\hat{f}_1(x)}}{\sum_k e^{\hat{f}_k(x)}}$ is close to 1, so the loss $-\log(\cdot)$ is close to 0 (which is good). Conversely, if the correct class gets a low score, that fraction is near 0 and the loss blows up. In short, the cross-entropy loss encourages $\hat{f}_y(x)$ to be the largest among all $K$ outputs.

The Softmax Connection

The expression inside the logarithm is the softmax function. Define the probability vector:

$$[p(x)]_k := \text{softmax}(\hat{f}(x))_k = \frac{e^{\hat{f}_k(x)}}{\sum_{j=1}^{K} e^{\hat{f}_j(x)}}$$

This vector satisfies $[p(x)]_k \geq 0$ and $\sum_k [p(x)]_k = 1$, so it's a valid probability distribution. The largest probability corresponds to the predicted class $\hat{y}(x) = \arg\max_k \hat{f}_k(x)$. Training with gradient descent on the average cross-entropy loss pushes these "probabilities" toward concentrating mass on the correct class.

Key Connection: Cross-Entropy Generalizes the Logistic Loss

For $K = 2$, the cross-entropy loss reduces exactly to the logistic loss from Lecture 7. The logistic/sigmoid function $\sigma(z) = \frac{e^z}{1 + e^z}$ is the $K = 2$ special case of the softmax. Training with $K = 2$ cross-entropy is sometimes called logistic regression, while the general $K$-class version is called multinomial logistic regression.

method = sklearn.linear_model.LogisticRegression(multi_class='multinomial')
method.fit(X, y)

An important practical point: the one-vs-rest and cross-entropy approaches generally lead to different decision boundaries, even for linear models on the same data.

3. Generalization for Classification

3.1 The Noise Model for Classification

Just like in regression, real-world labels are noisy. We assume there exists a ground-truth labeling function $y^*(x)$ that gives the "true" class. However, the labels we actually observe are noisy versions modeled as:

$$(y \mid x) = y^*(x) \cdot \varepsilon$$

where $\varepsilon \in \{-1, +1\}$ is a random noise variable that can flip the label. This is a multiplicative noise model (in contrast to the additive noise $Y = f^*(X) + \varepsilon$ used in regression). Importantly, $\varepsilon$ can depend on $x$ — for instance, points near the decision boundary are more likely to have their labels flipped.

3.2 Generalization Error in Classification

A good classifier should have a low generalization error (also called the expected 0-1 error):

Generalization Error (0-1 Loss)
$$L(\hat{f};\; \mathbb{P}_{X,Y}) = \mathbb{E}_{X,Y}[\ell_{0\text{-}1}(\hat{y}, y)] = \mathbb{P}_{X,Y}(\hat{y} \neq y)$$

This is the probability that our classifier misclassifies a randomly drawn test point.

Since we don't know the true distribution $\mathbb{P}_{X,Y}$, we estimate this error on a held-out test set $D_{\text{test}}$:

$$\widehat{L} = \frac{1}{|D_{\text{test}}|} \sum_{(x,y) \in D_{\text{test}}} \mathbb{1}\{\hat{y}(x) \neq y\}$$

This assumes the standard i.i.d. setting: training and test data are drawn from the same distribution. But what if they're not?

4. Robust Generalization: When Test ≠ Train

4.1 Distribution Shift

In many real-world scenarios, the test distribution $\tilde{\mathbb{P}}$ differs from the training distribution $\mathbb{P}$. This can happen in two ways: (i) only the input distribution shifts (the relationship $x \to y$ stays the same), or (ii) the relationship itself changes.

Example — Sentiment Analysis Across Domains

A model trained to detect sentiment in emails might perform poorly on tweets, even though the underlying task (positive vs. negative sentiment) is the same. The vocabulary, sentence length, and style differ enough that the input distribution has shifted.

Example — ImageNet Classifiers

Research by Recht et al. (2019) showed that ImageNet classifiers consistently lose accuracy when tested on a carefully re-collected version of the test set. Models that scored, say, 76% on the original test set only managed around 64% on the new one — a significant and systematic drop.

4.2 Worst-Group Accuracy

A related problem occurs when the dataset contains a majority group and a minority group drawn from different sub-distributions. A classifier that maximizes average accuracy may learn to get all majority-group points right while completely failing on the minority group.

Instead of optimizing the average error over all groups combined, we can instead minimize the worst-group error:

$$\max\left\{\frac{1}{|g_1|}\sum_{(x,y) \in g_1} \mathbb{1}\{\hat{y}(x) \neq y\},\;\; \frac{1}{|g_2|}\sum_{(x,y) \in g_2} \mathbb{1}\{\hat{y}(x) \neq y\}\right\}$$

This gives a classifier that is reasonably accurate on every sub-group, even at the cost of slightly lower average accuracy. This idea connects closely to fairness in ML.

4.3 Fairness (Bonus)

There are several formal notions of what it means for a classifier to be "fair" with respect to a sensitive group attribute $G$:

Notion Informal Meaning Formal Condition
Equalized Odds All groups have similar TPR and FPR $\mathbb{P}(\hat{Y}=1 \mid Y=y, G=g)$ equal for all $g$, both $y \in \{-1,+1\}$
Demographic Parity Equal % predicted positive across groups $\mathbb{P}(\hat{Y}=1 \mid G=g)$ equal for all $g$
Equalized Opportunity Same FNR (or TPR) across groups $\mathbb{P}(\hat{Y}=1 \mid Y=1, G=g)$ equal for all $g$
Worst-Group Error Largest group error is small $\sup_g \frac{1}{2}\sum_y \mathbb{P}(\hat{Y}=-y \mid Y=y, G=g)$ is small

There is no universally "correct" fairness notion — the right choice depends heavily on the application context. There is often a tradeoff between accuracy and fairness.

5. Asymmetric / Cost-Sensitive Losses

5.1 Why Treat Errors Differently?

The standard 0-1 loss counts every misclassification equally. But in many applications, one type of error is much worse than the other. Consider a Covid test from a public health perspective: labeling a sick patient as healthy (false positive in the hypothesis-testing sense) is far more dangerous than labeling a healthy patient as sick.

To reflect this, we use a weighted (asymmetric) loss:

Asymmetric Loss
$$L_{\text{asym}}(\hat{y};\; \mathbb{P}) = c_{\text{FP}} \cdot \text{FPR}(\hat{y}) + c_{\text{FN}} \cdot \text{FNR}(\hat{y})$$

where $c_{\text{FP}}, c_{\text{FN}} \geq 0$ control how much we penalize each error type.

5.2 The Confusion Matrix and Error Rates

All the error metrics stem from the four cells of the confusion matrix:

True $y = -1$ True $y = +1$
Predicted $\hat{y} = -1$ TN (True Negative) FN (False Negative) — Type II error
Predicted $\hat{y} = +1$ FP (False Positive) — Type I error TP (True Positive)

From these counts, we derive the key rates:

Metric Formula Goal
FPR (False Positive Rate) $\frac{\#\text{FP}}{\#\{y = -1\}}$ Want small
FNR (False Negative Rate) $\frac{\#\text{FN}}{\#\{y = +1\}}$ Want small
TPR / Recall / Power $\frac{\#\text{TP}}{\#\{y = +1\}} = 1 - \text{FNR}$ Want large
Precision $\frac{\#\text{TP}}{\#\{\hat{y} = +1\}}$ Want large
FDR (False Discovery Rate) $\frac{\#\text{FP}}{\#\{\hat{y} = +1\}} = 1 - \text{Precision}$ Want small
Intuition Check

Precision answers: "Among everything I predicted as positive, how many are actually positive?"
Recall (TPR) answers: "Among all actual positives, how many did I catch?"
These two are often in tension — raising one tends to lower the other.

5.3 The Threshold Trick

Rather than retraining the model with different cost weights, we can keep the same model $\hat{f}$ and simply change the decision threshold $\tau$:

$$\hat{y}_\tau(x) = \begin{cases} +1 & \text{if } \hat{f}(x) > \tau \\ -1 & \text{if } \hat{f}(x) < \tau \end{cases}$$

Setting $\tau > 0$ means we require higher confidence before predicting $+1$, which reduces false positives but increases false negatives. Setting $\tau < 0$ does the reverse. The original classifier corresponds to $\tau = 0$.

5.4 ROC Curves

The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between TPR and FPR as we vary the threshold $\tau$. For each value of $\tau$, we compute FPR$(\hat{y}_\tau)$ and TPR$(\hat{y}_\tau)$ and plot TPR (y-axis) against FPR (x-axis).

Reading an ROC Curve
  • Ideal classifier: top-left corner (TPR = 1, FPR = 0) — it gets everything right.
  • Random guessing: the diagonal line TPR = FPR — the classifier doesn't use $x$ at all.
  • Good model: curve bows toward the top-left, far above the diagonal.
  • As $\tau$ decreases (moving right along the curve), we classify more points as positive, so both TPR and FPR increase.

To summarize an ROC curve in a single number, we use the Area Under the ROC Curve (AUROC). It ranges from 0 to 1, with 1 being perfect and 0.5 being random guessing. This is useful when we want to compare models without committing to a specific threshold.

5.5 Precision-Recall Trade-off and the F1-Score

Sometimes we care specifically about performance on the positive class. We then look at the Precision-Recall curve (varying $\tau$) and summarize it with the F1-Score:

Definition — F1-Score
$$F_1 = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}}$$

This is the harmonic mean of precision and recall. It forces both to be large — if either is near zero, the F1-score is also near zero. This is deliberately stricter than a simple average, which could be gamed by having one metric be very high and the other very low.

6. Key Takeaways