Lecture 4: Optimization II

Course: Introduction to Machine Learning (IML) Lecture 4 ETH Zürich · Prof. Fanny Yang

This lecture continues the optimization story from Lecture 3, asking: how fast does gradient descent converge, and can we do better? We first analyse the convergence rate of gradient descent for linear regression, discovering that speed depends on the condition number of the data matrix. We then explore practical alternatives — momentum, adaptive methods, and stochastic gradient descent — that address the limitations of vanilla GD. Finally, we introduce convexity and strong convexity, two properties that let us make powerful guarantees about whether gradient descent actually finds the best solution.

1. Convergence of Gradient Descent for Linear Regression

1.1 The Convergence Theorem

Recall from Lecture 3 that the gradient descent update for the (modified) linear regression loss $L(w) = \tfrac{1}{2}\|y - Xw\|_2^2$ is $w^{t+1} = w^t - \eta\,\nabla L(w^t)$. The central question is: under what conditions does this sequence of iterates $w^0, w^1, w^2, \ldots$ actually converge to the optimal solution $\hat{w}$?

Theorem 5.7 — Convergence of GD for Linear Regression

Consider the problem $\min_{w \in \mathbb{R}^d} L(w) = \tfrac{1}{2}\|y - Xw\|_2^2$. If $X^\top X$ is full-rank and the learning rate satisfies $\eta < \frac{2}{\lambda_{\max}(X^\top X)}$, then gradient descent converges linearly to $\hat{w} = \arg\min_w L(w)$.

"Linear convergence" is a somewhat confusing name from the optimisation literature. It actually means the error shrinks exponentially in the number of iterations — each step multiplies the distance to the optimum by a fixed factor $\rho < 1$:

$$\|w^{t} - \hat{w}\|_2 \;\leq\; \rho^{\,t}\;\|w^0 - \hat{w}\|_2$$

where $\rho = \|I - \eta X^\top X\|_{\text{op}}$. The key assumption — $X^\top X$ full-rank — is equivalent to requiring $\lambda_{\min}(X^\top X) > 0$, which holds when $n > d$ and the data points are "different enough" (no perfect linear dependencies among columns of $X$).

Sketch of Proof Idea

Using the normal equations $X^\top X\hat{w} = X^\top y$, the update becomes:

$$w^{t+1} - \hat{w} = (I - \eta X^\top X)(w^t - \hat{w})$$

Applying this recursively and taking norms gives $\|w^{t+1} - \hat{w}\|_2 \leq \rho^{t+1}\|w^0 - \hat{w}\|_2$ with $\rho = \|I - \eta X^\top X\|_{\text{op}}$. The eigenvalues of $I - \eta X^\top X$ are $1 - \eta\lambda_i$, so $\rho = \max\{|1 - \eta\lambda_{\min}|,\; |1 - \eta\lambda_{\max}|\}$. If $\eta < 2/\lambda_{\max}$, then $\rho < 1$ and convergence is guaranteed.

1.2 Convergence Speed and the Condition Number

Having established that GD converges, we now ask how fast. The contraction factor $\rho$ controls everything — the smaller $\rho$ is, the fewer iterations we need. The goal is to pick the learning rate $\eta$ that minimises $\rho$.

Optimal Learning Rate and Condition Number

The optimal constant learning rate for linear regression GD is:

$$\eta_{\text{opt}} = \frac{2}{\lambda_{\max} + \lambda_{\min}}$$

which yields the minimal contraction factor:

$$\rho_{\min} = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\kappa - 1}{\kappa + 1}$$

where $\kappa := \lambda_{\max}/\lambda_{\min}$ is the condition number of $X^\top X$.

The condition number $\kappa$ has a beautiful geometric interpretation. The loss $L(w)$ is a quadratic function whose contour lines (level sets) are ellipsoids in parameter space. The condition number describes the "shape" of these ellipsoids:

Well-conditioned (κ ≈ 1) w⁰ Few steps, direct path Ill-conditioned (κ ≫ 1) w⁰ Many steps, zigzag path
Figure 1: Contour plots of the loss for well-conditioned (left) and ill-conditioned (right) problems. The orange path shows the GD trajectory.
Practical Caveat

Computing $\lambda_{\max}$ and $\lambda_{\min}$ exactly is itself expensive — often comparable to solving the original problem. In practice, $\eta$ is treated as a hyperparameter that we tune (e.g., via cross-validation), rather than compute from eigenvalues.

1.3 The Three Learning Rate Regimes

The relationship between $\eta$ and the contraction factor $\rho$ is piecewise linear. There are three regimes to keep in mind:

Learning Rate RangeContraction $\rho$Behaviour
$0 < \eta \leq \frac{1}{\lambda_{\max}}$$1 - \eta\lambda_{\min}$Converges, but may be slow
$\frac{1}{\lambda_{\max}} < \eta < \frac{2}{\lambda_{\max}}$$\max\{1 - \eta\lambda_{\min},\; \eta\lambda_{\max} - 1\}$Converges; sweet spot at $\eta_{\text{opt}}$
$\eta \geq \frac{2}{\lambda_{\max}}$$\rho \geq 1$No convergence guarantee; may diverge

2. Variants of Gradient Descent

Vanilla GD struggles with ill-conditioned landscapes: a low learning rate makes almost no progress in flat directions, while a high learning rate causes oscillations across steep directions. Several families of methods have been developed to address this.

2.1 Momentum-Based Methods

Inspired by physics, momentum methods add a fraction of the previous step to the current update. Think of a ball rolling downhill — it accumulates speed in consistent directions and dampens oscillations. The update rule is:

$$w^{t+1} = w^t + \alpha(w^t - w^{t-1}) - \eta\,\nabla L(w^t)$$

where typically $\alpha = \beta \cdot \eta$ with $\beta \in [0, 1)$. By expanding the recurrence, you can see that the influence of past gradients decays exponentially. Setting $\beta = 0$ recovers plain GD. The effect is that momentum dampens the zigzag oscillations in ill-conditioned landscapes and speeds up progress along consistently descending directions.

2.2 Adaptive Methods

Adaptive methods use per-parameter learning rates that adjust based on the history of gradient updates for each coordinate. The intuition: if a coordinate has already changed a lot, use a smaller step; if it has barely moved, use a larger step. A basic version updates coordinate $i$ as:

$$w_i^{t+1} = w_i^t - \frac{\eta}{\sqrt{\delta_i^t + \gamma}}\;\frac{\partial L}{\partial w_i}(w^t)$$

where $\delta_i^t = (w_i^t - w_i^{t-1})^2$ captures how much coordinate $i$ changed recently, and $\gamma > 0$ prevents division by zero. Popular refinements include Adam, AdaGrad, and RMSProp.

2.3 Second-Order Methods

A third class uses the Hessian (matrix of second derivatives) to get a richer picture of the local geometry. The Hessian tells us about curvature, letting us take smarter steps. However, computing the full $d \times d$ Hessian is $O(d^2)$ in storage and expensive to invert, so these methods are rarely used directly in large-scale ML.

3. Stochastic Gradient Descent (SGD)

3.1 Motivation: Computational Cost

In ML, the training loss is an average over $n$ data points: $L(w) = \frac{1}{n}\sum_{i=1}^n \ell(f_w(x_i), y_i)$. Each GD step computes the full gradient over all $n$ points, requiring $O(nd)$ memory and $O(n \times \text{cost of } \nabla_w \ell)$ computation per iteration. When $n$ is large (millions of images, say), this becomes prohibitive.

3.2 The SGD Algorithm

The key idea: instead of using all data points to compute the gradient, use a random subset (mini-batch) $S \subset \{1, \ldots, n\}$:

$$w^{t+1} = w^t - \eta\,\nabla L_S(w^t), \qquad \text{where } \nabla L_S(w) = \frac{1}{|S|}\sum_{i \in S} \nabla_w \ell(f_w(x_i), y_i)$$

When $|S| = 1$ (a single random data point), this is "pure" SGD. In practice, we almost always use mini-batches with $|S| > 1$.

Lemma 5.8 — SGD is Unbiased

If we select $k$ data points independently and uniformly at random from $\{1, \ldots, n\}$ to form $S$, then the expected value of the stochastic gradient equals the full gradient:

$$\mathbb{E}_S[\nabla L_S(w)] = \nabla L(w)$$

In other words, the stochastic gradient is an unbiased estimator of the true gradient. On average, SGD points in the right direction.

Proof Sketch

Let $I_1, \ldots, I_k$ be the randomly sampled indices. By linearity of expectation and the fact that each $I_j$ is uniform over $\{1,\ldots,n\}$:

$$\mathbb{E}_S[\nabla L_S(w)] = \frac{1}{k}\sum_{j=1}^k \mathbb{E}_{I_j}[\nabla_w \ell(f_w(x_{I_j}), y_{I_j})] = \frac{1}{k}\sum_{j=1}^k \frac{1}{n}\sum_{i=1}^n \nabla_w \ell(f_w(x_i), y_i) = \nabla L(w)$$

3.3 Batch Size and the Variance–Computation Trade-off

Since SGD only sees a subset of data at each step, the update direction is noisy — the loss may increase at some iterations, and the iterates follow an erratic path (unlike the smooth trajectory of full GD). The batch size $|S|$ controls this trade-off:

Variance Can Be Helpful!

Noise in SGD isn't always a bad thing. For non-convex loss functions (like neural network losses), GD can get stuck in saddle points or bad local minima. The randomness of SGD can help escape these traps by nudging iterates out of flat regions, thanks to the stochastic fluctuations in the update direction.

3.4 Practical Details: Epochs and Shuffling

In practice, rather than sampling with replacement at every step, we partition the training set into mini-batches $S_1, \ldots, S_b$ and cycle through them. One pass through the entire dataset is called an epoch. The data is typically reshuffled at the start of each epoch to avoid convergence issues.

4. Convexity

Everything so far has been about how to optimise. We now turn to a fundamental question: even if GD converges to a stationary point (where $\nabla L = 0$), is that point actually a global minimum? For general functions, the answer is no — we could land on a saddle point or a local minimum. Convexity is the property that guarantees every stationary point is globally optimal.

4.1 Definition and Conditions for Convexity

Definition 5.9 — Convexity

A set $W \subseteq \mathbb{R}^d$ is convex if it contains the line segment between any two of its points. A function $f: W \to \mathbb{R}$ (defined on a convex set) is convex if for all $w, v \in W$ and all $\lambda \in [0,1]$:

$$f(\lambda w + (1-\lambda)v) \;\leq\; \lambda\, f(w) + (1-\lambda)\,f(v)$$

Geometrically: the line segment connecting any two points on the graph of $f$ lies above or on the graph.

This "zeroth-order" condition can be hard to verify directly. For differentiable functions, two equivalent characterisations are more practical:

First-order condition: $f$ is convex if and only if the tangent line (linear approximation) at every point lies below the function. Formally, for all $w, v$:

$$f(v) \;\geq\; f(w) + \langle \nabla f(w),\; v - w\rangle$$

Second-order condition: $f$ is convex if and only if its Hessian is positive semi-definite everywhere: $D^2 f(w) \succeq 0$ for all $w$. Intuitively, the function has non-negative curvature in every direction.

Example: Linear Regression Loss is Convex

For $L(w) = \|y - Xw\|_2^2$, the Hessian is $\nabla^2 L(w) = 2X^\top X$, which is always positive semi-definite (since $v^\top X^\top X v = \|Xv\|_2^2 \geq 0$ for any $v$). Therefore the squared loss for linear regression is convex for any design matrix $X$.

4.2 Useful Rules for Proving Convexity

Rather than checking the definition from scratch every time, we can build convex functions from simpler ones using operations that preserve convexity:

4.3 Optimality of Stationary Points

This is the payoff of convexity for optimisation:

Theorem 5.14 — Stationary Points of Convex Functions Are Global Minima

Let $f: W \to \mathbb{R}$ be convex and differentiable on a convex open set $W$. If $w^*$ is a stationary point ($\nabla f(w^*) = 0$), then $w^*$ is a global minimum of $f$.

Proof (short and elegant)

Since $\nabla f(w^*) = 0$, plug into the first-order convexity condition: for all $v \in W$,

$$f(v) \geq f(w^*) + \underbrace{\langle \nabla f(w^*), v - w^*\rangle}_{= 0} = f(w^*)$$

So $f(v) \geq f(w^*)$ for every $v$ — that's exactly the definition of a global minimum.

Convexity ≠ Uniqueness

Convexity guarantees that any stationary point is a global minimum, but there may be multiple global minima achieving the same value. Think of a flat-bottomed bowl: every point on the bottom is a global minimum, but there's no unique one.

5. Strong Convexity

To guarantee a unique global minimum, we need something stronger than convexity.

Definition 5.15 — Strong Convexity

A function $f: W \to \mathbb{R}$ is $m$-strongly convex (with $m > 0$) if for all $w, v \in W$ and $\lambda \in [0,1]$:

$$f(\lambda w + (1-\lambda)v) \leq \lambda f(w) + (1-\lambda)f(v) - \frac{m}{2}\lambda(1-\lambda)\|w - v\|_2^2$$

Equivalently (and more intuitively): $f$ is $m$-strongly convex if and only if $f(w) - \frac{m}{2}\|w\|^2$ is convex. In words: $f$ is at least as curved as a quadratic.

The equivalent conditions mirror those for regular convexity, with extra terms:

Theorem 5.16 — Uniqueness of Global Minimum

If $f$ is strongly convex and differentiable, and $w^*$ is a stationary point, then $w^*$ is the unique global minimizer.

Application to Linear Regression

For $L(w) = \|y - Xw\|_2^2$, the Hessian is $2X^\top X$. If $X^\top X$ is full rank (i.e., $\lambda_{\min}(X^\top X) > 0$), then $D^2 L(w) = 2X^\top X \succeq 2\lambda_{\min} I \succ 0$. So $L$ is strongly convex, and the least-squares solution $\hat{w}$ is the unique global minimum. This is the usual case when $n > d$ and the data are non-degenerate.

5.1 The Big Picture: Stationary Points and Minima

Putting it all together, here's the hierarchy of guarantees for differentiable functions:

Property of $f$Stationary point $\nabla f = 0$ is…Uniqueness?
General (no convexity)Could be local min, local max, saddle pointNo
ConvexAlways a global minimumNot guaranteed
Strongly convexAlways a global minimumYes — unique

Key Takeaways