This lecture addresses two central questions: how do we go beyond linear models to fit nonlinear patterns in data, and how do we find optimal parameters when closed-form solutions are unavailable or too expensive? The first half introduces nonlinear regression via feature maps, showing that seemingly complex nonlinear models can still be solved with the linear regression machinery from Lecture 2 by transforming the inputs. The second half develops gradient descent from scratch—starting with the intuitive 1D case and building up to the full multi-dimensional algorithm, including why the negative gradient is the direction of steepest descent.
1. Nonlinear Regression
1.1 From Linear to Nonlinear via Feature Maps
In Lecture 2 we fitted linear functions $f(x) = \langle w, x \rangle$ to data. But real-world relationships are often nonlinear—think of how house prices might depend on size in a curved, not straight-line, way. The key insight of this lecture is that we can capture nonlinear patterns while still using the same linear regression framework, simply by transforming the inputs first.
The idea is to define a feature map $\varphi: \mathbb{R}^d \to \mathbb{R}^p$ that takes the raw input $x$ and produces a new vector of features $\varphi(x) = (\varphi_1(x), \dots, \varphi_p(x))$. Each $\varphi_j$ can be any function of the full input vector. Our model then becomes:
This is linear in the parameters $w$ (which is what matters for optimization), but can be highly nonlinear in the input $x$. The entire class of functions we can represent this way is:
Given a feature map $\varphi: \mathbb{R}^d \to \mathbb{R}^p$, the induced function class is:
The functions $\varphi_1, \dots, \varphi_p$ are often called basis functions.
For a single input ($d = 1$), the polynomial feature map of degree $p$ is $\varphi(x) = (1, x, x^2, \dots, x^p)$. Then:
This is a polynomial of degree $p$—clearly nonlinear in $x$—but it's still just a dot product $\langle w, \varphi(x) \rangle$, so it's linear in $w$.
Multi-dimensional polynomials ($d > 1$): features include all monomials up to some degree, e.g. $\varphi(x) = (1, x_{[1]}, x_{[2]}, x_{[1]}x_{[2]}, x_{[1]}^2, x_{[2]}^2, \dots)$.
Trigonometric (Fourier) features: $\varphi(x) = (1, \sin(2\pi x), \sin(4\pi x), \dots, \sin(2\pi p x))$. These are useful for periodic patterns.
The phrase "linear regression" refers to linearity in the parameters, not in the inputs. By choosing a rich enough feature map $\varphi$, we can model highly nonlinear relationships while still using the same closed-form solution and gradient-based methods from linear regression. Some ML models (like polynomial or kernel regression) use fixed feature maps that you choose ahead of time, while others (like neural networks) learn the feature map from data.
1.2 Training Loss in Matrix-Vector Notation
Just as we assembled the data matrix $X$ for linear regression, we now assemble a feature matrix $\Phi \in \mathbb{R}^{n \times p}$, where each row is the feature vector of one training sample:
The training loss (squared loss) can then be written compactly as:
This is exactly the same form as the linear regression loss $\frac{1}{n}\|y - Xw\|_2^2$ but with $X$ replaced by $\Phi$. That means all the machinery from Lecture 2 carries over directly.
1.3 Closed-Form Solution for Nonlinear Regression
The minimizer of the squared loss over the function class $\mathcal{F}_\varphi$ is given by:
and the resulting predictor is $\hat{f}(x) = \langle \hat{w}, \varphi(x) \rangle$.
This is obtained by simply substituting $\Phi$ for $X$ in the normal equations from Lecture 2. The derivation is identical: we set the gradient to zero and solve $\Phi^\top \Phi \, \hat{w} = \Phi^\top y$.
Computational cost: Computing $(\Phi^\top \Phi)^{-1}$ requires $O(p^3 + np^2)$ operations. When the number of features $p$ is large (or the number of samples $n$ is large), this becomes prohibitively expensive.
Lack of generality: The closed-form only works for the squared loss. For other loss functions (e.g. absolute loss, Huber loss), there is typically no neat formula for $\hat{w}$.
This motivates the need for iterative optimization methods, which is the focus of the rest of this lecture.
2. Gradient Descent: The 1D Warm-Up
Before jumping to the general multi-dimensional case, the lecture builds intuition by considering a loss function $L(w)$ of a single variable $w \in \mathbb{R}$. The goal is to find the value of $w$ that minimizes $L$. Instead of finding a formula, we will iteratively walk towards the minimum.
2.1 Which Direction to Move?
At a current point $w^{\text{now}}$, we ask: should we move left or right to decrease $L$? The answer comes from the derivative $L'(w^{\text{now}})$:
- If $L'(w^{\text{now}}) > 0$, the function is increasing at that point, so we should move left (decrease $w$).
- If $L'(w^{\text{now}}) < 0$, the function is decreasing, so we should move right (increase $w$).
In both cases, the correct direction is opposite to the sign of the derivative. The update rule is:
where $\eta > 0$ is the learning rate (or step size).
2.2 How Far to Move? The Step Size $\eta$
The learning rate $\eta$ controls how big each step is. Getting it right is critical:
- Too small: convergence is painfully slow—you inch towards the minimum but take forever to get there.
- Too large: the iterates "overshoot" the minimum and can bounce back and forth or even diverge.
The sweet spot is a step size that is small enough that the loss decreases at each step, but large enough that progress is made in a reasonable number of iterations.
2.3 When to Stop?
A natural stopping criterion is to check whether we are close to a stationary point (where $L'(w) = 0$). In practice, we stop when the change in the loss value becomes negligible:
for some small threshold $\epsilon > 0$. Alternatively, we can check if $|w^{t+1} - w^t|$ is small.
3. Gradient Descent in Multiple Dimensions
3.1 The General Iterative Framework
For a loss $L(w)$ with $w \in \mathbb{R}^d$, we cannot visualize the landscape when $d > 2$, but the same principles apply. The generic template for iterative optimization is: start with an initial guess $w^0$, and repeatedly update $w^{t+1} = w^t + \tilde{\eta}_t \, v^t$ where $v^t$ is an update direction and $\tilde{\eta}_t$ is a step size, until some stopping criterion is met.
The central question becomes: which direction $v$ should we move in? Since we want to decrease $L$ as much as possible, we want the direction of steepest descent.
3.2 Why the Negative Gradient Is the Steepest Descent Direction
The lecture derives this key result in two complementary ways.
Argument 1: Orthogonality to Contour Lines
When $d = 2$, we can visualize the loss as a surface. The contour lines (level sets) are curves along which $L$ has a constant value—like elevation lines on a topographic map. The direction of steepest ascent must be perpendicular to these contour lines, because moving along a contour line doesn't change $L$ at all. The gradient $\nabla L(w)$ is exactly the vector perpendicular to the contour line at $w$, pointing uphill. Therefore, $-\nabla L(w)$ points in the direction of steepest descent.
Argument 2: Linear Approximation (Taylor Expansion)
More rigorously, using the first-order Taylor expansion, the loss at a nearby point $w^{\text{now}} + \tilde{\eta} \, v$ is approximately:
To decrease $L$ the most, we need to minimize $\langle \nabla L(w^{\text{now}}), v \rangle$ over all unit vectors $v$ (i.e., $\|v\|_2 = 1$). By the Cauchy-Schwarz inequality, the inner product $\langle \nabla L, v \rangle \geq -\|\nabla L\|_2$ with equality when $v$ points in the direction of $-\nabla L$.
Let $L$ be a differentiable function. The negative gradient of $L$ is the direction of locally steepest descent:
3.3 The Gradient Descent Algorithm
Combining the steepest descent direction with a step size $\eta > 0$, we get the gradient descent update rule. The unnormalized version (which is standard) is:
This is convenient because near a stationary point (where $\nabla L \approx 0$), the step size automatically shrinks—the algorithm naturally takes smaller steps as it approaches a minimum, reducing the risk of overshooting.
Input: initial guess $w^0$, learning rate $\eta > 0$, tolerance $\epsilon > 0$
Repeat:
Until: $\|w^t - w^{t-1}\|_2 \leq \epsilon$
Output: $\hat{w} = w^t$
3.4 Descent Direction Property
A crucial guarantee is that the gradient descent update actually decreases the loss at each step (for a small enough $\eta$). Using the Taylor expansion:
Since $\|\nabla L(w^t)\|_2^2 > 0$ (as long as we're not already at a stationary point) and $\eta > 0$, the loss strictly decreases. The approximation is valid for small enough $\eta$ where the higher-order remainder term from the Taylor expansion is negligible.
For any $w^0 \in \mathbb{R}^d$ and any number of iterations $T$, there exists a step size $\eta > 0$ such that gradient descent satisfies $L(w^{t+1}) < L(w^t)$ as long as $\nabla L(w^t) \neq 0$ and $t \leq T$.
The choice of $\eta$ is critical. Too large and the iterates diverge; too small and convergence takes an impractical number of steps. In practice, finding a good learning rate often requires experimentation or adaptive methods (covered in Lecture 4).
4. Applying Gradient Descent to Linear Regression
To make gradient descent concrete, the lecture applies it to the familiar linear regression loss. Consider the (rescaled) problem:
4.1 Computing the Gradient
Using the chain rule (or direct computation), the gradient of this loss is:
So the gradient descent update becomes:
Each step adds a correction proportional to $X^\top(y - Xw^t)$, which is the data matrix $X$ transposed times the residuals (the gap between the true $y$ and the current predictions $Xw^t$). Where the residuals are large, the correction is large.
4.2 Convergence Guarantee
For linear regression, we can prove that gradient descent converges to the exact optimal solution—not just to a nearby point. The key insight is to track how far $w^t$ is from the optimum $\hat{w}$:
This means the error vector is multiplied by the matrix $(I - \eta X^\top X)$ at every step. The error shrinks if and only if this matrix has operator norm less than 1, which happens when $\eta$ is in the right range.
If $X^\top X$ is full-rank and the learning rate satisfies $\eta < 2 / \lambda_{\max}(X^\top X)$, then gradient descent converges linearly to the optimal solution:
where $\rho = \|I - \eta X^\top X\|_{\text{op}} < 1$. Here "linear convergence" means the error decreases by a constant factor $\rho$ at every step—it is actually exponentially fast in the number of iterations.
Confusingly, "linear convergence" in optimization does not mean the error decreases linearly like $C - t$. It means the error is multiplied by a constant $\rho < 1$ at each step: $\|w^t - \hat{w}\| \leq C \rho^t$. Since $\rho^t$ decreases exponentially in $t$, the convergence is actually exponentially fast. The name comes from the fact that $\log(\text{error})$ decreases linearly with $t$.
The convergence factor is $\rho = \max\{|1 - \eta\lambda_{\min}|, \, |1 - \eta\lambda_{\max}|\}$, where $\lambda_{\min}$ and $\lambda_{\max}$ are the smallest and largest eigenvalues of $X^\top X$. The optimal step size is $\eta_{\text{opt}} = \frac{2}{\lambda_{\max} + \lambda_{\min}}$, which yields $\rho_{\min} = \frac{\kappa - 1}{\kappa + 1}$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number. When $\kappa \approx 1$ (well-conditioned), convergence is fast. When $\kappa \gg 1$ (ill-conditioned), convergence is slow—even with the best step size.
5. Geometric Intuition: Contour Lines and Conditioning
For the squared loss of linear regression (which is a quadratic function of $w$), the contour lines form ellipsoids. The shape of these ellipsoids is determined by the eigenvalues of $X^\top X$:
- Well-conditioned ($\kappa \approx 1$): the contour ellipsoids are nearly circular. The loss changes at roughly the same rate in all directions, so gradient descent can take large steps and converges quickly.
- Ill-conditioned ($\kappa \gg 1$): the contour ellipsoids are highly elongated. The loss is very steep in some directions but very flat in others. Gradient descent oscillates across the steep direction while making slow progress along the flat direction—a classic zig-zag pattern.
In the ill-conditioned case, a small step size means painfully slow progress in the flat direction, while a large step size causes wild oscillations across the steep direction. Even the optimal step size leads to slow convergence when the condition number is large.
Key Takeaways
- Feature maps turn linear into nonlinear: By applying a fixed transformation $\varphi(x)$ to the inputs, we can fit nonlinear functions $f(x) = \langle w, \varphi(x) \rangle$ using the exact same linear regression machinery—just replace the data matrix $X$ with the feature matrix $\Phi$.
- Closed-form solutions don't always scale: While $\hat{w} = (\Phi^\top\Phi)^{-1}\Phi^\top y$ gives the exact answer, the $O(p^3)$ cost of matrix inversion makes it impractical for large feature dimensions, and it only works for squared loss. This motivates iterative methods.
- Gradient descent finds minima by following the steepest downhill direction: The negative gradient $-\nabla L(w)$ is the direction of locally steepest descent, derived either geometrically (perpendicular to contour lines) or analytically (via Taylor approximation and Cauchy-Schwarz).
- The learning rate $\eta$ is the most important hyperparameter: Too small means slow convergence; too large means divergence. For linear regression, convergence is guaranteed when $\eta < 2/\lambda_{\max}(X^\top X)$.
- Convergence speed depends on the condition number: $\kappa = \lambda_{\max}/\lambda_{\min}$ controls how elongated the loss contours are. Well-conditioned problems ($\kappa \approx 1$) converge fast; ill-conditioned ones ($\kappa \gg 1$) converge slowly. This geometric insight foreshadows the need for more advanced optimizers (Lecture 4).
- Gradient descent is a descent method: For small enough $\eta$, the loss strictly decreases at every iteration. This is proven via Taylor expansion: the decrease is approximately $\eta\|\nabla L\|^2$, which is positive whenever $\nabla L \neq 0$.