IML – Lecture 2: Linear Regression

This lecture walks through the entire supervised learning pipeline using the motivating example of housing price prediction. The central method introduced is linear regression — the simplest machine learning method — which combines a linear function class with the squared loss and admits a closed-form solution. Starting with the one-dimensional case and building up to multiple dimensions, the lecture derives the optimal parameters via stationary point conditions and a geometric projection argument, and briefly surveys alternative loss functions for situations where the squared loss is not ideal.

1. The Supervised Learning Pipeline

The lecture uses a concrete scenario to motivate the entire ML workflow: you want to sell your house and need to estimate a fair average market price. Since the target (price) is a continuous real number and you have historical data of houses sold with their attributes and prices, this is a regression task within supervised learning.

The pipeline consists of three steps:

Step I: Collect Training Data

Gather $n$ historical examples. Each example $i$ consists of input attributes $x_i$ (e.g. house size, number of bathrooms) and a corresponding output $y_i$ (sale price). We can't feed physical houses into a computer, so we represent each house as a numerical vector of its attributes.

Step II: Learn a Model

Use an ML method to efficiently find a predictor $\hat{f}$ from a chosen function class $\mathcal{F}$ that "fits" the training data well. This step requires the ML scientist to make three design choices:

Definition — The Three Components of an ML Method

1. Function class $\mathcal{F}$ — a parameterised set of candidate functions (e.g. linear functions).
2. Training loss $L(f)$ — the average loss of a predictor $f \in \mathcal{F}$ over the training set:

L(f) := \frac{1}{n}\sum_{i=1}^{n} \ell\bigl(f(x_i),\, y_i\bigr)

where $\ell(f(x), y)$ is a pointwise loss measuring the gap between prediction and truth.
3. Optimisation method — a procedure to (approximately) minimise $L$ and output the final predictor.

Step III: Predict

Feed the attributes of your (new, unseen) house into the learned model $\hat{f}$ and obtain a predicted average price $\hat{f}(x_{\text{new}})$.

2. Supervised Learning Terminology

Key Terms

Attributes / covariates: The raw inputs $x \in \mathbb{R}^d$ describing a single example (e.g. house size, location).
Features: A (possibly nonlinear) transformation $\phi(x) \in \mathbb{R}^p$ of the raw attributes. In plain linear regression, features and attributes are the same thing.
Outputs / targets / labels: The value $y \in \mathbb{R}$ we want to predict (e.g. price).
Training set: The collection of $(x_i, y_i)$ pairs we learn from.
Test set: A separate collection of $(x_i, y_i)$ pairs, disjoint from training, used to evaluate how well the model generalises.
Predictor / model: A function $f: \mathbb{R}^p \to \mathbb{R}$ mapping inputs to outputs.

3. Linear Regression in One Dimension

The simplest setting: each house is described by a single attribute $x_i \in \mathbb{R}$ (e.g. size in m²) and we want to predict its price $y_i$. We plot the training data as points in the $(x, y)$-plane and look for a "good" line through them.

3.1 The Linear Function Class

Definition — Linear Function Class (1-D)

The set of all affine functions of a single input:

\mathcal{F}_{\text{lin}} = \bigl\{\, f(x) = w_0 + w_1\, x \;\big|\; w_0, w_1 \in \mathbb{R} \,\bigr\}

Here $w_1$ is the slope and $w_0$ is the intercept (bias).

Every choice of $(w_0, w_1)$ gives a different line. Since there are infinitely many candidates, we can't simply try them all — we need a principled way to pick the best one.

3.2 The Squared Loss

To measure how well a candidate function $f$ fits the data, we use the squared loss:

Definition — Squared Loss

\ell\bigl(f(x),\, y\bigr) = \bigl(y - f(x)\bigr)^2

The training loss (average over all $n$ points) is then:

L(w_0, w_1) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - w_0 - w_1\, x_i\bigr)^2

Visually, each term $(y_i - f(x_i))^2$ is the squared vertical distance from a data point to the line. Summing these up gives the total "cost" of a particular line — our goal is to find the $(w_0, w_1)$ that makes this sum as small as possible.

Intuition — Why Squared Loss?

The squared loss is convenient because it is smooth (differentiable everywhere), which lets us use calculus to find the minimum. It also penalises large errors more than small ones because of the quadratic growth. However, this sensitivity to large errors means it can be strongly influenced by outliers — a single far-away point can drag the fitted line significantly.

3.3 Finding the Minimiser via Stationary Points

For the 1-D case, consider the simplified setting with $w_0 = 0$ (no intercept). The training loss becomes a function of a single variable $w_1$:

L(0, w_1) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - w_1\, x_i\bigr)^2

This is a quadratic function in $w_1$ (an "upward-opening parabola"), so it has exactly one minimum. We find it by setting the derivative to zero:

\frac{\partial L}{\partial w_1} = 0 \quad \Longrightarrow \quad \hat{w}_1 = \frac{\sum_{i=1}^{n} x_i\, y_i}{\sum_{i=1}^{n} x_i^2}

This is a closed-form solution — no iterative algorithm needed. We can compute the optimal slope directly from the data.

Example — Housing Price in 1-D

Suppose we have 5 houses with sizes (in m²) and prices (in millions CHF): $(120,\, 1.5)$, $(200,\, 3.0)$, $(130,\, 1.0)$, $(135,\, 0.7)$, $(80,\, 0.7)$. Plotting these as points and fitting the best line through them via the formula above gives us $\hat{f}(x) = \hat{w}_0 + \hat{w}_1 x$, a line that minimises the total squared vertical distance to all data points.

4. Multiple Linear Regression (General $d$)

In practice, houses have many attributes: size, number of bathrooms, location, age, etc. We now let $x_i \in \mathbb{R}^d$ be a $d$-dimensional vector of attributes for each house.

4.1 Function Class and Loss

Definition — Multivariate Linear Function Class

\mathcal{F}_{\text{lin}} = \bigl\{\, f(x) = w_0 + w^\top x \;\big|\; w_0 \in \mathbb{R},\; w \in \mathbb{R}^d \,\bigr\}

Here $w$ is the weight vector (generalising the slope) and $w_0$ the intercept. For $d=2$, the predictor is a plane through 3-D space; for general $d$, it is a hyperplane in $(d+1)$-dimensional space.

The training loss is the natural extension:

L(w_0, w) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - w_0 - w^\top x_i\bigr)^2

The Intercept Trick

We can absorb the intercept into the weight vector by augmenting every input: replace $x_i$ with $\tilde{x}_i = (1, x_i) \in \mathbb{R}^{d+1}$ and define $\tilde{w} = (w_0, w) \in \mathbb{R}^{d+1}$. Then $f(\tilde{x}) = \tilde{w}^\top \tilde{x}$ and we only need to optimise over a single vector $\tilde{w}$. For the rest of this summary, we assume this has been done and write simply $f(x) = w^\top x$.

4.2 Matrix Notation

Stack all $n$ training inputs row-wise into the design matrix $X \in \mathbb{R}^{n \times d}$ and all targets into the vector $y \in \mathbb{R}^n$. Then:

L(w) = \frac{1}{n}\|y - Xw\|_2^2

Since the factor $\frac{1}{n}$ doesn't affect the location of the minimum, minimising $L(w)$ is equivalent to minimising $\|y - Xw\|_2^2$.

4.3 Deriving the Closed-Form Solution

There are two complementary ways to derive the optimal $\hat{w}$:

Method 1: Stationary Point Condition

Any minimiser $\hat{w}$ must satisfy $\nabla_w L = 0$. Computing the gradient of $\|y - Xw\|_2^2$ and setting it to zero yields:

\nabla_w \|y - Xw\|_2^2 = 2X^\top(Xw - y) = 0

Definition — Normal Equations

X^\top X\,\hat{w} = X^\top y

These are the normal equations. Every minimiser of the squared loss must satisfy them.

Why is this truly a minimum? The Hessian of $\|y - Xw\|_2^2$ is $2X^\top X$, which is always positive semi-definite (psd). This means the loss function is convex, so every stationary point is automatically a global minimum.

When $X^\top X$ is invertible (which happens when $d \leq n$ and $X$ has full column rank), we get the unique solution:

\hat{w} = (X^\top X)^{-1} X^\top y

Method 2: Geometric (Projection) Argument

This viewpoint gives beautiful geometric insight. Consider the prediction vector $X w \in \mathbb{R}^n$ for a given $w$. As $w$ ranges over all of $\mathbb{R}^d$, the vector $Xw$ traces out $\text{span}(X) = \{Xw : w \in \mathbb{R}^d\}$, which is a subspace of $\mathbb{R}^n$ (spanned by the columns of $X$).

Minimising $\|y - Xw\|_2^2$ means finding the point in $\text{span}(X)$ that is closest to $y$. By linear algebra, this closest point is the orthogonal projection of $y$ onto $\text{span}(X)$:

X\hat{w} = \Pi_X\, y

The residual vector $y - X\hat{w}$ is perpendicular to every column of $X$. Writing this orthogonality condition out gives exactly the normal equations $X^\top(y - X\hat{w}) = 0$.

Figure 1: The linear regression estimate as an orthogonal projection of $y$ onto $\text{span}(X)$. The residual is perpendicular to the subspace.

How Many Solutions?

The number of minimisers $\hat{w}$ depends on $n$, $d$, and the rank of $X$. If $d \leq n$ and $X$ has full column rank, $X^\top X$ is invertible and there is a unique $\hat{w}$. In the overparameterised case ($d > n$), there are infinitely many $w$ that achieve zero training loss ($y = Xw$ has infinitely many solutions). However, the prediction vector $X\hat{w} = \Pi_X y$ is always unique — only the weight vector is non-unique. The standard convention is to pick the minimum-norm solution $\hat{w} = X^\top(XX^\top)^{-1}y$.

5. Beyond the Squared Loss

The squared loss is the default for linear regression, but it isn't always the best choice. The lecture briefly introduces two alternatives:

5.1 Huber Loss (Robustness to Outliers)

Since the squared loss grows quadratically, a single outlier with a very large residual can dominate the sum and pull the fitted line far from the bulk of the data.

Definition — Huber Loss

\ell_{\text{huber}}(f(x), y) = \begin{cases} \frac{1}{2}(f(x) - y)^2, & \text{if } |f(x) - y| \leq \delta \\ \delta\bigl(|f(x) - y| - \tfrac{1}{2}\delta\bigr), & \text{if } |f(x) - y| > \delta \end{cases}

For small errors it behaves like the squared loss, but for large errors it grows only linearly, effectively down-weighting outliers. The parameter $\delta$ controls the transition point.

5.2 Asymmetric Losses

Sometimes over- and underestimation carry different costs. For instance, when selling a house, underestimating the price (leaving money on the table) may be worse than slightly overestimating it. Asymmetric loss functions penalise one direction more than the other. A prominent example is the quantile loss (pinball loss):

\ell_\tau(f(x), y) = \tau\max\{y - f(x), 0\} + (1 - \tau)\max\{f(x) - y, 0\}

When $\tau < 0.5$, overestimation is penalised more and the model tends to underestimate; when $\tau > 0.5$, the reverse happens.

Practical Takeaway

The squared loss is a reasonable default, but always think about whether your application calls for robustness to outliers (→ Huber) or asymmetric error costs (→ quantile loss). The choice of loss implicitly encodes assumptions about your problem.

6. Preview: From Linear to Nonlinear via Feature Maps

A natural question: what if the relationship between input and output is not linear? The lecture notes and subsequent lectures introduce the idea of a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^p$ that transforms the raw inputs before feeding them into the linear regression machinery:

f_w(x) = \langle w,\, \phi(x) \rangle = \sum_{j=1}^{p} w_j\, \phi_j(x)

The function $f_w$ is linear in the features $\phi(x)$ but can be highly nonlinear in the original input $x$.

Example — Polynomial Regression

For a scalar input $x \in \mathbb{R}$, define $\phi(x) = (1,\, x,\, x^2,\, \dots,\, x^p)$. Then $f_w(x) = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$ is a polynomial of degree $p$ — a very flexible nonlinear function — yet we can solve for the optimal $w$ using exactly the same normal equations, just with the feature matrix $\Phi$ in place of $X$.

The closed-form solution carries over directly: replace $X$ by the feature matrix $\Phi \in \mathbb{R}^{n \times p}$ (whose $i$-th row is $\phi(x_i)^\top$) to obtain $\hat{w} = (\Phi^\top \Phi)^{-1}\Phi^\top y$. This idea is developed fully in Lecture 3 and later chapters.

Key Takeaways

The ML pipeline has three steps: collect data, learn a model by minimising a training loss over a function class, and predict on new inputs. These three design choices — function class, loss, and optimisation — define any ML method.
Linear regression = linear functions + squared loss + closed-form optimisation. It is the simplest and most fundamental supervised learning method, and understanding it deeply is the foundation for everything that follows.
The normal equations $X^\top X \hat{w} = X^\top y$ characterise all minimisers of the squared loss. When $X^\top X$ is invertible ($d \leq n$, full rank), the unique solution is $\hat{w} = (X^\top X)^{-1}X^\top y$.
Geometric interpretation: the optimal prediction vector $X\hat{w}$ is the orthogonal projection of $y$ onto the column space of $X$. The residual $y - X\hat{w}$ is perpendicular to this subspace.
Loss function choice matters: the squared loss is smooth and convex but sensitive to outliers. Huber loss provides robustness; asymmetric losses let you penalise over- and underestimation differently.
Feature maps unlock nonlinear modelling while keeping the same linear algebra machinery — a powerful idea that reappears throughout the course in polynomial regression, kernel methods, and neural networks.