Neural Networks 2: Training & Backpropagation

Introduction to Machine Learning Lecture 12 Prof. Andreas Krause, ETH Zürich

This lecture moves from the architecture of neural networks (covered in Lecture 11) to the question of how to train them. The central topics are the forward propagation algorithm for computing predictions, the backpropagation algorithm for efficiently computing gradients, the non-convex nature of neural network optimisation, and practical techniques — including regularisation — that make training work in practice. The lecture also introduces autodifferentiation as used in modern frameworks like PyTorch.

1. Recap — Multi-Layer Neural Networks

Recall from Lecture 11 that an artificial neural network (ANN) is a nested composition of learnable linear transformations and fixed nonlinear activation functions. In compact form, a network with $L-1$ hidden layers computes:

$$f(\mathbf{x};\,\Theta) \;=\; W^{(L)}\,\varphi\!\Big(W^{(L-1)}\,\varphi\!\big(\cdots\varphi\!\big(W^{(1)}\mathbf{x}+\mathbf{b}^{(1)}\big)\cdots\big)+\mathbf{b}^{(L-1)}\Big)+\mathbf{b}^{(L)}$$

Here, $\Theta = \{W^{(1)},\ldots,W^{(L)},\,\mathbf{b}^{(1)},\ldots,\mathbf{b}^{(L)}\}$ collects all the weight matrices and bias vectors, and $\varphi$ is applied element-wise (e.g. sigmoid, ReLU, tanh). The number of hidden layers is the depth, and the number of units in a layer is its width.

Definition — Artificial Neural Network (ANN)

An ANN (also called a Multi-Layer Perceptron) is a nonlinear function built from nested compositions of learnable linear maps and fixed nonlinearities. In the simplest single-hidden-layer form: $f(\mathbf{x};\,w,\theta) = \sum_{j=1}^{p} w_j\,\varphi(\theta_j^T \mathbf{x})$.

1.1 A Simple Worked Example

Consider the smallest non-trivial ANN: one input $x$, one hidden unit $v$, one output $f$, with weights $w'$ and $w$. The network computes:

$$f(x;\,[w,w']) \;=\; w\,\varphi(w' x)$$

The input $x$ is first scaled by $w'$, then passed through the activation $\varphi$, and finally scaled by $w$ to produce the output. This tiny network already illustrates the key pattern: linear → nonlinear → linear.

2. Forward Propagation

Forward propagation (or "forward pass") is the procedure a neural network uses to compute its output given an input $\mathbf{x}$ and a fixed set of parameters $\Theta$. It is essentially reading the nested formula from the inside out, one layer at a time.

Algorithm — Forward Propagation (L-layer network)
  1. Set $\mathbf{h}^{(0)} = \mathbf{x}$  (the input is treated as the "zeroth" hidden layer).
  2. For each hidden layer $l = 1, \ldots, L-1$:
    $$\mathbf{z}^{(l)} = W^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \qquad \mathbf{h}^{(l)} = \varphi\!\big(\mathbf{z}^{(l)}\big)$$
  3. Compute the output (no activation at the last layer):
    $$\mathbf{f} = W^{(L)}\,\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)}$$
  4. Return $\mathbf{f}$.

The key observation is that each layer does the same two things: (1) compute a linear combination of the previous layer's outputs, and (2) apply the activation function. This repetitive structure is what makes the algorithm a simple loop — and is also what makes gradient computation tractable via backpropagation.

Example — Forward pass with sigmoid activation

Consider a network with 3 inputs, two hidden layers of 2 units each, and 1 output, all using sigmoid $\varphi(z) = \frac{1}{1+e^{-z}}$.

Hidden layer 1: $h_i^{(1)} = \varphi\!\big(\sum_{k=1}^{3} w_{ki}^{(1)} x_k\big)$ for $i = 1,2$.

Hidden layer 2: $h_i^{(2)} = \varphi\!\big(\sum_{k=1}^{2} w_{ki}^{(2)} h_k^{(1)}\big)$ for $i = 1,2$.

Output: $f = w_1^{(3)} h_1^{(2)} + w_2^{(3)} h_2^{(2)}$  (no activation applied at the output layer for regression).

3. Training Neural Networks — The Optimisation Problem

Training a neural network means finding parameters $\Theta^*$ that minimise the empirical loss over a dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$:

$$\Theta^* = \arg\min_{\Theta}\; \mathcal{L}(\Theta;\,\mathcal{D}) = \arg\min_{\Theta}\; \frac{1}{n}\sum_{i=1}^{n} \ell\!\big(\Theta;\,\mathbf{x}_i, y_i\big)$$

where $\ell$ is a pointwise loss, such as the squared loss $(y - f(\mathbf{x};\Theta))^2$ for regression or the logistic loss $\log(1 + e^{-y \cdot f(\mathbf{x};\Theta)})$ for classification.

Important — Non-Convexity

Unlike linear regression or logistic regression, the loss $\mathcal{L}(\Theta;\mathcal{D})$ of a neural network is not convex in general. Gradient descent may therefore converge to a local minimum, a saddle point, or even get stuck at a local maximum — depending on the initialisation. This is a fundamental difficulty of neural network training.

3.1 Why (Minibatch) SGD?

Despite non-convexity, we still use gradient-based methods — in particular minibatch stochastic gradient descent (SGD). There are two good reasons for this:

4. Backpropagation

The central algorithmic contribution that makes training deep networks feasible is backpropagation (short for "backward propagation of errors"). It is an efficient way to compute the gradient $\nabla_\Theta \ell$ by exploiting the layered structure of the network via the chain rule.

4.1 Intuition from the Simple Example

Return to our tiny network: $\hat{y} = w_2 \cdot \varphi(w_1 x)$, with squared loss $\ell = (y - \hat{y})^2$. We need both partial derivatives:

$$\frac{\partial \ell}{\partial w_2} = \frac{\partial \ell}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w_2} = 2(\hat{y}-y) \cdot h$$
$$\frac{\partial \ell}{\partial w_1} = \frac{\partial \ell}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial w_1} = 2(\hat{y}-y) \cdot w_2 \cdot \varphi'(z) \cdot x$$

Notice two crucial efficiency gains: (a) the factor $2(\hat{y}-y)$ is computed once and reused, and (b) the intermediate values $z$, $h$, $\hat{y}$ were already computed during the forward pass and can be reused. Instead of recomputing everything from scratch, we propagate the error signal backward through the layers.

4.2 General Backpropagation

For an $L$-layer network, we want all gradients $\nabla_{W^{(l)}} \ell$ for $l = 1,\ldots,L$. Backpropagation computes them layer by layer, starting from the output and working backwards. At each step, only one new factor needs to be computed — everything else is reused from earlier steps or the forward pass.

Key Insight — Reuse of Computations

The gradient at layer $l$ has the chain-rule structure:

$$\nabla_{W^{(l)}} \ell = \frac{\partial \ell}{\partial f} \cdot \frac{\partial f}{\partial \mathbf{z}^{(L-1)}} \cdot \frac{\partial \mathbf{z}^{(L-1)}}{\partial \mathbf{z}^{(L-2)}} \cdots \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{z}^{(l)}} \cdot \frac{\partial \mathbf{z}^{(l)}}{\partial W^{(l)}}$$

Each additional layer only adds one new factor (shown in red in the lecture slides). All preceding factors have already been computed when processing the layer above. This is why the algorithm is called backpropagation: we move backward through the layers, accumulating the gradient one factor at a time.

4.3 Backpropagation in Matrix Form

The algorithm can be expressed concisely using an "error signal" $\boldsymbol{\delta}^{(l)}$ at each layer:

Algorithm — Backpropagation (Matrix Form)

1. Output layer:

  • Compute error: $\boldsymbol{\delta}^{(L)} = \nabla_f \ell$
  • Gradient: $\nabla_{W^{(L)}} \ell = \boldsymbol{\delta}^{(L)}\, \mathbf{v}^{(L-1)T}$

2. For each hidden layer $l = L-1, \ldots, 1$:

  • Compute error: $\boldsymbol{\delta}^{(l)} = \varphi'(\mathbf{z}^{(l)}) \odot \big(W^{(l+1)T}\,\boldsymbol{\delta}^{(l+1)}\big)$
  • Gradient: $\nabla_{W^{(l)}} \ell = \boldsymbol{\delta}^{(l)}\, \mathbf{v}^{(l-1)T}$

Here $\odot$ denotes element-wise multiplication, $\mathbf{v}^{(l)}$ is the (augmented) hidden representation at layer $l$ from the forward pass, and $\varphi'(\mathbf{z}^{(l)})$ is the element-wise derivative of the activation.

The error signal $\boldsymbol{\delta}^{(l)}$ captures "how much the loss would change if the pre-activation $\mathbf{z}^{(l)}$ changed." It flows from the output layer backward, being multiplied at each step by the transpose of the weight matrix above and the local activation derivative.

5. Autodifferentiation

In practice, nobody implements backpropagation by hand for every new architecture. Modern deep learning frameworks such as PyTorch and TensorFlow use autodifferentiation: you define the computational graph (i.e. the network architecture and loss), and the framework automatically computes all gradients for you.

Example — Training a small ANN in PyTorch
Pythonmodel = torch.nn.Sequential(
    torch.nn.Linear(5, 3),   # 5 inputs → 3 hidden units
    torch.nn.ReLU(),
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(1000):
    y_pred = model(x)           # forward pass
    loss = loss_fn(y_pred, y)   # compute loss
    optimizer.zero_grad()       # reset gradients
    loss.backward()             # backpropagation
    optimizer.step()            # update parameters

The call loss.backward() triggers the automatic computation of all gradients through the recorded computation graph. This is the autodifferentiation magic — the same backpropagation algorithm, but automated.

6. Regularisation in Neural Networks

Neural networks often have thousands or millions of parameters, which creates a significant risk of overfitting. Several techniques help mitigate this.

6.1 Weight Decay (L2 Regularisation)

Instead of minimising just the loss, we add a penalty that keeps the weights small:

$$\Theta^* = \arg\min_\Theta \;\sum_{i=1}^{n} \ell(\Theta;\,\mathbf{x}_i,y_i) + \lambda\,\|\Theta\|_2^2$$

This is the same idea as ridge regression, applied to all network weights. The hyperparameter $\lambda > 0$ controls the strength of regularisation. Larger $\lambda$ means simpler models with smaller weights.

6.2 Early Stopping

During training, the training error keeps decreasing, but the validation error eventually starts to rise — this is the point where the network begins to overfit. Early stopping means halting training when the validation error stops decreasing, even if the normal convergence criterion has not been met. It is one of the simplest and most effective regularisation strategies.

6.3 Dropout

Dropout (Srivastava et al., 2014) is a regularisation technique specific to neural networks. The idea is to randomly "drop out" (i.e. set to zero) each hidden unit with probability $1-p$ during each iteration of SGD. This forces the network to not rely too heavily on any single unit and encourages redundancy across units.

During training, a different random sub-network is used at each step. At test time, all units are active, but the weights are multiplied by $p$ to compensate for the fact that more units are now present than during any single training step.

Key Point — Dropout as Ensemble

You can think of dropout as implicitly training an exponential number of different sub-networks (one for each dropout pattern) and averaging their predictions at test time. This ensemble effect is what gives dropout its strong regularisation power.

7. Key Takeaways