Lecture 11: Introduction to Neural Networks

Course: Introduction to Machine Learning (IML) Lecture 11 Prof. Andreas Krause, ETH Zürich

This lecture introduces artificial neural networks (ANNs), the foundational architecture behind modern deep learning. The key idea is deceptively simple: instead of hand-designing feature transformations (as in kernel methods or polynomial features), neural networks learn the features directly from data. We cover the general architecture, the algorithms for making predictions (forward propagation) and training the network (backward propagation), and discuss the most common activation functions.

1. Motivation — From Fixed to Learned Features

In earlier lectures, we saw how non-linear feature maps $\phi(x)$ allow linear models to capture non-linear relationships. Given a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^p$, the model is $f(x) = w^T \phi(x)$, and we optimize over the weights $w$ while $\phi$ stays fixed.

The core problem: designing good features by hand is difficult. For something like handwritten digit recognition, it's unclear what features will generalize well — shifting a digit by a few pixels can completely change hand-crafted features even though the label hasn't changed.

Key Insight — Neural Networks Learn Features

Instead of fixing $\phi$ and only optimizing $w$, neural networks optimize both the feature map $\phi$ (parameterized by $\Theta$) and the output weights $w$ jointly:

$$\hat{w}, \hat{\Theta} = \arg\min_{w \in \mathbb{R}^m,\, \Theta \in \mathbb{R}^{m \times d}} \frac{1}{n}\sum_{i=1}^{n} \ell\!\left(w^T \phi(x_i;\,\Theta),\; y_i\right)$$

This is the fundamental difference: the features themselves are learned from the training data.

2. Architecture of a Neural Network

2.1 The Single Hidden Layer Network

The simplest ANN uses parameterized feature maps of the form $\phi_j(x;\,\theta_j) = \varphi(\theta_j^T x)$, where $\varphi: \mathbb{R} \to \mathbb{R}$ is a fixed nonlinear activation function and $\theta_j \in \mathbb{R}^d$ are learned parameters. In matrix form: $\phi(x;\,\Theta) = \varphi(\Theta x)$, where $\Theta \in \mathbb{R}^{m \times d}$ and $\varphi$ is applied element-wise.

The network then outputs:

$$f(x;\, w, \Theta) = \sum_{j=1}^{m} w_j \,\varphi\!\left(\theta_j^T x\right)$$

Such nested functions — where learnable linear maps alternate with fixed nonlinearities — are called artificial neural networks (ANNs), also known as multi-layer perceptrons (MLPs).

2.2 How a Single-Layer Network Works

The computation inside a one-hidden-layer network proceeds in three steps:

  1. Input: The network receives a $d$-dimensional vector $x = (x_1, \ldots, x_d)$.
  2. Linear transform + activation: For each hidden unit $i = 1, \ldots, m$, compute the pre-activation $z_i = \sum_{j=1}^{d} \theta_{i,j}\, x_j$ (or $z = \Theta x$ in matrix form), then apply the activation: $h_i = \varphi(z_i)$. The number $m$ of hidden units is called the width of the layer.
  3. Output: Linearly combine the hidden units: $f = w^T h = \sum_{i=1}^{m} w_i\, h_i$.

Bias terms $b$ are typically included, so the linear transform becomes $z = \Theta x + b$. In diagrams, bias nodes are often omitted for clarity.

2.3 Going Deeper — Multiple Hidden Layers

While one hidden layer is theoretically sufficient (see the Universal Approximation Theorem below), in practice we often use deeper networks. We can increase complexity in several ways: increase the number of hidden layers (the depth), add multiple outputs (for tasks like multi-class classification), or use different activation functions across layers.

In a network with $L-1$ hidden layers, we denote the weight matrix and bias of the $l$-th layer by $W^{(l)}$ and $b^{(l)}$. The full parameter set is $\Theta = \left(W^{(1)}, \ldots, W^{(L)},\, b^{(1)}, \ldots, b^{(L)}\right)$. Such networks where every unit connects to all units in the adjacent layers are called fully connected networks.

3. Activation Functions

The activation function $\varphi$ introduces the necessary nonlinearity. Without it, stacking linear layers would just produce another linear function (a composition of linear maps is still linear). The most common choices are:

NameFormulaRangeKey Property
Identity$\varphi(z) = z$$(-\infty, \infty)$Linear — used in the output layer for regression
Sigmoid$\varphi(z) = \frac{1}{1+\exp(-z)}$$(0, 1)$Smooth, outputs probabilities; derivative can vanish for large $|z|$
Tanh$\varphi(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$(-1, 1)$Zero-centered; also suffers from vanishing gradients
ReLU$\varphi(z) = \max(0, z)$$[0, \infty)$Simple, avoids vanishing gradient for $z > 0$; not differentiable at 0
Why Nonlinearity Matters

If we used the identity activation everywhere, the entire network would collapse to a single linear transformation regardless of depth: $W^{(L)} \cdots W^{(1)} x$ is still just some matrix times $x$. The nonlinear activation is what gives neural networks the ability to represent complex, non-linear decision boundaries.

4. Universal Approximation Theorem

Theorem (Cybenko, 1989) — Universal Approximation

Let $f: [0,1]^d \to \mathbb{R}$ be any continuous function, and let $\varphi$ be any sigmoidal activation function (i.e., a continuous function with $\lim_{t \to \infty} \sigma(t) = 1$ and $\lim_{t \to -\infty} \sigma(t) = 0$). Then $f$ can be uniformly approximated by a single hidden layer network:

$$\hat{f}(x) = W^{(2)} \varphi\!\left(W^{(1)} x + b^{(1)}\right)$$

That is, for any $\varepsilon > 0$, there exist $W^{(1)} \in \mathbb{R}^{m \times d}$, $b^{(1)} \in \mathbb{R}^m$, and $W^{(2)} \in \mathbb{R}^{1 \times m}$ such that $\sup_{x \in [0,1]^d} \left|\hat{f}(x) - f(x)\right| \leq \varepsilon.$

Important Caveat

This theorem guarantees existence but says nothing about how many hidden units $m$ (the width) we need — it could be astronomically large. In practice, increasing depth (more layers) is often more efficient than making a single layer extremely wide, because deep networks can exploit compositional structure in the data.

5. Forward Propagation

Given a network with $L-1$ hidden layers and known parameters $\Theta$, forward propagation (or "forward pass") is the algorithm that computes the output $f(x;\, \Theta)$ for a given input $x$. The idea is straightforward — unroll the nested function and iterate through the layers:

$$f(x;\,\Theta) = W^{(L)}\,\varphi\!\Big(W^{(L-1)}\,\varphi\!\big(\cdots\varphi\!\left(W^{(1)}x + b^{(1)}\right)\cdots\big) + b^{(L-1)}\Big) + b^{(L)}$$
Algorithm — Forward Propagation

For an $L$-layer neural network:

  1. Set $h^{(0)} = x$ (the input)
  2. For each hidden layer $l = 1, \ldots, L-1$:
    • Compute the pre-activation: $z^{(l)} = W^{(l)} h^{(l-1)} + b^{(l)}$
    • Apply the activation: $h^{(l)} = \varphi(z^{(l)})$
  3. Compute the output: $f = W^{(L)} h^{(L-1)} + b^{(L)}$
  4. Return $f$

Notice that the activation function is not applied at the output layer — the final layer is simply a linear transformation. For regression, $f$ is the predicted value directly. For classification, we typically apply a decision rule like $\hat{y} = \text{sign}(f)$ or $\hat{y} = \arg\max_i f_i$.

Example — Two Hidden Layers with Sigmoid Activation

Consider a network with input dimension $d=3$, two hidden layers of width 2 each, and sigmoid activation $\varphi(z) = \frac{1}{1+\exp(-z)}$.

Layer 1: $h_i^{(1)} = \varphi\!\left(\sum_{k=1}^{3} w_{k,i}^{(1)} x_k\right)$ for $i = 1, 2$.

Layer 2: $h_i^{(2)} = \varphi\!\left(\sum_{k=1}^{2} w_{k,i}^{(2)} h_k^{(1)}\right)$ for $i = 1, 2$.

Output: $f = w_1^{(3)} h_1^{(2)} + w_2^{(3)} h_2^{(2)}$ — no activation function at the final step.

6. Backward Propagation (Backpropagation)

Forward propagation handles inference. But how do we train the network? As with previous models, training means finding parameters $\Theta^*$ that minimize the empirical loss:

$$\Theta^* = \arg\min_\Theta \; \mathcal{L}(\Theta;\,\mathcal{D}) = \arg\min_\Theta \; \frac{1}{n}\sum_{i=1}^{n} \ell(\Theta;\, x_i, y_i)$$

where $\ell$ is a loss function such as the squared loss $(y - f(x;\,\Theta))^2$ or the logistic loss $\log(1 + \exp\{-y \cdot f(x;\,\Theta)\})$.

Non-Convexity

Unlike linear regression or logistic regression, the loss surface of a neural network is generally not convex. Gradient descent may converge to a local minimum, a saddle point, or even get stuck at a stationary point depending on the initialization. Despite this, gradient-based methods — particularly variants of stochastic gradient descent (SGD) — work remarkably well in practice.

6.1 The Idea: Efficient Gradient Computation via the Chain Rule

Training requires computing the gradient $\nabla_\Theta \ell$ with respect to all weights in the network. For a network with many layers, this means computing partial derivatives like $\frac{\partial \ell}{\partial w_{i,j}^{(l)}}$ for every layer $l$. Doing this naively would be extremely expensive.

Backpropagation solves this by exploiting the chain structure of the network. The key insight is that through repeated application of the chain rule, we can propagate gradient information backwards through the layers, reusing intermediate results computed during the forward pass and from subsequent layers.

6.2 Step-by-Step for a General Network

Starting from the output layer and moving backwards:

Step 1 — Last layer ($W^{(L)}$):

$$\nabla_{W^{(L)}} \ell = \frac{\partial \ell}{\partial f} \cdot \frac{\partial f}{\partial W^{(L)}}$$

Since $f = W^{(L)} h^{(L-1)} + b^{(L)}$, the second factor involves $h^{(L-1)}$ which was already computed during the forward pass.

Step 2 — Previous layer ($W^{(L-1)}$):

$$\nabla_{W^{(L-1)}} \ell = \frac{\partial \ell}{\partial f} \cdot \frac{\partial f}{\partial h^{(L-1)}} \cdot \frac{\partial h^{(L-1)}}{\partial z^{(L-1)}} \cdot \frac{\partial z^{(L-1)}}{\partial W^{(L-1)}}$$

Here $\frac{\partial h^{(L-1)}}{\partial z^{(L-1)}} = \text{diag}\!\left(\varphi'(z^{(L-1)})\right)$ — a diagonal matrix of the activation function's derivatives. The first two factors were already computed in Step 1 and can be reused.

General pattern — Layer $l$:

$$\nabla_{W^{(l)}} \ell = \frac{\partial \ell}{\partial f} \cdot \frac{\partial f}{\partial z^{(L-1)}} \cdot \frac{\partial z^{(L-1)}}{\partial z^{(L-2)}} \cdots \frac{\partial z^{(l+1)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}$$

At each step, only one new factor needs to be computed — everything else is reused from the forward pass or from the previous backpropagation step. This is what makes the algorithm efficient.

Backpropagation — Matrix Form Summary

Define the "error signal" $\delta^{(l)}$ at layer $l$:

  1. Output layer: $\delta^{(L)} = \nabla_f \ell$
  2. Hidden layers (for $l = L-1, L-2, \ldots, 1$): $\delta^{(l)} = \varphi'(z^{(l)}) \odot \left({W^{(l+1)}}^T \delta^{(l+1)}\right)$ where $\odot$ denotes element-wise multiplication.

The gradient with respect to the weights at layer $l$ is then: $\nabla_{W^{(l)}} \ell = \delta^{(l)} \cdot {h^{(l-1)}}^T$

Intuition — Why "Backward"?

The name "backpropagation" comes from the fact that we start computing gradients at the output and propagate them backwards, layer by layer, towards the input. Each layer's gradient builds on the gradient of the layer above it. This is simply a systematic application of the multivariate chain rule.

7. Vanishing and Exploding Gradients

A critical practical concern is that the gradient factors $\text{diag}\!\left(\varphi'(z^{(l)})\right)$ and the weight matrices $W^{(l)}$ are multiplied together across layers. If these products shrink at each layer, the gradient signal becomes vanishingly small in the early layers (vanishing gradients). If they grow, the gradient can blow up (exploding gradients). In either case, optimization fails.

7.1 The Role of Activation Functions

The sigmoid activation has derivative $\varphi'(z) = \varphi(z)(1 - \varphi(z))$, which peaks at $0.25$ (when $z = 0$) and approaches zero rapidly for large $|z|$. This means that in deep networks with sigmoid activations, gradients tend to shrink exponentially with depth — a primary reason for the "vanishing gradient problem."

The ReLU activation $\varphi(z) = \max(0, z)$ has derivative $\varphi'(z) = 1$ for $z > 0$ and $\varphi'(z) = 0$ for $z < 0$. For positive pre-activations, the gradient passes through unchanged, which helps mitigate vanishing gradients. This is one of the main reasons ReLU became the default activation in deep networks.

7.2 Weight Initialization

Even with good activation functions, poorly initialized weights can cause the hidden layer activations $h^{(l)}$ to grow or shrink as they propagate through layers, which also affects gradient magnitudes. The solution is to initialize weights randomly with carefully chosen variance.

For a one-hidden-layer network with ReLU activation and $d$ inputs with zero mean and unit variance: if the weights $w_i$ are drawn i.i.d. with zero mean and variance $\sigma^2$, then $\text{Var}(h_j) = \frac{d \,\sigma^2}{2}$. To keep the variance of activations roughly constant across layers, we should set $\sigma^2 = \frac{2}{d}$. This principle generalizes to the well-known He initialization scheme used in practice.

8. Training in Practice

Since the loss surface is non-convex, gradient descent's convergence depends heavily on initialization. In practice, we use minibatch stochastic gradient descent (SGD): at each iteration, compute the gradient on a small random subset $S$ of training points rather than all $n$, making each step much cheaper:

$$\Theta_{t+1} = \Theta_t - \eta \cdot \frac{1}{|S|} \sum_{i \in S} \nabla_\Theta \,\ell(\Theta_t;\, x_i, y_i)$$

The stochasticity in SGD is actually beneficial — the noise in gradient estimates can help the optimizer escape saddle points and poor local minima. Modern deep learning frameworks like PyTorch and TensorFlow implement automatic differentiation, so in practice you specify the network architecture and the framework computes all gradients via backpropagation automatically.

Key Takeaways