IML – Lecture 14: Convolutional Neural Networks

This lecture introduces Convolutional Neural Networks (CNNs), a specialized neural network architecture designed for data with spatial structure — most notably images. The core idea is deceptively simple: instead of connecting every input to every hidden unit, connect each unit only to a small local neighbourhood and share the same weights across all positions. This single design choice leads to massive parameter savings, natural translation equivariance, and the ability to learn hierarchical visual features. The lecture covers the motivation from image data, the mathematical definition of convolution in 1D and 2D, the role of padding and stride, and how full CNN architectures are assembled.

1. Motivation: Why Not Just Use Fully Connected Networks?

1.1 Images as Numerical Objects

Neural networks used in computer vision take images as input. The first step is always to turn an image into numbers. A grayscale image of size $m \times n$ is represented as an $m \times n$ matrix, where each entry is the pixel's intensity (typically a value in $[0,1]$, from white to black). An RGB image has three colour channels (red, green, blue), so it becomes an $m \times n \times 3$ tensor — think of it as three matrices stacked behind each other, where each matrix records the intensity of one colour channel for every pixel.

Example — A Tiny Tensor

A $5 \times 5 \times 3$ tensor represents a $5 \times 5$ RGB image. The upper-left pixel has three intensity values — one from the red channel, one from blue, one from green. Together, these three numbers determine the pixel's colour.

1.2 The Parameter Explosion Problem

One naive approach is to flatten the image into a vector and feed it into a standard fully connected network. But this quickly becomes impractical. A modest $100 \times 100$ grayscale image has 10,000 pixel values. A single fully connected layer mapping this to 1,000 hidden units requires $10{,}000 \times 1{,}000 = 10{,}000{,}000$ parameters — and that's just one layer. In practice, images are much larger (e.g. $224 \times 224 \times 3$), making the parameter count astronomically high. This is both computationally wasteful and highly prone to overfitting.

1.3 The Key Observation

The crucial insight that motivates CNNs is that the colour of each pixel is mostly related to the colour of nearby pixels, and much less related to distant ones. A cat's ear in the top-left corner of an image is defined by the local arrangement of pixels around it — not by what's happening in the bottom-right corner. Fully connected layers completely ignore this spatial locality; they treat every pair of input-output connections as equally important. This calls for a different architecture.

2. The CNN Architecture: Local Connectivity & Weight Sharing

CNNs address the problems above with two principles that are baked into the architecture itself:

Local connectivity: Each output unit is connected only to a small local region (a "patch") of the input, not to every input unit. For example, output unit $z_2$ might depend only on inputs $x_1$ and $x_2$, not on all of $x_1, x_2, \ldots, x_n$.
Weight sharing: The same small set of weights is reused at every position. Instead of learning separate weights for each connection, we learn a single small filter (also called a kernel) and slide it across the entire input.

These two properties have profound consequences. Weight sharing means the number of learnable parameters in a convolutional layer depends only on the filter size — not on the image size. A layer with a $3 \times 3$ filter has only 9 weights (plus a bias), regardless of whether the image is $32 \times 32$ or $1024 \times 1024$. Compare this to the millions of parameters in a fully connected layer doing the same job.

CNNs Are Not Purely Convolutional

A CNN does not necessarily consist only of convolutional layers. It is common for the later layers (closer to the output) to be fully connected layers, which combine the high-level features extracted by the convolutional layers into a final prediction (e.g. a class label).

3. One-Dimensional Convolution

Before tackling images (which are 2D or 3D), it's helpful to understand convolution in the simplest case: one-dimensional input. Recall that in a fully connected network, each hidden layer is related to the previous one via:

\mathbf{v}^{(l)} = \varphi\!\left(W^{(l)}\,\mathbf{v}^{(l-1)}\right)

where $W^{(l)}$ is a full weight matrix and $\varphi$ is a nonlinear activation function applied element-wise. In the CNN architecture, this full matrix is replaced by a structured matrix that encodes local connectivity and weight sharing. Consider a small example with input $\mathbf{x} = (x_1, x_2, x_3)^T$ and filter $\mathbf{w} = (w_1, w_2)^T$. The output is:

\begin{pmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \end{pmatrix} = \begin{pmatrix} w_1 & 0 & 0 \\ w_2 & w_1 & 0 \\ 0 & w_2 & w_1 \\ 0 & 0 & w_2 \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}

Notice the special structure: the weight matrix is a Toeplitz matrix — it has the same entries along each diagonal. Each output $z_i$ depends on at most two consecutive inputs, and the same weights $w_1, w_2$ are reused everywhere. This is exactly the convolution operation.

Definition — 1D Convolution

Given two vectors $\mathbf{w} \in \mathbb{R}^k$ and $\mathbf{x} \in \mathbb{R}^d$, their convolution is the vector $\mathbf{z} = \mathbf{w} * \mathbf{x} \in \mathbb{R}^{k+d-1}$ with entries:

z_i := \sum_{j=\max\{1,\, i-d+1\}}^{\min\{i,\, k\}} w_j \, x_{i-j+1}

Example — Computing a 1D Convolution

Let $\mathbf{w} = (w_1, w_2)^T$ and $\mathbf{x} = (x_1, x_2, x_3)^T$. Then:

\mathbf{w} * \mathbf{x} = \begin{pmatrix} w_1 \cdot x_1 \\ w_1 \cdot x_2 + w_2 \cdot x_1 \\ w_1 \cdot x_3 + w_2 \cdot x_2 \\ w_2 \cdot x_3 \end{pmatrix}

Each entry is a weighted sum of a small local window of the input, with the same weights $w_1, w_2$ applied everywhere.

Using this notation, the relationship between consecutive layers in a convolutional network is written compactly as:

\mathbf{v}^{(l)} = \varphi\!\left(\mathbf{w}^{(l)} * \mathbf{v}^{(l-1)}\right)

This is the defining equation of a convolutional neural network — hence the name.

4. Filters as Feature Detectors

Intuitively, a filter in a CNN acts as a pattern detector. As the filter slides across the input, it produces a high output value wherever the local patch of the input "matches" the pattern encoded by the filter, and a low value where it doesn't. The output of applying a filter across all positions is called a feature map.

Different filters detect different things. For example, a Gaussian filter has a smoothing effect — it blurs the image by averaging nearby pixel values. An edge-detection filter highlights sharp transitions between dark and light regions. In a trained CNN, the network learns the filter values from data, so the filters end up detecting whatever patterns are most useful for the task at hand (edges, textures, shapes, object parts, etc.).

5. Multidimensional Convolution

5.1 2D Convolution (Grayscale Images)

Images are 2D (or 3D for colour), so we need convolution to work on matrices and tensors, not just vectors. The idea is the same: slide a small filter over the input, compute a dot product at each position, and collect the results into an output matrix (the feature map).

Example — 2D Convolution Step by Step

Consider a $7 \times 7$ grayscale image $\mathbf{x}$ and a $3 \times 3$ filter $W$:

W = \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{pmatrix}

To compute the $(1,1)$ entry of the output, place $W$ on top of the upper-left $3 \times 3$ submatrix of $\mathbf{x}$, treat both as 9-dimensional vectors, and compute their dot product. Then slide $W$ one step to the right and repeat to get the $(1,2)$ entry. Continue sliding through all valid positions. Since a $3 \times 3$ filter can be placed in $5 \times 5 = 25$ positions on a $7 \times 7$ image, the result is a $5 \times 5$ feature map.

5.2 3D Convolution (RGB Images)

For RGB images (tensors of shape $n \times n \times 3$), the filter is also three-dimensional — for example $3 \times 3 \times 3$. The operation works the same way: slide the filter across the two spatial dimensions, and at each position, flatten both the overlapping $3 \times 3 \times 3$ patch and the filter into 27-dimensional vectors and compute their dot product. The result of applying one 3D filter is still a single 2D feature map (a matrix), because the depth dimension is summed over.

In practice, we apply multiple filters $w_1, w_2, \ldots, w_m$ to the same input. Each filter produces its own feature map, and we stack these maps together to form an output tensor of depth $m$. This means a convolutional layer takes a tensor as input and outputs a tensor — possibly with different spatial dimensions and a different depth.

6. Padding and Stride

Two hyperparameters control the details of how the filter slides over the input:

6.1 Padding ($p$)

Without any special treatment, pixels at the border of the image participate in far fewer dot products than pixels in the centre. For instance, the top-left corner pixel is only covered when the filter is placed at the very first position, while a central pixel is covered by many filter positions. This means border information is underrepresented in the output.

Padding addresses this by adding a frame of zeros (rows and columns of zeros) around the image before applying the convolution. If the padding has depth $p$, the effective image size becomes $(n + 2p) \times (n + 2p)$. A common choice is to set $p$ so that the output has the same spatial dimensions as the input (called "same" padding).

6.2 Stride ($s$)

So far, we've assumed the filter moves one pixel at a time (stride $s = 1$). But we can choose a larger step size: with stride $s = 2$, the filter jumps two pixels at each step, producing a smaller output. Larger strides reduce the output dimensions and can be used as an alternative to pooling for downsampling.

Output Dimension Formula

For an $n \times n$ image, $m$ filters of size $f \times f$, padding $p$, and stride $s$, the output tensor has dimensions:

\left\lfloor\frac{n + 2p - f}{s} + 1\right\rfloor \times \left\lfloor\frac{n + 2p - f}{s} + 1\right\rfloor \times m

If we want $s$ to divide $(n + 2p - f)$ exactly (so the filter covers the image completely without leftover), the condition is that $s \mid (n + 2p - f)$.

Example — Applying the Formula

Consider a $32 \times 32$ RGB image. We apply 16 filters of size $5 \times 5$ with padding $p = 2$ and stride $s = 1$. The output dimensions are:

\frac{32 + 2(2) - 5}{1} + 1 = 32 \quad \Longrightarrow \quad 32 \times 32 \times 16

The spatial dimensions are preserved (thanks to the padding), while the depth has changed from 3 to 16 — one feature map per filter.

Counting Parameters in a Convolutional Layer

Each filter has size $f \times f \times c$ (where $c$ is the input depth / number of channels), plus one bias term. With $m$ filters, the total number of parameters is $m \times (f^2 \times c + 1)$. Notice this does not depend on the image size $n$ — a huge advantage over fully connected layers.

7. Putting It All Together: A Full CNN

A typical CNN is built by stacking several types of layers in sequence:

Convolutional layer: Applies a set of learned filters to produce feature maps. Each convolutional layer is followed by a nonlinear activation (typically ReLU).
Pooling layer (optional): Downsamples the feature maps spatially. A common choice is max pooling, which takes the maximum value in each small patch (e.g. $2 \times 2$). This reduces spatial dimensions, provides a degree of translation invariance, and further reduces computation.
Fully connected layers: After several rounds of convolution and pooling, the resulting feature maps are flattened into a vector and fed through one or more fully connected layers to produce the final output (e.g. class probabilities for classification via softmax).

The early convolutional layers tend to learn low-level features (edges, corners, colour blobs), while deeper layers combine these into progressively higher-level features (textures, object parts, entire objects). This hierarchical feature learning is one of the main reasons CNNs are so effective for vision tasks.

Translation Equivariance

Because the same filter is applied at every spatial position, convolution is equivariant to translation: if the input is shifted, the feature map shifts by the same amount. This means a CNN can detect a pattern regardless of where it appears in the image — a cat's ear in the top-left corner activates the same edge-detecting filter as the same ear in the bottom-right. Combined with pooling (which provides a degree of translation invariance), this makes CNNs naturally suited to visual recognition tasks.

8. Fully Connected vs. Convolutional: A Comparison

Property	Fully Connected Layer	Convolutional Layer
Connectivity	Every input connected to every output	Each output depends only on a local patch
Weight sharing	No — separate weight per connection	Yes — same filter reused at every position
Parameters (1 layer)	$n_{\text{in}} \times n_{\text{out}}$ — depends on input size	$m \times (f^2 \times c + 1)$ — independent of image size
Spatial structure	Ignored (input is flattened)	Exploited (local neighbourhoods preserved)
Translation equivariance	No	Yes — built into the architecture

Key Takeaways

Images are tensors. Grayscale images are matrices; RGB images are $m \times n \times 3$ tensors. This spatial structure contains vital information that fully connected networks ignore.
Local connectivity + weight sharing are the two defining ideas of CNNs. They slash the parameter count from millions to thousands and encode the inductive bias that nearby pixels matter more than distant ones.
Convolution is a sliding dot product. In 1D, the filter slides along a vector; in 2D, it slides across a matrix; in 3D (RGB), the filter matches the input depth and the dot product sums over all channels. The 1D case can be written as a matrix multiplication with a Toeplitz matrix.
Padding and stride control the spatial dimensions of the output. Padding preserves border information; stride controls downsampling. The output size is $\left(\frac{n+2p-f}{s}+1\right) \times \left(\frac{n+2p-f}{s}+1\right) \times m$.
CNNs are hierarchical feature learners: early layers detect low-level patterns (edges, textures); deeper layers combine them into high-level concepts (object parts, objects). This is followed by fully connected layers for the final prediction.
Translation equivariance is a built-in property of convolution: shifting the input shifts the feature map by the same amount, meaning the network naturally detects patterns regardless of their spatial position.