Neural Networks 3: Training Techniques & Regularization

Course: Introduction to Machine Learning Lecture 13 Spring 2026, Prof. Andreas Krause

This lecture addresses the practical challenges of training deep neural networks. Building on the backpropagation and architecture foundations from Lectures 11–12, we now tackle three critical questions: how should we initialize weights, how should we tune the learning rate and optimizer, and how do we prevent overfitting in models with millions of parameters? The lecture also introduces convolutional neural networks (CNNs) as a specialized architecture for image data.

1. Weight Initialization

1.1 Why Initialization Matters

Because the NN loss function is non-convex, the starting point of optimization directly affects where gradient descent converges. If we initialize all weights to the same value (e.g., zero), every hidden unit computes the same thing and receives the same gradient — the network can never break this symmetry. Random initialization breaks this symmetry, but we still need to be careful about the scale of the random values.

1.2 The Vanishing & Exploding Gradients Problem

Recall from backpropagation that the gradient with respect to the weights at layer $l$ is:

$$\nabla_{W^{(l)}} \ell = \frac{\partial \ell}{\partial f} \cdot \frac{\partial f}{\partial h^{(l)}} \cdot \frac{\partial h^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}$$

This is a product of many terms — one for each layer the signal has to pass through. If the individual factors are consistently greater than 1, the product grows exponentially (exploding gradients). If they are consistently less than 1, the product shrinks to zero (vanishing gradients). Either scenario makes training fail: exploding gradients cause divergence, while vanishing gradients cause the early layers to stop learning entirely.

Key Insight

The factors $\text{diag}(\varphi'(z^{(l)}))$ and the activations $h^{(l-1)}$ both appear in the gradient chain. If the norms of these vectors grow or shrink uncontrollably across layers, optimization breaks down. This is why both the activation function and the weight scale matter so much.

1.3 Activation Function Choice

The sigmoid activation $\varphi(z) = \frac{1}{1+e^{-z}}$ has a derivative $\varphi'(z) = \varphi(z)(1-\varphi(z))$ that is essentially zero outside a narrow band around $z=0$. This makes it prone to vanishing gradients in deep networks. The ReLU activation $\varphi(z) = \max(z, 0)$ has derivative 1 for positive inputs and 0 for negative ones — the gradient neither amplifies nor shrinks for active units, which generally helps training.

1.4 Initialization Schemes

The goal of smart initialization is to keep the variance of activations approximately constant across layers, so that signals and gradients neither explode nor vanish at the start of training.

Intuition: A Single-Layer Example

Consider one hidden unit with ReLU and $d$ inputs $x_1, \ldots, x_d$ (zero-mean, unit variance). If weights are drawn i.i.d. from $\mathcal{N}(0, \sigma^2)$, then the pre-activation $Z = \sum_{i=1}^d w_i x_i$ has variance $d \cdot \sigma^2$. After ReLU, $\mathbb{E}[\max(Z,0)^2] = \frac{1}{2} d \sigma^2$. Setting $\sigma^2 = 2/d$ makes the output variance equal to 1 — matching the input variance.

This reasoning extends to practical initialization schemes used in deep networks:

Xavier / Glorot Initialization (for tanh)

Draw each weight from $w_{i,j} \sim \mathcal{N}\!\left(0, \frac{1}{n_{\text{in}}}\right)$ or $w_{i,j} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$, where $n_{\text{in}}$ and $n_{\text{out}}$ are the number of incoming and outgoing units of the layer.

He Initialization (for ReLU)

Draw each weight from $w_{i,j} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}}}\right)$. The factor of 2 accounts for ReLU zeroing out roughly half of the inputs.

2. Learning Rate & Momentum

2.1 Choosing the Learning Rate

The SGD update rule is $W \leftarrow W - \eta_t \nabla_W \ell(W; x, y)$. The learning rate $\eta_t$ is perhaps the single most important hyperparameter:

Learning Rate Schedules

In practice, a common strategy is to use a decaying learning rate schedule. The intuition is straightforward: at the beginning of training, the parameters are far from any good solution, so we want large steps to make quick progress. As we get closer to a minimum, we want smaller steps to avoid overshooting. Popular choices include piecewise constant decay (drop the rate by a factor at set epochs) and linear or cosine decay.

Monitoring the Update Ratio

A useful diagnostic is to track the ratio of the gradient norm to the weight norm: $\frac{\|\nabla_\Theta L(\Theta; \mathcal{D})\|}{\|\Theta\|_2}$. If this ratio is very small, the updates are negligible relative to the weights, suggesting the learning rate should increase. If it's very large, the opposite applies.

2.2 Momentum

Plain SGD often oscillates erratically, especially in loss landscapes where the curvature differs a lot across directions (ill-conditioned landscapes). Momentum smooths out these oscillations by incorporating information from past update directions:

$$\mathbf{d} \leftarrow m \cdot \mathbf{d} + \eta_t \nabla_W \ell(W; x, y)$$ $$W \leftarrow W - \mathbf{d}$$

Here $m \in [0, 1)$ is the momentum coefficient. The vector $\mathbf{d}$ accumulates a running average of past gradients (with exponential decay). The effect is that consistent gradient directions get reinforced, while oscillating components cancel out — leading to faster, more stable convergence.

Practical Note

There are also adaptive methods (AdaGrad, RMSProp, Adam) that maintain per-parameter learning rates. The core idea: coordinates that have already changed a lot get a smaller effective step size. These are very popular in modern deep learning and often work well out of the box with less learning rate tuning.

3. Regularization in Neural Networks

Neural networks can have thousands to millions of parameters, which makes them highly expressive but also prone to overfitting. This section covers the main countermeasures.

3.1 Weight Decay (L2 Regularization)

Just like ridge regression, we add a penalty term that discourages large weights:

$$\Theta^* = \arg\min_\Theta \sum_{i=1}^{n} \ell(\Theta; x_i, y_i) + \lambda \|\Theta\|_2^2$$

The hyperparameter $\lambda > 0$ controls the strength of the penalty. Larger $\lambda$ pushes weights toward zero, reducing model complexity at the cost of potentially underfitting. This technique is also called weight decay because the gradient of the penalty term $2\lambda \Theta$ effectively "decays" the weights a little at every step.

3.2 Early Stopping

Instead of training until the optimization converges fully, we monitor the model's performance on a validation set during training. When the training error keeps going down but the validation error starts to rise, the model is beginning to overfit. We stop training right before this point and use the model from that epoch.

Why does this work?

Early stopping effectively limits how far the parameters can travel from their initialization. Since the parameters start small (from random initialization), stopping early means the learned function stays relatively "simple." This acts as an implicit form of regularization — conceptually similar to constraining the norm of the weight vector.

3.3 Dropout

Dropout (Srivastava et al., 2014) is a regularization technique unique to neural networks. The idea is elegant: during each iteration of SGD, we randomly "drop out" (remove) each hidden unit independently with probability $1-p$. The remaining sub-network is trained normally. In the next iteration, a different random subset of units is dropped.

Training Phase

For each SGD step, every hidden unit is kept with probability $p$ and deleted with probability $1-p$. The weights associated with deleted units are frozen for that step. Only the surviving sub-network receives gradient updates.

Test Phase

At test time, all units are used (no dropout), but all weights are multiplied by $p$ to compensate. The reasoning: during training, each unit was active only a fraction $p$ of the time, so its weights were calibrated for a network where roughly $p$ fraction of units are active. Multiplying by $p$ ensures the expected activation magnitude matches what was seen during training.

Why Dropout Works — Intuition

Dropout prevents units from co-adapting — relying too heavily on specific other units. Since any unit may be absent in a given training step, each unit must learn features that are useful on their own, in combination with many different random subsets of other units. This encourages more robust, generalizable representations. You can also think of it as training an exponential number of different "thinned" networks simultaneously, then averaging their predictions at test time.

4. Batch Normalization

4.1 Motivation

We normalize input data (zero mean, unit variance) as a standard preprocessing step — it helps optimization converge faster. But in deep networks, each layer receives input from the previous layer, and as the weights change during training, the distribution of these intermediate inputs keeps shifting. This phenomenon is called internal covariate shift. Weight initialization helps at the start, but after a few SGD steps the weights may drift far from their initial values, and the benefits of careful initialization are lost.

4.2 The Batch Normalization Layer

Batch normalization (Ioffe & Szegedy, 2015) addresses this by standardizing activations during training, not just at initialization. It is inserted as an additional layer in the network. Given a mini-batch $S$ of activation values, it computes:

Batch Normalization — Algorithm

Input: Activation values $x_j$ for all $j \in S$ in the current mini-batch.

Learnable parameters: $\gamma$ (scale), $\beta$ (shift)

  1. Compute mini-batch mean: $\mu_S = \frac{1}{|S|} \sum_{j \in S} x_j$
  2. Compute mini-batch variance: $\sigma_S^2 = \frac{1}{|S|} \sum_{j \in S} (x_j - \mu_S)^2$
  3. Normalize: $\hat{x}_j = \frac{x_j - \mu_S}{\sqrt{\sigma_S^2 + \varepsilon}}$ (where $\varepsilon$ prevents division by zero)
  4. Scale and shift: $\bar{x}_j = \gamma \hat{x}_j + \beta$

Output: $\bar{x}_j$ for all $j \in S$

Why include learnable $\gamma$ and $\beta$?

After step 3, the activations have zero mean and unit variance. But this isn't necessarily optimal — maybe a different mean and variance would make learning easier. The learnable parameters $\gamma$ and $\beta$ give the network the flexibility to undo the normalization if that helps. Crucially, if the network "wants" mean 0 and variance 1, it simply learns $\beta = 0$ and $\gamma = 1$.

Test-Time Behavior

During inference, we may not have a mini-batch to compute statistics over (e.g., predicting a single example). To handle this, batch normalization maintains a moving average of the mean and variance during training (using an exponential moving average with momentum $\alpha$). At test time, these stored statistics are used instead of mini-batch statistics.

4.3 Benefits of Batch Normalization

Beyond Batch Normalization

Batch normalization is not the only normalization technique. Alternatives include Layer Normalization (normalizes across features instead of across the batch — popular in NLP/transformers), Group Normalization, Instance Normalization, and Weight Standardization. Each has different trade-offs depending on the architecture and task.

5. Convolutional Neural Networks (CNNs)

5.1 Motivation: Images as Input

In computer vision tasks, the input is a digital image represented as a grid of pixel values. A grayscale image of size $m \times n$ is a matrix; an RGB image is a tensor of shape $m \times n \times 3$ (one matrix per color channel). We could flatten this into a vector and feed it to a fully connected NN, but this ignores the spatial structure of images: nearby pixels are strongly correlated, while distant pixels are less so.

A fully connected layer connecting, say, a $100 \times 100$ image to a hidden layer of 1000 units would require $10{,}000 \times 1{,}000 = 10{,}000{,}000$ parameters — just for one layer. This is both wasteful and overfitting-prone.

5.2 Key Idea: Local Connectivity & Weight Sharing

CNNs replace the fully connected layers (at least in the early part of the network) with convolutional layers, which exploit two principles:

5.3 The Convolution Operation

Given an input (e.g., a matrix representing a grayscale image) and a small filter $W$, the convolution $x * W$ is computed by sliding $W$ across every position in $x$ and computing the dot product at each position. The result is a new matrix (often called a feature map).

2D Convolution Example

Place a $3 \times 3$ filter at the top-left corner of a $7 \times 7$ image. Treat both the overlapping $3\times3$ patch and the filter as 9-dimensional vectors, and compute their dot product — that gives you the $(1,1)$ entry of the output. Slide the filter one step to the right, repeat, and so on through all positions. The result is a $5 \times 5$ feature map.

For RGB images (3D tensors), the filter is also 3D (e.g., $3 \times 3 \times 3$). You still slide it across spatial positions, taking the dot product of the full 3D patch with the 3D filter at each location. Applying multiple filters gives multiple feature maps, stacked into an output tensor.

5.4 Padding and Stride

Two important hyperparameters control how the convolution is performed:

Output Dimension Formula

For an $n \times n$ image, $m$ filters of size $f \times f$, padding $p$, and stride $s$, the output tensor has dimensions:

$$\left(\frac{n + 2p - f}{s} + 1\right) \times \left(\frac{n + 2p - f}{s} + 1\right) \times m$$

This requires that $s$ divides $(n + 2p - f)$.

5.5 Putting It Together

A typical CNN alternates convolutional layers (which extract local features) with nonlinear activations and sometimes pooling layers (which downsample feature maps). The later layers may be fully connected. The key advantage over fully connected networks is a massive reduction in parameters due to weight sharing, and the architecture's natural ability to recognize patterns regardless of their position in the image (translation equivariance).

Key Takeaways