machine-learningneural-networksdeep-learning

How Neural Networks Actually Learn: Backpropagation Explained

Most tutorials hand-wave over backprop. This one walks through the actual math — gradients, the chain rule, and why any of it works — with code you can run.

·5 min read

How Neural Networks Actually Learn: Backpropagation Explained

Every neural network tutorial eventually says "and then we run backpropagation." Most stop there. This one won't.

Backpropagation is just the chain rule from calculus, applied systematically across a graph of operations. Once you see that, the whole thing clicks.

What We're Actually Trying to Do

A neural network is a function. You feed it an input xx, it produces an output y^\hat{y}, and you measure how wrong it is with a loss function LL.

L=1ni=1n(y^iyi)2L = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

Training means adjusting the weights WW so that LL gets smaller. The question is: which direction do we adjust them?

That's what the gradient tells us. The gradient LW\frac{\partial L}{\partial W} is a vector pointing in the direction of steepest increase in LL. We go the opposite way.

WWαLWW \leftarrow W - \alpha \cdot \frac{\partial L}{\partial W}

This is gradient descent. α\alpha is the learning rate — how big a step to take.

The Chain Rule Is the Whole Story

Consider a simple two-layer network:

z1=W1x,a1=σ(z1),z2=W2a1,L=loss(z2,y)z_1 = W_1 x, \quad a_1 = \sigma(z_1), \quad z_2 = W_2 a_1, \quad L = \text{loss}(z_2, y)

To update W1W_1, we need LW1\frac{\partial L}{\partial W_1}. We don't have a direct formula for that — but the chain rule says:

LW1=Lz2z2a1a1z1z1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}

Each of these individual derivatives is simple. We just multiply them. That's backpropagation — computing this product backwards through the network, layer by layer.

A Minimal Implementation from Scratch

Here's a single hidden-layer network in pure NumPy, with backprop written explicitly:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

# Forward pass
def forward(X, W1, b1, W2, b2):
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    return z1, a1, z2

# Mean squared error loss
def mse_loss(y_pred, y_true):
    return np.mean((y_pred - y_true) ** 2)

# Backward pass — this is backprop
def backward(X, y, z1, a1, z2, W2):
    n = X.shape[0]

    # Gradient of loss w.r.t. z2 (output layer pre-activation)
    dL_dz2 = 2 * (z2 - y) / n          # shape: (n, output_dim)

    # Gradients for W2 and b2
    dL_dW2 = a1.T @ dL_dz2             # chain: dz2/dW2 = a1
    dL_db2 = dL_dz2.sum(axis=0)

    # Backprop through W2 into a1
    dL_da1 = dL_dz2 @ W2.T             # chain: dz2/da1 = W2

    # Backprop through sigmoid
    dL_dz1 = dL_da1 * sigmoid_grad(z1) # chain: da1/dz1 = sigmoid'

    # Gradients for W1 and b1
    dL_dW1 = X.T @ dL_dz1
    dL_db1 = dL_dz1.sum(axis=0)

    return dL_dW1, dL_db1, dL_dW2, dL_db2

# Training loop
np.random.seed(42)
X = np.random.randn(100, 4)   # 100 examples, 4 features
y = np.random.randn(100, 1)   # regression target

W1 = np.random.randn(4, 8) * 0.01
b1 = np.zeros(8)
W2 = np.random.randn(8, 1) * 0.01
b2 = np.zeros(1)
lr = 0.01

for step in range(500):
    z1, a1, z2 = forward(X, W1, b1, W2, b2)
    loss = mse_loss(z2, y)

    dW1, db1, dW2, db2 = backward(X, y, z1, a1, z2, W2)

    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2

    if step % 100 == 0:
        print(f"Step {step:4d} | Loss: {loss:.4f}")

Run this and you'll see the loss fall. Each dL_dW follows exactly one chain of derivatives — nothing magical.

Why Deep Networks Are Tricky

Two classic problems emerge in deeper networks:

Vanishing gradients. Sigmoid squashes inputs into (0,1)(0, 1). Its gradient is at most 0.250.25. Multiply 20 of those together and you get 0.252010120.25^{20} \approx 10^{-12}. Gradients in early layers essentially vanish — those weights barely move. This is why ReLU replaced sigmoid in hidden layers: ReLU(x)=1\text{ReLU}'(x) = 1 for x>0x > 0, so gradients pass through unchanged.

Exploding gradients. The opposite: gradients amplify through layers, causing weight updates to be enormous and training to diverge. Gradient clipping is the standard fix — cap the gradient norm before the update:

max_norm = 1.0
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > max_norm:
    grads = [g * (max_norm / grad_norm) for g in grads]

What PyTorch Does Differently

You almost never write backward passes manually. PyTorch's autograd builds a computation graph as you do the forward pass, then .backward() traverses it:

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(4, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

X = torch.randn(100, 4)
y = torch.randn(100, 1)

for step in range(500):
    optimizer.zero_grad()       # clear previous gradients
    y_pred = model(X)           # forward pass builds the graph
    loss = loss_fn(y_pred, y)
    loss.backward()             # backprop: fills .grad on each parameter
    optimizer.step()            # W -= lr * W.grad

Same math. PyTorch just automates the chain rule through its graph.

The Key Insight

Backpropagation isn't a special algorithm — it's calculus. The reason it works is that neural networks are just compositions of differentiable functions, and the chain rule tells us how to differentiate compositions.

Every activation, every weight matrix, every loss function is just a node in a graph. Each node knows its local derivative. Multiply them in reverse order and you have the gradient of the loss with respect to any parameter in the network.

That's the whole thing.