How Neural Networks Actually Learn: Backpropagation Explained
Most tutorials hand-wave over backprop. This one walks through the actual math — gradients, the chain rule, and why any of it works — with code you can run.
How Neural Networks Actually Learn: Backpropagation Explained
Every neural network tutorial eventually says "and then we run backpropagation." Most stop there. This one won't.
Backpropagation is just the chain rule from calculus, applied systematically across a graph of operations. Once you see that, the whole thing clicks.
What We're Actually Trying to Do
A neural network is a function. You feed it an input , it produces an output , and you measure how wrong it is with a loss function .
Training means adjusting the weights so that gets smaller. The question is: which direction do we adjust them?
That's what the gradient tells us. The gradient is a vector pointing in the direction of steepest increase in . We go the opposite way.
This is gradient descent. is the learning rate — how big a step to take.
The Chain Rule Is the Whole Story
Consider a simple two-layer network:
To update , we need . We don't have a direct formula for that — but the chain rule says:
Each of these individual derivatives is simple. We just multiply them. That's backpropagation — computing this product backwards through the network, layer by layer.
A Minimal Implementation from Scratch
Here's a single hidden-layer network in pure NumPy, with backprop written explicitly:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1 - s)
# Forward pass
def forward(X, W1, b1, W2, b2):
z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
return z1, a1, z2
# Mean squared error loss
def mse_loss(y_pred, y_true):
return np.mean((y_pred - y_true) ** 2)
# Backward pass — this is backprop
def backward(X, y, z1, a1, z2, W2):
n = X.shape[0]
# Gradient of loss w.r.t. z2 (output layer pre-activation)
dL_dz2 = 2 * (z2 - y) / n # shape: (n, output_dim)
# Gradients for W2 and b2
dL_dW2 = a1.T @ dL_dz2 # chain: dz2/dW2 = a1
dL_db2 = dL_dz2.sum(axis=0)
# Backprop through W2 into a1
dL_da1 = dL_dz2 @ W2.T # chain: dz2/da1 = W2
# Backprop through sigmoid
dL_dz1 = dL_da1 * sigmoid_grad(z1) # chain: da1/dz1 = sigmoid'
# Gradients for W1 and b1
dL_dW1 = X.T @ dL_dz1
dL_db1 = dL_dz1.sum(axis=0)
return dL_dW1, dL_db1, dL_dW2, dL_db2
# Training loop
np.random.seed(42)
X = np.random.randn(100, 4) # 100 examples, 4 features
y = np.random.randn(100, 1) # regression target
W1 = np.random.randn(4, 8) * 0.01
b1 = np.zeros(8)
W2 = np.random.randn(8, 1) * 0.01
b2 = np.zeros(1)
lr = 0.01
for step in range(500):
z1, a1, z2 = forward(X, W1, b1, W2, b2)
loss = mse_loss(z2, y)
dW1, db1, dW2, db2 = backward(X, y, z1, a1, z2, W2)
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
if step % 100 == 0:
print(f"Step {step:4d} | Loss: {loss:.4f}")
Run this and you'll see the loss fall. Each dL_dW follows exactly one chain of derivatives — nothing magical.
Why Deep Networks Are Tricky
Two classic problems emerge in deeper networks:
Vanishing gradients. Sigmoid squashes inputs into . Its gradient is at most . Multiply 20 of those together and you get . Gradients in early layers essentially vanish — those weights barely move. This is why ReLU replaced sigmoid in hidden layers: for , so gradients pass through unchanged.
Exploding gradients. The opposite: gradients amplify through layers, causing weight updates to be enormous and training to diverge. Gradient clipping is the standard fix — cap the gradient norm before the update:
max_norm = 1.0
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > max_norm:
grads = [g * (max_norm / grad_norm) for g in grads]
What PyTorch Does Differently
You almost never write backward passes manually. PyTorch's autograd builds a computation graph as you do the forward pass, then .backward() traverses it:
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(4, 8),
nn.ReLU(),
nn.Linear(8, 1),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
X = torch.randn(100, 4)
y = torch.randn(100, 1)
for step in range(500):
optimizer.zero_grad() # clear previous gradients
y_pred = model(X) # forward pass builds the graph
loss = loss_fn(y_pred, y)
loss.backward() # backprop: fills .grad on each parameter
optimizer.step() # W -= lr * W.grad
Same math. PyTorch just automates the chain rule through its graph.
The Key Insight
Backpropagation isn't a special algorithm — it's calculus. The reason it works is that neural networks are just compositions of differentiable functions, and the chain rule tells us how to differentiate compositions.
Every activation, every weight matrix, every loss function is just a node in a graph. Each node knows its local derivative. Multiply them in reverse order and you have the gradient of the loss with respect to any parameter in the network.
That's the whole thing.