Gradient Descent: The Algorithm That Trains Every Model

Every machine learning model you've ever used — from a linear regression to GPT — was trained with some variant of gradient descent. The idea is almost embarrassingly simple. The implementation details are where the real engineering lives.

The Core Idea: Walk Downhill

A loss function $L(\theta)$ measures how wrong your model is. It's a surface in high-dimensional parameter space. Training means finding the lowest point on that surface.

The gradient $\nabla_\theta L$ points uphill — toward higher loss. So we take a small step in the opposite direction:

$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$

$\alpha$ is the learning rate. Too large: you overshoot and diverge. Too small: training takes forever.

# Vanilla gradient descent
for step in range(n_steps):
    loss = compute_loss(model, X, y)
    grads = compute_gradients(loss, model.parameters())
    for param, grad in zip(model.parameters(), grads):
        param -= learning_rate * grad

This is the entire algorithm. Everything else is an optimization of it.

Stochastic vs Batch vs Mini-Batch

Batch gradient descent computes the gradient on the full dataset before updating. Precise, but slow — you wait through the whole dataset for one step.

Stochastic gradient descent (SGD) updates after every single example. Fast, but noisy — the gradient from one example is a rough estimate of the true gradient.

Mini-batch gradient descent is the standard: compute the gradient on a small batch (typically 32–512 examples), then update. You get the efficiency of parallelism (batches fit on GPU) with reasonable gradient estimates.

# Mini-batch training loop
for epoch in range(n_epochs):
    for X_batch, y_batch in dataloader:   # dataloader shuffles & batches
        optimizer.zero_grad()
        loss = loss_fn(model(X_batch), y_batch)
        loss.backward()
        optimizer.step()

The noise in mini-batch SGD is actually useful — it acts as a regularizer, helping models generalize rather than overfit.

The Problem with Vanilla SGD

Plain gradient descent has two major failure modes on real loss surfaces:

Ravines. Loss surfaces are often much steeper in some directions than others. SGD oscillates wildly across the steep dimension while making slow progress along the gentle one.

Saddle points. In high-dimensional spaces, saddle points (zero gradient but not a minimum) are everywhere. Pure SGD can stall at them.

Momentum: Accumulate History

Momentum fixes the oscillation problem by accumulating a moving average of past gradients:

$v_{t+1} = \beta v_t + (1 - \beta) \nabla_\theta L$ $\theta_{t+1} = \theta_t - \alpha v_{t+1}$

Think of it physically: a ball rolling downhill accumulates speed in the consistent direction and cancels out the oscillations. $\beta = 0.9$ is the standard starting point.

# SGD with momentum
v = {p: torch.zeros_like(p) for p in model.parameters()}
beta = 0.9

for X_batch, y_batch in dataloader:
    grads = compute_gradients(model, X_batch, y_batch)
    for p, g in zip(model.parameters(), grads):
        v[p] = beta * v[p] + (1 - beta) * g
        p.data -= lr * v[p]

Adaptive Learning Rates: RMSProp

Different parameters may need different learning rates — some may be updated frequently (sparse features), others rarely. RMSProp adapts:

$s_{t+1} = \beta s_t + (1 - \beta) g_t^2$ $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{s_{t+1} + \epsilon}} g_t$

Parameters with historically large gradients get a smaller effective learning rate. Parameters with small gradients get a larger one.

Adam: The Default Choice

Adam (Adaptive Moment Estimation) combines momentum and RMSProp. It's the default optimizer for most deep learning tasks:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment)}$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment)}$

Bias-corrected estimates (important early in training when $m$ and $v$ are near zero):

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Update:

$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Standard hyperparameters: $\alpha = 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ .

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

One line. Under the hood it's tracking two moving averages per parameter.

Learning Rate Schedules

A fixed learning rate is rarely optimal. Common schedules:

Linear warmup + cosine decay (standard for transformers):

from torch.optim.lr_scheduler import CosineAnnealingLR

scheduler = CosineAnnealingLR(optimizer, T_max=n_steps)

for step in range(n_steps):
    # Linear warmup
    if step < warmup_steps:
        scale = step / warmup_steps
        for g in optimizer.param_groups:
            g['lr'] = base_lr * scale

    train_step()
    scheduler.step()

ReduceLROnPlateau: cut the learning rate when validation loss stops improving. Good for training from scratch when you're not sure about the schedule.

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, patience=5, factor=0.5
)
scheduler.step(val_loss)

Practical Choices That Actually Matter

Batch size. Larger batches give better gradient estimates but generalize worse (sharp minima). A commonly cited rule: if you increase batch size by $k\times$ , increase the learning rate by $\sqrt{k}$ .

Gradient clipping. Essential for transformers and RNNs. Without it, one bad batch can send your weights to infinity:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Weight decay. L2 regularization on parameters. In Adam, use AdamW (which decouples weight decay from the gradient update — the standard Adam implementation applies weight decay incorrectly):

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

A Mental Model for the Loss Surface

A useful (imperfect) intuition: imagine a high-dimensional bowl. Early training rapidly descends toward the bottom. Late training is slower — you're navigating a flat region with small gradients, looking for local minima that generalize well.

The loss surface for real models isn't convex. There are many local minima. Remarkably, large neural networks trained with SGD almost always find solutions that generalize well — the noise in mini-batch training helps escape sharp minima that overfit, and tends toward flat minima that generalize.

✓

When training isn't working: check the learning rate first. Diverging loss → too high. Loss barely moving → too low. A 10× change in either direction is usually the fastest diagnostic.

Summary

Optimizer	When to use
SGD + momentum	Image models (ResNet, ViT) with known schedules
Adam	Default for most tasks, especially NLP
AdamW	Any transformer training
RMSProp	RNNs, RL

The algorithm is simple. The craft is in learning rates, schedules, batch sizes, and clipping thresholds — and the only way to get intuition for those is to train models and watch what happens.