Gradient checkpointing is a memory-for-compute trade-off technique used during the backward pass of neural network training. Instead of storing all intermediate activations from the forward pass—which consumes memory proportional to network depth—it strategically saves only a subset (checkpoints). The non-saved activations are recomputed on-demand from the nearest checkpoint during backpropagation, dramatically reducing peak memory usage at the cost of extra computation.
