Gradient accumulation in training neural networks addresses the challenge of limited batch sizes, especially on smaller hardware.
Typically, larger batch sizes help smooth out the training process by reducing the variance and noise in the optimization steps, but they require more memory. When restricted to smaller batch sizes, the training process can resemble stochastic gradient descent, which is more erratic.
Gradient accumulation circumvents this by simulating larger batch sizes without the extra memory demand. During training, instead of updating the model parameters after every backward pass, the gradients are stored in a buffer. After several forward and backward passes, the accumulated gradients are summed together to perform a single, larger update on the model parameters. This technique essentially emulates the effect of training with a larger batch size, smoothing out the training process and potentially leading to better model convergence without the additional memory overhead.