Stochastic gradient descent

An efficient and stochastic version of Gradient descent. Instead of computing the loss for the whole dataset, create small batches and perform many gradient descent sessions by rotating through the batches.

Smith2021origin and Barrett2021implicit argues that SGD provides implicit Regularization. Here is a note on these papers by Ferenc Huszár.

Deep learning < >