How to Train BERT with an Academic Budget

We present a recipe for training a BERT-like masked language model in 24 hours, using only 8 Nvidia Titan-V GPUs (12GB each).