How to pre-train transformer models?

Transformer model, BERT, GPT, T5

Implementations

Cost

How much does it cost to train a BERT model? Sharir2020cost estimates cost for models with different parameter size.

Clova’s LaRva1 team says that BERT costs about $7,000 with 16 v2 TPUs (4 days). The Staggering Cost of Training SOTA AI Models names $6,912 for BERT one-time pretraining, or about $500 for BERT-Base model.

Izsak2021how proposes a recipe for training BERT in 24 hours with 8 12GB GPUs.

Tips

It’s important to be able to see the progress during the training1. They also made a dynamic data pipeline that generates training set on the fly.

How to scale the BERT Training with Nvidia GPUs?

Options

Google’s cloud TPU can be used for training. There is a tutorial with source code. –https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 and https://colab.research.google.com/drive/1nVn6AFpQSzXBt8_ywfx6XR8ZfQXlKGAz

Hugging Face has a tutorial for training a new language model from scratch: https://huggingface.co/blog/how-to-train

It can be done with AWS: Amazon Web Services achieves fastest training times for BERT and Mask R-CNN. There are a couple of offerings in AWS marketplace: https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=BERT