Probably a useful way to understand BERT is in the perspective of Machine translation with the Attention mechanism. The first author Jacob Devlin‘s talk BERT and other pretrained language models explains the model very well.

A common pipeline for many NLP task is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook).

Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.

There are also domain-specific models, such as SciBERT. See also Language models for health records.


How to pretrain transformer models? See also Implementations

Transformer model < > ALBERT | T5