BERT

Probably a useful way to understand BERT is from the perspective of Machine translation with the Attention mechanism. The first author Jacob Devlin‘s talk BERT and other pretrained language models explain the model very well.

Rogers2020primer reviews our understanding of how BERT works. Tenney2019BERT showed that BERT rediscovers the classical NLP pipeline.

A common pipeline for many NLP tasks is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook). Mosbach2020stability examines the stability of the fine-tuning process.

Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.

There are also domain-specific models, such as SciBERT. See also Language models for health records.

Tutorials

Clark2019what proposed a way to examine where the attention is paid.