BERT
Probably a useful way to understand BERT is from the perspective of Machine translation with the Attention mechanism. The first author Jacob Devlin‘s talk BERT and other pretrained language models explain the model very well.
Rogers2020primer reviews our understanding of how BERT works. Tenney2019BERT showed that BERT rediscovers the classical NLP pipeline.
- Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
- From BERT to ALBERT: Pre-trained Language Models: https://medium.com/@hamdan.hussam/from-bert-to-albert-pre-trained-langaug-models-5865aa5c3762
- Tenney2019BERT
- BERTScore: Evaluating Text Generation with BERT
A common pipeline for many NLP tasks is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook). Mosbach2020stability examines the stability of the fine-tuning process.
Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.
There are also domain-specific models, such as SciBERT. See also Language models for health records.
Tutorials
Pre-training
How to pretrain transformer models? See also Implementations
Interpretation
Clark2019what proposed a way to examine where the attention is paid.
Variants
Transformer model < > ALBERT | T5