Probably a useful way to understand BERT is in the perspective of Machine translation with the Attention mechanism. The first author Jacob Devlin‘s talk BERT and other pretrained language models explains the model very well.
- Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
- From BERT to ALBERT: Pre-trained Language Models: https://firstname.lastname@example.org/from-bert-to-albert-pre-trained-langaug-models-5865aa5c3762
- BERT Rediscovers the Classical NLP Pipeline
- BERTScore: Evaluating Text Generation with BERT
A common pipeline for many NLP task is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook).
Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.
There are also domain-specific models, such as SciBERT. See also Language models for health records.
How to pretrain transformer models? See also Implementations
Transformer model < > ALBERT | T5