Probably a useful way to understand BERT is in the perspective of Machine translation. With the Attention mechanism
- Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
- From BERT to ALBERT: Pre-trained Language Models: https://email@example.com/from-bert-to-albert-pre-trained-langaug-models-5865aa5c3762
- BERT Rediscovers the Classical NLP Pipeline
- BERTScore: Evaluating Text Generation with BERT
A common pipeline for many NLP task is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook).
Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.
There are also domain-specific models, such as SciBERT. See also Language models for health records.