Probably a useful way to understand BERT is in the perspective of Machine translation. With the Attention mechanism

A common pipeline for many NLP task is simply to fine-tune the pre-trained BERT. Here is an excellent tutorial: BERT Fine-Tuning Tutorial with PyTorch (colab notebook).

Because BERT is a big model, there are attempts to reduce the number of parameters like DistilBERT.

There are also domain-specific models, such as SciBERT. See also Language models for health records.