Transformer model
The core architecture behind the LLMs. It uses Attention mechanism
- http://jalammar.github.io/illustrated-transformer/
- https://www.youtube.com/watch?v=-QH8fRhqFHM : GPT is more about decoders (generation), BERT is more about encoders (translation and representation).
Google’s T5 paper provides a unified framework to understand and train transformer models.
Tutorials and reviews
- A walkthrough of transformer architecture code by Mark Riedl
- Transformers from scratch
- “Attention”, “Transformers”, in Neural Network “Large Language Models” by Cosma Shalizi
- Understanding Encoder And Decoder LLMs by Sebastian Raschka
- The Illustrated Transformer by Jay Alammar
- The Attention Mechanism in Large Language Models by Serrano
Implementations
See also Implementations
https://huggingface.co/blog/how-to-train shows how to train a transformer model from scratch. See also How to pretrain transformer models or A complete Hugging Face tutorial: how to build and train a vision transformer
- gpt-fast: https://github.com/pytorch-labs/gpt-fast
Applications
They are used in other areas outside Language models, including Computer vision and Reinforcement learning (Decision transformer).