GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

https://arxiv.org/abs/2305.13245
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

Grouped query attention

Second, we propose grouped-query attention (GQA), an interpolation between multi-head and multi-query attention with single key and value heads per subgroup of query heads. We show that uptrained GQA achieves quality close to multi- head attention while being almost as fast as multi- query attention.