GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Grouped query attention

Second, we propose grouped-query attention (GQA), an interpolation between multi-head and multi-query attention with single key and value heads per subgroup of query heads. We show that uptrained GQA achieves quality close to multi- head attention while being almost as fast as multi- query attention.