Grouped-query attention

Attention mechanism used in the LLaMA.

Improved upon Multi query attention, suggested in Ainslie2023gqa.

It is an interpolation between multi-head and multi-query attention, with subgroups of query heads.