FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- https://arxiv.org/abs/2205.14135
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Attention mechanism, FlashAttention
time and memory complexity of Self attention are quadratic in sequence length.
FlashAttention is an IO-aware, exact self attention algorithm. It minimizes the reads and writes between GPU high bandwidth memory and GPU on-chip SRAM (SRAM speed >> HBM speed >> CPU DRAM speed).