FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

https://arxiv.org/abs/2205.14135
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

time and memory complexity of Self attention are quadratic in sequence length.

FlashAttention is an IO-aware, exact self attention algorithm. It minimizes the reads and writes between GPU high bandwidth memory and GPU on-chip SRAM (SRAM speed >> HBM speed >> CPU DRAM speed).