FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Attention mechanism, FlashAttention

time and memory complexity of Self attention are quadratic in sequence length.

FlashAttention is an IO-aware, exact self attention algorithm. It minimizes the reads and writes between GPU high bandwidth memory and GPU on-chip SRAM (SRAM speed >> HBM speed >> CPU DRAM speed).