Transformers, parallel computation, and logarithmic depth
- https://arxiv.org/abs/2402.09268
- Clayton Sanford, Daniel Hsu, Matus Telgarsky
a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations.
Transformer model ~ Massively parallel computation.