Scaling Laws for Neural Language Models

There is a crazy power-law Scaling law between test loss and dataset/model/computation size. No apparent saturation and the loss keeps decreasing.


LLMs < >