Utility is in the Eye of the User: A Critique of NLP Leaderboards https://arxiv.org/abs/2009.13888 Kawin Ethayarajh, Dan Jurafsky Model evaluation, Human evaluation See also Ethayarajh2022authenticity