Often, once the de-facto evaluation standard emerges, little attention is paid to the evaluation tasks themselves although they need to be critically examined.
Another common issue is not paying attention to the uncertainty in the evaluation. Agarwal2021deep argues this point about Reinforcement learning evaluation.
The evaluation often does not take into account important information for practioners (e.g., cost of the model, latency, etc.). This misalignment of incentives can be bad. See for instance Ethayarajh2020utility.