Testing theory of mind in large language models and humans

Theory of mind, LLMs,

They used “theory of mind battery” tests with GPT-4, GPT-3.5 and LLaMA2-70B-Chat. GPT-4 performs better than human in most tasks although it tends to fail at Faux pas test, where LLaMA performed better than human.