Testing theory of mind in large language models and humans

https://www.nature.com/articles/s41562-024-01882-z
James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano & Cristina Becchio

They used “theory of mind battery” tests with GPT-4, GPT-3.5 and LLaMA2-70B-Chat. GPT-4 performs better than human in most tasks although it tends to fail at Faux pas test, where LLaMA performed better than human.