Benchmarking Large Language Models As AI Research Agents https://arxiv.org/abs/2310.03302 Qian Huang, Jian Vora, Percy Liang, Jure Leskovec LLMs, Capacity