Benchmarking Large Language Models As AI Research Agents

LLMs, Capacity