Large language models are increasingly used for diverse tasks, yet we have limited insight into their understanding of chemistry. Now ChemBench—a benchmarking framework containing more than 2,700 question–answer pairs—has been developed to assess their chemical knowledge and reasoning, revealing that the best models surpass human chemists on average but struggle with some basic tasks.
- Adrian Mirza
- Nawaf Alampara
- Kevin Maik Jablonka