Multimodal scientific reasoning remains a significant challenge for large language models in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, the authors systematically evaluate 40 proprietary and open-source multimodal large language models on a curated benchmark of Olympiad-style chemistry questions, revealing that some models struggle with vision-language fusion, while chain-of-thought prompting can enhance both accuracy and visual grounding.
- Yiming Cui
- Xin Yao
- Guoping Hu