Abstract
Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
Data availability
The complete SciCUEval dataset used in this study has been uploaded to figshare (https://doi.org/10.6084/m9.figshare.29924687)27.
Code availability
The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).
References
Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).
OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).
Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).
Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).
Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).
Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).
Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).
Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).
Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).
Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology 6, 020601 (2025).
Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).
Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).
He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems 37, 132876–132907 (2024).
Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).
International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.
The Materials Project. Available at: https://next-gen.materialsproject.org.
National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.
Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.
Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016).
Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics 22, bbaa344 (2021).
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data 10(1), 67 (2023).
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).
Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare https://doi.org/10.6084/m9.figshare.29924687 (2025).
Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).
DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).
Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).
Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.
Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).
Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).
Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).
Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).
Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).
Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).
Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).
Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence 38(16), 17754–17762 (2024).
Acknowledgements
This work is funded by New Generation Artificial Intelligence - National Science and Technology Major Project (2025ZD0122801, H.C.), NSFC62301480 (K.D.), NSFC62302433 (Q.Z.), NSFCU23A20496 (Q.Z.), and Ant Group Research Fund (K.D.). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of Chinese Academy of Sciences.
Author information
Authors and Affiliations
Contributions
J.Y. and Y.T. contributed equally to this work. J.Y., Y.T., and K.D. conceived the study and designed the method. J.Y. and Y.T. implemented the method, conducted the experiments, and performed the result analyses. K.F. and M.R. provided guidance and assistance in dataset construction. J.Y., Y.T., Q.Z., K.D., and H.C. wrote and revised the manuscript. K.D. and H.C. supervised the entire project. All authors wrote the paper, reviewed it, and approved the final paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yu, J., Tang, Y., Feng, K. et al. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. Sci Data (2026). https://doi.org/10.1038/s41597-026-06594-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06594-9