SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Yu, Jing; Tang, Yuqi; Feng, Kehua; Liang, Lei; Zhang, Qiang; Ding, Keyan; Chen, Huajun

doi:10.1038/s41597-026-06594-9

Data Descriptor
Open access
Published: 26 February 2026

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Scientific Data , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.

Data availability

The complete SciCUEval dataset used in this study has been uploaded to figshare (https://doi.org/10.6084/m9.figshare.29924687)²⁷.

Code availability

The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).

References

Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).
OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).
Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).
Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).
Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
Google Scholar
Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).
Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).
Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).
Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).
Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).
Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology 6, 020601 (2025).
Google Scholar
Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).
Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).
He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems 37, 132876–132907 (2024).
Google Scholar
Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).
International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.
The Materials Project. Available at: https://next-gen.materialsproject.org.
National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.
Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.
Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016).
Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics 22, bbaa344 (2021).
Google Scholar
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data 10(1), 67 (2023).
Google Scholar
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).
Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare https://doi.org/10.6084/m9.figshare.29924687 (2025).
Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).
DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).
Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).
Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.
Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).
Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).
Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).
Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).
Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).
Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).
Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).
Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence 38(16), 17754–17762 (2024).
Google Scholar

Download references

Acknowledgements

This work is funded by New Generation Artificial Intelligence - National Science and Technology Major Project (2025ZD0122801, H.C.), NSFC62301480 (K.D.), NSFC62302433 (Q.Z.), NSFCU23A20496 (Q.Z.), and Ant Group Research Fund (K.D.). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of Chinese Academy of Sciences.

Author information

Authors and Affiliations

ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, China
Jing Yu, Yuqi Tang, Kehua Feng, Keyan Ding & Huajun Chen
The Polytechnic Institute, Zhejiang University, Hangzhou, China
Jing Yu
ZJU-UIUC, Zhejiang University, Jiaxing, China
Yuqi Tang & Qiang Zhang
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Kehua Feng & Huajun Chen
AntGroup, Hangzhou, China
Lei Liang

Authors

Jing Yu
View author publications
Search author on:PubMed Google Scholar
Yuqi Tang
View author publications
Search author on:PubMed Google Scholar
Kehua Feng
View author publications
Search author on:PubMed Google Scholar
Lei Liang
View author publications
Search author on:PubMed Google Scholar
Qiang Zhang
View author publications
Search author on:PubMed Google Scholar
Keyan Ding
View author publications
Search author on:PubMed Google Scholar
Huajun Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

J.Y. and Y.T. contributed equally to this work. J.Y., Y.T., and K.D. conceived the study and designed the method. J.Y. and Y.T. implemented the method, conducted the experiments, and performed the result analyses. K.F. and M.R. provided guidance and assistance in dataset construction. J.Y., Y.T., Q.Z., K.D., and H.C. wrote and revised the manuscript. K.D. and H.C. supervised the entire project. All authors wrote the paper, reviewed it, and approved the final paper.

Corresponding authors

Correspondence to Keyan Ding or Huajun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yu, J., Tang, Y., Feng, K. et al. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. Sci Data (2026). https://doi.org/10.1038/s41597-026-06594-9

Download citation

Received: 19 August 2025
Accepted: 08 January 2026
Published: 26 February 2026
DOI: https://doi.org/10.1038/s41597-026-06594-9