Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 26 February 2026

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

  • Jing Yu1,2,
  • Yuqi Tang  ORCID: orcid.org/0009-0003-4903-72341,3,
  • Kehua Feng  ORCID: orcid.org/0009-0001-9620-05111,4,
  • Lei Liang5,
  • Qiang Zhang  ORCID: orcid.org/0000-0003-1636-52693,
  • Keyan Ding  ORCID: orcid.org/0000-0003-2900-73131 &
  • …
  • Huajun Chen1,4 

Scientific Data , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Data acquisition
  • Databases
  • Interdisciplinary studies

Abstract

Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.

Data availability

The complete SciCUEval dataset used in this study has been uploaded to figshare (https://doi.org/10.6084/m9.figshare.29924687)27.

Code availability

The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).

References

  1. Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).

  2. OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).

  3. Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).

  4. Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).

  5. Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).

  6. Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).

  7. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020).

    Google Scholar 

  8. Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).

  9. Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).

  10. Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).

  11. Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).

  12. Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).

  13. Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).

  14. Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology 6, 020601 (2025).

    Google Scholar 

  15. Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).

  16. Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).

  17. He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems 37, 132876–132907 (2024).

    Google Scholar 

  18. Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).

  19. International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.

  20. The Materials Project. Available at: https://next-gen.materialsproject.org.

  21. National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.

  22. Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.

  23. Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016).

  24. Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics 22, bbaa344 (2021).

    Google Scholar 

  25. Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data 10(1), 67 (2023).

    Google Scholar 

  26. Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).

  27. Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare https://doi.org/10.6084/m9.figshare.29924687 (2025).

  28. Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).

  29. DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).

  30. Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).

  31. Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).

  32. Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.

  33. Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).

  34. Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).

  35. Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).

  36. Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).

  37. Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).

  38. Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).

  39. Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).

  40. Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence 38(16), 17754–17762 (2024).

    Google Scholar 

Download references

Acknowledgements

This work is funded by New Generation Artificial Intelligence - National Science and Technology Major Project (2025ZD0122801, H.C.), NSFC62301480 (K.D.), NSFC62302433 (Q.Z.), NSFCU23A20496 (Q.Z.), and Ant Group Research Fund (K.D.). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of Chinese Academy of Sciences.

Author information

Authors and Affiliations

  1. ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, China

    Jing Yu, Yuqi Tang, Kehua Feng, Keyan Ding & Huajun Chen

  2. The Polytechnic Institute, Zhejiang University, Hangzhou, China

    Jing Yu

  3. ZJU-UIUC, Zhejiang University, Jiaxing, China

    Yuqi Tang & Qiang Zhang

  4. College of Computer Science and Technology, Zhejiang University, Hangzhou, China

    Kehua Feng & Huajun Chen

  5. AntGroup, Hangzhou, China

    Lei Liang

Authors
  1. Jing Yu
    View author publications

    Search author on:PubMed Google Scholar

  2. Yuqi Tang
    View author publications

    Search author on:PubMed Google Scholar

  3. Kehua Feng
    View author publications

    Search author on:PubMed Google Scholar

  4. Lei Liang
    View author publications

    Search author on:PubMed Google Scholar

  5. Qiang Zhang
    View author publications

    Search author on:PubMed Google Scholar

  6. Keyan Ding
    View author publications

    Search author on:PubMed Google Scholar

  7. Huajun Chen
    View author publications

    Search author on:PubMed Google Scholar

Contributions

J.Y. and Y.T. contributed equally to this work. J.Y., Y.T., and K.D. conceived the study and designed the method. J.Y. and Y.T. implemented the method, conducted the experiments, and performed the result analyses. K.F. and M.R. provided guidance and assistance in dataset construction. J.Y., Y.T., Q.Z., K.D., and H.C. wrote and revised the manuscript. K.D. and H.C. supervised the entire project. All authors wrote the paper, reviewed it, and approved the final paper.

Corresponding authors

Correspondence to Keyan Ding or Huajun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, J., Tang, Y., Feng, K. et al. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. Sci Data (2026). https://doi.org/10.1038/s41597-026-06594-9

Download citation

  • Received: 19 August 2025

  • Accepted: 08 January 2026

  • Published: 26 February 2026

  • DOI: https://doi.org/10.1038/s41597-026-06594-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing