Abstract
Accelerator-based boron neutron capture therapy (BNCT) is a binary radiation therapy that has rapidly developed in recent years. This study systematically evaluated and compared the performance of four mainstream model families [ChatGPT, Bard (Gemini), Claude, and ERNIE Bot] in answering BNCT-related knowledge questions, providing a reference for exploring their potential in BNCT professional education. Forty-seven bilingual BNCT questions covering key concepts, clinical practice, and reasoning tasks were constructed. Four mainstream model families [ ChatGPT, Claude, Bard(Gemini), and ERNIE Bot] were tested across five rounds in two languages and question formats. The accuracy, reasoning ability, uncertainty expression, and version effects were analyzed. ChatGPT (72.8%) and Claude (70.4%) showed significantly higher overall accuracy rates than Bard(Gemini) (62.0%) and ERNIE Bot (55.6%) (p < 0.001). Both high-performance models performed significantly better on reasoning-based questions than on fact-based questions (p < 0.001). The average performance improvement from version updates (7.51 ± 8.46percentage points) was numerically higher than the changes during same-version maintenance (0.61 ± 8.68 percentage points, p = 0.126). Although language and questioning methods showed statistically significant effects, the effect sizes were minimal (η2p < 0.01). Uncertainty acknowledgment rates varied significantly among the model families (4.7%-23.7%, p = 0.003). ChatGPT can provide relatively accurate knowledge for the popularization of BNCT. However, existing general-purpose LLMs still cannot accurately answer all BNCT questions and show significant differences in uncertainty expression.
Data availability
The datasets generated or analysed during the current study are available from the corresponding author on reasonable request.
References
Cheng, X., Li, F. & Liang, L. Boron neutron capture therapy: clinical application and research progress. Curr. Oncol. tor. Ont. 29, 7868–7886 (2022).
Shi, Y. et al. Localized nuclear reaction breaks boron drug capsules loaded with immune adjuvants for cancer immunotherapy. Nat. Commun. 14, 1884 (2023).
Mirzaei, H. R. et al. Boron neutron capture therapy: moving toward targeted cancer therapy. J. Cancer Res. Ther. 12, 520–525 (2016).
Sauerwein, W. A. G. Principles and roots of neutron capture therapy. In Neutron Capture Therapy: Principles and Applications (eds Sauerwein, W. et al.) 1–16 (Springer, 2012). https://doi.org/10.1007/978-3-642-31334-9_1.
Matsumoto, Y. et al. A critical review of radiation therapy: from particle beam therapy (proton, carbon, and BNCT) to beyond. J. Pers. Med. 11, 825 (2021).
Matsumura, A. et al. Initiatives toward clinical boron neutron capture therapy in Japan. Cancer Biother. Radiopharm. 38, 201–207 (2023).
Zhou, T. et al. The current status and novel advances of boron neutron capture therapy clinical trials. Am. J. Cancer Res. 14, 429–447 (2024).
Japanese society of neutron capture therapy | home. http://www.jsnct.jp/e/.
Brin, D. et al. How large language models perform on the United States medical licensing examination: a systematic review. 2023.09.03.23294842 Preprint at https://doi.org/10.1101/2023.09.03.23294842 (2023).
Anderson, L. W. & Krathwohl, D. R. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives (Longman, 2001).
Mayer, R. E. Rote versus meaningful learning. Theory Pract. 41, 226–232 (2002).
Halpern, D. F. Thought and Knowledge: An Introduction to Critical Thinking 5th edn. (Psychology Press, 2014).
Nitko, A. J. & Brookhart, S. M. Educational Assessment of Students (Pearson/Allyn & Bacon, 2011).
OpenAI et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).
Ii, M. B. & Katz, D. M. GPT takes the bar exam. Preprint at https://doi.org/10.48550/arXiv.2212.14402 (2022).
Suárez, A. et al. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int. Endod. J. 57, 108–113 (2024).
Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27, 120 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Mirzadeh, I. et al. GSM-symbolic: understanding the limitations of mathematical reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2410.05229 (2024).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2212.09196 (2023).
H, Z. et al. Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference. Brief. Bioinform. 26, (2025).
Qiu, P. et al. Towards building multilingual language model for medicine. Nat. Commun. 15, 8384 (2024).
Qin, L. et al. A survey of multilingual large language models. Patterns 6, (2025).
Lai, V. D. et al. Okapi: instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. Preprint at https://doi.org/10.48550/arXiv.2307.16039 (2023).
Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Med. 7, 41 (2024).
Maharjan, J. et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 14, 14156 (2024).
Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digital Med. 8, 274 (2025).
Hao, G., Wu, J., Pan, Q. & Morello, R. Quantifying the uncertainty of LLM hallucination spreading in complex adaptive social networks. Sci. Rep. 14, 16375 (2024).
Kim, Y. et al. Medical hallucination in foundation models and their impact on healthcare. 2025.02.28.25323115 Preprint at https://doi.org/10.1101/2025.02.28.25323115 (2025).
Liu, M. et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using japanese national medical examination. Int. J. Med. Inf. 193, 105673 (2025).
Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family.
Funding
This study was supported by the Natural Science Research Project for Anhui Universities (No. KJ2021A0311) and the Anhui Province Health Research Project (No. AHWJ2022b082). The funding sources were not involved in the research design, data collection, analysis, Writing, or publication decisions.
Author information
Authors and Affiliations
Contributions
**Shen Shumin** : Conceptualization, Data Analysis, Writing – original draft, Writing – review & editing. **Wang Shanghu** : Conceptualization, Data collection, Writing – original draft, Writing – review & editing. **Gao Mingzhu** : Data curation, Methodology, Writing – review & editing. **Yang Yuecai** : Methodology, Investigation, Writing – review & editing. **Wu Xiuwei** : Data curation, Investigation, Writing – review & editing **Wu Jinjin** : Data curation, Investigation, Writing – review & editing **Zhou Dacheng** : Data curation, Methodology, Writing – review & editing. **Wang Nianfei** : Conceptualization, Project administration, Supervision, Writing-review & editing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that there are no conflicts of interest. There are also no conflicts of interest between the author and the company that tested the LLMs.
Ethical approval
This study used only publicly available Internet data and did not involve human subjects. Therefore, no specific ethical considerations were required in this study.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shen, S., Wang, S., Gao, M. et al. Performance comparison of large language models in boron neutron capture therapy knowledge assessment. Sci Rep (2026). https://doi.org/10.1038/s41598-026-36322-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-36322-7