Abstract
Accelerator-based boron neutron capture therapy (BNCT) is a binary radiation therapy that has rapidly developed in recent years. This study systematically evaluated and compared the performance of four mainstream model families [ChatGPT, Bard (Gemini), Claude, and ERNIE Bot] in answering BNCT-related knowledge questions, providing a reference for exploring their potential in BNCT professional education. Forty-seven bilingual BNCT questions covering key concepts, clinical practice, and reasoning tasks were constructed. Four mainstream model families [ ChatGPT, Claude, Bard(Gemini), and ERNIE Bot] were tested across five rounds in two languages and question formats. The accuracy, reasoning ability, uncertainty expression, and version effects were analyzed. ChatGPT (72.8%) and Claude (70.4%) showed significantly higher overall accuracy rates than Bard(Gemini) (62.0%) and ERNIE Bot (55.6%) (p < 0.001). Both high-performance models performed significantly better on reasoning-based questions than on fact-based questions (p < 0.001). The average performance improvement from version updates (7.51 ± 8.46percentage points) was numerically higher than the changes during same-version maintenance (0.61 ± 8.68 percentage points, p = 0.126). Although language and questioning methods showed statistically significant effects, the effect sizes were minimal (η2p < 0.01). Uncertainty acknowledgment rates varied significantly among the model families (4.7%-23.7%, p = 0.003). ChatGPT can provide relatively accurate knowledge for the popularization of BNCT. However, existing general-purpose LLMs still cannot accurately answer all BNCT questions and show significant differences in uncertainty expression.
Introduction
Boron Neutron Capture Therapy (BNCT) is a precise binary radiotherapy method based on the cell level1. Binary radiotherapy refers to a two-component therapeutic strategy in which a non-toxic agent is selectively accumulated within tumor cells and subsequently activated by an external physical trigger to generate a localized cytotoxic effect2,3. The principle of BNCT is based on the selective uptake of 10B-containing compounds by tumor cells. When using thermal neutrons to irradiate tumors, 10 B captured neutrons produce alpha particles and 7Li particles, with a range of only about 10 microns (about a cell diameter)4. Theoretically, the protective ability of BNCT for normal organs is better than that of radiotherapy based on photons (X-rays and gamma rays) and charged particles (carbon ions and protons)5. In addition, α particles belong to high linear energy transfer (LET) particles with a higher killing effect on tumors, and this process does not depend on oxygen, so it also has an excellent therapeutic effect on hypoxic tumors6. The reactor neutron source limits the development of BNCT. In March 2020, Japan’s Ministry of Health, Labor, and Welfare officially approved the world’s first accelerator-based BNCT equipment and boron drugs for treating recurrent head and neck tumors. This approval marks a significant milestone in the history of BNCT. More than ten countries and regions, including China, have built accelerator-based BNCT systems, and registered clinical research is gradually being carried out7.
However, BNCT differs from photon and particle radiation therapy in terms of treatment principles, processes, hardware, and software. Japanese Society of Neutron Capture Therapy (JSNCT) established the BNCT practitioner qualification certification method in 2015 and issued new qualification certification requirements in 20178. Accelerator BNCT clinical research has also been conducted in China; however, no certification method for BNCT qualification exists. Therefore, it is imperative to establish a training and assessment system for BNCT-related employees.
In December 2022, OpenAI introduced ChatGPT, a chatbot based on a large language model (LLM), marking the beginning of a new era in generative artificial intelligence (AI). Throughout 2023, other major companies such as Google, Anthropic, and Baidu also released AI products such as Bard (Gemini), Claude, and ERNIE Bot. Although the data used to train these LLMs is exclusively composed of professional medical texts, they include publicly available medical information to some extent, enabling the models to perform reasonably well in medical knowledge assessments9. This research was based on constructing a BNCT question bank and using ChatGPT, Bard (Gemini), Claude, and ERNIE Bot to conduct tests comparing the performance of these LLMs’ answers to BNCT-related knowledge.
Materials and methods
Data retrieval and test question construction
We searched the PubMed and CNKI databases for original research papers, reviews, and works published in the last five years (search strategy and results are provided in Appendix 1). The literature focuses on the latest BNCT progress and important knowledge points. Two evaluators independently screened the literature and prepared relevant topics, while two tumor radiotherapy experts determined the final questions, options, and standard answers.
We designed comprehensive BNCT knowledge assessment questions to evaluate the mastery and application of professional knowledge by the models. The test set covered a wide range of BNCT topics, including fundamental theoretical knowledge, the latest research progress, and various question types, such as objective, reasoning, and numerical problems. The final test set consisted of 47 questions, 36 fact-based and 11 reasoning-based (Appendix 2). Question classification was based on Bloom’s taxonomy of educational objectives10,11. This classification method has been widely applied in educational measurement 12,13and AI evaluation studies14,15.
Definition of fact-based questions: Questions that can be answered directly through memory or clear knowledge. For example, who discovered the neutron?
Definition of reasoning-based questions: Questions that require logical reasoning, computational analysis, or information integration to derive answers. For example, given 100 mg of a radioactive nuclide, after five half-lives, the remaining mass of the element is ( ) mg.
LLMs version and test methods
All models were accessed through their official web interfaces under uniform hardware conditions (Google Chrome v120.0). The specific versions of ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), and ERNIE Bot (Baidu) evaluated in this study, along with their respective testing periods, are presented in Table 1. Responses were fully generated before proceeding to the next question to avoid contextual contamination. The model versions were verified and recorded for each testing round.
To assess the LLMs’ performance over time, tests were conducted in five separate rounds (December 2023, January 2024, March 2024, April 2025, and June 2025). Each question was presented to the models in both Chinese and English under two questioning conditions: (1) direct questioning, where the question was posed exactly as written, and (2) prompt-based questioning, where the question was embedded in a contextual scenario instructing the model to answer as a BNCT specialist.
For example, direct questioning: The full name of BNCT is ( ).
Prompt-based questioning: Suppose you are a seasoned doctor specializing in BNCT tumor radiation therapy and are now taking a test. The question is: ‘The full name of BNCT is ( ). ’ Please answer this question and provide a detailed explanation, reasoning, or calculation for your answer.’
Two clinicians independently scored the answers against predetermined standard responses, with a maximum of one point per question. Scoring consistency was ensured through consensus.
Recording and Scoring: We recorded the LLM answer content for each question. The scores were summarized in an Excel spreadsheet (Appendix 3) for statistical analysis.
Evaluation metrics and calculation methods
Accuracy
Accuracy was measured by comparing the LLM responses with pre-validated BNCT answers. For each model, the overall accuracy reported in the abstract and main results was calculated as the mean accuracy across all testing conditions:
This represents the average performance across all test instances (5 rounds × 2 languages × 2 methods × 47 questions) for each model. Individual accuracy rates for specific conditions (e.g., by language or round) were calculated separately using the same formula applied to the respective data subsets.
A higher accuracy rate indicates better language model performance16.
Uncertainty acknowledgment
The “hallucination” phenomenon refers to LLMs producing absurd or unreal content17. When a model directly states, "I don’t know" or similar, indicating an inability to provide a relevant answer, it is considered an expression of " uncertainty acknowledgment ". The “uncertainty acknowledgment” rate was calculated as follows:
Statistical analysis
All statistical analyses were performed using R version 4.5.0. Continuous variables are reported as mean ± standard deviation (SD) or median with interquartile range (IQR), and categorical variables are reported as frequencies and percentages. Normality was assessed using the Shapiro–Wilk test. Group comparisons were performed using one-way ANOVA with Bonferroni correction or the Kruskal–Wallis test with Dunn’s post hoc test, and within-model comparisons used paired t-tests with Cohen’s d for effect size. Binary outcomes were analyzed using Wilson score confidence intervals and chi-square tests. Associations between AI performance and time-series trends with Mann–Whitney U tests. All tests were two-tailed with α = 0.05, and sample sizes were calculated to achieve 80% power for detecting medium effect sizes.
Results
Comparison of the overall performance of four model families
The four model families showed significant performance differences in BNCT medical knowledge testing (Fig. 1). Claude demonstrated the best and most stable performance, with a distribution peak concentrated in the 85–90% range, showing a right-skewed distribution. ChatGPT showed moderate performance levels, with a relatively concentrated distribution but a peak in a lower range (75–80%). Bard(Gemini) and ERNIE Bot showed more dispersed distributions, indicating greater performance fluctuations, with ERNIE Bot showing the flattest distribution.
Performance distribution of four model families in BNCT knowledge assessment. Ridge plot showing performance distributions of four model families (ChatGPT, Claude, Bard(Gemini), ERNIE Bot) on a 47-item BNCT knowledge test across 20 conditions each. Statistical markers: mean (yellow star), median (red circle), Q1 (blue triangle up), Q3 (blue triangle down). Labels show mean accuracy; parentheses indicate mean ± SD. n = 940 observations per model.
ChatGPT achieved the highest accuracy at 72.8% (95% CI: 69.9–75.6%), followed by Claude (70.4%, 95% CI: 67.5–73.3%) and Bard(Gemini) (62.0%, 95% CI: 58.9–65.1%), with ERNIE Bot performing lowest 55.6%, 95% CI: 52.5–58.8%) (Fig. 2). One-way ANOVA with Bonferroni post-hoc correction showed significant differences between all model pairs (p < 0.001) except between ChatGPT and Claude (p > 0.05).
AI model accuracy in BNCT knowledge assessment. Comparison of accuracy rates across four model families evaluated on 47 BNCT questions under 80 testing conditions each (n = 3,760 responses per model). Error bars represent 95% confidence intervals. Models are arranged by ascending performance. One-way ANOVA revealed significant differences between models (F(3,15,036) = 242.8, P < 0.001).
Comparison of model families’ reasoning ability
The model families showed significant heterogeneity in handling different cognitive task types (Fig. 3). All three model families—ChatGPT, Claude, and ERNIE Bot—performed better on reasoning-based questions than fact-based questions, with ChatGPT and Claude showing larger effect sizes (both p < 0.001) compared to ERNIE Bot (p = 0.021). Bard(Gemini) showed a relatively consistent performance on both question types, with no significant difference between fact-based and reasoning-based questions (p = 0.237). These findings reveal distinct performance patterns across the model families. ChatGPT and Claude demonstrated clear advantages in handling complex problems requiring comprehensive analysis and logical reasoning. In contrast, Bard(Gemini) maintained balanced performance across both question types, while ERNIE Bot, despite showing a significant advantage on reasoning questions, exhibited the weakest overall performance.
Comparative analysis of factual versus reasoning performance across four model families. Grouped bar chart displaying accuracy rates for factual (coral red) and reasoning (sky blue) questions. Each model pair connected by horizontal brackets with p-values. Error bars represent 95% confidence intervals. Statistical analysis employed paired t-tests with Bonferroni correction. Models arranged by descending overall performance with exact percentages displayed.
Impact of LLM version updates on performance
During the testing period, all four model families underwent substantial version updates, showing different performance evolution patterns (Fig. 4). Claude showed the most significant continuous improvement, rising from 57.4% in Round 1 to 81.4% in Round 5, particularly achieving the largest single improvement among all version updates when upgrading from Claude 3 Opus to Claude 3.7 Sonnet (Round 3 to 4: 19.7 percentage points). ChatGPT maintained relatively stable high performance, slightly declining from 69.1% to 62.8% (Round 3), then recovering to 80.9% after the GPT-4.5 version update, ultimately reaching 82.4%. Despite an initial slight decline, the ChatGPT family maintained one of the highest levels among the four model families overall. Bard(Gemini) showed a stable upward trend, improving from 50.0% to 77.7%. ERNIE Bot showed fluctuating performance initially (59.6% down to 44.7%), but significantly improved after introducing ERNIE Bot 4.5 and ERNIE X1 Turbo, ultimately reaching 66.5%. The average performance improvement during version updates (7.51 ± 8.46percentage points) was numerically higher than the changes during same-version maintenance (0.61 ± 8.68 percentage points), though this difference was not statistically significant (p = 0.126).
Temporal evolution of AI model performance across five testing rounds with version transitions. Line graph depicting accuracy rates over five consecutive testing rounds. Bard(Gemini) (red), ChatGPT (cyan), Claude (green), ERNIE Bot (dark blue). Triangular markers (▲) indicate version updates; circular markers (●) represent within-version iterations. Complete version names displayed at data points. Statistical comparison between version updates and same-version changes using Mann–Whitney U test.
Impact of questioning methods and languages on BNCT knowledge response accuracy
The language effects varied slightly across models (Fig. 5). ChatGPT (English: 73.2%; Chinese: 73.2%) and Claude (English: 70.9%; Chinese: 70.0%) showed marginally better performance in English. In contrast, Bard (Gemini) performed better in Chinese (63.4%) than in English (63.4%), whereas ERNIE Bot exhibited negligible differences between the two languages (Chinese: 55.5%; English: 55.7%). Language had minimal practical impact on overall accuracy.
Heatmap visualization of language and prompting method effects on AI model performance in BNCT knowledge assessment. Heatmap matrix displaying accuracy rates with standard errors across four language-method combinations. Color gradient encodes accuracy from lowest (deep blue) to highest (dark orange). Models are arranged vertically by performance ranking. Horizontal axis shows Chinese-Direct, Chinese-Indirect, English-Direct, and English-Indirect conditions. Each cell displays the mean accuracy and standard error.
Prompt-guided questioning provided slight advantages for ChatGPT (prompt: 73.8%, direct: 71.7%) and Bard (Gemini) (prompt: 63.8%, direct: 60.2%), whereas Claude (prompt: 70.6%, direct: 70.2%) and ERNIE Bot (prompt: 55.7%, direct: 55.5%) showed minimal variation between the two questioning methods.
The three-factor ANOVA showed that intrinsic model capabilities accounted for the largest variance in performance (F = 26.15, p < 2 × 10⁻1⁶, η2p = 0.021), while language, method, and all interactions were not significant (F = 0.234, p = 0.629, η2p < 0.001).
Comparison of “Uncertainty Acknowledgment” among different model families
The four model families showed significant differences in uncertainty acknowledgment rates when providing incorrect answers (p = 0.003) (Fig. 6). Bard (Gemini) demonstrated the highest uncertainty acknowledgment rate (17.6% ± 13.8%), actively expressing "don’t know" in nearly one-fifth of incorrect responses, reflecting relatively conservative knowledge boundary awareness. ChatGPT (17.4% ± 15.1%) showed a similar level, while ERNIE Bot (13.2% ± 9.2%) exhibited an intermediate tendency. Claude showed the lowest uncertainty acknowledgment rate (4.7% ± 7.6%), preferring to provide definitive answers rather than acknowledging knowledge insufficiency when incorrect answers were provided. Pairwise comparisons revealed that ChatGPT (p < 0.01), Bard(Gemini) (p < 0.001), and ERNIE Bot (p < 0.01) all had significantly higher acknowledgment rates than Claude. These differences may stem from fundamental variations in training objectives, alignment strategies, or architectural design. Bard(Gemini) and ChatGPT’s higher acknowledgment rates provide more transparent knowledge boundary indications, whereas Claude’s low acknowledgment rate prioritizes response completeness. Both strategies have potential value and risks in medical AI application.
Discussion
This study represents the first systematic evaluation of mainstream model families’ performance in answering BNCT-related medical knowledge questions. Our comprehensive assessment across multiple dimensions revealed that ChatGPT and Claude significantly outperformed Bard(Gemini) and ERNIE Bot. Notably, high-performing models showed enhanced reasoning capabilities over factual recall, while exhibiting distinct patterns in uncertainty acknowledgment and version-dependent performance improvements.
Despite the rapid advances in LLM capabilities, their application in specialized medical domains, such as BNCT, remains unexplored. Current research demonstrates that LLMs have achieved remarkable performance in general medical knowledge assessment, with GPT-4 reaching 80–90% accuracy on USMLE examinations9 and Med-PaLM 2 becoming the first AI system to achieve human expert-level performance on medical licensing questions18. However, these achievements focus on broad medical knowledge domains, and no studies have specifically examined LLM performance in BNCT-related assessments. Our study addresses this gap by providing the first systematic evaluation of mainstream LLMs on BNCT-specific knowledge, offering insights into the capabilities and limitations of AI in this specialized medical field.
ChatGPT and Claude demonstrated significantly higher accuracy on reasoning-based tasks than on fact-based questions, raising profound questions about whether LLMs truly possess reasoning capabilities. Some researchers argue that LLMs merely generate seemingly plausible answers through pattern matching and statistical correlations19, while others have presented evidence of the emergence of reasoning abilities in LLMs20. Our findings tend to support the latter perspective—the models’ superior performance on multi-step logical inference problems in BNCT dosimetry calculations and treatment protocol design is difficult to explain solely through simple pattern recognition. This “reasoning advantage” may stem from extensive structured scientific literature in the training data, enabling models to learn the identification and application of causal relationship chains21. However, whether this capability constitutes genuine reasoning or merely sophisticated pattern matching remains an open question requiring further investigation.
Our findings reveal that language and questioning methodologies have minimal impact on LLM performance. The negligible language differences align with existing evidence showing that multilingual LLMs maintain consistent performance across languages. Studies have demonstrated that models such as MMed-Llama 3 achieve comparable results on English medical benchmarks despite multilingual training22, and contemporary research has confirmed stable capabilities across languages for structured knowledge tasks23,24. Our observation of minimal differences between prompt-based and direct questioning contrasts with extensive research demonstrating significant improvements through prompt engineering. Medical applications show that chain-of-thought prompting can improve clinical guideline consistency from 62.9 to 77.5%25, while specialized strategies like OpenMedLM outperform fine-tuning for medical question-answering26. The minimal effect sizes (η2p < 0.001) suggest that for well-established medical knowledge domains, such as BNCT, model architecture and training data may be more decisive factors than linguistic presentation or prompting strategies.
The significant variation in uncertainty acknowledgment rates (4.7%-17.6%) among the models has important implications for medical AI applications. Medical hallucination studies indicate that LLMs frequently exhibit overconfidence, with hallucination rates of 47% in some applications27. Bard’s higher uncertainty rate (23.7%) suggests a more conservative approach, while Claude’s lower rate (6.5%) indicates greater confidence. However, poor calibration between confidence and accuracy can mislead clinicians28. These findings highlight the importance of evaluating uncertainty expression in specialized medical domains, as general-purpose medical AI assessments may not adequately capture the unique requirements of highly specialized fields like BNCT29.
Version updates represent the most significant driver of LLM performance improvement, with major architectural revisions yielding substantially larger gains than incremental updates. Our observed average improvement of 7.51 percentage points during version transitions aligns with broader field patterns30. The Claude 2 to Claude 3 Opus transition demonstrated a two-fold accuracy improvement on challenging questions31, while GPT-4o achieved 89.2% accuracy on the Japanese Medical Licensing Examination compared to GPT-4's 76.8%30. These performance gains reflect fundamental advances in model architecture, training methodologies, and dataset curation. For specialized medical applications, such as BNCT, architectural improvements enhance both domain-specific knowledge representation and performance reliability.
This study had a few limitations. Firstly, an authoritative BNCT examination standard needs to be established. Secondly, the number of test questions and sample size should be increased to improve standardization and robustness. Finally, the dynamic nature of LLMs’ computational power and training data may lead to varying results over time, necessitating further research.
The rapid evolution of LLM capabilities, combined with the accelerating pace of medical innovation, suggests that specialized AI-assisted education and decision support will become increasingly viable. As BNCT and similar emerging therapies gain broader clinical adoption, the development of dedicated medical AI systems trained on curated clinical datasets will likely bridge the current gap between general-purpose LLMs and domain-specific expertise, ultimately enhancing both medical education quality and patient care outcomes in specialized therapeutic areas.
Conclusion
ChatGPT demonstrated superior overall performance and the strongest reasoning capabilities among the tested model families. However, existing general-purpose LLMs still cannot accurately answer all BNCT questions and show significant differences in uncertainty expression. The results support the use of LLMs as auxiliary tools for BNCT education but emphasize the need to develop specialized models and establish standardized application frameworks to ensure their safety and reliability in medical applications.
Data availability
The datasets generated or analysed during the current study are available from the corresponding author on reasonable request.
References
Cheng, X., Li, F. & Liang, L. Boron neutron capture therapy: clinical application and research progress. Curr. Oncol. tor. Ont. 29, 7868–7886 (2022).
Shi, Y. et al. Localized nuclear reaction breaks boron drug capsules loaded with immune adjuvants for cancer immunotherapy. Nat. Commun. 14, 1884 (2023).
Mirzaei, H. R. et al. Boron neutron capture therapy: moving toward targeted cancer therapy. J. Cancer Res. Ther. 12, 520–525 (2016).
Sauerwein, W. A. G. Principles and roots of neutron capture therapy. In Neutron Capture Therapy: Principles and Applications (eds Sauerwein, W. et al.) 1–16 (Springer, 2012). https://doi.org/10.1007/978-3-642-31334-9_1.
Matsumoto, Y. et al. A critical review of radiation therapy: from particle beam therapy (proton, carbon, and BNCT) to beyond. J. Pers. Med. 11, 825 (2021).
Matsumura, A. et al. Initiatives toward clinical boron neutron capture therapy in Japan. Cancer Biother. Radiopharm. 38, 201–207 (2023).
Zhou, T. et al. The current status and novel advances of boron neutron capture therapy clinical trials. Am. J. Cancer Res. 14, 429–447 (2024).
Japanese society of neutron capture therapy | home. http://www.jsnct.jp/e/.
Brin, D. et al. How large language models perform on the United States medical licensing examination: a systematic review. 2023.09.03.23294842 Preprint at https://doi.org/10.1101/2023.09.03.23294842 (2023).
Anderson, L. W. & Krathwohl, D. R. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives (Longman, 2001).
Mayer, R. E. Rote versus meaningful learning. Theory Pract. 41, 226–232 (2002).
Halpern, D. F. Thought and Knowledge: An Introduction to Critical Thinking 5th edn. (Psychology Press, 2014).
Nitko, A. J. & Brookhart, S. M. Educational Assessment of Students (Pearson/Allyn & Bacon, 2011).
OpenAI et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).
Ii, M. B. & Katz, D. M. GPT takes the bar exam. Preprint at https://doi.org/10.48550/arXiv.2212.14402 (2022).
Suárez, A. et al. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int. Endod. J. 57, 108–113 (2024).
Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27, 120 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Mirzadeh, I. et al. GSM-symbolic: understanding the limitations of mathematical reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2410.05229 (2024).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2212.09196 (2023).
H, Z. et al. Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference. Brief. Bioinform. 26, (2025).
Qiu, P. et al. Towards building multilingual language model for medicine. Nat. Commun. 15, 8384 (2024).
Qin, L. et al. A survey of multilingual large language models. Patterns 6, (2025).
Lai, V. D. et al. Okapi: instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. Preprint at https://doi.org/10.48550/arXiv.2307.16039 (2023).
Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Med. 7, 41 (2024).
Maharjan, J. et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 14, 14156 (2024).
Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digital Med. 8, 274 (2025).
Hao, G., Wu, J., Pan, Q. & Morello, R. Quantifying the uncertainty of LLM hallucination spreading in complex adaptive social networks. Sci. Rep. 14, 16375 (2024).
Kim, Y. et al. Medical hallucination in foundation models and their impact on healthcare. 2025.02.28.25323115 Preprint at https://doi.org/10.1101/2025.02.28.25323115 (2025).
Liu, M. et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using japanese national medical examination. Int. J. Med. Inf. 193, 105673 (2025).
Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family.
Funding
This study was supported by the Natural Science Research Project for Anhui Universities (No. KJ2021A0311) and the Anhui Province Health Research Project (No. AHWJ2022b082). The funding sources were not involved in the research design, data collection, analysis, Writing, or publication decisions.
Author information
Authors and Affiliations
Contributions
**Shen Shumin** : Conceptualization, Data Analysis, Writing – original draft, Writing – review & editing. **Wang Shanghu** : Conceptualization, Data collection, Writing – original draft, Writing – review & editing. **Gao Mingzhu** : Data curation, Methodology, Writing – review & editing. **Yang Yuecai** : Methodology, Investigation, Writing – review & editing. **Wu Xiuwei** : Data curation, Investigation, Writing – review & editing **Wu Jinjin** : Data curation, Investigation, Writing – review & editing **Zhou Dacheng** : Data curation, Methodology, Writing – review & editing. **Wang Nianfei** : Conceptualization, Project administration, Supervision, Writing-review & editing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that there are no conflicts of interest. There are also no conflicts of interest between the author and the company that tested the LLMs.
Ethical approval
This study used only publicly available Internet data and did not involve human subjects. Therefore, no specific ethical considerations were required in this study.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shen, S., Wang, S., Gao, M. et al. Performance comparison of large language models in boron neutron capture therapy knowledge assessment. Sci Rep 16, 5321 (2026). https://doi.org/10.1038/s41598-026-36322-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-36322-7





