Abstract
Background
The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions.
Methods
Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. “Good”-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. “Poor”-rating responses were further prompted for self-correction and re-analysis.
Results
The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% “Good” responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a “Poor” rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%.
Conclusions
Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 18 print issues and online access
$259.00 per year
only $14.39 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Data availability
The datasets generated during and analyzed during the current study are available in the Github repository, [https://github.com/YukinoWmy/Original-Chinese-question-response-pairs-and-their-English-translations-with-scoring].
References
Betzler BK, Chen H, Cheng CY, Lee CS, Ning G, Song SJ, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023;5:e917–e924.
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324.
Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141:589–97.
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862–8.
Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021;7:e27850.
Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616:259–65.
Haupt CE, Marks M. AI-generated medical advice—GPT and beyond. JAMA. 2023;329:1349–50.
OpenAI. Is ChatGPT biased? 2023. https://help.openai.com/en/articles/8313359-is-chatgpt-biased.
Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos A, et al. PaLM 2 technical report. 2023. Preprint at https://arxiv.org/abs/2305.10403.
Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, et al. ChatGPT performs on the chinese national medical licensing examination. J Med Syst. 2023;47. 86.
Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, et al. Uncovering language disparity of chatgpt on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26:e51926.
Wang H, Wu W, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inf. 2023;177:105173.
Li R, Zhang K, Li SM, Zhang Y, Tian J, Lu Z, et al. Implementing a digital comprehensive myopia prevention and control strategy for children and adolescents in China: a cost-effectiveness analysis. Lancet Reg Health West Pac. 2023;38:100837.
Chen M, Wu A, Zhang L, Wang W, Chen X, Yu X, et al. The increasing prevalence of myopia and high myopia among high school students in Fenghua city, eastern China: a 15-year population-based survey. BMC Ophthalmol. 2018;18:159.
Baird PN, Saw SM, Lanca C, Guggenheim JA, Smith Iii EL, Zhou X, et al. Myopia. Nat Rev Dis Prim. 2020;6:99.
Morgan IG, French AN, Ashby RS, Guo X, Ding X, He M, et al. The epidemics of myopia: aetiology and prevention. Prog Retin Eye Res. 2018;62:134–49.
Baidu index searching. 2023. https://index.baidu.com/v2/index.html#/.
Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95. 104770.
Chinese J Sci. https://finance.sina.com.cn/wm/2023-06-20/doc-imyxyaxf8235213.shtml 2023.
Wall Street News. Introducing qwen-7B: open foundation and human-aligned models (of the state-of-the-art). 2023. https://zhuanlan.zhihu.com/p/648007297?utm_id=0.
Hongbo Z, Junying C, Feng J, Fei Yu, Zhihong C, Jianquan L, et al. HuatuoGPT, towards taming language models to be a doctor. Preprint at arXiv:2305.15075 [csCL] 2023.
Kraljevic Z, Shek A, Bean D, Bendayan R, Teo J, Dobson R. MedGPT: medical concept prediction from clinical narratives. 2021.
Baidu ERNIE. https://baijiahao.baidu.com/s?id=1775669682987813307&wfr=spider&for=pc.202.
Llama2-Chinese. https://ollama.com/library/llama2-chinese:7b-chat.S 2023.
Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus. 2023;15:e40895.
Rasmussen MLR, Larsen AC, Subhi Y, Potapenko I. Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis. Graefes Arch Clin Exp Ophthalmol. 2023;261:3041–43.
Lahat A, Shachar E, Avidan B, Glicksberg B, Klang E. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: are we there yet?. Diagnostics. 2023;13:1950.
Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Square. 2023:rs.3.rs-2566942.
Li H, Moon JT, Purkayastha S, Celi LA, Trivedi H, Gichoya JW. Ethics of large language models in medicine and medical research. Lancet Digital Health. 2023;5:e333–e335.
Luo MJ, Pang J, Bi S, Lai Y, Zhao J, Shang Y, et al. Development and evaluation of a retrieval-augmented large language model framework for ophthalmology. JAMA Ophthalmol. 2024;142:798–805.
Acknowledgements
We would like to acknowledge the funding support of the National Key R&D Program of China (2022YFC2502800), the National Natural Science Fund of China (82388101), and the Beijing Natural Science Foundation (IS23096).
Author information
Authors and Affiliations
Contributions
ZHJ, YYX, and YCT conceptualized and designed the research. ZHJ, YYX, ZYW, and YXH analyzed the data. ZWL and SMY performed confirmation of the experimental design. ZHJ, YYX, ZP, QW, and GYW conducted an experimental investigation. YCT, YXW, XFW, and TYW reviewed the first manuscript and suggested revisions. YCT and YXW conducted the experimental idea construction and supervised the whole experiment.
Corresponding authors
Ethics declarations
Competing interests
TYW discloses receiving consulting fees from multiple pharmaceutical companies including Aldropika Therapeutics, Bayer, Boehringer Ingelheim, Genentech, Iveric Bio, Novartis, Plano, Oxurion, Roche, Sanofi, Shanghai Henlius. Additionally, TYW holds roles as an inventor, patent holder, and co-founder of the start-up companies EyRiS and Visre. The remaining authors have reported no conflicts of interest.
Additional information
Publisher”s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, Z., Xu, Y., Lim, Z.W. et al. Comparative performance analysis of global and chinese-domain large language models for myopia. Eye 39, 2015–2022 (2025). https://doi.org/10.1038/s41433-025-03775-5
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41433-025-03775-5