Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Comparative performance analysis of global and chinese-domain large language models for myopia

Abstract

Background

The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions.

Methods

Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. “Good”-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. “Poor”-rating responses were further prompted for self-correction and re-analysis.

Results

The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% “Good” responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a “Poor” rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%.

Conclusions

Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1
Fig. 2: Results of Accuracy Evaluation for 9 LLM Chatbots.
Fig. 3

Similar content being viewed by others

Data availability

The datasets generated during and analyzed during the current study are available in the Github repository, [https://github.com/YukinoWmy/Original-Chinese-question-response-pairs-and-their-English-translations-with-scoring].

References

  1. Betzler BK, Chen H, Cheng CY, Lee CS, Ning G, Song SJ, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023;5:e917–e924.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324.

  3. Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141:589–97.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862–8.

  5. Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021;7:e27850.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616:259–65.

    Article  CAS  PubMed  Google Scholar 

  7. Haupt CE, Marks M. AI-generated medical advice—GPT and beyond. JAMA. 2023;329:1349–50.

    Article  PubMed  Google Scholar 

  8. OpenAI. Is ChatGPT biased? 2023. https://help.openai.com/en/articles/8313359-is-chatgpt-biased.

  9. Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos A, et al. PaLM 2 technical report. 2023. Preprint at https://arxiv.org/abs/2305.10403.

  10. Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, et al. ChatGPT performs on the chinese national medical licensing examination. J Med Syst. 2023;47. 86.

    Article  PubMed  Google Scholar 

  11. Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, et al. Uncovering language disparity of chatgpt on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26:e51926.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Wang H, Wu W, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inf. 2023;177:105173.

    Article  Google Scholar 

  13. Li R, Zhang K, Li SM, Zhang Y, Tian J, Lu Z, et al. Implementing a digital comprehensive myopia prevention and control strategy for children and adolescents in China: a cost-effectiveness analysis. Lancet Reg Health West Pac. 2023;38:100837.

    PubMed  PubMed Central  Google Scholar 

  14. Chen M, Wu A, Zhang L, Wang W, Chen X, Yu X, et al. The increasing prevalence of myopia and high myopia among high school students in Fenghua city, eastern China: a 15-year population-based survey. BMC Ophthalmol. 2018;18:159.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Baird PN, Saw SM, Lanca C, Guggenheim JA, Smith Iii EL, Zhou X, et al. Myopia. Nat Rev Dis Prim. 2020;6:99.

    Article  PubMed  Google Scholar 

  16. Morgan IG, French AN, Ashby RS, Guo X, Ding X, He M, et al. The epidemics of myopia: aetiology and prevention. Prog Retin Eye Res. 2018;62:134–49.

    Article  PubMed  Google Scholar 

  17. Baidu index searching. 2023. https://index.baidu.com/v2/index.html#/.

  18. Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95. 104770.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Chinese J Sci. https://finance.sina.com.cn/wm/2023-06-20/doc-imyxyaxf8235213.shtml 2023.

  20. Wall Street News. Introducing qwen-7B: open foundation and human-aligned models (of the state-of-the-art). 2023. https://zhuanlan.zhihu.com/p/648007297?utm_id=0.

  21. Hongbo Z, Junying C, Feng J, Fei Yu, Zhihong C, Jianquan L, et al. HuatuoGPT, towards taming language models to be a doctor. Preprint at arXiv:2305.15075 [csCL] 2023.

  22. Kraljevic Z, Shek A, Bean D, Bendayan R, Teo J, Dobson R. MedGPT: medical concept prediction from clinical narratives. 2021.

  23. Baidu ERNIE. https://baijiahao.baidu.com/s?id=1775669682987813307&wfr=spider&for=pc.202.

  24. Llama2-Chinese. https://ollama.com/library/llama2-chinese:7b-chat.S 2023.

  25. Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus. 2023;15:e40895.

    PubMed  PubMed Central  Google Scholar 

  26. Rasmussen MLR, Larsen AC, Subhi Y, Potapenko I. Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis. Graefes Arch Clin Exp Ophthalmol. 2023;261:3041–43.

  27. Lahat A, Shachar E, Avidan B, Glicksberg B, Klang E. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: are we there yet?. Diagnostics. 2023;13:1950.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Square. 2023:rs.3.rs-2566942.

  29. Li H, Moon JT, Purkayastha S, Celi LA, Trivedi H, Gichoya JW. Ethics of large language models in medicine and medical research. Lancet Digital Health. 2023;5:e333–e335.

    Article  CAS  PubMed  Google Scholar 

  30. Luo MJ, Pang J, Bi S, Lai Y, Zhao J, Shang Y, et al. Development and evaluation of a retrieval-augmented large language model framework for ophthalmology. JAMA Ophthalmol. 2024;142:798–805.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the funding support of the National Key R&D Program of China (2022YFC2502800), the National Natural Science Fund of China (82388101), and the Beijing Natural Science Foundation (IS23096).

Author information

Authors and Affiliations

Authors

Contributions

ZHJ, YYX, and YCT conceptualized and designed the research. ZHJ, YYX, ZYW, and YXH analyzed the data. ZWL and SMY performed confirmation of the experimental design. ZHJ, YYX, ZP, QW, and GYW conducted an experimental investigation. YCT, YXW, XFW, and TYW reviewed the first manuscript and suggested revisions. YCT and YXW conducted the experimental idea construction and supervised the whole experiment.

Corresponding authors

Correspondence to Yaxing Wang or Yih Chung Tham.

Ethics declarations

Competing interests

TYW discloses receiving consulting fees from multiple pharmaceutical companies including Aldropika Therapeutics, Bayer, Boehringer Ingelheim, Genentech, Iveric Bio, Novartis, Plano, Oxurion, Roche, Sanofi, Shanghai Henlius. Additionally, TYW holds roles as an inventor, patent holder, and co-founder of the start-up companies EyRiS and Visre. The remaining authors have reported no conflicts of interest.

Additional information

Publisher”s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, Z., Xu, Y., Lim, Z.W. et al. Comparative performance analysis of global and chinese-domain large language models for myopia. Eye 39, 2015–2022 (2025). https://doi.org/10.1038/s41433-025-03775-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41433-025-03775-5

Search

Quick links