Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

“Comparative analysis of large language models against the NHS 111 online triaging for emergency ophthalmology”

Abstract

Background

This study presents a comprehensive evaluation of the performance of various large language models in generating responses for ophthalmology emergencies and compares their accuracy with the established United Kingdom’s National Health Service 111 online system.

Methods

We included 21 ophthalmology-related emergency scenario questions from the NHS 111 triaging algorithm. These questions were based on four different ophthalmology emergency themes as laid out in the NHS 111 algorithm. Responses generated from NHS 111 online, were compared to different LLM-chatbots responses to determine the accuracy of LLM responses. We included a range of models including ChatGPT-3.5, Google Bard, Bing Chat, and ChatGPT-4.0. The accuracy of each LLM-chatbot response was compared against the NHS 111 Triage using a two-prompt strategy. Answers were graded as following: −2 graded as “Very poor”, −1 as “Poor”, O as “No response”, 1 as “Good”, 2 as “Very good” and 3 graded as “Excellent”.

Results

Overall LLMs’ attained a good accuracy in this study compared against the NHS 111 responses. The score of ≥1 graded as “Good” was achieved by 93% responses of all LLMs. This refers to at least part of this answer having correct information as well as absence of any wrong information. There was no marked difference and very similar results seen overall on both prompts.

Conclusions

The high accuracy and safety observed in LLM responses support their potential as effective tools for providing timely information and guidance to patients. LLMs hold promise in enhancing patient care and healthcare accessibility in digital age.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Graphical representation of the comparison of LLM responses to NHS 111 responses.
Fig. 2: Diagramatic representation of the accuracy of responses.
Fig. 3: Figure showing a break up comparison of the LLM responses to NHS 111 responses.

Similar content being viewed by others

Data availability

Data supporting the findings of this study are available upon written request to the corresponding author.

References

  1. Bates DW, Levine D, Syrowatka A, Kuznetsova M, Craig KJT, Rui A, et al. The potential of artificial intelligence to improve patient safety: a scoping review. NPJ Digit Med. 2021;4:54 https://doi.org/10.1038/s41746-021-00423-6.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care: a report from the national academy of medicine. JAMA. 2020;323:509–10. https://doi.org/10.1001/jama.2019.21579.

    Article  PubMed  Google Scholar 

  3. Jiang X, Xie M, Ma L, Dong L, Li D. International publication trends in the application of artificial intelligence in ophthalmology research: an updated bibliometric analysis. Ann Transl Med. 2023;11:219.

    Article  PubMed  PubMed Central  Google Scholar 

  4. OpenAI ChatGPT (Mar 13 version) [Large language model] Available at: https://openai.com/blog/chatgpt [Accessed Aug 13, 2023].

  5. Bard, an experiment by Google (Mar 21 version). Available at: https://bard.google.com/. [Accessed August 13, 2023].

  6. Microsoft Bing Chat (Feb 7 version). Available at: https://www.bing.com/new. [Accessed Aug 13, 2023].

  7. [Internet]. NHS; [cited 2023 Aug 13]. Available from: https://www.nhs.uk/nhs-services/urgent-and-emergency-care-services/when-to-use-111/how-nhs-111-online-works/?_id=111Website.

  8. [Internet]. NHS; [cited 2023 Aug 15]. Available from: https://digital.nhs.uk/services/nhs-111-online/nhs-111-online-is-a-class-1-medical-device.

  9. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study. Lancet Digit Health. 2024;6:e555–e561.

  10. Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616:259–65.

    Article  CAS  PubMed  Google Scholar 

  11. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Heal. 2023;2:e0000198.

    Article  Google Scholar 

  12. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW. et al. Large language models encode clinical knowledge [published correction appears in Nature.Nature.2023;620:172–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Raimondi R, Tzoumas N, Salisbury T, Simplicio SD, Romano MR, et al. Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye (2023). https://doi.org/10.1038/s41433-023-02563-3.

  14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Towards expert-level medical question answering with large language models. Nat Med. (2025). https://doi.org/10.1038/s41591-024-03423-7.

  15. Cappellani F, Card KR, Shields CL, Pulido JS, Haller JA Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye (2024). https://doi.org/10.1038/s41433-023-02906-0.

  16. Bsharat SM, Myrzakhan A, Shen Z Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4, principled instructions are all you need for questioning llama-1/2, GPT-3.5/4. Available at: https://arxiv.org/html/2312.16171v2 (Accessed: 19 April 2024).

  17. Sallam M.ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns.Healthcare.2023;11:887.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Chrishan Gunasekera (CG) was responsible for the study conception and design. Shaheryar Ahmed Khan (SAK) was involved with the study design, data collection, and literature review. CG and SAK were responsible for data analysis and interpretation of results. Both the authors contributed to draft manuscript preparation, reviewed the results, and approved the final version of the manuscript.

Corresponding author

Correspondence to Shaheryar Ahmed Khan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, S.A., Gunasekera, C. “Comparative analysis of large language models against the NHS 111 online triaging for emergency ophthalmology”. Eye 39, 1301–1308 (2025). https://doi.org/10.1038/s41433-025-03605-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41433-025-03605-8

Search

Quick links