Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

npj Digital Medicine
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. npj digital medicine
  3. articles
  4. article
A scalable framework for evaluating health language models
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 27 February 2026

A scalable framework for evaluating health language models

  • Neil Mallinar1 na1,
  • A. Ali Heydari1 na1,
  • Xin Liu1,
  • Anthony Z. Faranesh1,
  • Brent Winslow1,
  • Nova Hammerquist1,
  • Benjamin Graef2,
  • Cathy Speed1,
  • Mark Malhotra1,
  • Shwetak Patel1,
  • Javier L. Prieto1,
  • Daniel McDuff1 &
  • …
  • Ahmed A. Metwally1 

npj Digital Medicine , Article number:  (2026) Cite this article

  • 3468 Accesses

  • 2 Citations

  • 5 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Machine learning
  • Metabolic syndrome
  • Pre-diabetes

Abstract

Large language models (LLMs) have emerged as powerful tools for analyzing and interpreting complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization, relevance and safety. However, current evaluation practices, particularly for open-ended text responses, heavily rely on human experts. This approach introduces human factors (perspectives, potential biases, inconsistencies) and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data, which is often nuanced and diverse. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that aims to streamline human and automated evaluation of open-ended questions by identifying critical gaps in model responses using a minimal set of targeted rubric questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple Boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield substantially higher inter-rater agreement among both expert and non-expert human evaluators, as well as in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency and scalability, particularly through automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Similar content being viewed by others

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Article Open access 29 March 2024

Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Article Open access 27 October 2025

Large language models in biomedicine and healthcare

Article Open access 01 December 2025

Data availability

All data related to prompts, queries, synthetic personas and evaluation rubrics are provided in the supplementary material. Data access to the WEAR-ME study can be found in Metwally et al.45.

References

  1. Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).

    Google Scholar 

  2. Singhal, K. et al. Large language models encode clinical knowledge. Nature (2022).

  3. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).

    Google Scholar 

  4. Shi, W. et al. EHRAgent: Code empowers large language models for Few-Shot complex tabular reasoning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22315–22339 (2024).

  5. Gottweis, J. et al. Towards an AI Co-Scientist. arXiv preprint (2025).

  6. Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. https://doi.org/10.1038/s41591-025-03888-0 (2025).

  7. Merrill, M.A. et al. Transforming wearable data into personal health insights using large language model agents. Nat. Commun. 17, 1143 (2026).

    Google Scholar 

  8. Heydari, A. A. et al. The anatomy of a personal health agent https://arxiv.org/abs/2508.20148 (2025).

  9. Fraser, H. et al. Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: Clinical data analysis study. JMIR mHealth uHealth 11, e49995 (2023).

    Google Scholar 

  10. Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

    Google Scholar 

  11. Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).

    Google Scholar 

  12. Yang, Z., Meng, Z., Zheng, X. & Wattenhofer, R. Assessing adversarial robustness of large language models: An empirical study https://arxiv.org/abs/2405.02764 (2024).

  13. Cao, Y. et al. Toward generalizable evaluation in the LLM era: A survey beyond benchmarks https://arxiv.org/abs/2504.18838 (2025).

  14. Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR) (2022).

  15. Likert, R.A Technique for the Measurement of Attitudes (Archives of Psychology, 22, 1932).

  16. Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, (1977).

  17. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. ICLR 2020 (2019).

  18. Westland, J. C. Information loss and bias in Likert survey responses. PLoS ONE 17, e0271949 (2022).

    Google Scholar 

  19. Elangovan, A., Liu, L., Xu, L., Bodapati, S. & Roth, D. ConSiDERS-The-Human evaluation framework: Rethinking human evaluation for generative large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024).

  20. Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G. & Arawjo, I. Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences. UIST ’24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024).

  21. Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers (2023).

  22. Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (TIST) (2023).

  23. Guo, Z. et al. Evaluating large language models: A comprehensive survey. arXiv preprint (2023).

  24. Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 1–14 (2024).

    Google Scholar 

  25. Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. arXiv preprint (2023).

  26. Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 1–20 (2024).

    Google Scholar 

  27. Vu, T. et al. Foundational autoraters: Taming large language models for better automatic evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024).

  28. Zhong, M. et al. Towards a unified Multi-Dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023–2038 (2022).

  29. Min, S. et al. FactScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100 (2023).

  30. Lee, Y. et al. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. HEAL Workshop at CHI (2024).

  31. Dasgupta, S., Frost, N., Moshkovitz, M. & Rashtchian, C. Explainable k-Means and k-Medians clustering. Proceedings of the 37th International Conference on Machine Learning (2020).

  32. Gemini Team, G. Gemini: A family of highly capable multimodal models. arXiv preprint https://arxiv.org/abs/2312.11805 (2023).

  33. Saab, K. et al. Capabilities of Gemini models in medicine. arXiv preprint (2024).

  34. Yang, L. et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint (2024).

  35. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).

    Google Scholar 

  36. StatPearls. Ace the endocrinology, diabetes, & metabolism exam. https://www.statpearls.com/boardreview/Endocrinology (2025).

  37. American Board of Internal Medicine. Endocrinology, diabetes, & metabolism exam scoring. https://www.abim.org/maintenance-of-certification/assessment-information/endocrinology-diabetes-metabolism/scoring-results (2025).

  38. American Board of Internal Medicine. Initial certification pass rates. https://www.abim.org/Media/yeqiumdc/certification-pass-rates.pdf (2024).

  39. BoardVitals. Cardiology board review questions [2025] - BoardVitals. https://www.boardvitals.com/cardiology-board-review/?utm_term=&utm_campaign=Performance+Max+-+BoardReview&utm_source=google&utm_medium=cpc&hsa_acc=3629361371&hsa_cam=16996727962&hsa_grp=&hsa_ad=&hsa_src=x&hsa_tgt=&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&utm_content=april-flash&gad_source=1&gclid=EAIaIQobChMI7tewscPJiwMVAQCtBh3AiAH2EAAYASAAEgI_S_D_BwE (2025).

  40. Panickssery, A., Bowman, S. R. & Feng, S. LLM evaluators recognize and favor their own generations. In The 38th Annual Conference on Neural Information Processing Systemshttps://openreview.net/forum?id=4NJBV6Wp0h (2024).

  41. Liu, A. et al. DeepSeek-V3 technical report https://arxiv.org/abs/2412.19437 (2025).

  42. Hurst, A. et al. GPT-4o system card https://arxiv.org/abs/2410.21276 (2024).

  43. OpenAI. OpenAI o3 Model. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).

  44. Prieto, J. L. New Fitbit study explores metabolic health. https://blog.google/products/fitbit/new-quest-fitbit-study-metabolic-health/ (2024).

  45. Metwally, A. A. et al. Insulin resistance prediction from wearables and routine blood biomarkers. Nature (In-Press). https://arxiv.org/abs/2505.03784.

  46. The Cleveland Clinic. Hypercholesterolemia, Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/23921-hypercholesterolemia (2025).

  47. Fabbri, A. R. et al. SummEval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. (2020).

  48. Gopalakrishnan, K. et al. Topical-Chat: Towards Knowledge-Grounded Open-Domain conversations. INTERSPEECH (2019).

  49. Clark, E. et al. All that’s ’human’ is not gold: Evaluating human evaluation of generated text https://arxiv.org/abs/2107.00061 (2021).

  50. Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 30, 3590–3600 (2024).

    Google Scholar 

  51. Gehrmann, S., Clark, E. & Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. https://arxiv.org/abs/2202.06935 (2022).

  52. Yu, F. When AIs judge AIs: The rise of Agent-as-a-Judge evaluation for LLMs https://arxiv.org/abs/2508.02994 (2025).

  53. Fisher, S. R. A.Statistical Methods for Research Workers (Oliver & Boyd (Edinburgh), 1925).

  54. Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19, (1966).

  55. Shrout, P. & Fleiss, J. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).

    Google Scholar 

  56. Liljequist, D., Elfving, B. & Roaldsen, K. S. Intraclass correlation – a discussion and demonstration of basic features. PLOS ONE 14, e0219854 (2019).

    Google Scholar 

  57. Hackl, V., Müller, A. E., Granitzer, M. & Sailer, M. Is GPT-4 a reliable rater? evaluating consistency in GPT-4’s text ratings. Front. Educ. 8, 1272229 (2023).

    Google Scholar 

Download references

Acknowledgements

This study was funded by Google LLC. We are deeply grateful to the members of the Human Research Laboratory at Google for helping set up evaluation workflows for human evaluators, in particular, Erik Schenck, and Derek Peyton. We thank our expert evaluators Michelle Jonelis, Narayan Krishnamurthy, Thuan Dang, Timothy Wong, and Andreas Michaelides and non-expert evaluators Aayush Ranjan, Pawan, Shwetank Dhruva, and Nitesh Tiwari.

Author information

Author notes
  1. These authors contributed equally: Neil Mallinar, A. Ali Heydari.

Authors and Affiliations

  1. Google Research, Mountain View, CA, USA

    Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff & Ahmed A. Metwally

  2. Vituity, Emeryville, CA, USA

    Benjamin Graef

Authors
  1. Neil Mallinar
    View author publications

    Search author on:PubMed Google Scholar

  2. A. Ali Heydari
    View author publications

    Search author on:PubMed Google Scholar

  3. Xin Liu
    View author publications

    Search author on:PubMed Google Scholar

  4. Anthony Z. Faranesh
    View author publications

    Search author on:PubMed Google Scholar

  5. Brent Winslow
    View author publications

    Search author on:PubMed Google Scholar

  6. Nova Hammerquist
    View author publications

    Search author on:PubMed Google Scholar

  7. Benjamin Graef
    View author publications

    Search author on:PubMed Google Scholar

  8. Cathy Speed
    View author publications

    Search author on:PubMed Google Scholar

  9. Mark Malhotra
    View author publications

    Search author on:PubMed Google Scholar

  10. Shwetak Patel
    View author publications

    Search author on:PubMed Google Scholar

  11. Javier L. Prieto
    View author publications

    Search author on:PubMed Google Scholar

  12. Daniel McDuff
    View author publications

    Search author on:PubMed Google Scholar

  13. Ahmed A. Metwally
    View author publications

    Search author on:PubMed Google Scholar

Contributions

N.M., A.A.H., D.M., and A.A.M. conceptualized and designed the research. N.M., A.A.H., and B.G. conducted data curation. N.M., A.A.H., D.M., and A.A.M. analyzed and visualized data. N.M., A.A.H., X.L., D.M., and A.A.M. wrote the original draft of the paper. N.M., A.A.H., X.L., A.Z.F., B.W., N.H., B.G., C.S., M.M., S.P., J.L.P., D.M., and A.A.M. reviewed and edited the paper. A.A.M. contributed to project administration. D.M., A.A.M. contributed to project supervision.

Corresponding authors

Correspondence to Daniel McDuff or Ahmed A. Metwally.

Ethics declarations

Competing interests

A.A.H., X.L., A.Z.F., B.W., N.H., C.S., M.M, S.P., J.L.P, D.M., and A.A.M. are or were employees of Alphabet at the time of submission, and may own stock as part of the standard compensation package. N.M. was an intern at Google during this research. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Data 1 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mallinar, N., Heydari, A.A., Liu, X. et al. A scalable framework for evaluating health language models. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02492-x

Download citation

  • Received: 28 March 2025

  • Accepted: 15 February 2026

  • Published: 27 February 2026

  • DOI: https://doi.org/10.1038/s41746-026-02492-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Content types
  • Journal Information
  • About the Editors
  • Contact
  • Editorial policies
  • Calls for Papers
  • Journal Metrics
  • About the Partner
  • Open Access
  • Early Career Researcher Editorial Fellowship
  • Editorial Team Vacancies
  • News and Views Student Editor
  • Communication Fellowship

Publish with us

  • For Authors and Referees
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

npj Digital Medicine (npj Digit. Med.)

ISSN 2398-6352 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing