A scalable framework for evaluating health language models

Mallinar, Neil; Heydari, A. Ali; Liu, Xin; Faranesh, Anthony Z.; Winslow, Brent; Hammerquist, Nova; Graef, Benjamin; Speed, Cathy; Malhotra, Mark; Patel, Shwetak; Prieto, Javier L.; McDuff, Daniel; Metwally, Ahmed A.

doi:10.1038/s41746-026-02492-x

Download PDF

Article
Open access
Published: 27 February 2026

A scalable framework for evaluating health language models

Neil Mallinar¹^na1,
A. Ali Heydari¹^na1,
Xin Liu¹,
Anthony Z. Faranesh¹,
Brent Winslow¹,
Nova Hammerquist¹,
Benjamin Graef²,
Cathy Speed¹,
Mark Malhotra¹,
Shwetak Patel¹,
Javier L. Prieto¹,
Daniel McDuff¹ &
…
Ahmed A. Metwally¹

npj Digital Medicine , Article number: (2026) Cite this article

3468 Accesses
2 Citations
5 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models (LLMs) have emerged as powerful tools for analyzing and interpreting complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization, relevance and safety. However, current evaluation practices, particularly for open-ended text responses, heavily rely on human experts. This approach introduces human factors (perspectives, potential biases, inconsistencies) and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data, which is often nuanced and diverse. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that aims to streamline human and automated evaluation of open-ended questions by identifying critical gaps in model responses using a minimal set of targeted rubric questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple Boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield substantially higher inter-rater agreement among both expert and non-expert human evaluators, as well as in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency and scalability, particularly through automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Article Open access 29 March 2024

Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Article Open access 27 October 2025

Large language models in biomedicine and healthcare

Article Open access 01 December 2025

Data availability

All data related to prompts, queries, synthetic personas and evaluation rubrics are provided in the supplementary material. Data access to the WEAR-ME study can be found in Metwally et al.⁴⁵.

References

Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature (2022).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Google Scholar
Shi, W. et al. EHRAgent: Code empowers large language models for Few-Shot complex tabular reasoning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22315–22339 (2024).
Gottweis, J. et al. Towards an AI Co-Scientist. arXiv preprint (2025).
Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. https://doi.org/10.1038/s41591-025-03888-0 (2025).
Merrill, M.A. et al. Transforming wearable data into personal health insights using large language model agents. Nat. Commun. 17, 1143 (2026).
Google Scholar
Heydari, A. A. et al. The anatomy of a personal health agent https://arxiv.org/abs/2508.20148 (2025).
Fraser, H. et al. Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: Clinical data analysis study. JMIR mHealth uHealth 11, e49995 (2023).
Google Scholar
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Google Scholar
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).
Google Scholar
Yang, Z., Meng, Z., Zheng, X. & Wattenhofer, R. Assessing adversarial robustness of large language models: An empirical study https://arxiv.org/abs/2405.02764 (2024).
Cao, Y. et al. Toward generalizable evaluation in the LLM era: A survey beyond benchmarks https://arxiv.org/abs/2504.18838 (2025).
Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR) (2022).
Likert, R.A Technique for the Measurement of Attitudes (Archives of Psychology, 22, 1932).
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, (1977).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. ICLR 2020 (2019).
Westland, J. C. Information loss and bias in Likert survey responses. PLoS ONE 17, e0271949 (2022).
Google Scholar
Elangovan, A., Liu, L., Xu, L., Bodapati, S. & Roth, D. ConSiDERS-The-Human evaluation framework: Rethinking human evaluation for generative large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024).
Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G. & Arawjo, I. Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences. UIST ’24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024).
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers (2023).
Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (TIST) (2023).
Guo, Z. et al. Evaluating large language models: A comprehensive survey. arXiv preprint (2023).
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 1–14 (2024).
Google Scholar
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. arXiv preprint (2023).
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 1–20 (2024).
Google Scholar
Vu, T. et al. Foundational autoraters: Taming large language models for better automatic evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024).
Zhong, M. et al. Towards a unified Multi-Dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023–2038 (2022).
Min, S. et al. FactScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100 (2023).
Lee, Y. et al. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. HEAL Workshop at CHI (2024).
Dasgupta, S., Frost, N., Moshkovitz, M. & Rashtchian, C. Explainable k-Means and k-Medians clustering. Proceedings of the 37th International Conference on Machine Learning (2020).
Gemini Team, G. Gemini: A family of highly capable multimodal models. arXiv preprint https://arxiv.org/abs/2312.11805 (2023).
Saab, K. et al. Capabilities of Gemini models in medicine. arXiv preprint (2024).
Yang, L. et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint (2024).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Google Scholar
StatPearls. Ace the endocrinology, diabetes, & metabolism exam. https://www.statpearls.com/boardreview/Endocrinology (2025).
American Board of Internal Medicine. Endocrinology, diabetes, & metabolism exam scoring. https://www.abim.org/maintenance-of-certification/assessment-information/endocrinology-diabetes-metabolism/scoring-results (2025).
American Board of Internal Medicine. Initial certification pass rates. https://www.abim.org/Media/yeqiumdc/certification-pass-rates.pdf (2024).
BoardVitals. Cardiology board review questions [2025] - BoardVitals. https://www.boardvitals.com/cardiology-board-review/?utm_term=&utm_campaign=Performance+Max+-+BoardReview&utm_source=google&utm_medium=cpc&hsa_acc=3629361371&hsa_cam=16996727962&hsa_grp=&hsa_ad=&hsa_src=x&hsa_tgt=&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&utm_content=april-flash&gad_source=1&gclid=EAIaIQobChMI7tewscPJiwMVAQCtBh3AiAH2EAAYASAAEgI_S_D_BwE (2025).
Panickssery, A., Bowman, S. R. & Feng, S. LLM evaluators recognize and favor their own generations. In The 38th Annual Conference on Neural Information Processing Systemshttps://openreview.net/forum?id=4NJBV6Wp0h (2024).
Liu, A. et al. DeepSeek-V3 technical report https://arxiv.org/abs/2412.19437 (2025).
Hurst, A. et al. GPT-4o system card https://arxiv.org/abs/2410.21276 (2024).
OpenAI. OpenAI o3 Model. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).
Prieto, J. L. New Fitbit study explores metabolic health. https://blog.google/products/fitbit/new-quest-fitbit-study-metabolic-health/ (2024).
Metwally, A. A. et al. Insulin resistance prediction from wearables and routine blood biomarkers. Nature (In-Press). https://arxiv.org/abs/2505.03784.
The Cleveland Clinic. Hypercholesterolemia, Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/23921-hypercholesterolemia (2025).
Fabbri, A. R. et al. SummEval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. (2020).
Gopalakrishnan, K. et al. Topical-Chat: Towards Knowledge-Grounded Open-Domain conversations. INTERSPEECH (2019).
Clark, E. et al. All that’s ’human’ is not gold: Evaluating human evaluation of generated text https://arxiv.org/abs/2107.00061 (2021).
Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 30, 3590–3600 (2024).
Google Scholar
Gehrmann, S., Clark, E. & Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. https://arxiv.org/abs/2202.06935 (2022).
Yu, F. When AIs judge AIs: The rise of Agent-as-a-Judge evaluation for LLMs https://arxiv.org/abs/2508.02994 (2025).
Fisher, S. R. A.Statistical Methods for Research Workers (Oliver & Boyd (Edinburgh), 1925).
Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19, (1966).
Shrout, P. & Fleiss, J. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Google Scholar
Liljequist, D., Elfving, B. & Roaldsen, K. S. Intraclass correlation – a discussion and demonstration of basic features. PLOS ONE 14, e0219854 (2019).
Google Scholar
Hackl, V., Müller, A. E., Granitzer, M. & Sailer, M. Is GPT-4 a reliable rater? evaluating consistency in GPT-4’s text ratings. Front. Educ. 8, 1272229 (2023).
Google Scholar

Download references

Acknowledgements

This study was funded by Google LLC. We are deeply grateful to the members of the Human Research Laboratory at Google for helping set up evaluation workflows for human evaluators, in particular, Erik Schenck, and Derek Peyton. We thank our expert evaluators Michelle Jonelis, Narayan Krishnamurthy, Thuan Dang, Timothy Wong, and Andreas Michaelides and non-expert evaluators Aayush Ranjan, Pawan, Shwetank Dhruva, and Nitesh Tiwari.

Author information

These authors contributed equally: Neil Mallinar, A. Ali Heydari.

Authors and Affiliations

Google Research, Mountain View, CA, USA
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff & Ahmed A. Metwally
Vituity, Emeryville, CA, USA
Benjamin Graef

Authors

Neil Mallinar
View author publications
Search author on:PubMed Google Scholar
A. Ali Heydari
View author publications
Search author on:PubMed Google Scholar
Xin Liu
View author publications
Search author on:PubMed Google Scholar
Anthony Z. Faranesh
View author publications
Search author on:PubMed Google Scholar
Brent Winslow
View author publications
Search author on:PubMed Google Scholar
Nova Hammerquist
View author publications
Search author on:PubMed Google Scholar
Benjamin Graef
View author publications
Search author on:PubMed Google Scholar
Cathy Speed
View author publications
Search author on:PubMed Google Scholar
Mark Malhotra
View author publications
Search author on:PubMed Google Scholar
Shwetak Patel
View author publications
Search author on:PubMed Google Scholar
Javier L. Prieto
View author publications
Search author on:PubMed Google Scholar
Daniel McDuff
View author publications
Search author on:PubMed Google Scholar
Ahmed A. Metwally
View author publications
Search author on:PubMed Google Scholar

Contributions

N.M., A.A.H., D.M., and A.A.M. conceptualized and designed the research. N.M., A.A.H., and B.G. conducted data curation. N.M., A.A.H., D.M., and A.A.M. analyzed and visualized data. N.M., A.A.H., X.L., D.M., and A.A.M. wrote the original draft of the paper. N.M., A.A.H., X.L., A.Z.F., B.W., N.H., B.G., C.S., M.M., S.P., J.L.P., D.M., and A.A.M. reviewed and edited the paper. A.A.M. contributed to project administration. D.M., A.A.M. contributed to project supervision.

Corresponding authors

Correspondence to Daniel McDuff or Ahmed A. Metwally.

Ethics declarations

Competing interests

A.A.H., X.L., A.Z.F., B.W., N.H., C.S., M.M, S.P., J.L.P, D.M., and A.A.M. are or were employees of Alphabet at the time of submission, and may own stock as part of the standard compensation package. N.M. was an intern at Google during this research. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Data 1 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Mallinar, N., Heydari, A.A., Liu, X. et al. A scalable framework for evaluating health language models. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02492-x

Download citation

Received: 28 March 2025
Accepted: 15 February 2026
Published: 27 February 2026
DOI: https://doi.org/10.1038/s41746-026-02492-x