Abstract
Large language models (LLMs) have emerged as powerful tools for analyzing and interpreting complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization, relevance and safety. However, current evaluation practices, particularly for open-ended text responses, heavily rely on human experts. This approach introduces human factors (perspectives, potential biases, inconsistencies) and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data, which is often nuanced and diverse. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that aims to streamline human and automated evaluation of open-ended questions by identifying critical gaps in model responses using a minimal set of targeted rubric questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple Boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield substantially higher inter-rater agreement among both expert and non-expert human evaluators, as well as in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency and scalability, particularly through automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
Similar content being viewed by others
Data availability
All data related to prompts, queries, synthetic personas and evaluation rubrics are provided in the supplementary material. Data access to the WEAR-ME study can be found in Metwally et al.45.
References
Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature (2022).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Shi, W. et al. EHRAgent: Code empowers large language models for Few-Shot complex tabular reasoning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22315–22339 (2024).
Gottweis, J. et al. Towards an AI Co-Scientist. arXiv preprint (2025).
Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. https://doi.org/10.1038/s41591-025-03888-0 (2025).
Merrill, M.A. et al. Transforming wearable data into personal health insights using large language model agents. Nat. Commun. 17, 1143 (2026).
Heydari, A. A. et al. The anatomy of a personal health agent https://arxiv.org/abs/2508.20148 (2025).
Fraser, H. et al. Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: Clinical data analysis study. JMIR mHealth uHealth 11, e49995 (2023).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).
Yang, Z., Meng, Z., Zheng, X. & Wattenhofer, R. Assessing adversarial robustness of large language models: An empirical study https://arxiv.org/abs/2405.02764 (2024).
Cao, Y. et al. Toward generalizable evaluation in the LLM era: A survey beyond benchmarks https://arxiv.org/abs/2504.18838 (2025).
Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR) (2022).
Likert, R.A Technique for the Measurement of Attitudes (Archives of Psychology, 22, 1932).
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, (1977).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. ICLR 2020 (2019).
Westland, J. C. Information loss and bias in Likert survey responses. PLoS ONE 17, e0271949 (2022).
Elangovan, A., Liu, L., Xu, L., Bodapati, S. & Roth, D. ConSiDERS-The-Human evaluation framework: Rethinking human evaluation for generative large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024).
Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G. & Arawjo, I. Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences. UIST ’24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024).
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers (2023).
Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (TIST) (2023).
Guo, Z. et al. Evaluating large language models: A comprehensive survey. arXiv preprint (2023).
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 1–14 (2024).
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. arXiv preprint (2023).
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 1–20 (2024).
Vu, T. et al. Foundational autoraters: Taming large language models for better automatic evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024).
Zhong, M. et al. Towards a unified Multi-Dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023–2038 (2022).
Min, S. et al. FactScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100 (2023).
Lee, Y. et al. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. HEAL Workshop at CHI (2024).
Dasgupta, S., Frost, N., Moshkovitz, M. & Rashtchian, C. Explainable k-Means and k-Medians clustering. Proceedings of the 37th International Conference on Machine Learning (2020).
Gemini Team, G. Gemini: A family of highly capable multimodal models. arXiv preprint https://arxiv.org/abs/2312.11805 (2023).
Saab, K. et al. Capabilities of Gemini models in medicine. arXiv preprint (2024).
Yang, L. et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint (2024).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
StatPearls. Ace the endocrinology, diabetes, & metabolism exam. https://www.statpearls.com/boardreview/Endocrinology (2025).
American Board of Internal Medicine. Endocrinology, diabetes, & metabolism exam scoring. https://www.abim.org/maintenance-of-certification/assessment-information/endocrinology-diabetes-metabolism/scoring-results (2025).
American Board of Internal Medicine. Initial certification pass rates. https://www.abim.org/Media/yeqiumdc/certification-pass-rates.pdf (2024).
BoardVitals. Cardiology board review questions [2025] - BoardVitals. https://www.boardvitals.com/cardiology-board-review/?utm_term=&utm_campaign=Performance+Max+-+BoardReview&utm_source=google&utm_medium=cpc&hsa_acc=3629361371&hsa_cam=16996727962&hsa_grp=&hsa_ad=&hsa_src=x&hsa_tgt=&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&utm_content=april-flash&gad_source=1&gclid=EAIaIQobChMI7tewscPJiwMVAQCtBh3AiAH2EAAYASAAEgI_S_D_BwE (2025).
Panickssery, A., Bowman, S. R. & Feng, S. LLM evaluators recognize and favor their own generations. In The 38th Annual Conference on Neural Information Processing Systemshttps://openreview.net/forum?id=4NJBV6Wp0h (2024).
Liu, A. et al. DeepSeek-V3 technical report https://arxiv.org/abs/2412.19437 (2025).
Hurst, A. et al. GPT-4o system card https://arxiv.org/abs/2410.21276 (2024).
OpenAI. OpenAI o3 Model. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).
Prieto, J. L. New Fitbit study explores metabolic health. https://blog.google/products/fitbit/new-quest-fitbit-study-metabolic-health/ (2024).
Metwally, A. A. et al. Insulin resistance prediction from wearables and routine blood biomarkers. Nature (In-Press). https://arxiv.org/abs/2505.03784.
The Cleveland Clinic. Hypercholesterolemia, Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/23921-hypercholesterolemia (2025).
Fabbri, A. R. et al. SummEval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. (2020).
Gopalakrishnan, K. et al. Topical-Chat: Towards Knowledge-Grounded Open-Domain conversations. INTERSPEECH (2019).
Clark, E. et al. All that’s ’human’ is not gold: Evaluating human evaluation of generated text https://arxiv.org/abs/2107.00061 (2021).
Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 30, 3590–3600 (2024).
Gehrmann, S., Clark, E. & Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. https://arxiv.org/abs/2202.06935 (2022).
Yu, F. When AIs judge AIs: The rise of Agent-as-a-Judge evaluation for LLMs https://arxiv.org/abs/2508.02994 (2025).
Fisher, S. R. A.Statistical Methods for Research Workers (Oliver & Boyd (Edinburgh), 1925).
Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19, (1966).
Shrout, P. & Fleiss, J. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Liljequist, D., Elfving, B. & Roaldsen, K. S. Intraclass correlation – a discussion and demonstration of basic features. PLOS ONE 14, e0219854 (2019).
Hackl, V., Müller, A. E., Granitzer, M. & Sailer, M. Is GPT-4 a reliable rater? evaluating consistency in GPT-4’s text ratings. Front. Educ. 8, 1272229 (2023).
Acknowledgements
This study was funded by Google LLC. We are deeply grateful to the members of the Human Research Laboratory at Google for helping set up evaluation workflows for human evaluators, in particular, Erik Schenck, and Derek Peyton. We thank our expert evaluators Michelle Jonelis, Narayan Krishnamurthy, Thuan Dang, Timothy Wong, and Andreas Michaelides and non-expert evaluators Aayush Ranjan, Pawan, Shwetank Dhruva, and Nitesh Tiwari.
Author information
Authors and Affiliations
Contributions
N.M., A.A.H., D.M., and A.A.M. conceptualized and designed the research. N.M., A.A.H., and B.G. conducted data curation. N.M., A.A.H., D.M., and A.A.M. analyzed and visualized data. N.M., A.A.H., X.L., D.M., and A.A.M. wrote the original draft of the paper. N.M., A.A.H., X.L., A.Z.F., B.W., N.H., B.G., C.S., M.M., S.P., J.L.P., D.M., and A.A.M. reviewed and edited the paper. A.A.M. contributed to project administration. D.M., A.A.M. contributed to project supervision.
Corresponding authors
Ethics declarations
Competing interests
A.A.H., X.L., A.Z.F., B.W., N.H., C.S., M.M, S.P., J.L.P, D.M., and A.A.M. are or were employees of Alphabet at the time of submission, and may own stock as part of the standard compensation package. N.M. was an intern at Google during this research. All other authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mallinar, N., Heydari, A.A., Liu, X. et al. A scalable framework for evaluating health language models. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02492-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02492-x


