Fig. 3: Variations and gaps in the human evaluation landscape of LLMs in healthcare.

A Variations in metrics used for human evaluation of LLM outputs. B Composition of annotator types involved in evaluations (C) Top 10 model types used in publications related to LLMs in healthcare with human evaluation. D Number of models evaluated per study, highlighting the predominance of single-model evaluations.