Fig. 3: Variations and gaps in the human evaluation landscape of LLMs in healthcare. | npj Health Systems

Fig. 3: Variations and gaps in the human evaluation landscape of LLMs in healthcare.

From: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization

Fig. 3: Variations and gaps in the human evaluation landscape of LLMs in healthcare.

A Variations in metrics used for human evaluation of LLM outputs. B Composition of annotator types involved in evaluations (C) Top 10 model types used in publications related to LLMs in healthcare with human evaluation. D Number of models evaluated per study, highlighting the predominance of single-model evaluations.

Back to article page