Fig. 1: Model Rankings by Human and Automated Evaluators.
From: Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation

The mean ranks of five LLMs across 100 ICM consultation scenarios evaluated by a radiologist and three LLM-based scorers, where lower values indicate better performance and error bars represent 95% confidence intervals.