Fig. 4: Evaluation Workflow for Language Model Responses.
From: Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation

Schematic of the evaluation pipeline. Each clinical query was simultaneously processed by five LLMs (three cloud-based and two locally deployable models). The generated responses were anonymized and evaluated by a radiologist and three LLM-based evaluators using both human ranking and rubric-based scoring.