Fig. 1: Study overview.
From: Evaluating clinical AI summaries with large language models as judges

Five distinct training strategies for large language models using the PDSQI-9 instrument were evaluated. The experiments comprised expert-driven prompt engineering, supervised fine-tuning, direct preference optimization, and multi-agent architectures, representing the LLM-as-a-Judge framework for clinical summarization.