Fig. 6: Aligning FORTE score utility with physician expert-based evaluation and LLM-as-the-judge.
From: Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation

a The general and average FORTE score, rather than the traditional metrics, correlate more with the DocLens LLM-as-the-Judge perspective in the two-sided Pearson correlation coefficient analysis. Notably, the correlation increased when sentence pairing pre-processing was applied. b Interclass correlation coefficient (ICC) between traditional metrics, FORTE, LLM-based DocLens, and the five-item medical expert evaluation was calculated and presented as a heatmap.