Fig. 4: Gauging the competence of feature-oriented radiology task evaluation (FORTE) scoring and the effects of negation removal in RRG. | Nature Communications

Fig. 4: Gauging the competence of feature-oriented radiology task evaluation (FORTE) scoring and the effects of negation removal in RRG.

From: Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation

Fig. 4: Gauging the competence of feature-oriented radiology task evaluation (FORTE) scoring and the effects of negation removal in RRG.

a 2D scatter plot documenting the radiology keyword recall discrepancy between the CVIT-conditioned generated reports and the original ground truth. The wording elements were either referred to as the radiology keyword or as the grammatical filler words, among which the CVIT process significantly boosted the keyword usage of BrainGPT models. Note that the differential diagnostic purposed phrase “no” stood out as the top-ranked recall keyword and was further investigated by the negation removal experiments. b By further differentiating the keywords into distinct subjects (degree, landmark, feature, impression), we benchmarked inter-model performance and tested the negation removal effects on BrainGPT models. The CVIT and RVIT effects on scoring gains were marked out by red dashed lines, whereas the effect of negation removal scoring gains was marked out by blue dashed lines. Statistical significance was assessed using a two-sided Wilcoxon signed-rank test. c The testing of instruction tuning and negation removal was repeated with the assessment of traditional evaluation metrics. Statistical significance for these comparisons was also determined using a two-sided Wilcoxon signed-rank test. d Heatmap and 2D scatter plot of the two-sided Pearson analysis showed a distinct evaluation spectrum between the traditional metrics and the FORTE keyword categories. While the traditional evaluation metrics showed high homogeneity, the FORTE evaluation addresses diverse aspects of the radiology description context. (* p < 0.05, ** p < 0.01, *** p < 0.001).

Back to article page