Table 1 Comparison of our proposed RadFM with other foundation models on nine existing datasets, together with ablation studies

From: Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Dataset	Metric	OpenFlamingo (Few-shot)	MedVInT	LLaVA-Med	MedFlamingo (Few-shot)	RadFM (w/o Ins-tuning)	RadFM (w/o Our Data)	RadFM
Disease diagnosis
VinDr-Mammo	ACC	49.92 (48.20, 51.65)	50.06 (48.52, 51.59)	50.27 (49.20, 51.52)	49.80 (48.33, 51.42)	49.80 (48.15,51.53)	55.35 (54.35,559.47)	59.96 (58.41, 61.59)
	F1	57.01 (54.64, 60.08)	66.56 (65.2, 67.93)	56.48 (55.45, 57.73)	64.92 (63.52, 66.32)	60.32 (58.25,62.31)	60.57 (58.75,62.58)	62.11 (60.09, 63.75)
VinDr-SpineXr	ACC	50.33 (47.13, 53.53)	49.93 (46.99, 52.86)	49.85 (47.95, 52.23)	49.61 (46.05, 53.16)	52.19 (49.18,55.17)	64.43 (61.76,67.22)	68.82 (65.92, 71.47)
	F1	31.79 (26.99, 36.58)	62.32 (59.38, 65.25)	54.83 (51.88, 57.45)	63.23 (59.74, 66.74)	34.19 (30.17,38.09)	65.86 (62.62,68.81)	67.69 (64.5, 70.98)
VinDr-PCXR	ACC	49.85 (45.40, 54.31)	50.29 (45.88, 54.69)	49.62 (45.79, 53.64)	49.37 (44.44, 54.31)	50.12 (45.21, 54.60)	51.82 (46.46, 57.09)	56.32 (51.82, 61.21)
	F1	41.44 (33.77, 49.10)	66.29 (62.36, 70.23)	47.81 (42.33, 53.42)	66.94 (62.57, 71.32)	43.33 (40.37, 40.88)	49.14 (43.66, 56.18)	37.53 (28.88, 43.67)
CXR-Mix	ACC	50.63 (50.07, 51.03)	49.2 (48.53, 49.88)	53.26 (52.72, 53.91)	50.00 (49.50, 50.51)	77.71 (77.25, 77.95)	78.63 (78.51, 79.10)	83.62 (83.23, 83.97)
	F1	24.83 (24.11, 25.54)	67.22 (66.62, 67.82)	22.63 (22.70, 24.53)	66.11 (65.72, 66.61)	74.42 (73.98, 75.01)	78.35 (77.85, 78.93)	82.99 (82.58, 83.49)
RadChest-CT	ACC	50.93 (49.13, 52.72)	50.07 (47.68, 52.45)	51.09 (50.05, 52.63)	50.39 (48.34, 52.43)	51.97 (50.05, 53.31)	69.72 (67.44, 71.53)	72.95 (71.06, 74.78)
	F1	43.49 (41.18, 45.99)	66.57 (64.45, 68.69)	44.42 (42.00, 46.55)	63.31 (61.39, 65.23)	38.67 (36.37, 41.46)	67.84 (65.64, 70.11)	71.86 (69.42, 83.49)
Medical VQA
PMC-VQA	BLEU	11.10 (8.93, 13.41)	23.73 (21.03, 26.73)	13.66 (11.68, 15.52)	11.03 (9.27, 13.49)	5.23 (3.23, 8.84)	14.01 (10.92, 17.25)	17.99 (14.80, 20.83)
	ROUGE	13.03 (10.63, 15.46)	27.24 (24.04, 30.91)	18.14 (16.46, 20.20)	13.06 (10.93, 15.66)	5.82 (2.03, 10.09)	14.23 (11.20, 17.66)	19.43 (16.56, 23.55)
	UMLS_Precision	7.60 (5.41, 10.83)	19.64 (16.2, 23.59)	16.38 (12.67, 20.25)	6.45 (4.05, 8.97)	18.63 (14.84, 20.76)	13.24 (9.90, 17.02)	20.74 (17.39, 24.71)
	UMLS_Recall	7.56 (5.40, 10.51)	18.88 (15.51, 22.68)	13.34 (10.59, 16.07)	6.10 (4.04, 8.97)	15.03 (12.07, 18.34)	12.94 (9.39, 15.86)	14.14 (11.19, 17.37)
	BERT-Sim	52.08 (50.43, 54.07)	57.81 (55.49, 59.76)	42.46 (41.50, 43.44)	51.37 (49.57, 53.01)	47.85 (44.20, 49.37)	57.57 (55.85, 60.19)	63.85 (62.04, 65.94)
VQA-RAD	BLEU	33.98 (26.75, 41.85)	35.1 (28.44, 41.55)	31.55 (24.89, 38.35)	35.97 (29.14, 45.45)	22.03 (15.67, 30.38)	43.98 (36.58, 50.51)	52.24 (44.97, 59.43)
	ROUGE	35.26 (28.21, 43.91)	39.2 (31.36, 46.33)	37.47 (30.83, 44.47)	38.64 (31.42, 48.23)	22.67 (14.92, 28.57)	44.70 (38.35, 50.81)	52.74 (45.39, 61.05)
	UMLS_Precision	14.72 (6.86, 24.22)	16.46 (7.83, 25.93)	13.30 (12.14, 14.50)	18.70 (8.76, 29.61)	60.30 (50.88, 67.07)	61.52 (53.65, 69.51)	62.12 (54.01, 71.12)
	UMLS_Recall	14.52 (7.63, 23.33)	15.94 (7.72, 25.48)	12.16 (10.09, 13.93)	17.46 (8.76, 27.85)	39.43 (32.59, 47.12)	41.14 (34.49, 48.76)	42.82 (32.31, 51.54)
	BERT-Sim	71.49 (67.63, 74.96)	71.39 (66.94, 75.46)	68.28 (64.07, 72.00)	73.40 (69.62, 77.32)	58.88 (56.74, 61.08)	80.64 (77.55, 83.89)	81.52 (77.41, 85.17)
SLAKE	BLEU	27.16 (22.01, 32.56)	24.81 (20.23, 30.52)	21.43 (17.07, 25.35)	23.62 (18.06, 28.26)	24.39 (15.81, 30.74)	67.44 (63.74, 71.68)	78.56 (72.2, 83.28)
	ROUGE	29.36 (24.23, 34.73)	29.08 (24.06, 34.8)	29.92 (25.31, 34.09)	24.86 (19.47, 29.94)	24.81 (16.93, 30.59)	67.90 (63.58, 74.28)	79.42 (75.15, 84.05)
	UMLS_Precision	23.02 (17.52, 30.73)	23.32 (18.08, 29.42)	23.14 (18.29, 28.86)	18.28 (13.23, 23.38)	68.87 (64.43, 73.27)	76.09 (71.63, 80.21)	81.5 (76.81, 86.87)
	UMLS_Recall	22.71 (17.48, 29.53)	23.74 (18, 30.08)	23.31 (18.29, 27.98)	19.21 (13.38, 24.37)	57.38 (52.49, 63.66)	72.04 (67.59, 76.36)	74.42 (66.7, 81.19)
	BERT-Sim	69.42 (66.09, 72.04)	67.7 (64.94, 70.69)	69.14 (66.53, 70.92)	66.93 (63.98, 70.32)	62.35 (61.15, 63.66)	90.93 (89.46, 92.30)	93.30 (90.99, 95.60)
Report generation
MIMIC-CXR	BLEU	23.79 (22.62, 24.86)	0.04 (0.01, 0.08)	11.29 (9.92, 12.86)	22.65 (20.93, 24.06)	11.06 (8.36, 14.43)	20.63 (17.16, 25.43)	19.43 (16.12, 23.25)
	ROUGE	35.83 (33.7, 37.96)	2.69 (2.26, 3.15)	13.91 (12.63, 15.29)	27.29 (25.63, 29.04)	15.05 (12.72, 19.54)	25.42 (21.89, 29.47)	26.18 (23.07, 29.86)
	UMLS_Precision	16.75 (15.74, 17.88)	26.67 (11.19, 42.12)	10.50 (8.42, 12.88)	22.36 (20.13, 24.33)	21.80 (19.26, 24.29)	43.64 (36.96, 49.45)	45.51 (40.47, 52.77)
	UMLS_Recall	24.93 (22.86, 27.38)	0.52 (0.2, 0.88)	10.71 (8.37, 13.85)	19.64 (17.89, 21.43)	15.97 (12.92, 18.48)	22.73 (19.64, 26.57)	23.39 (20.18, 27.53)
	BERT-Sim	65.91 (65.20, 66.70)	34.48 (32.69, 36.02)	49.20 (48.22, 50.35)	66.03 (65.37, 66.83)	63.13 (61.31, 64.87)	64.22 (61.74, 65.97)	66.77 (64.87, 68.58)

We adopt a few-shot prompting setting for flamingo-liked models, while we adopt a zero-shot instruction prompting strategy for MedVInT, LLaVA-Med and RadFM. “W/o Ins-tuning” denotes training without the domain-specific instruction tuning, and “w/o Our Data” denotes training without all our newly collected data, i.e., using the combination of existing datasets only. ACC, F1, BLEU, ROUGE, UMLS_Precision, UMLS_Recall and BERT-Sim are reported based on task types, and the metrics refer to the average score on all test samples. Numbers within parentheses indicate 95% CI. Percentage (%) signs have been omitted in the table.

Back to article page

Table 1 Comparison of our proposed RadFM with other foundation models on nine existing datasets, together with ablation studies

Search

Quick links