Table 1 Comparison of our proposed RadFM with other foundation models on nine existing datasets, together with ablation studies

From: Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Dataset

Metric

OpenFlamingo (Few-shot)

MedVInT

LLaVA-Med

MedFlamingo (Few-shot)

RadFM (w/o Ins-tuning)

RadFM (w/o Our Data)

RadFM

Disease diagnosis

VinDr-Mammo

ACC

49.92 (48.20, 51.65)

50.06 (48.52, 51.59)

50.27 (49.20, 51.52)

49.80 (48.33, 51.42)

49.80 (48.15,51.53)

55.35 (54.35,559.47)

59.96 (58.41, 61.59)

 

F1

57.01 (54.64, 60.08)

66.56 (65.2, 67.93)

56.48 (55.45, 57.73)

64.92 (63.52, 66.32)

60.32 (58.25,62.31)

60.57 (58.75,62.58)

62.11 (60.09, 63.75)

VinDr-SpineXr

ACC

50.33 (47.13, 53.53)

49.93 (46.99, 52.86)

49.85 (47.95, 52.23)

49.61 (46.05, 53.16)

52.19 (49.18,55.17)

64.43 (61.76,67.22)

68.82 (65.92, 71.47)

 

F1

31.79 (26.99, 36.58)

62.32 (59.38, 65.25)

54.83 (51.88, 57.45)

63.23 (59.74, 66.74)

34.19 (30.17,38.09)

65.86 (62.62,68.81)

67.69 (64.5, 70.98)

VinDr-PCXR

ACC

49.85 (45.40, 54.31)

50.29 (45.88, 54.69)

49.62 (45.79, 53.64)

49.37 (44.44, 54.31)

50.12 (45.21, 54.60)

51.82 (46.46, 57.09)

56.32 (51.82, 61.21)

 

F1

41.44 (33.77, 49.10)

66.29 (62.36, 70.23)

47.81 (42.33, 53.42)

66.94 (62.57, 71.32)

43.33 (40.37, 40.88)

49.14 (43.66, 56.18)

37.53 (28.88, 43.67)

CXR-Mix

ACC

50.63 (50.07, 51.03)

49.2 (48.53, 49.88)

53.26 (52.72, 53.91)

50.00 (49.50, 50.51)

77.71 (77.25, 77.95)

78.63 (78.51, 79.10)

83.62 (83.23, 83.97)

 

F1

24.83 (24.11, 25.54)

67.22 (66.62, 67.82)

22.63 (22.70, 24.53)

66.11 (65.72, 66.61)

74.42 (73.98, 75.01)

78.35 (77.85, 78.93)

82.99 (82.58, 83.49)

RadChest-CT

ACC

50.93 (49.13, 52.72)

50.07 (47.68, 52.45)

51.09 (50.05, 52.63)

50.39 (48.34, 52.43)

51.97 (50.05, 53.31)

69.72 (67.44, 71.53)

72.95 (71.06, 74.78)

 

F1

43.49 (41.18, 45.99)

66.57 (64.45, 68.69)

44.42 (42.00, 46.55)

63.31 (61.39, 65.23)

38.67 (36.37, 41.46)

67.84 (65.64, 70.11)

71.86 (69.42, 83.49)

Medical VQA

PMC-VQA

BLEU

11.10 (8.93, 13.41)

23.73 (21.03, 26.73)

13.66 (11.68, 15.52)

11.03 (9.27, 13.49)

5.23 (3.23, 8.84)

14.01 (10.92, 17.25)

17.99 (14.80, 20.83)

 

ROUGE

13.03 (10.63, 15.46)

27.24 (24.04, 30.91)

18.14 (16.46, 20.20)

13.06 (10.93, 15.66)

5.82 (2.03, 10.09)

14.23 (11.20, 17.66)

19.43 (16.56, 23.55)

 

UMLS_Precision

7.60 (5.41, 10.83)

19.64 (16.2, 23.59)

16.38 (12.67, 20.25)

6.45 (4.05, 8.97)

18.63 (14.84, 20.76)

13.24 (9.90, 17.02)

20.74 (17.39, 24.71)

 

UMLS_Recall

7.56 (5.40, 10.51)

18.88 (15.51, 22.68)

13.34 (10.59, 16.07)

6.10 (4.04, 8.97)

15.03 (12.07, 18.34)

12.94 (9.39, 15.86)

14.14 (11.19, 17.37)

 

BERT-Sim

52.08 (50.43, 54.07)

57.81 (55.49, 59.76)

42.46 (41.50, 43.44)

51.37 (49.57, 53.01)

47.85 (44.20, 49.37)

57.57 (55.85, 60.19)

63.85 (62.04, 65.94)

VQA-RAD

BLEU

33.98 (26.75, 41.85)

35.1 (28.44, 41.55)

31.55 (24.89, 38.35)

35.97 (29.14, 45.45)

22.03 (15.67, 30.38)

43.98 (36.58, 50.51)

52.24 (44.97, 59.43)

 

ROUGE

35.26 (28.21, 43.91)

39.2 (31.36, 46.33)

37.47 (30.83, 44.47)

38.64 (31.42, 48.23)

22.67 (14.92, 28.57)

44.70 (38.35, 50.81)

52.74 (45.39, 61.05)

 

UMLS_Precision

14.72 (6.86, 24.22)

16.46 (7.83, 25.93)

13.30 (12.14, 14.50)

18.70 (8.76, 29.61)

60.30 (50.88, 67.07)

61.52 (53.65, 69.51)

62.12 (54.01, 71.12)

 

UMLS_Recall

14.52 (7.63, 23.33)

15.94 (7.72, 25.48)

12.16 (10.09, 13.93)

17.46 (8.76, 27.85)

39.43 (32.59, 47.12)

41.14 (34.49, 48.76)

42.82 (32.31, 51.54)

 

BERT-Sim

71.49 (67.63, 74.96)

71.39 (66.94, 75.46)

68.28 (64.07, 72.00)

73.40 (69.62, 77.32)

58.88 (56.74, 61.08)

80.64 (77.55, 83.89)

81.52 (77.41, 85.17)

SLAKE

BLEU

27.16 (22.01, 32.56)

24.81 (20.23, 30.52)

21.43 (17.07, 25.35)

23.62 (18.06, 28.26)

24.39 (15.81, 30.74)

67.44 (63.74, 71.68)

78.56 (72.2, 83.28)

 

ROUGE

29.36 (24.23, 34.73)

29.08 (24.06, 34.8)

29.92 (25.31, 34.09)

24.86 (19.47, 29.94)

24.81 (16.93, 30.59)

67.90 (63.58, 74.28)

79.42 (75.15, 84.05)

 

UMLS_Precision

23.02 (17.52, 30.73)

23.32 (18.08, 29.42)

23.14 (18.29, 28.86)

18.28 (13.23, 23.38)

68.87 (64.43, 73.27)

76.09 (71.63, 80.21)

81.5 (76.81, 86.87)

 

UMLS_Recall

22.71 (17.48, 29.53)

23.74 (18, 30.08)

23.31 (18.29, 27.98)

19.21 (13.38, 24.37)

57.38 (52.49, 63.66)

72.04 (67.59, 76.36)

74.42 (66.7, 81.19)

 

BERT-Sim

69.42 (66.09, 72.04)

67.7 (64.94, 70.69)

69.14 (66.53, 70.92)

66.93 (63.98, 70.32)

62.35 (61.15, 63.66)

90.93 (89.46, 92.30)

93.30 (90.99, 95.60)

Report generation

MIMIC-CXR

BLEU

23.79 (22.62, 24.86)

0.04 (0.01, 0.08)

11.29 (9.92, 12.86)

22.65 (20.93, 24.06)

11.06 (8.36, 14.43)

20.63 (17.16, 25.43)

19.43 (16.12, 23.25)

 

ROUGE

35.83 (33.7, 37.96)

2.69 (2.26, 3.15)

13.91 (12.63, 15.29)

27.29 (25.63, 29.04)

15.05 (12.72, 19.54)

25.42 (21.89, 29.47)

26.18 (23.07, 29.86)

 

UMLS_Precision

16.75 (15.74, 17.88)

26.67 (11.19, 42.12)

10.50 (8.42, 12.88)

22.36 (20.13, 24.33)

21.80 (19.26, 24.29)

43.64 (36.96, 49.45)

45.51 (40.47, 52.77)

 

UMLS_Recall

24.93 (22.86, 27.38)

0.52 (0.2, 0.88)

10.71 (8.37, 13.85)

19.64 (17.89, 21.43)

15.97 (12.92, 18.48)

22.73 (19.64, 26.57)

23.39 (20.18, 27.53)

 

BERT-Sim

65.91 (65.20, 66.70)

34.48 (32.69, 36.02)

49.20 (48.22, 50.35)

66.03 (65.37, 66.83)

63.13 (61.31, 64.87)

64.22 (61.74, 65.97)

66.77 (64.87, 68.58)

  1. We adopt a few-shot prompting setting for flamingo-liked models, while we adopt a zero-shot instruction prompting strategy for MedVInT, LLaVA-Med and RadFM. “W/o Ins-tuning” denotes training without the domain-specific instruction tuning, and “w/o Our Data” denotes training without all our newly collected data, i.e., using the combination of existing datasets only. ACC, F1, BLEU, ROUGE, UMLS_Precision, UMLS_Recall and BERT-Sim are reported based on task types, and the metrics refer to the average score on all test samples. Numbers within parentheses indicate 95% CI. Percentage (%) signs have been omitted in the table.