Table 2 Accuracy of language models across zero-shot prompting, conventional online RAG, and RaR on the RadioRAG dataset

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name

Zero-shot

Conventional online RAG

RaR

Accuracy (%)

Total correct (n)

P-value

Accuracy (%)

Total correct (n)

P-value

Accuracy (%)

Total correct (n)

Ministral-8B

47 ± 5 [38,57]

49

0.020

51 ± 5 [41,61]

53

0.051

66 ± 5 [57,76]

69

Mistral Large (123B)

72 ± 4 [63,81]

75

0.146

74 ± 4 [65,83]

77

0.273

81 ± 4 [72,88]

84

Llama3.3-8B

62 ± 5 [53,71]

65

0.807

63 ± 5 [55,72]

66

0.999

65 ± 5 [57,74]

68

Llama3.3-70B

76 ± 4 [67,84]

79

0.212

73 ± 4 [63,81]

76

0.081

83 ± 4 [75,89]

86

Llama3-Med42-8B

67 ± 5 [58,77]

70

0.263

67 ± 5 [59,77]

70

0.383

75 ± 4 [66,84]

78

Llama3-Med42-70B

72 ± 4 [63,80]

75

0.263

75 ± 4 [67,83]

78

0.705

79 ± 4 [71,87]

82

Llama4 Scout 16E

76 ± 4 [67,85]

79

0.392

80 ± 4 [72,88]

83

0.999

81 ± 4 [73,88]

84

DeepSeek R1-70B

78 ± 4 [70,86]

81

0.859

76 ± 4 [67,84]

79

0.662

80 ± 4 [72,88]

83

DeepSeek R1 (671B)

82 ± 4 [74,89]

85

0.859

79 ± 4 [71,87]

82

0.999

80 ± 4 [72,88]

83

DeepSeek-V3 (671B)

76 ± 4 [67,84]

79

0.106

80 ± 4 [72,88]

83

0.273

86 ± 4 [78,92]

89

Qwen 2.5-0.5B

37 ± 5 [27,46]

38

0.726

46 ± 5 [37,56]

48

0.737

42 ± 5 [32,52]

43

Qwen 2.5-3B

54 ± 5 [44,63]

56

0.146

53 ± 5 [43,62]

55

0.171

65 ± 5 [56,74]

68

Qwen 2.5-7B

55 ± 5 [45,64]

57

0.041

59 ± 5 [49,68]

61

0.171

71 ± 4 [62,80]

74

Qwen 2.5-14B

68 ± 4 [59,77]

71

0.752

67 ± 5 [57,76]

69

0.549

72 ± 4 [63,81]

75

Qwen 2.5-70B

70 ± 5 [62,79]

73

0.185

73 ± 4 [64,82]

76

0.599

78 ± 4 [70,86]

81

Qwen 3-8B

66 ± 5 [57,75]

69

0.157

73 ± 4 [65,81]

76

0.862

76 ± 4 [68,84]

79

Qwen 3-235B

82 ± 4 [74,89]

85

0.999

84 ± 4 [75,90]

87

0.999

83 ± 4 [75,89]

86

GPT-3.5-turbo

57 ± 5 [47,66]

59

0.146

62 ± 5 [53,71]

64

0.540

68 ± 5 [60,77]

71

GPT-4-turbo

76 ± 4 [67,84]

79

0.999

76 ± 4 [67,84]

79

0.999

77 ± 4 [69,85]

80

o3

86 ± 4 [78,92]

89

0.781

85 ± 4 [77,91]

88

0.705

88 ± 3 [81,93]

91

GPT-5

82 ± 4 [74,89]

85

0.097

80 ± 4 [72,88]

83

0.081

88 ± 3 [82,94]

92

MedGemma-4B-it

56 ± 5 [46,65]

58

0.157

52 ± 5 [42,62]

54

0.051

66 ± 5 [57,75]

69

MedGemma-27B-text-it

71 ± 4 [62,79]

74

0.146

75 ± 4 [66,84]

78

0.438

81 ± 4 [73,88]

84

Gemma-3-4B-it

46 ± 5 [37,56]

48

0.094

53 ± 5 [43,62]

55

0.273

62 ± 5 [52,71]

64

Gemma-3-27B-it

65 ± 5 [57,75]

68

0.157

66 ± 5 [58,75]

69

0.270

76 ± 4 [67,85]

79

  1. Accuracy is reported in percentage as mean ± standard deviation, with 95% confidence intervals shown in brackets. Results are based on 104 questions, using bootstrapping with 1000 repetitions and replacement while preserving pairing. P-values were calculated for each model using McNemar’s test on paired outcomes relative to RaR and adjusted for multiple comparisons using the false discovery rate. A p < 0.05 was considered statistically significant. Accuracy is presented alongside total correct answers per method.