Table 4 Response time comparison between zero-shot and RaR strategies on the RadioRAG dataset

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model/group name

Time

Zero-shot (s)

RaR (s)

Absolute difference (s)

Relative increase (times)

DeepSeek-V3 group

98.55 ± 53.58

412.7 ± 156.7

314.2 ± 141.6

4.2×

Large (120–250B) group

63.7 ± 29.4

845.1 ± 744.7

781.4 ± 715.2

13.3×

Llama4 Scout 16E

49.6 ± 24.6

462.3 ± 190.2

412.6 ± 169.7

9.3×

Mistral Large

43.9 ± 23.9

369.7 ± 142.0

325.8 ± 126.0

8.4×

Qwen 3-235B

97.5 ± 54.6

1703.3 ± 787.6

1605.8 ± 744.0

17.5×

Medium ( ≈ 70B) group

78.7 ± 51.4

230.58 ± 44.8

151.8 ± 34.3

2.9×

DeepSeek R1-70B

151.3 ± 83.4

282.8 ± 95.0

131.3 ± 68.3

1.9×

Llama3-Med42-70B

42.2 ± 22.4

177.0 ± 39.5

134.8 ± 27.9

4.2×

Llama3.3-70B

78.5 ± 43.6

216.7 ± 60.7

138.2 ± 34.7

2.8×

Qwen 2.5-70B

42.6 ± 22.2

245.7 ± 76.8

203.1 ± 58.5

5.8×

Gemma 27B group

75.8 ± 38.2

214.1 ± 54.9

138.3 ± 16.7

2.8×

Gemma-3-27B-it

48.8 ± 28.6

175.3 ± 37.4

126.5 ± 26.2

3.6×

MedGemma-27B-text-it

102.8 ± 56.1

253.0 ± 75.2

150.1 ± 38.4

2.5×

Small (7 – 8B) group

22.0 ± 39.9

132.9 ± 33.9

110.9 ± 9.3

6.0×

Llama3-Med42-8B

1.4 ± 0.7

108.0 ± 3.7

106.6 ± 3.3

76.5×

Llama3.3-8B

8.4 ± 4.0

116.3 ± 7.6

107.9 ± 4.6

13.9×

Ministral-8B

3.7 ± 2.2

124.9 ± 11.8

121.2 ± 10.4

34.0×

Qwen 2.5-7B

3.4 ± 1.6

122.8 ± 11.4

119.4 ± 10.4

36.0×

Qwen 3-8B

93.2 ± 53.4

192.3 ± 49.8

99.1 ± 33.9

2.1×

Mini (3 – 4B) group

11.4 ± 5.4

126.3 ± 6.3

114.9 ± 8.4

11.1×

Gemma-3-4B-it

17.5 ± 7.9

127.7 ± 13.1

110.2 ± 7.0

7.3×

MedGemma-4B-it

9.6 ± 5.4

119.4 ± 9.9

109.8 ± 9.1

12.5×

Qwen 2.5-3B

7.1 ± 3.7

131.7 ± 13.7

124.6 ± 11.0

18.6×

Average

53.7 ± 28.4

324.4 ± 270.2

271.2 ± 257.3

6.7 ± 4.1×

  1. Average per-question response times (n = 104) are reported in seconds as mean ± standard deviation for both individual models and aggregated model groups. On the RadioRAG dataset, a fixed overhead of 10,554.6 s per model, corresponding to context generation, was evenly distributed across all questions, contributing approximately 101.5 s per question. For time analysis, models were grouped based on parameter scale and architectural characteristics into six categories: the DeepSeek mixture of experts (MoE) group, the large model group (120–250B), the medium-scale group (~70B), the Gemma-27B group, the small model group (7–8B), and the mini model group (3–4B). “Absolute difference” denotes the increase in average response time per question introduced by the RaR method, and “Relative increase” refers to the ratio of mean RaR time to mean zero-shot time per group. Statistics are computed at the group level.