Table 4 Response time comparison between zero-shot and RaR strategies on the RadioRAG dataset

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model/group name	Time
Model/group name	Zero-shot (s)	RaR (s)	Absolute difference (s)	Relative increase (times)
DeepSeek-V3 group	98.55 ± 53.58	412.7 ± 156.7	314.2 ± 141.6	4.2×
Large (120–250B) group	63.7 ± 29.4	845.1 ± 744.7	781.4 ± 715.2	13.3×
Llama4 Scout 16E	49.6 ± 24.6	462.3 ± 190.2	412.6 ± 169.7	9.3×
Mistral Large	43.9 ± 23.9	369.7 ± 142.0	325.8 ± 126.0	8.4×
Qwen 3-235B	97.5 ± 54.6	1703.3 ± 787.6	1605.8 ± 744.0	17.5×
Medium ( ≈ 70B) group	78.7 ± 51.4	230.58 ± 44.8	151.8 ± 34.3	2.9×
DeepSeek R1-70B	151.3 ± 83.4	282.8 ± 95.0	131.3 ± 68.3	1.9×
Llama3-Med42-70B	42.2 ± 22.4	177.0 ± 39.5	134.8 ± 27.9	4.2×
Llama3.3-70B	78.5 ± 43.6	216.7 ± 60.7	138.2 ± 34.7	2.8×
Qwen 2.5-70B	42.6 ± 22.2	245.7 ± 76.8	203.1 ± 58.5	5.8×
Gemma 27B group	75.8 ± 38.2	214.1 ± 54.9	138.3 ± 16.7	2.8×
Gemma-3-27B-it	48.8 ± 28.6	175.3 ± 37.4	126.5 ± 26.2	3.6×
MedGemma-27B-text-it	102.8 ± 56.1	253.0 ± 75.2	150.1 ± 38.4	2.5×
Small (7 – 8B) group	22.0 ± 39.9	132.9 ± 33.9	110.9 ± 9.3	6.0×
Llama3-Med42-8B	1.4 ± 0.7	108.0 ± 3.7	106.6 ± 3.3	76.5×
Llama3.3-8B	8.4 ± 4.0	116.3 ± 7.6	107.9 ± 4.6	13.9×
Ministral-8B	3.7 ± 2.2	124.9 ± 11.8	121.2 ± 10.4	34.0×
Qwen 2.5-7B	3.4 ± 1.6	122.8 ± 11.4	119.4 ± 10.4	36.0×
Qwen 3-8B	93.2 ± 53.4	192.3 ± 49.8	99.1 ± 33.9	2.1×
Mini (3 – 4B) group	11.4 ± 5.4	126.3 ± 6.3	114.9 ± 8.4	11.1×
Gemma-3-4B-it	17.5 ± 7.9	127.7 ± 13.1	110.2 ± 7.0	7.3×
MedGemma-4B-it	9.6 ± 5.4	119.4 ± 9.9	109.8 ± 9.1	12.5×
Qwen 2.5-3B	7.1 ± 3.7	131.7 ± 13.7	124.6 ± 11.0	18.6×
Average	53.7 ± 28.4	324.4 ± 270.2	271.2 ± 257.3	6.7 ± 4.1×

Average per-question response times (n = 104) are reported in seconds as mean ± standard deviation for both individual models and aggregated model groups. On the RadioRAG dataset, a fixed overhead of 10,554.6 s per model, corresponding to context generation, was evenly distributed across all questions, contributing approximately 101.5 s per question. For time analysis, models were grouped based on parameter scale and architectural characteristics into six categories: the DeepSeek mixture of experts (MoE) group, the large model group (120–250B), the medium-scale group (~70B), the Gemma-27B group, the small model group (7–8B), and the mini model group (3–4B). “Absolute difference” denotes the increase in average response time per question introduced by the RaR method, and “Relative increase” refers to the ratio of mean RaR time to mean zero-shot time per group. Statistics are computed at the group level.

Back to article page

Table 4 Response time comparison between zero-shot and RaR strategies on the RadioRAG dataset

Search

Quick links