Table 4 Response time comparison between zero-shot and RaR strategies on the RadioRAG dataset
From: Multi-step retrieval and reasoning improves radiology question answering with large language models
Model/group name | Time | |||
|---|---|---|---|---|
Zero-shot (s) | RaR (s) | Absolute difference (s) | Relative increase (times) | |
DeepSeek-V3 group | 98.55 ± 53.58 | 412.7 ± 156.7 | 314.2 ± 141.6 | 4.2× |
Large (120–250B) group | 63.7 ± 29.4 | 845.1 ± 744.7 | 781.4 ± 715.2 | 13.3× |
Llama4 Scout 16E | 49.6 ± 24.6 | 462.3 ± 190.2 | 412.6 ± 169.7 | 9.3× |
Mistral Large | 43.9 ± 23.9 | 369.7 ± 142.0 | 325.8 ± 126.0 | 8.4× |
Qwen 3-235B | 97.5 ± 54.6 | 1703.3 ± 787.6 | 1605.8 ± 744.0 | 17.5× |
Medium ( ≈ 70B) group | 78.7 ± 51.4 | 230.58 ± 44.8 | 151.8 ± 34.3 | 2.9× |
DeepSeek R1-70B | 151.3 ± 83.4 | 282.8 ± 95.0 | 131.3 ± 68.3 | 1.9× |
Llama3-Med42-70B | 42.2 ± 22.4 | 177.0 ± 39.5 | 134.8 ± 27.9 | 4.2× |
Llama3.3-70B | 78.5 ± 43.6 | 216.7 ± 60.7 | 138.2 ± 34.7 | 2.8× |
Qwen 2.5-70B | 42.6 ± 22.2 | 245.7 ± 76.8 | 203.1 ± 58.5 | 5.8× |
Gemma 27B group | 75.8 ± 38.2 | 214.1 ± 54.9 | 138.3 ± 16.7 | 2.8× |
Gemma-3-27B-it | 48.8 ± 28.6 | 175.3 ± 37.4 | 126.5 ± 26.2 | 3.6× |
MedGemma-27B-text-it | 102.8 ± 56.1 | 253.0 ± 75.2 | 150.1 ± 38.4 | 2.5× |
Small (7 – 8B) group | 22.0 ± 39.9 | 132.9 ± 33.9 | 110.9 ± 9.3 | 6.0× |
Llama3-Med42-8B | 1.4 ± 0.7 | 108.0 ± 3.7 | 106.6 ± 3.3 | 76.5× |
Llama3.3-8B | 8.4 ± 4.0 | 116.3 ± 7.6 | 107.9 ± 4.6 | 13.9× |
Ministral-8B | 3.7 ± 2.2 | 124.9 ± 11.8 | 121.2 ± 10.4 | 34.0× |
Qwen 2.5-7B | 3.4 ± 1.6 | 122.8 ± 11.4 | 119.4 ± 10.4 | 36.0× |
Qwen 3-8B | 93.2 ± 53.4 | 192.3 ± 49.8 | 99.1 ± 33.9 | 2.1× |
Mini (3 – 4B) group | 11.4 ± 5.4 | 126.3 ± 6.3 | 114.9 ± 8.4 | 11.1× |
Gemma-3-4B-it | 17.5 ± 7.9 | 127.7 ± 13.1 | 110.2 ± 7.0 | 7.3× |
MedGemma-4B-it | 9.6 ± 5.4 | 119.4 ± 9.9 | 109.8 ± 9.1 | 12.5× |
Qwen 2.5-3B | 7.1 ± 3.7 | 131.7 ± 13.7 | 124.6 ± 11.0 | 18.6× |
Average | 53.7 ± 28.4 | 324.4 ± 270.2 | 271.2 ± 257.3 | 6.7 ± 4.1× |