Table 2 Accuracy of language models across zero-shot prompting, conventional online RAG, and RaR on the RadioRAG dataset
From: Multi-step retrieval and reasoning improves radiology question answering with large language models
Model name | Zero-shot | Conventional online RAG | RaR | |||||
|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Total correct (n) | P-value | Accuracy (%) | Total correct (n) | P-value | Accuracy (%) | Total correct (n) | |
Ministral-8B | 47 ± 5 [38,57] | 49 | 0.020 | 51 ± 5 [41,61] | 53 | 0.051 | 66 ± 5 [57,76] | 69 |
Mistral Large (123B) | 72 ± 4 [63,81] | 75 | 0.146 | 74 ± 4 [65,83] | 77 | 0.273 | 81 ± 4 [72,88] | 84 |
Llama3.3-8B | 62 ± 5 [53,71] | 65 | 0.807 | 63 ± 5 [55,72] | 66 | 0.999 | 65 ± 5 [57,74] | 68 |
Llama3.3-70B | 76 ± 4 [67,84] | 79 | 0.212 | 73 ± 4 [63,81] | 76 | 0.081 | 83 ± 4 [75,89] | 86 |
Llama3-Med42-8B | 67 ± 5 [58,77] | 70 | 0.263 | 67 ± 5 [59,77] | 70 | 0.383 | 75 ± 4 [66,84] | 78 |
Llama3-Med42-70B | 72 ± 4 [63,80] | 75 | 0.263 | 75 ± 4 [67,83] | 78 | 0.705 | 79 ± 4 [71,87] | 82 |
Llama4 Scout 16E | 76 ± 4 [67,85] | 79 | 0.392 | 80 ± 4 [72,88] | 83 | 0.999 | 81 ± 4 [73,88] | 84 |
DeepSeek R1-70B | 78 ± 4 [70,86] | 81 | 0.859 | 76 ± 4 [67,84] | 79 | 0.662 | 80 ± 4 [72,88] | 83 |
DeepSeek R1 (671B) | 82 ± 4 [74,89] | 85 | 0.859 | 79 ± 4 [71,87] | 82 | 0.999 | 80 ± 4 [72,88] | 83 |
DeepSeek-V3 (671B) | 76 ± 4 [67,84] | 79 | 0.106 | 80 ± 4 [72,88] | 83 | 0.273 | 86 ± 4 [78,92] | 89 |
Qwen 2.5-0.5B | 37 ± 5 [27,46] | 38 | 0.726 | 46 ± 5 [37,56] | 48 | 0.737 | 42 ± 5 [32,52] | 43 |
Qwen 2.5-3B | 54 ± 5 [44,63] | 56 | 0.146 | 53 ± 5 [43,62] | 55 | 0.171 | 65 ± 5 [56,74] | 68 |
Qwen 2.5-7B | 55 ± 5 [45,64] | 57 | 0.041 | 59 ± 5 [49,68] | 61 | 0.171 | 71 ± 4 [62,80] | 74 |
Qwen 2.5-14B | 68 ± 4 [59,77] | 71 | 0.752 | 67 ± 5 [57,76] | 69 | 0.549 | 72 ± 4 [63,81] | 75 |
Qwen 2.5-70B | 70 ± 5 [62,79] | 73 | 0.185 | 73 ± 4 [64,82] | 76 | 0.599 | 78 ± 4 [70,86] | 81 |
Qwen 3-8B | 66 ± 5 [57,75] | 69 | 0.157 | 73 ± 4 [65,81] | 76 | 0.862 | 76 ± 4 [68,84] | 79 |
Qwen 3-235B | 82 ± 4 [74,89] | 85 | 0.999 | 84 ± 4 [75,90] | 87 | 0.999 | 83 ± 4 [75,89] | 86 |
GPT-3.5-turbo | 57 ± 5 [47,66] | 59 | 0.146 | 62 ± 5 [53,71] | 64 | 0.540 | 68 ± 5 [60,77] | 71 |
GPT-4-turbo | 76 ± 4 [67,84] | 79 | 0.999 | 76 ± 4 [67,84] | 79 | 0.999 | 77 ± 4 [69,85] | 80 |
o3 | 86 ± 4 [78,92] | 89 | 0.781 | 85 ± 4 [77,91] | 88 | 0.705 | 88 ± 3 [81,93] | 91 |
GPT-5 | 82 ± 4 [74,89] | 85 | 0.097 | 80 ± 4 [72,88] | 83 | 0.081 | 88 ± 3 [82,94] | 92 |
MedGemma-4B-it | 56 ± 5 [46,65] | 58 | 0.157 | 52 ± 5 [42,62] | 54 | 0.051 | 66 ± 5 [57,75] | 69 |
MedGemma-27B-text-it | 71 ± 4 [62,79] | 74 | 0.146 | 75 ± 4 [66,84] | 78 | 0.438 | 81 ± 4 [73,88] | 84 |
Gemma-3-4B-it | 46 ± 5 [37,56] | 48 | 0.094 | 53 ± 5 [43,62] | 55 | 0.273 | 62 ± 5 [52,71] | 64 |
Gemma-3-27B-it | 65 ± 5 [57,75] | 68 | 0.157 | 66 ± 5 [58,75] | 69 | 0.270 | 76 ± 4 [67,85] | 79 |