Table 2 Accuracy of language models across zero-shot prompting, conventional online RAG, and RaR on the RadioRAG dataset

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name	Zero-shot			Conventional online RAG			RaR
Model name	Accuracy (%)	Total correct (n)	P-value	Accuracy (%)	Total correct (n)	P-value	Accuracy (%)	Total correct (n)
Ministral-8B	47 ± 5 [38,57]	49	0.020	51 ± 5 [41,61]	53	0.051	66 ± 5 [57,76]	69
Mistral Large (123B)	72 ± 4 [63,81]	75	0.146	74 ± 4 [65,83]	77	0.273	81 ± 4 [72,88]	84
Llama3.3-8B	62 ± 5 [53,71]	65	0.807	63 ± 5 [55,72]	66	0.999	65 ± 5 [57,74]	68
Llama3.3-70B	76 ± 4 [67,84]	79	0.212	73 ± 4 [63,81]	76	0.081	83 ± 4 [75,89]	86
Llama3-Med42-8B	67 ± 5 [58,77]	70	0.263	67 ± 5 [59,77]	70	0.383	75 ± 4 [66,84]	78
Llama3-Med42-70B	72 ± 4 [63,80]	75	0.263	75 ± 4 [67,83]	78	0.705	79 ± 4 [71,87]	82
Llama4 Scout 16E	76 ± 4 [67,85]	79	0.392	80 ± 4 [72,88]	83	0.999	81 ± 4 [73,88]	84
DeepSeek R1-70B	78 ± 4 [70,86]	81	0.859	76 ± 4 [67,84]	79	0.662	80 ± 4 [72,88]	83
DeepSeek R1 (671B)	82 ± 4 [74,89]	85	0.859	79 ± 4 [71,87]	82	0.999	80 ± 4 [72,88]	83
DeepSeek-V3 (671B)	76 ± 4 [67,84]	79	0.106	80 ± 4 [72,88]	83	0.273	86 ± 4 [78,92]	89
Qwen 2.5-0.5B	37 ± 5 [27,46]	38	0.726	46 ± 5 [37,56]	48	0.737	42 ± 5 [32,52]	43
Qwen 2.5-3B	54 ± 5 [44,63]	56	0.146	53 ± 5 [43,62]	55	0.171	65 ± 5 [56,74]	68
Qwen 2.5-7B	55 ± 5 [45,64]	57	0.041	59 ± 5 [49,68]	61	0.171	71 ± 4 [62,80]	74
Qwen 2.5-14B	68 ± 4 [59,77]	71	0.752	67 ± 5 [57,76]	69	0.549	72 ± 4 [63,81]	75
Qwen 2.5-70B	70 ± 5 [62,79]	73	0.185	73 ± 4 [64,82]	76	0.599	78 ± 4 [70,86]	81
Qwen 3-8B	66 ± 5 [57,75]	69	0.157	73 ± 4 [65,81]	76	0.862	76 ± 4 [68,84]	79
Qwen 3-235B	82 ± 4 [74,89]	85	0.999	84 ± 4 [75,90]	87	0.999	83 ± 4 [75,89]	86
GPT-3.5-turbo	57 ± 5 [47,66]	59	0.146	62 ± 5 [53,71]	64	0.540	68 ± 5 [60,77]	71
GPT-4-turbo	76 ± 4 [67,84]	79	0.999	76 ± 4 [67,84]	79	0.999	77 ± 4 [69,85]	80
o3	86 ± 4 [78,92]	89	0.781	85 ± 4 [77,91]	88	0.705	88 ± 3 [81,93]	91
GPT-5	82 ± 4 [74,89]	85	0.097	80 ± 4 [72,88]	83	0.081	88 ± 3 [82,94]	92
MedGemma-4B-it	56 ± 5 [46,65]	58	0.157	52 ± 5 [42,62]	54	0.051	66 ± 5 [57,75]	69
MedGemma-27B-text-it	71 ± 4 [62,79]	74	0.146	75 ± 4 [66,84]	78	0.438	81 ± 4 [73,88]	84
Gemma-3-4B-it	46 ± 5 [37,56]	48	0.094	53 ± 5 [43,62]	55	0.273	62 ± 5 [52,71]	64
Gemma-3-27B-it	65 ± 5 [57,75]	68	0.157	66 ± 5 [58,75]	69	0.270	76 ± 4 [67,85]	79

Accuracy is reported in percentage as mean ± standard deviation, with 95% confidence intervals shown in brackets. Results are based on 104 questions, using bootstrapping with 1000 repetitions and replacement while preserving pairing. P-values were calculated for each model using McNemar’s test on paired outcomes relative to RaR and adjusted for multiple comparisons using the false discovery rate. A p < 0.05 was considered statistically significant. Accuracy is presented alongside total correct answers per method.

Back to article page

Table 2 Accuracy of language models across zero-shot prompting, conventional online RAG, and RaR on the RadioRAG dataset

Search

Quick links