Table 3 Hallucination and relevance metrics for RaR-powered responses on the RadioRAG dataset (n = 104)

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name	Context relevant	Hallucination (relevant context, incorrect response)	Correct despite irrelevant context	Zero-shot incorrect → RaR correct
Ministral-8B	46% (48/104)	14% (15/104)	35% (36/104)	26% (27/104)
Mistral Large (123B)	46% (48/104)	6% (6/104)	40% (42/104)	12% (13/104)
Llama3.3-8B	46% (48/104)	17% (18/104)	37% (38/104)	12% (13/104)
Llama3.3-70B	46% (48/104)	6% (6/104)	42% (44/104)	11% (11/104)
Llama3-Med42-8B	46% (48/104)	11% (11/104)	39% (41/104)	16% (17/104)
Llama3-Med42-70B	46% (48/104)	7% (7/104)	39% (41/104)	12% (13/104)
Llama4 Scout 16E	46% (48/104)	5% (5/104)	39% (41/104)	9% (9/104)
DeepSeek R1-70B	46% (48/104)	5% (5/104)	38% (40/104)	8% (8/104)
DeepSeek R1 (671B)	46% (48/104)	3% (3/104)	37% (38/104)	6% (6/104)
DeepSeek-V3 (671B)	46% (48/104)	4% (4/104)	43% (45/104)	12% (13/104)
Qwen 2.5-0.5B	46% (48/104)	26% (27/104)	21% (22/104)	21% (22/104)
Qwen 2.5-3B	46% (48/104)	13% (14/104)	33% (34/104)	21% (22/104)
Qwen 2.5-7B	46% (48/104)	12% (12/104)	37% (38/104)	23% (24/104)
Qwen 2.5-14B	46% (48/104)	10% (10/104)	36% (37/104)	15% (16/104)
Qwen 2.5-70B	46% (48/104)	5% (5/104)	37% (38/104)	12% (13/104)
Qwen 3-8B	46% (48/104)	6% (6/104)	36% (37/104)	17% (18/104)
Qwen 3-235B	46% (48/104)	5% (5/104)	41% (43/104)	6% (6/104)
GPT-3.5-turbo	46% (48/104)	13% (14/104)	36% (37/104)	21% (22/104)
GPT-4-turbo	46% (48/104)	9% (9/104)	39% (41/104)	8% (8/104)
o3	46% (48/104)	2% (2/104)	43% (45/104)	3% (3/104)
GPT-5	46% (48/104)	3% (3/104)	45% (47/104)	7% (7/104)
MedGemma-4B-it	46% (48/104)	17% (18/104)	38% (39/104)	20% (21/104)
MedGemma-27B-text-it	46% (48/104)	3% (3/104)	38% (39/104)	15% (16/104)
Gemma-3-4B-it	46% (48/104)	20% (21/104)	36% (37/104)	25% (26/104)
Gemma-3-27B-it	46% (48/104)	7% (7/104)	37% (38/104)	20% (21/104)
*Average*	46% ± 0.0	9.2% ± 6.1	37.4% ± 4.9	14.3% ± 6.5

“Context relevant” was evaluated at the dataset level: each question was labeled as having relevant or irrelevant retrieved context, and the same label was applied across all models (48/104 questions were judged to have clinically appropriate context). “Hallucination” refers to incorrect model answers despite relevant context. “Correct despite irrelevant context” captures correct answers when the retrieved context was not clinically useful. The final column reports the percentage of questions that were incorrect in zero-shot prompting but answered correctly using RaR.

Back to article page

Table 3 Hallucination and relevance metrics for RaR-powered responses on the RadioRAG dataset (n = 104)

Search

Quick links