Table 3 Hallucination and relevance metrics for RaR-powered responses on the RadioRAG dataset (n = 104)
From: Multi-step retrieval and reasoning improves radiology question answering with large language models
Model name | Context relevant | Hallucination (relevant context, incorrect response) | Correct despite irrelevant context | Zero-shot incorrect → RaR correct |
|---|---|---|---|---|
Ministral-8B | 46% (48/104) | 14% (15/104) | 35% (36/104) | 26% (27/104) |
Mistral Large (123B) | 46% (48/104) | 6% (6/104) | 40% (42/104) | 12% (13/104) |
Llama3.3-8B | 46% (48/104) | 17% (18/104) | 37% (38/104) | 12% (13/104) |
Llama3.3-70B | 46% (48/104) | 6% (6/104) | 42% (44/104) | 11% (11/104) |
Llama3-Med42-8B | 46% (48/104) | 11% (11/104) | 39% (41/104) | 16% (17/104) |
Llama3-Med42-70B | 46% (48/104) | 7% (7/104) | 39% (41/104) | 12% (13/104) |
Llama4 Scout 16E | 46% (48/104) | 5% (5/104) | 39% (41/104) | 9% (9/104) |
DeepSeek R1-70B | 46% (48/104) | 5% (5/104) | 38% (40/104) | 8% (8/104) |
DeepSeek R1 (671B) | 46% (48/104) | 3% (3/104) | 37% (38/104) | 6% (6/104) |
DeepSeek-V3 (671B) | 46% (48/104) | 4% (4/104) | 43% (45/104) | 12% (13/104) |
Qwen 2.5-0.5B | 46% (48/104) | 26% (27/104) | 21% (22/104) | 21% (22/104) |
Qwen 2.5-3B | 46% (48/104) | 13% (14/104) | 33% (34/104) | 21% (22/104) |
Qwen 2.5-7B | 46% (48/104) | 12% (12/104) | 37% (38/104) | 23% (24/104) |
Qwen 2.5-14B | 46% (48/104) | 10% (10/104) | 36% (37/104) | 15% (16/104) |
Qwen 2.5-70B | 46% (48/104) | 5% (5/104) | 37% (38/104) | 12% (13/104) |
Qwen 3-8B | 46% (48/104) | 6% (6/104) | 36% (37/104) | 17% (18/104) |
Qwen 3-235B | 46% (48/104) | 5% (5/104) | 41% (43/104) | 6% (6/104) |
GPT-3.5-turbo | 46% (48/104) | 13% (14/104) | 36% (37/104) | 21% (22/104) |
GPT-4-turbo | 46% (48/104) | 9% (9/104) | 39% (41/104) | 8% (8/104) |
o3 | 46% (48/104) | 2% (2/104) | 43% (45/104) | 3% (3/104) |
GPT-5 | 46% (48/104) | 3% (3/104) | 45% (47/104) | 7% (7/104) |
MedGemma-4B-it | 46% (48/104) | 17% (18/104) | 38% (39/104) | 20% (21/104) |
MedGemma-27B-text-it | 46% (48/104) | 3% (3/104) | 38% (39/104) | 15% (16/104) |
Gemma-3-4B-it | 46% (48/104) | 20% (21/104) | 36% (37/104) | 25% (26/104) |
Gemma-3-27B-it | 46% (48/104) | 7% (7/104) | 37% (38/104) | 20% (21/104) |
Average | 46% ± 0.0 | 9.2% ± 6.1 | 37.4% ± 4.9 | 14.3% ± 6.5 |