Table 3 Hallucination and relevance metrics for RaR-powered responses on the RadioRAG dataset (n = 104)

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name

Context relevant

Hallucination (relevant context, incorrect response)

Correct despite irrelevant context

Zero-shot incorrect → RaR correct

Ministral-8B

46% (48/104)

14% (15/104)

35% (36/104)

26% (27/104)

Mistral Large (123B)

46% (48/104)

6% (6/104)

40% (42/104)

12% (13/104)

Llama3.3-8B

46% (48/104)

17% (18/104)

37% (38/104)

12% (13/104)

Llama3.3-70B

46% (48/104)

6% (6/104)

42% (44/104)

11% (11/104)

Llama3-Med42-8B

46% (48/104)

11% (11/104)

39% (41/104)

16% (17/104)

Llama3-Med42-70B

46% (48/104)

7% (7/104)

39% (41/104)

12% (13/104)

Llama4 Scout 16E

46% (48/104)

5% (5/104)

39% (41/104)

9% (9/104)

DeepSeek R1-70B

46% (48/104)

5% (5/104)

38% (40/104)

8% (8/104)

DeepSeek R1 (671B)

46% (48/104)

3% (3/104)

37% (38/104)

6% (6/104)

DeepSeek-V3 (671B)

46% (48/104)

4% (4/104)

43% (45/104)

12% (13/104)

Qwen 2.5-0.5B

46% (48/104)

26% (27/104)

21% (22/104)

21% (22/104)

Qwen 2.5-3B

46% (48/104)

13% (14/104)

33% (34/104)

21% (22/104)

Qwen 2.5-7B

46% (48/104)

12% (12/104)

37% (38/104)

23% (24/104)

Qwen 2.5-14B

46% (48/104)

10% (10/104)

36% (37/104)

15% (16/104)

Qwen 2.5-70B

46% (48/104)

5% (5/104)

37% (38/104)

12% (13/104)

Qwen 3-8B

46% (48/104)

6% (6/104)

36% (37/104)

17% (18/104)

Qwen 3-235B

46% (48/104)

5% (5/104)

41% (43/104)

6% (6/104)

GPT-3.5-turbo

46% (48/104)

13% (14/104)

36% (37/104)

21% (22/104)

GPT-4-turbo

46% (48/104)

9% (9/104)

39% (41/104)

8% (8/104)

o3

46% (48/104)

2% (2/104)

43% (45/104)

3% (3/104)

GPT-5

46% (48/104)

3% (3/104)

45% (47/104)

7% (7/104)

MedGemma-4B-it

46% (48/104)

17% (18/104)

38% (39/104)

20% (21/104)

MedGemma-27B-text-it

46% (48/104)

3% (3/104)

38% (39/104)

15% (16/104)

Gemma-3-4B-it

46% (48/104)

20% (21/104)

36% (37/104)

25% (26/104)

Gemma-3-27B-it

46% (48/104)

7% (7/104)

37% (38/104)

20% (21/104)

Average

46% ± 0.0

9.2% ± 6.1

37.4% ± 4.9

14.3% ± 6.5

  1. “Context relevant” was evaluated at the dataset level: each question was labeled as having relevant or irrelevant retrieved context, and the same label was applied across all models (48/104 questions were judged to have clinically appropriate context). “Hallucination” refers to incorrect model answers despite relevant context. “Correct despite irrelevant context” captures correct answers when the retrieved context was not clinically useful. The final column reports the percentage of questions that were incorrect in zero-shot prompting but answered correctly using RaR.