Fig. 4: Comparative accuracy distributions and inference-time multipliers for zero-shot versus RaR strategies across model groups (RadioRAG dataset). | npj Digital Medicine

Fig. 4: Comparative accuracy distributions and inference-time multipliers for zero-shot versus RaR strategies across model groups (RadioRAG dataset).

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Fig. 4

Accuracy results are shown for a small-scale models (Ministral-8B, Gemma-3-4B-it, Qwen2.5-7B, Qwen2.5-3B, Qwen2.5-0.5B, Qwen3-8B, Llama3-8B), b large models (o3, GPT-5, DeepSeek-R1, Qwen3-235B, GPT-4-turbo, DeepSeek-V3), c mid-sized models (Mid-Sized Models: GPT-3.5-turbo, Llama3.3-70B, MistralLarge, Qwen2.5-70B, Llama4Scout16E, Gemma-3-27B-it, DeepSeek-R1-70B), d across Qwen2.5 family for different parameter sizes: Qwen2.5-70B, 14B, 7B, 3B and 0.5B, and e medically fine-tuned models (MedGemma27B-text-it, MedGemma4B-it, Llama3-Med42-70B, Llama3-Med42-8B). f Distribution of RaR-to-zero-shot runtime multipliers (× slower/faster) across all models. comparisons were performed on the RadioRAG benchmark dataset (n = 104). Boxplots display accuracy (%) distributions (n = 1000) for zero-shot (orange) and RaR (blue): boxes span Q1–Q3, central line is the median (Q2), whiskers extend to 1.5 × IQR and dots mark outliers. Line chart shows mean accuracy versus model size for zero-shot (green), onlineRAG (orange) and RaR (purple) across Qwen2.5 family. P-values were calculated between each pair’s accuracy values for each model using McNemar’s test on paired outcomes relative to RaR and adjusted for multiple comparisons using the false discovery rate. A p < 0.05 was considered statistically significant.

Back to article page