Fig. 3: Factuality assessment of LLM responses on the RadioRAG dataset.
From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Each bar plot shows the proportion of cases per model falling into a specific factuality category, with models ordered by descending percentage. Comparisons were based on the RadioRAG benchmark dataset (n = 104). a Hallucinations: Cases in which the provided context was relevant, but the model still generated an incorrect response (context = 1, response = 0). b Context irrelevance tolerance: Cases where the model produced a correct response despite the retrieved context being unhelpful or irrelevant (context = 0, response = 1). c RaR correction: Instances where the zero-shot response was incorrect but RaR strategy successfully produced a correct response (zero-shot = 0, RaR = 1).