Table 1 Specifications of the language models evaluated in this study

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name	Parameters (billion)	Category	Accessibility	Knowledge cutoff date	Developer	Context length (thousand tokens)
Ministral-8B	8	IT	Open-source	October 2023	Mistral AI	128
Mistral Large	123	IT	Open-source	November 2024	Mistral AI	128
Llama3.3-8B	8	IT	Open-weights	March 2023	Meta AI	8
Llama3.3-70B	70	IT	Open-weights	December 2023	Meta AI	128
Llama3-Med42-8B	8	IT, clinically-aligned	Open-weights	August 2024	M42 Health AI Team	8
Llama3-Med42-70B	70	IT, clinically-aligned	Open-weights	August 2024	M42 Health AI Team	8
Llama4 Scout 16E	17	IT, 17B active parameters	Open-weights	August 2023	Meta AI	10,000 (10 M tokens)
DeepSeek R1-70B	70	Reasoning	Open-source	January 2025	DeepSeek	128
DeepSeek-R1	671	Reasoning	Open-source	January 2025	DeepSeek	128
DeepSeek-V3	671	Mixture of experts	Open-source	July 2024	DeepSeek	128
Qwen 2.5-0.5B	0.5	IT	Open-source	September 2024	Alibaba Cloud	32
Qwen 2.5-3B	3	IT	Open-source	September 2024	Alibaba Cloud	32
Qwen 2.5-7B	7	IT	Open-source	September 2024	Alibaba Cloud	131
Qwen 2.5-14B	14	IT	Open-source	September 2024	Alibaba Cloud	131
Qwen 2.5-70B	70	IT	Open-source	September 2024	Alibaba Cloud	131
Qwen 3-8B	8	Reasoning, mixture of experts	Open-source	December2024	Alibaba Cloud	32
Qwen 3-235B	235	Reasoning, mixture of experts	Open-source	July 2025	Alibaba Cloud	32
GPT-3.5-turbo	Undisclosed	IT	Proprietary	September 2021	OpenAI	16
GPT-4-turbo	Undisclosed	IT	Proprietary	December 2023	OpenAI	128
o3	Undisclosed	Reasoning	Proprietary	June 2024	OpenAI	200
GPT-5	Undisclosed	IT, reasoning	Proprietary	September 2024	OpenAI	128
MedGemma-4B-it	4	Gemma 3-based, multimodal, IT, clinical reasoning	Open-weights	July 2025	Google DeepMind	128
MedGemma-27B-text-it	27	Gemma 3-based, text only, IT, clinical reasoning	Open-weights	July 2025	Google DeepMind	≥ 128
Gemma-3-4B-it	4	IT	Open-weights	August 2024	Google DeepMind	128
Gemma-3-27B-it	27	IT	Open-weights	August 2024	Google DeepMind	128

Summary of the 25 LLMs assessed across zero-shot prompting, conventional online RAG, and the proposed radiology Retrieval and Reasoning (RaR). Listed for each model are parameter count (in billions), training category (e.g., instruction-tuned (IT), reasoning-optimized), accessibility, knowledge cutoff date, developer, and context length (in thousand tokens). Evaluations were conducted between July 1 and August 22, 2025. GPT-5 is included as a widely used system-level benchmark rather than a single fixed model architecture, as it dynamically routes queries across underlying models depending on the task.

Back to article page

Table 1 Specifications of the language models evaluated in this study

Search

Quick links