Table 1 Specifications of the language models evaluated in this study

From: Multi-step retrieval and reasoning improves radiology question answering with large language models

Model name

Parameters (billion)

Category

Accessibility

Knowledge cutoff date

Developer

Context length (thousand tokens)

Ministral-8B

8

IT

Open-source

October 2023

Mistral AI

128

Mistral Large

123

IT

Open-source

November 2024

Mistral AI

128

Llama3.3-8B

8

IT

Open-weights

March 2023

Meta AI

8

Llama3.3-70B

70

IT

Open-weights

December 2023

Meta AI

128

Llama3-Med42-8B

8

IT, clinically-aligned

Open-weights

August 2024

M42 Health AI Team

8

Llama3-Med42-70B

70

IT, clinically-aligned

Open-weights

August 2024

M42 Health AI Team

8

Llama4 Scout 16E

17

IT, 17B active parameters

Open-weights

August 2023

Meta AI

10,000 (10 M tokens)

DeepSeek R1-70B

70

Reasoning

Open-source

January 2025

DeepSeek

128

DeepSeek-R1

671

Reasoning

Open-source

January 2025

DeepSeek

128

DeepSeek-V3

671

Mixture of experts

Open-source

July 2024

DeepSeek

128

Qwen 2.5-0.5B

0.5

IT

Open-source

September 2024

Alibaba Cloud

32

Qwen 2.5-3B

3

IT

Open-source

September 2024

Alibaba Cloud

32

Qwen 2.5-7B

7

IT

Open-source

September 2024

Alibaba Cloud

131

Qwen 2.5-14B

14

IT

Open-source

September 2024

Alibaba Cloud

131

Qwen 2.5-70B

70

IT

Open-source

September 2024

Alibaba Cloud

131

Qwen 3-8B

8

Reasoning, mixture of experts

Open-source

December2024

Alibaba Cloud

32

Qwen 3-235B

235

Reasoning, mixture of experts

Open-source

July 2025

Alibaba Cloud

32

GPT-3.5-turbo

Undisclosed

IT

Proprietary

September 2021

OpenAI

16

GPT-4-turbo

Undisclosed

IT

Proprietary

December 2023

OpenAI

128

o3

Undisclosed

Reasoning

Proprietary

June 2024

OpenAI

200

GPT-5

Undisclosed

IT, reasoning

Proprietary

September 2024

OpenAI

128

MedGemma-4B-it

4

Gemma 3-based, multimodal, IT, clinical reasoning

Open-weights

July 2025

Google DeepMind

128

MedGemma-27B-text-it

27

Gemma 3-based, text only, IT, clinical reasoning

Open-weights

July 2025

Google DeepMind

≥ 128

Gemma-3-4B-it

4

IT

Open-weights

August 2024

Google DeepMind

128

Gemma-3-27B-it

27

IT

Open-weights

August 2024

Google DeepMind

128

  1. Summary of the 25 LLMs assessed across zero-shot prompting, conventional online RAG, and the proposed radiology Retrieval and Reasoning (RaR). Listed for each model are parameter count (in billions), training category (e.g., instruction-tuned (IT), reasoning-optimized), accessibility, knowledge cutoff date, developer, and context length (in thousand tokens). Evaluations were conducted between July 1 and August 22, 2025. GPT-5 is included as a widely used system-level benchmark rather than a single fixed model architecture, as it dynamically routes queries across underlying models depending on the task.