Table 1 Architectural comparison of our proposed model with baseline 3D medical vision-language models.

From: A data-efficient 3D medical vision-language model using only a 2D encoder

Model Name

Architecture

RadFM5

3D Vision Transformer (ViT) Encoder + Perceiver Module + MedLLaMA-13B LLM

M3D-LaMed4

3D Vision Transformer (ViT) Encoder + 3D Spatial Pooling Perceiver + LLaMA-2-7B LLM

Med3DVLM6

DCFormer Vision Encoder + Dual-Stream MLP-Mixer Projector + Qwen2.5-7B-Instruct LLM

Ours

2D Vision Encoder + Feature Enhancement and Token Compression + LLaMA-2-7B LLM