Table 1 Architectural comparison of our proposed model with baseline 3D medical vision-language models.

Model Name	Architecture
RadFM⁵	3D Vision Transformer (ViT) Encoder + Perceiver Module + MedLLaMA-13B LLM
M3D-LaMed⁴	3D Vision Transformer (ViT) Encoder + 3D Spatial Pooling Perceiver + LLaMA-2-7B LLM
Med3DVLM⁶	DCFormer Vision Encoder + Dual-Stream MLP-Mixer Projector + Qwen2.5-7B-Instruct LLM
Ours	2D Vision Encoder + Feature Enhancement and Token Compression + LLaMA-2-7B LLM

Quick links

Search