Table 1 Architectural comparison of our proposed model with baseline 3D medical vision-language models.
From: A data-efficient 3D medical vision-language model using only a 2D encoder
Model Name | Architecture |
|---|---|
RadFM5 | 3D Vision Transformer (ViT) Encoder + Perceiver Module + MedLLaMA-13B LLM |
M3D-LaMed4 | 3D Vision Transformer (ViT) Encoder + 3D Spatial Pooling Perceiver + LLaMA-2-7B LLM |
Med3DVLM6 | DCFormer Vision Encoder + Dual-Stream MLP-Mixer Projector + Qwen2.5-7B-Instruct LLM |
Ours | 2D Vision Encoder + Feature Enhancement and Token Compression + LLaMA-2-7B LLM |