Fig. 2: Employing clinical visual instruction tuning (CVIT) to fine-tune BrainGPT from the baseline Otter model. | Nature Communications

Fig. 2: Employing clinical visual instruction tuning (CVIT) to fine-tune BrainGPT from the baseline Otter model.

From: Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation

Fig. 2

From the end-to-end Otter foundation structure, we tested four distinct fine-tuning conditions: two of them were regular visual instruction tuning (RVIT)—plain instruction and in example instruction; whereas the other two were clinical visual instruction tuning (CVIT)- template instruction and keyword instruction. To enable multi-image in-context learning, we formatted the data as image-instruction-answer triplets, whereupon the instructions and binarized images were arranged into standardized JSON files using the Otter MIMIC-IT pipeline. Specifically, the Otter framework integrates visual data (instrumenting a frozen CLIP ViT-L/14 encoder) and language data (using the LlaMA-7B large language model) through a trainable perceiver resampler module. In the LlaMA-7B architecture, cross-gated attention layers are added to distribute even focus across volumetric CT scans. The parameters of the remaining modules except for input/output embeddings, are frozen so that the training expenses are minimized. The model efficacy was evaluated via traditional metrics, LLM-as-a-judge evaluation, and a FORTE scoring system tailored for RRG evaluation.

Back to article page