Fig. 1: Overview of medical multimodal multitask foundation model (M3FM).
From: Medical multimodal multitask foundation model for lung cancer screening

a M3FM architecture consists of four components: Computed Tomography Vision Transformer (CTViT), Text Transformer, Task Encoder, and Predictors. b Pretraining CTViT on multiscale CT volumes with voxel size-aware masked image modeling. Scale 1, Scale 2,⋯, and Scale S denote S different sizes of images. c Pretraining Text Transformer with masked image models, where T1, T2,⋯, and T5 denote a sequence of text tokens, T1, T2, and T4 are inputs, T3 and T5 are targets, and MASK is a special token meaning that the input token is masked. d Training the shared M3FM jointly with flexible multimodal and synergistic multitask learning using our distributed task-parallel strategy. Each device focuses on a single task with task-specific inputs, targets, and loss functions. D, N, and R denote the numbers of different devices, tasks, and image regions, respectively. Different tasks may have the same multimodal inputs on devices 1 and 2 and various multimodal or single-modality inputs on devices 2, 3, 4, and 5. e M3FM inference flexibly handles multi-scale CT volumes (indicated by the rectangle boxes in different sizes and colors), clinical data, and multiple tasks. The colors of the CT bounding boxes match those of the questions and the predicted answers. For example, to answer Question 16, M3FM takes the orange region in the CT volume automatically localized using an organ localization model, the corresponding voxel size, and clinical text data as inputs. Questions 17 and 18 are two examples of auxiliary information retrieval tasks for clinical data modeling, which only take the clinical text as input. Question 19 predicts the Lung CT Screening Reporting and Data System (Lung-RADS) from lung nodule descriptions in a radiology report. reticular/... /scar reticular/reticulonodular opacities/honeycombing/fibrosis/scar, where / means or, COVID-19 Coronavirus Disease 2019.