Fig. 2: Dataset construction and summary. | Nature Communications

Fig. 2: Dataset construction and summary.

From: Medical multimodal multitask foundation model for lung cancer screening

Fig. 2

a General data construction workflow consists of four steps: medical task definition, task-specific multimodal data collection, multimodal data processing and alignment, and multimodal question-answering construction. b The data used in this study was collected from two data centers, National Lung Screening Trial (NLST) and Medical Imaging and Data Resource Center (MIDRC), and two medical institutes, Wake Forest University School of Medicine (WFUSM) and Massachusetts General Hospital (MGH), with the key characteristics summarized, based on which a large volumetric Computed Tomography (CT) pretraining dataset and a simulated clinical dataset were constructed. The detailed configuration can be found in Supplementary Table 3. The blue boxes indicate the OpenM3Chest dataset that is publicly available. c The patient sex and age distributions of the collected data from the involved data centers, where the age data represent mean age  ± standard deviation. d Distributions of the training, validation, and test datasets over all tasks. e Distributions of independent evaluation datasets from MGH. f Distributions of independent evaluation, full dose (FD) CT, and fine-tuning datasets from WFUSM. CVD Cardiovascular Disease, Reticular/... /scar, reticular/reticulonodular opacities/honeycombing/fibrosis/scar where / means or, COVID-19 Coronavirus Disease 2019, Lung-RADS Lung CT Screening Reporting and Data System, CAC Coronary Artery Calcification. Source data are provided as a Source Data file.

Back to article page