Fig. 4: Comprehensive analysis of tumor segmentation performance and data scaling effects.
From: Large-scale generative tumor synthesis in computed tomography images for improving tumor recognition

a The overall Dice score comparisons with baseline tumor segmentation models8,48 --51, we conduct a five-fold evaluation on 12 downstream datasets (number of volumes n = 3686). Box plots show the mean (center), 25th and 75th percentiles (bounds of box), and minima to maxima (whiskers). Significance levels at which FreeTumor outperforms the baseline SwinUNETR51, with two-sided paired t test are ***p-values < 1 × 10−3 and ****p-values < 1 × 10−4. Exact p-values for the comparison between FreeTumor and SwinUNETR51 are: p-values = 6.048 × 10−7 for liver tumors, p-values = 4.017 × 10−7 for pancreas tumors, p-values = 1.043 × 10−5 for kidney tumors, p-values = 7.366 × 10−5 for lung tumors, and p-values = 9.062 × 10−4 for COVID-19. b Qualitative segmentation results of FreeTumor. The organ segmentation results are presented for better visualization. c–g The effectiveness of scaling up training datasets. We evaluate the correlation between the data scale of segmentation training datasets and segmentation performance. Specifically, the foundation models53,54,70 are unable to utilize unlabeled data in segmentation training. Thus, their data scales of segmentation training datasets are the same as the baseline models8,48,49,50,51. h Comparisons between FreeTumor and previous methods8,38,46,48,49,50,51 in data utilization. We assess these methods across three dimensions: the scale of training datasets (number of CT volumes), the utilization of unlabeled data in synthesis training, and the utilization of unlabeled data in segmentation training. Source data are provided as a Source Data file.