Extended Data Fig. 4: Comparing BiomedParse with biomedical-specific text prompt segmentation models.

Bar plot comparing BiomedParse with biomedical-specific text prompt segmentation models across different organs on CT in terms of Dice score. Each bar shows the mean of the distribution, with error bar indicating the 95% confidence interval. The sample sizes for the target organs are as follows: n = 27,779 for All, n = 4,409 for Aorta, n = 864 for Bladder, n = 1,677 for Duodenum, n = 1,964 for Esophagus, n = 712 for Gallbladder, n = 4,105 for Inferior vena cava, n = 635 for Left adrenal gland, n = 1,776 for Left kidney, n = 4,648 for Liver, n = 1,345 for Pancreas, n = 571 for Right adrenal gland, n = 1,649 for Right kidney, n = 1,587 for Spleen, and n = 1,837 for Stomach. *indicates the significance level at which BiomedParse outperforms the best-competing method, with two-sided paired t-test p-value < 1 × 10-2 for **, p-value < 1 × 10-3 for ***, p-value < 1 × 10-4 for ****. Exact p-values for the comparison between BiomedParse and SegVol are as follows: p-value < 2.23 × 10-308 for All, p-value < 1.86 × 10-58 for Aorta, p-value < 1.73 × 10-7 for Bladder, p-value < 3.44 × 10-86 for Duodenum, p-value < 5.00 × 10-185 for Esophagus, p-value < 3.37 × 10-15 for Gallbladder, p-value < 6.28 × 10-99 for Inferior vena cava, p-value < 5.08 × 10-10 for Left adrenal gland, p-value < 9.26 × 10-31 for Left kidney, p-value < 3.31 × 10-37 for Liver, p-value < 2.27 × 10-56 for Pancreas, p-value < 1.01 × 10-16 for Right adrenal gland, p-value < 2.98 × 10-20 for Right kidney, p-value < 1.09 × 10-20 for Spleen, and p-value < 4.68 × 10-25 for Stomach.