Fig. 2: Benchmarking of segmentation results.

a–c Evaluations based on manual annotations of cells in three independent wild-type samples (02–04) with seven time points per embryo. a The dice ratio of the segmentations generated by 3DUNet, CellProfiler, FusionNet, RACE, SingleCellDetector, B-CShaper, and CShaper. Cell numbers are averaged at corresponding time points for each of the three embryos. Significance level is derived by one-sided Wilcoxon rank-sum test over n = 3 independent embryos (n.s. non significant, p > 0.10; *p ≤ 0.10; **p ≤ 0.05); error bar represents standard deviation (SD). b The average Hausdorff distance between the segmentation results produced by these methods and the ground truth for each sample. Significance level is derived by one-sided Wilcoxon rank-sum test over n = 353, 261, and 470 independent cells for Samples 02–04, respectively (***p ≤ 0.01); error bar represents standard deviation (SD). c Object-level F1 scores based on 1084 independent cells at different IoU thresholds. d, e Statistics describing additional 17 samples (04–20) imaged and segmented spanning the 4– to 350-cell stages. The embryos’ time scales are proportionally normalized to their average. d Distribution of cell volume inconsistency coefficient (ρc) over time (tc). e The number and ratio of lost cells over developmental time, where the last time point of the four-cell stage is set as the starting time point (indicated with a black point). Each color represents an individual embryo. Solid and dashed lines denote the total number of cells that were identified by nucleus tracing (total) and unsuccessfully segmented (lost), respectively. Source data are provided as a Source Data file.