Quantitative benchmarking of nuclear segmentation algorithms in multiplexed immunofluorescence imaging for translational studies

Sankaranarayanan, Abishek; Khachaturov, Georgii; Smythe, Kimberly S.; Mittal, Shachi

doi:10.1038/s42003-025-08184-8

Download PDF

Article
Open access
Published: 30 May 2025

Quantitative benchmarking of nuclear segmentation algorithms in multiplexed immunofluorescence imaging for translational studies

Communications Biology volume 8, Article number: 836 (2025) Cite this article

4252 Accesses
3 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Multiplexed imaging techniques require identifying different cell types in the tissue. To utilize their potential for cellular and molecular analysis, high throughput and accurate analytical approaches are needed in parsing vast amounts of data, particularly in clinical settings. Nuclear segmentation errors propagate in all downstream steps of cell phenotyping and single-cell spatial analyses. Here, we benchmark and compare the nuclear segmentation tools commonly used in multiplexed immunofluorescence data by evaluating their performance across 7 tissue types encompassing ~20,000 labeled nuclei from human tissue samples. Pre-trained deep learning models outperform classical nuclear segmentation algorithms. Overall, Mesmer is recommended as it exhibits the highest nuclear segmentation accuracy with 0.67 F1-score at an IoU threshold of 0.5 on our composite dataset. Pre-trained StarDist model is recommended in case of limited computational resources, providing ~12x run time improvement with CPU compute and ~4x improvement with the GPU compute over Mesmer, but it struggles in dense nuclear regions.

Cell segmentation for immunofluorescence multiplexed images using two-stage domain adaptation and weakly labeled data for pre-training

Article Open access 15 March 2022

Automatic segmentation of skin cells in multiphoton data using multi-stage merging

Article Open access 15 July 2021

Comparison between a deep-learning and a pixel-based approach for the automated quantification of HIV target cells in foreskin tissue

Article Open access 23 January 2024

Introduction

Multiplexed immunofluorescence (mIF) imaging improves on conventional immunohistochemistry (IHC) as a diagnostic tool to facilitate analysis of tissue composition, spatial biomarker distribution, and cell-cell interactions^{1,2,3,4,5,6,7,8,9,10,11,12}, allowing for correlation with diagnostic and prognostic outcomes. There is also the added benefit of applying multiple markers on a single section when tissue samples are scarce¹³. Cell segmentation is an integral process in mIF workflows^14,15 following which spatial molecular analysis is conducted^16,17. First, individual nuclei are segmented within a tissue section. This is followed by either a fixed pixel expansion around the nuclei or the use of cytoplasm/cell membrane markers for subsequent whole-cell segmentation.

Classical nuclear segmentation techniques involving thresholding (the most common technique being Otsu thresholding¹⁸), morphological operations, and watershed algorithms^19,20,21 are easier to implement and well-known to the research and clinical community but can be limiting in the accuracy they provide²² They also involve a large number of user-inputted parameters that vary from algorithm to algorithm making standardized performance difficult. Proprietary software (e.g., inForm® (Akoya Biosciences) or HALO® (Indica Labs)), commonly being used in clinical settings offer a seamless, graphical user interface to implement the entire pipeline of cell segmentation and phenotyping in one platform. However, they are costly and do not provide much room for customization. Deep learning-based open-source algorithms^{23,24,25,26,27,28,29,30,31,32,33} are highly accurate, customizable, and more generalizable but can be non-intuitive to implement without computational expertise. For any given spatial analysis task, nuclear segmentation is an important component of the analytical pipeline and can have significant impacts on the biological insights drawn from the dataset, particularly if cells are erroneously counted or missed at the beginning of the analysis.

Any errors during nuclear segmentation can propagate into downstream analysis steps such as cell phenotyping, immune profiling, and prognosis prediction. In part, it is difficult to achieve accurate and reliable nuclear segmentation due to differences in nuclear size, shape, and density (the number of nuclei per unit area of tissue) across tissue types, imaging conditions of different microscopes, and illumination differences within a whole slide image (WSI)^23,34. Classical algorithms suffer from this as they generally require parameter tuning to optimize performance (labor and time-intensive, requires expertise, and operator-dependent results) due to these differences between WSIs or even between different fields within a WSI. Pre-trained deep learning models suffer because their training data only contains a subset of these differences and can tend to underperform when exposed to these alterations, which can often be improved upon by further re-training of these algorithms. It is especially challenging in crowded regions of tissue, where nuclei overlap, making it difficult to demarcate one from another^24,35. The challenge of nuclear segmentation has been around for many years and continues to add unreliability to mIF workflows^27,36.

Currently, there are no benchmarks in the field to quantitatively evaluate these algorithms for a given task making the mIF analytical pipeline subjective and qualitative. Therefore, it is crucial to be able to quantitatively evaluate the different nuclear segmentation algorithms³⁷, and carefully choose the right platform for a given dataset and task.

In this study, we objectively compare the performance of commonly used nuclear segmentation algorithms/models on different datasets (spanning 7 human tissue types collected at different institutions and ~20,000 nuclei) and provide bounds on both their accuracy and computational efficiency. This provides an expectation for the upper bound of the accuracy of mIF workflows (e.g., single-cell spatial mapping and profiling) when using one of the algorithms discussed in this study. We provide recommendations as to the best algorithms/models to use when working with certain data types (e.g., tissue type, nuclear sparsity/density of the dataset) and constraints (e.g., GPU availability, and computation time). To do this, we curated multiple datasets of spectrally unmixed DAPI signal with paired ground truth nuclear annotations (Fig. 1A). A subset of the datasets was generated in-house (Fig. 1A(ii–vi)), and has been made publicly available via Zenodo repository 13306714³⁸ for benchmark analysis of other algorithms/models not considered in this study, while the rest was obtained from an already publicly available dataset³⁹. We quantitatively evaluated the in-built nuclear segmentation algorithms of GUI-based open-source platforms (Fiji^21,40, CellProfiler^41,42,43,44, and QuPath⁴⁵), and one commercial piece of software with a proprietary nuclear segmentation algorithm (inForm® (Akoya Biosciences)) which utilize classical segmentation techniques (Fig. 1B, C). Similarly, we evaluated three pre-trained models of nuclear segmentation deep-learning models (Mesmer²³, Cellpose²⁹, and StarDist²⁸) which we implemented in Python (Fig. 1B, C). Largely, this study provides an expectation for and compares (using quantitative metrics, qualitative segmentation results, and two benchmark datasets) the performance of commonly available nuclear segmentation algorithms for use in the analysis of multiplexed immunofluorescence data of human tissue samples in translational workflows. Our work can be used as a guide to select the optimal nuclear segmentation platform given the type and nature of data along with open-source implementation of the chosen algorithm without the need for licensed software. Additionally, all codes and methods released through this manuscript can directly be translated to benchmark any other nuclear/cell segmentation algorithms on a new set of data.

**Fig. 1: Workflow for standardized implementation and benchmarking of nuclear segmentation algorithms on our curated in-house and external paired DAPI and annotated ground truth nuclei datasets.**

Results

Deep learning generally outperforms classical algorithms with varied performance between tissue types

We investigated the quantitative performance of the seven nuclear segmentation tools averaged over the entire in-house dataset, consisting of thousands of nuclei across three tissue types. Figure 2A(i) shows the average performance along with a 95% confidence interval across all images (three tissue types combined) at an IoU threshold of 0.5. Figure 2A(ii) illustrates the performance of different platforms across a range of IoU threshold values, capturing the efficiency of platforms in localizing the boundaries of nuclei with the area under the curve measured and shown in Fig. 2A(iii). Overall, it is evident that the deep learning platforms (Cellpose, Mesmer, StarDist) perform better than the morphological platforms represented both by a higher F1-score at an IoU of 0.5 and by a higher area under the curve (AUC) across IoU thresholds. Based on the results of the quantitative evaluation of the nuclear segmentation platforms, it is safe to conclude that Fiji and CellProfiler are limited in their accuracy relative to the other platforms. Furthermore, of the tested morphological operations-based platforms, QuPath’s (a freely available source) in-built segmentation tool performs better or is similar to inForm®, which is expensive software to license. Cellpose consistently outperformed other platforms at IoU thresholds greater than 0.5 for our dataset.

**Fig. 2: Quantitative evaluation of nuclear segmentation for the in-house dataset.**

Next, we investigated the performance of segmentation platforms for each individual tissue type in our cohort (Fig. 2B–D). Again, deep learning platforms performed better. However, different models emerged as the best choice for each tissue type. For tonsil tissue, Cellpose outperforms other platforms and QuPath surpasses StarDist in segmentation F1-scores at all IoU thresholds as shown in Fig. 2B(ii). While Mesmer matches the performance of Cellpose at low IoU thresholds for the tonsil dataset, it loses accuracy at stricter/higher thresholds (Fig. 2B(ii)). We noticed that the tonsil samples had more non-specific staining as compared to the other samples making them hard to segment, while Cellpose still performed the segmentation task with high accuracy. For skin, Mesmer outperforms other platforms, however, StarDist performs better at stricter IoU thresholds (>0.75) as evident in Fig. 3C(i) and (ii). Mesmer also gives the best performance in breast tissue not only at an IoU threshold of 0.5 but across all IoU thresholds as evident in Fig. 2D. Cellpose didn’t show a similar level of robustness in segmentation as it showed with the tonsil dataset. We believe it is due to a preprocessing step where Cellpose linearly scales the raw pixel intensities such that the 1 and 99 percentiles correspond to 0 and 1²⁹. As a result, pixel intensity data with high variance and/or many outliers will cause Cellpose to perform poorly, as is observed with the breast data. Across all tissue types, the three deep learning models outperform the other morphological operations-based platforms. It is also evident that each platform handled touching nuclei, high or low nuclear density, varying nuclear sizes, and non-specific DAPI staining in intrinsically different ways as described in Table 1. Since the deep learning models have been trained on a large cohort of nuclei spanning different types of images, they are more agnostic to these variations, making them consistently top performers. A brief description of the overall principle of nuclear segmentation for five of the top-performing platforms and ease of implementation is described in Table 1.

Fig. 3: Comparison of nuclear segmentation performance of the pre-trained models of Mesmer, Cellpose (“nuclei” model), and StarDist (“2D_versatile_fluo” model) on the merged dataset using quantitative metrics.

Table 1 Comparisons in implementing and optimizing nuclear segmentation platforms

Full size table

Evaluation of performance on the composite merged dataset and stratified by tissue type

Since the three pre-trained deep learning models outperformed all other platforms across tissue types and IoU thresholds on our in-house benchmarking dataset, we benchmarked their performance on a large, publicly available dataset to achieve higher statistical power. This was efficiently carried out as the deep learning models are parameter-free (no manual optimization required) and our pipeline is batch-operable in Python. We utilized our pipeline to evaluate the performance on the merged dataset of ~20,000 nuclei (Fig. 3A–F), spanning 7 different tissue types. It is evident from Fig. 3A, C that at an IoU threshold of 0.5 and all IoU thresholds, Mesmer outperforms the other algorithms (mean F1-score at 0.5 IoU threshold = 0.67, AUC = 0.55). Cellpose (mean F1-score at 0.5 IoU threshold = 0.65, AUC = 0.53) was in between Mesmer and StarDist (mean F1-score at 0.5 IoU threshold = 0.63, AUC = 0.51) for nuclear segmentation performance (Fig. 3A, C). Figure 3B also illustrates that Mesmer didn’t have any outliers on the box plot suggesting that its performance is consistent irrespective of the nature of the region of interest (ROI). However, Cellpose and StarDist exhibited extremely low performance on some ROIs as shown by their respective outlier points on the box plot.

Figure 4 shows the performance of the three pre-trained models on the individual tissue types using the F1-score at 0.5 IoU threshold and the F1-score vs. IoU threshold AUC as the main metrics for evaluation. Mesmer had the highest performance for pancreas, skin, tongue, and breast while Cellpose had the best performance for tonsil and colon tissue types.

**Fig. 4: Benchmarking and comparison of nuclear segmentation performance of pre-trained deep learning models on the merged dataset stratified by tissue type.**

StarDist accuracy drops in regions of high nuclear density

To investigate whether a model has different performance based on nuclear sparsity/density of a region, we split the merged dataset into groups by a cut-off based on the median (~278 nuclei/10⁵ pixels), upper quartile (~368 nuclei/10⁵ pixels) or lower quartile (~200 nuclei/10⁵ pixels) nuclear density (Fig. 5). The StarDist model has reduced performance above the median nuclear density (Fig. 5C(i),(ii)) than below (Fig. 5B(i),(ii)) shown by the drop in mean F1-score at 0.5 IoU threshold from 0.69 to 0.57 and drop in F1-AUC from 0.57 to 0.46. This difference is even more pronounced above the upper quartile nuclear density (Fig. 5D(i),(ii)) than below the lower quartile (Fig. 5A(i),(ii)) with the drop in mean F1-score at 0.5 IoU threshold from 0.67 to 0.47 and drop in F1-AUC from 0.56 to 0.39. Another observation to note is that the performance of Cellpose is the most stable/consistent with varying densities even though the absolute performance mostly stays higher for Mesmer.

**Fig. 5: Evaluation of nuclear segmentation performance using various quantitative metrics of pre-trained deep learning models on ROIs of varying nuclear density from the merged dataset.**

Pre-trained models exhibit higher precision than recall with varying relative differences affecting performance

In almost all cases, all three pre-trained models exhibit a higher precision than recall (Figs. 5(iii),(iv) and 6(ii),(iii)). This means that the error in nuclear segmentation can be attributed to pre-trained models generally predicting more false negative nuclei (under-segmented or completely missed nuclei) rather than more false positive nuclei (over-segmented or falsely predicted nuclei). This is an interesting bias and one that is observed with all three pre-trained models.

**Fig. 6: Comparison of nuclear segmentation performance using various quantitative metrics of pre-trained deep learning models on the merged dataset stratified by tissue type.**

In addition to performing the best on the composite dataset, Mesmer also performed better than Cellpose and StarDist models (as measured by the F1-score at 0.5 IoU threshold and the F1-score AUC) when stratified by the tissue type in pancreas (Fig. 6A), skin (Fig. 6C), breast (Fig. 6F), and tongue (Fig. 6G) tissue and in regions of low nuclear density (Fig. 5A). This is generally because Mesmer has the highest recall (sensitivity), especially when its precision is comparable to or lower than that of Cellpose and StarDist (where Cellpose and StarDist under-segment). A representative region of breast tissue and the performance of the three pre-trained models is shown in Fig. 7A.

**Fig. 7: Evaluation of nuclear segmentation by the pre-trained Cellpose, Mesmer, and StarDist models on representative ROIs.**

Cellpose performed better than Mesmer and StarDist models (as measured by F1-score at 0.5 IoU threshold and the F1-score AUC) in tonsil (Fig. 6D) and colon (Fig. 6E) tissue types and in regions of extremely high nuclear density (Fig. 5D). This is usually because Cellpose has the highest precision, especially when its recall is comparable to or lower than that of Mesmer and StarDist (where Mesmer and StarDist over segment). A representative region of tonsil tissue (with high nuclear density) and this relative performance gain of Cellpose over Mesmer is shown in Fig. 7B.

Except for lung (Fig. 5B), StarDist performs worse than Mesmer and/or Cellpose models (as measured by the F1-score at 0.5 IoU threshold and the F1-score AUC) in all other tissue types. This is usually because StarDist has the lowest recall, especially when its precision is comparable to or higher than that of Mesmer and Cellpose (StarDist has a bias to under-segment). This can especially be seen with tonsil tissue (Fig. 6D) and in regions of high nuclear density (Fig. 5C, D). A representative region of tonsil tissue (with high nuclear density) and this relative performance drop of StarDist is shown in Fig. 7B.

Another quantitative metric reported in this study to compare the pre-trained models is the Jaccard index (Figs. 5(v) and 6(iv)). It can be seen from their computation formulae (Eqs. (2) and (5)) that the Jaccard index and F1-score are mathematically related. One of the metrics can be calculated simply by knowing the value of the other. While the Jaccard index is objectively a stricter metric (its value will always be lower than the F1-score), the relative differences between the three models have a conserved trend between the Jaccard index and F1-score (Figs. 5 and 6).

StarDist is the most computationally time efficient and Cellpose has the most reliance on GPU usage

Figure 3G shows the computation times for both CPU and GPU usage for the three pre-trained models on the external dataset, which consists of 27 fields sampled from 27 WSIs. One ROI (400 × 400 pixels) is sampled from each segmented field for performance evaluation. Five large fields from the in-house dataset were left out of this analysis because they were significantly larger than those in the external dataset which could confound the computational time analysis.

Figure 3G shows that the StarDist model has the lowest computation time for both GPU (~4x and ~15x shorter median run time than Mesmer and Cellpose respectively) and CPU (~12x and ~89x shorter median run time than Mesmer and Cellpose respectively). As such, StarDist is the recommended choice for nuclear segmentation tool for studies where computational resources are constrained, especially when the data does not contain overly crowded regions of nuclei where it tends to struggle with accuracy. It is worth noting that Cellpose sees the most improvement in processing time with ~12x shorter median run time for GPU over CPU usage. Mesmer’s computation time for both CPU and GPU is in between that of Cellpose and StarDist, with a high level of accuracy for most tissue types and nuclear density levels. As such, Mesmer is the recommended model for general-purpose nuclear segmentation use cases.

Discussion

In this study, we quantitatively evaluated the performance of seven nuclear segmentation tools by comparing their predictions with ground truth nuclear annotations. We performed comparative analyses between the seven segmentation tools on an ROI level. F1-scores were calculated for each tool in each ROI at varying IoU thresholds, in which hundreds of nuclei were compared in the predicted and ground truth nuclear masks for the ROI. Based on the quantitative evaluation results on the in-house dataset, it is safe to conclude that Fiji and CellProfiler are limited in their accuracy relative to the other platforms. Further, we noticed that inForm® and QuPath performed quite well but, in most cases, were outperformed by newer deep-learning platforms.

Even though deep learning platforms performed the best across datasets, they are only slowly being adopted in clinical settings and molecular labs. Reasons for this barrier to applying deep learning models in clinical practice include their non-intuitive implementation and a lack of interpretability, reliability, and transparency. However, studies are being conducted to improve the interpretability aspect⁴⁶. Another reason for the barrier to deep learning clinical application is the usual need for fine-tuning models on datasets. This could be because the tissue type is different from the dataset the models were originally trained on and/or because differences in institutions and protocols can cause variability in stain quality and appearance⁴⁷. Deep learning models like Cellpose and StarDist allow the training of custom models on user-obtained data for even better segmentation results^28,29. Mesmer was trained on the largest dataset to date for nuclear annotations (TissueNet)²³. TissueNet contains 2D nuclear annotations of 1.2 million nuclei from six imaging platforms, nine organs, and three species to allow generalizability.

In this study, we benchmark the performance of three pre-trained deep learning models of Cellpose, Mesmer, and StarDist, which are trained for the specific purpose of segmentation of nuclei in fluorescence imaging. It is important to note that there are newly developed foundation models that can perform segmentation on multiple different imaging modalities, which boast higher nuclear segmentation accuracies. These foundation models can be applied for nuclear segmentation in H&E-stained data⁴⁸, as well as fluorescence data⁴⁹. Another benchmark study also compares different multimodality cell segmentation approaches with Cellpose⁵⁰. There is also a novel embedding-based nucleus and cell segmentation approach that performs with high accuracy and low processing time⁵¹, with a separate channel invariant architecture to specifically deal with the multi-channel nature of mIF data⁵². In this work, we have shown the increased nuclear segmentation accuracy that deep learning models provide compared to platforms using classical segmentation techniques. We evaluated the segmentation results on 60 ROIs and ~20000 nuclei in a high throughput manner. The developed pipeline provides all of the quantitative metrics in this study within 2 minutes. Additionally, we have benchmarked the nuclear segmentation accuracy of the pre-trained models of Cellpose, Mesmer, and StarDist for use in scalable studies with human tissue samples.

We primarily use an object-level F1-score, evaluated at multiple IoU thresholds to compare the different algorithms and models. For density, counts, and whole slide-level single-cell spatial analysis, the F1-score at an IoU threshold of 0.5 is a good metric as it captures the accuracy of the object detection task. For spatial profiling of mIF data using nuclear and membrane markers, it is important to accurately segment the boundaries of the nuclei as well as to capture the staining of markers in and around the nuclei so that the cell phenotype classifiers can accurately assign phenotypes to the cells. For studies where the morphology of individual nuclei is being studied, the F1-score at high IoU thresholds is a better metric. Therefore, we also define another composite metric: the area under the curve (AUC) of the F1-score across all IoU thresholds, providing a collective measure of both the accuracy of the object detection task and the segmented morphology.

From the results of this study, it is clear that Cellpose model performance can be improved by re-annotating false negative predictions and Memser’s can be improved by removing/editing false positive predictions. This can be done via a human-in-the-loop approach to improve model performance on specific datasets. In line with this, our lab has worked extensively with Cellpose on other datasets and has seen significant improvement in nuclear segmentation accuracy by providing fine-tuning annotations for the nuclei that the base Cellpose model misses/under segments. However, the need for extensive training data in deep learning models is a challenge in their customization and evaluation. There is research being conducted to use a combination of classical segmentation techniques (Otsu thresholding and watershed algorithms) to automate and fast-track dataset annotation for deep learning segmentation model training and testing^53,54, enabling their widespread use.

It is important to note that even though we are only considering nuclei in this study as we utilize the DAPI channel, the same pipeline can be extended for cell segmentation. Directly segmenting whole cells is difficult because whole cell morphologies are more variable than that of nuclei alone⁵⁵, and ensuring adequate cytoplasm/membrane markers for mIF with a limited number of markers can be challenging. When cell membrane or cytoplasm markers are not available for every cell type in a dataset, a commonly used procedure to approximate the whole cell boundary from the DAPI channel is to have the cytoplasm be an area within a fixed radius from the nucleus⁵⁶. Additionally, we benchmark and compare the nuclear segmentation performance of algorithms and pre-trained models on Vectra Polaris data. Future studies can translate our pipeline to other mIF technologies like CODEX and other imaging platforms.

Methods

Datasets

(1)
In-house datasets: formalin-fixed paraffin-embedded (FFPE) tissue sections were used to acquire 6-plex multispectral images, aka “multiplexed immunofluorescence images,” using the PhenoImager HT (Akoya Biosciences) and the 7-color opal reagent kit. The data used in this study was collected at two different sites (University of Washington and Fred Hutchinson Cancer Center). Multiplexed immunofluorescence (mIF) whole slide images (WSIs) at ×20 magnification from 3 human tissue types (tonsil, melanoma, and breast) comprised this dataset. These were imaged with 5 different imaging runs. One multispectral field of view was chosen from each WSI using Phenochart™ whole slide viewer software (Akoya Biosciences), and 4 subregions were sampled from each field. We also developed code in Python to sample and extract the field from the WSI given coordinates, which is provided in the GitHub repository and eliminates the need for Phenochart based on reviewer suggestion. Therefore, we used a total of 20 images (256 × 256 pixels) for our nuclear segmentation performance analysis (3967 total ground truth nuclei). This sampling strategy was selected to incorporate heterogeneity across tissue samples and account for intra-tissue heterogeneity. We also attempted to sample the subregions across a range of nuclear densities. This curated dataset is freely available via the Zenodo repository 13306714³⁸ for use by other developers and labs.
(2)
Public datasets for benchmarking: the external, benchmarking dataset is hosted in Synapse³⁹. We included data from 6 human tissue types (Fig. 4B) spanning a total of 16,197 nuclei extracted from 40 images (400 × 400 pixels) sampled from 27 WSIs at ×20 magnification. The whole external dataset has data from three multiplex fluorescence imaging technologies: (1) sequential multiplex IF unmixing via Akoya Vectra/PhenoImager HT, (2) sequential multiplex IF narrowband capture via Ultivue InSituPlex with Zeiss Axioscan image capture, and (3) cyclic multiplex IF narrowband capture via Akoya CODEX. We incorporated 16197 nuclei from the 40 ROIs imaged using the Akoya PhenoImager HT system that had paired DAPI intensity signal with the ground truth data. While being compatible with our pipeline, the Akoya CODEX and Ultivue InSituPlex data did not have enough nuclear annotations (as opposed to whole-cell annotations) to yield meaningful benchmarking using this pipeline.

Study approval

Only de-identified tissue samples were used, which do not classify as human subjects and did not require a separate IRB approval.

Hardware environment

The algorithms were run using two different hardware environments, as described below:

(1)
CPU:

System: Windows 11 Enterprise
CPU: 12th Gen Intel(R) Core(TM) i7-12700K 3.60 GHz
RAM: 64 GB

(2)
GPU:

System: Windows 11 Enterprise
GPU: NVIDIA GeForce RTX 3060 12 GB
RAM: 64 GB
CPU: 12th Gen Intel(R) Core(TM) i7-12700K 3.60 GHz
CUDA: version 12.7

Preprocessing

We do not perform any pre-processing of the DAPI intensity signal before nuclear segmentation with any algorithm. However, the deep learning frameworks of Mesmer²³, StarDist²⁸, and Cellpose²⁹ all perform some in-built normalization before implementing their respective segmentation algorithms.

Nuclear segmentation

Seven algorithms were compared in this study for their nuclear segmentation performance. Cellpose²⁹ (pre-trained “nuclei” model), Mesmer²³, and StarDist²⁸ (pre-trained “2D_versatile_fluo” model) employ emerging pre-trained deep learning frameworks. The in-built nuclear segmentation tools of platforms that utilize classical morphological segmentation techniques were also implemented: Fiji^21,40, CellProfiler^41,42,43,44, inForm® (Akoya Biosciences), and QuPath⁴⁵. inForm® and QuPath are two of the commonly used segmentation platforms, particularly for mIF data in labs and clinical settings^14,57. An overview of our analysis pipeline is given in Fig. 1. The guidelines for implementing algorithms and the associated Python code are available through the Supplementary Note 1 and GitHub repository. Once the segmented masks were predicted from each algorithm, one-pixel erosion post-processing on connecting parts of distinct nuclei was implemented to visualize the overlaid segmented masks and ground truth masks for qualitative comparisons.

Ground truth nuclei annotation

Nuclei were annotated in this study for the generation of the in-house dataset (Fig. 1A(vi)). In our evaluation of 2D nuclear segmentation, all the algorithms predict non-overlapping nuclei. To preserve this in the ground truth masks, we annotated the ground truth nuclei and split the overlapping nuclei such that the nucleus on the top (determined observationally) incorporated the overlapping region while truncating the nucleus on the bottom so that the resulting two nuclei do not have any overlap. GIMP image editor (https://www.gimp.org/) was used to create ground truth annotations. Boundaries of the nuclei were annotated using a 1 × 1 pixel square brush (Fig. 1A(iii)). The annotations were then filled with white (foreground) using a bucket fill tool and touching regions of nuclei were separated by a 2-pixel gap (Fig. 1C(iv)). The opacity of the annotated binary mask layer was then turned up to 100% before exporting the ground truth binary mask as a .TIFF file (Fig. 1A(v)).

Ground truth nuclei were used in this study from both the in-house and external datasets. Representative ground truth annotations are shown in Fig. 7 and Supplementary Fig. 1.

Evaluation

Nuclear segmentation performed by the different algorithms was evaluated in this study by measuring the difference between the nuclei mask output from the algorithms and a reference ground truth nuclei (annotated) mask. The primary evaluation metric in this study was an object-based F1-score. A higher F1-score indicates an algorithm’s ability to more accurately segment nuclei on an object level. F1-scores are a commonly used metric for the evaluation of nuclear segmentation performance in mIF workflows^36,58. The F1-scores in this study were computed as described by Caicedo et al.²², using a threshold of the overlapping area between the predicted and ground truth nuclei masks²². Intersection-over-union (IoU) between ground truth nuclei (T) and predicted nuclei by the individual algorithms (P) was first calculated as:

$${IoU}\left(T,P\right)=\frac{T\cap P}{T\cup P}$$

(1)

For n ground truth nuclei and m predicted nuclei, an n × m matrix was computed with all IoU values between ground truth and predicted nuclei. The resulting matrix was sparse in nature because few prediction-ground truth nuclei pairs had overlapping areas to have non-zero IoU scores²². A true positive predicted nucleus was identified if it had an IoU greater than a pre-set threshold with a ground truth nucleus. Predicted nuclei that did not have IoU with any ground truth nucleus greater than the threshold were identified as false positives, and ground truth nuclei that did not have IoU with any predicted nucleus greater than the threshold were identified as false negatives. Overlapping nuclei were treated as distinct nuclei separated by a dividing line that was two pixels in width. The two-pixel gap was added both in ground truth annotations as well as in the post-processing of binary nuclei masks output from the various digital segmentation tools for consistency. Setting the IoU threshold at 0.5 or above ensures that only one true positive predicted nucleus could be assigned to a ground truth nucleus. We computed the F1-scores across a varying range of IoU thresholds to capture the holistic performance of the different segmentation algorithms. For IoU thresholds less than 0.5, there is a possibility of multiple matches of (true positive) predicted nuclei with one ground truth nucleus. In this case, the prediction with the highest IoU is assigned as the true positive while the others are assigned as false positives. We also generated three other object-level quantitative metrics: precision, recall, and Jaccard index. It is worth noting that the F1-score is the harmonic mean of the precision and recall metrics. Precision measures the ratio of predicted nuclei that are true positives and recall measures the ratio of ground truth nuclei that have a corresponding true positive prediction. The Jaccard index is the ratio of the number of true positive nuclei (intersection of predictions and ground truth) to the number of nuclei in the union of the predictions and ground truth. Using the computed IoU matrix and defined IoU threshold, the number of true positives (TP), false positives (FP), and false negatives (FN) can be counted. The object-level metrics are calculated as:

$$F1=\frac{2* {TP}}{2* {TP}+ {FN}+ {FP}}$$

(2)

$${Precision}=\frac{{TP}}{{TP}+ {FP}}$$

(3)

$${Recall}=\frac{{TP}}{{TP}+ {FN}}$$

(4)

$${Jaccard}\,{index}=\frac{{TP}}{{TP}+ {FN}+ {FP}}$$

(5)

The protocol and code used to generate quantitative metrics and subsequent visualization are in Supplementary Note 1 and the GitHub repository, respectively.

Statistics and reproducibility

Confidence intervals, the number of evaluation regions, and the number of nuclei in each evaluation region are reported where relevant. Quantitative comparison between the nuclear segmentation tools was performed on a region-level. Quantitative metrics for each region were computed to assess the object-level segmentation performance of all nuclei in the respective regions.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The DAPI intensity signal of ROIs from our in-house generated dataset, along with paired nuclear ground truth and segmentation binary masks used for evaluating the different algorithms/models are available in Zenodo repository 13306714 https://doi.org/10.5281/zenodo.13306714³⁸. In our Zenodo repository 13306714³⁸ we have a spreadsheet directly identifying the specific ROIs that we used to curate the merged dataset in this study from the external dataset in the Synapse repository at https://doi.org/10.7303/SYN27624812³⁹. Source data for figures can be found in Supplementary Data 1.

Code availability

The guidance and code for nuclear segmentation with various platform GUIs and python-based models and visualization/evaluation of segmentations are available in Supplementary Note 1 and the GitHub repository at https://github.com/Shachi-Mittal-Lab/nuclear_segmentation. All code is available under an open-source MIT license.

References

Lim, J. C. T. et al. An automated staining protocol for seven-colour immunofluorescence of human tissue sections for diagnostic and prognostic use. Pathology 50, 333–341 (2018).
Article PubMed Google Scholar
Almekinders, M. M. et al. Comprehensive multiplexed immune profiling of the ductal carcinoma in situ immune microenvironment regarding subsequent ipsilateral invasive breast cancer risk. Br. J. Cancer 127, 1201–1213 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wang, J. et al. Multiplexed immunofluorescence identifies high stromal CD68+PD-L1+ macrophages as a predictor of improved survival in triple negative breast cancer. Sci. Rep.11, 1–12 (2021).
Google Scholar
Vanguri, R. S. et al. Tumor immune microenvironment and response to neoadjuvant chemotherapy in hormone receptor/HER2+ early stage breast cancer. Clin. Breast Cancer 22, 538–546 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yaseen, Z. et al. Validation of an accurate automated multiplex immunofluorescence method for immuno-profiling melanoma. Front. Mol. Biosci. 9, 810858 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huang, Y. K. et al. Macrophage spatial heterogeneity in gastric cancer defined by multiplex immunohistochemistry. Nat. Commun.10, 1–15 (2019).
Article Google Scholar
Taube, J. M. et al. The Society for Immunotherapy in Cancer statement on best practices for multiplex immunohistochemistry (IHC) and immunofluorescence (IF) staining and validation. J. Immunother. Cancer 8, 155 (2020).
Article Google Scholar
Wortman, J. C. et al. Spatial distribution of B cells and lymphocyte clusters as a predictor of triple-negative breast cancer outcome. npj Breast Cancer 7, 1–13 (2021).
Article Google Scholar
Hoyt, C. C. Multiplex immunofluorescence and multispectral imaging: forming the basis of a clinical test platform for immuno-oncology. Front. Mol. Biosci. 8, 674747 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wilson, C. M. et al. Challenges and opportunities in the statistical analysis of multiplex immunofluorescence data. Cancers13, 3031 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tzoras, E. et al. Dissecting tumor-immune microenvironment in breast cancer at a spatial and multiplex resolution. Cancers 14, https://doi.org/10.3390/cancers14081999 (2022).
Tomimatsu, K. et al. Precise immunofluorescence canceling for highly multiplexed imaging to capture specific cell states. Nat. Commun.15, 1–16 (2024).
Article Google Scholar
Tan, W. C. C. et al. Overview of multiplex immunohistochemistry/immunofluorescence techniques in the era of cancer immunotherapy. Cancer Commun. 40, 135–153 (2020).
Article Google Scholar
Rojas, F., Hernandez, S., Lazcano, R., Laberiano-Fernandez, C. & Parra, E. R. Multiplex immunofluorescence and the digital image analysis workflow for evaluation of the tumor immune environment in translational research. Front. Oncol. 12, 1 (2022).
Article Google Scholar
Parra, E. R. et al. Procedural requirements and recommendations for multiplex immunofluorescence tyramide signal amplification assays to support translational oncology studies. Cancers12, 255 (2020).
Article PubMed PubMed Central Google Scholar
Liu, C. C. et al. Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering. Nat. Commun.14, 1–16 (2023).
Google Scholar
Bortolomeazzi, M. et al. A SIMPLI (Single-cell Identification from MultiPLexed Images) approach for spatially-resolved tissue phenotyping at single-cell resolution. Nat. Commun.13, 1–14 (2022).
Article Google Scholar
Otsu, N. Threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cyber. SMC-9, 62–66 (1979).
Article Google Scholar
Roerdink, J. B. T. M. & Meijster, A. The watershed transform: definitions, algorithms and parallelization strategies. Fundam. Inf. 41, 187–228 (2000).
Google Scholar
Irshad, H., Veillard, A., Roux, L. & Racoceanu, D. Methods for nuclei detection, segmentation, and classification in digital histopathology: a review-current status and future potential. IEEE Rev. Biomed. Eng. 7, 97–114 (2014).
Article PubMed Google Scholar
Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods9, 676–682 (2012).
Article CAS PubMed Google Scholar
Caicedo, J. C. et al. Evaluation of deep learning strategies for nucleus segmentation in fluorescence images. Cytometry 95, 952 (2019).
Article PubMed PubMed Central Google Scholar
Greenwald, N. F. et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat. Biotechnol. 40, 555 (2022).
Article CAS PubMed Google Scholar
Yang, L. et al. NuSeT: a deep learning tool for reliably separating and analyzing crowded cells. PLoS Comput. Biol. 16, e1008193 (2020).
Caicedo, J. C. et al. Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nat. Methods16, 1247–1253 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hollandi, R. et al. nucleAIzer: a parameter-free deep learning framework for nucleus segmentation using image style transfer. Cell Syst. 10, 453–458.e6 (2020).
Article CAS PubMed PubMed Central Google Scholar
Han, W., Cheung, A. M., Yaffe, M. J. & Martel, A. L. Cell segmentation for immunofluorescence multiplexed images using two-stage domain adaptation and weakly labeled data for pre-training. Sci. Rep.12, 1–14 (2022).
Google Scholar
Schmidt, U., Weigert, M., Broaddus, C. & Myers, G. Cell detection with star-convex polygons. Lect. Notes Comput. Sci. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinforma.11071 LNCS, 265–273 (2018).
Google Scholar
Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Cellpose: a generalist algorithm for cellular segmentation. Nat. Methods18, 100–106 (2020).
Article PubMed Google Scholar
Al-Kofahi, Y., Zaltsman, A., Graves, R., Marshall, W. & Rusu, M. A deep learning-based algorithm for 2-D cell segmentation in microscopy images. BMC Bioinforma. 19, 1–11 (2018).
Article Google Scholar
Vahadane, A. et al. Development of an automated combined positive score prediction pipeline using artificial intelligence on multiplexed immunofluorescence images. Comput. Biol. Med. 152, 106337 (2023).
Article PubMed Google Scholar
Falk, T. et al. U-Net: deep learning for cell counting, detection, and morphometry. Nat. Methods16, 67–70 (2018).
Article PubMed Google Scholar
Maric, D. et al. Whole-brain tissue mapping toolkit using large-scale highly multiplexed immunofluorescence imaging and deep neural networks. Nat. Commun.12, 1–12 (2021).
Article Google Scholar
Meijering, E. Cell segmentation: 50 years down the road [life Sciences]. IEEE Signal Process Mag. 29, 140–145 (2012).
Article Google Scholar
Kumar, N. et al. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 36, 1550–1560 (2017).
Article PubMed Google Scholar
Kromp, F. et al. Evaluation of deep learning architectures for complex immunofluorescence nuclear image segmentation. IEEE Trans. Med. Imaging 40, 1934–1949 (2021).
Article PubMed Google Scholar
Chen, H. & Murphy, R. F. Evaluation of cell segmentation methods without reference segmentations. Mol. Biol. Cell 34, ar50 (2023).
Sankaranarayanan, A., Khachaturov, G., Smythe, K. S. & Mittal, S. Standardized framework and quantitative benchmarking: nuclear segmentation platforms for multiplexed immunofluorescence data in translational studies. Zenodo https://doi.org/10.5281/zenodo.12775616 (2024).
Aleynick, N. et al. Cross-platform dataset of multiplex fluorescent cellular object image annotations. Sci. Data 10, 1–6 (2023).
Article Google Scholar
Schneider, C. A., Rasband, W. S. & Eliceiri, K. W. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 9, 671–675 (2012).
Article CAS PubMed PubMed Central Google Scholar
McQuin, C. et al. CellProfiler 3.0: next-generation image processing for biology. PLoS Biol. 16, e2005970 (2018).
Article PubMed PubMed Central Google Scholar
Lamprecht, M. R., Sabatini, D. M. & Carpenter, A. E. CellProfiler: free, versatile software for automated biological image analysis. Biotechniques 42, 71–75 (2007).
Article CAS PubMed Google Scholar
Carpenter, A. E. et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 7, 1–11 (2006).
Article Google Scholar
Jones, T. R. et al. CellProfiler analyst: data exploration and analysis software for complex image-based screens. BMC Bioinforma. 9, 1–16 (2008).
Article Google Scholar
Bankhead, P. et al. QuPath: open source software for digital pathology image analysis. Sci. Rep.7, 1–7 (2017).
Article CAS Google Scholar
Teng, Q., Liu, Z., Song, Y., Han, K. & Lu, Y. A survey on the interpretability of deep learning in medical diagnosis. Multimed. Syst. 28, 2335 (2022).
Article PubMed PubMed Central Google Scholar
Durkee, M. S., Abraham, R., Clark, M. R. & Giger, M. L. Artificial intelligence and cellular segmentation in tissue microscopy images. Am. J. Pathol. 191, 1693–1701 (2021).
Article PubMed PubMed Central Google Scholar
Zhao, T. et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 1–11 https://doi.org/10.1038/s41592-024-02499-w (2024).
Israel, U. et al. A foundation model for cell segmentation. bioRxiv https://doi.org/10.1101/2023.11.17.567630 (2024).
Ma, J. et al. The multimodality cell segmentation challenge: toward universal solutions. Nat. Methods21, 1103–1113 (2024).
Article CAS PubMed PubMed Central Google Scholar
Goldsborough, T. et al. InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation. arXiv:2408.15954 (2024).
Goldsborough, T. et al. A novel channel invariant architecture for the segmentation of cells and nuclei in multiplexed images using InstanSeg. bioRxiv https://doi.org/10.1101/2024.09.04.611150. (2024).
Englbrecht, F., Ruider, I. E. & Bausch, A. R. Automatic image annotation for fluorescent cell nuclei segmentation. PLoS ONE 16, e0250093 (2021).
Article CAS PubMed PubMed Central Google Scholar
Helmbrecht, H., Lin, T. J., Janakiraman, S., Decker, K. & Nance, E. Prevalence and practices of immunofluorescent cell image processing: a systematic review. Front. Cell Neurosci. 17, 1188858 (2023).
Article PubMed PubMed Central Google Scholar
Bosisio, F. M. et al. Next-generation pathology using multiplexed immunohistochemistry: mapping tissue architecture at single-cell level. Front. Oncol. 12, 918900 (2022).
Article CAS PubMed PubMed Central Google Scholar
Harms, P. W. et al. Multiplex immunohistochemistry and immunofluorescence: a practical update for pathologists. Mod. Pathol. 36, 100197 (2023).
Article CAS PubMed Google Scholar
Parra, E. R., Francisco-Cruz, A. & Wistuba, I. I. State-of-the-art of profiling immune contexture in the era of multiplexed staining and digital analysis to study paraffin tumor tissues. Cancers11, 247 (2019).
Article CAS PubMed PubMed Central Google Scholar
De León Rodríguez, S. G. et al. A machine learning workflow of multiplexed immunofluorescence images to interrogate activator and tolerogenic profiles of conventional type 1 dendritic cells infiltrating melanomas of disease-free and metastatic patients. J. Oncol. 2022, 9775736 (2022).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the startup grant provided by the Department of Chemical Engineering, University of Washington. The multispectral imaging system was jointly funded by the Department of Chemical Engineering and the Department of Laboratory Medicine and Pathology at the University of Washington.

Author information

Authors and Affiliations

Department of Chemical Engineering, University of Washington, Seattle, WA, USA
Abishek Sankaranarayanan, Georgii Khachaturov & Shachi Mittal
Translational Science and Therapeutics Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
Kimberly S. Smythe
Department of Laboratory Medicine and Pathology, School of Medicine, University of Washington, Seattle, WA, USA
Shachi Mittal

Authors

Abishek Sankaranarayanan
View author publications
Search author on:PubMed Google Scholar
Georgii Khachaturov
View author publications
Search author on:PubMed Google Scholar
Kimberly S. Smythe
View author publications
Search author on:PubMed Google Scholar
Shachi Mittal
View author publications
Search author on:PubMed Google Scholar

Contributions

S.M. and A.S. conceptualized and designed the research. A.S. and K.S. acquired, processed, and analyzed multispectral imaging data. A.S. and G.K. performed the nuclear segmentation across platforms and annotated data for ground truth nuclei segmentation masks. A.S. created the standardized postprocessing and evaluation scheme. S.M., A.S. and G.K. wrote the manuscript.

Corresponding author

Correspondence to Shachi Mittal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Ophelia Bu.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary File

Description of Additional Supplementary Files

Supplementary Data 1

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sankaranarayanan, A., Khachaturov, G., Smythe, K.S. et al. Quantitative benchmarking of nuclear segmentation algorithms in multiplexed immunofluorescence imaging for translational studies. Commun Biol 8, 836 (2025). https://doi.org/10.1038/s42003-025-08184-8

Download citation

Received: 07 October 2024
Accepted: 07 May 2025
Published: 30 May 2025
Version of record: 30 May 2025
DOI: https://doi.org/10.1038/s42003-025-08184-8

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Deep learning generally outperforms classical algorithms with varied performance between tissue types

Evaluation of performance on the composite merged dataset and stratified by tissue type

StarDist accuracy drops in regions of high nuclear density

Pre-trained models exhibit higher precision than recall with varying relative differences affecting performance

StarDist is the most computationally time efficient and Cellpose has the most reliance on GPU usage

Discussion

Methods

Datasets

Study approval

Hardware environment

Preprocessing

Nuclear segmentation

Ground truth nuclei annotation

Evaluation

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links