Introduction

Cell segmentation serves as a crucial component for numerous applications, including survival prediction1, tumor/non-tumor classification2, as well as cell counting3. Our focus is on cell segmentation for histopathology images, which are obtained from tissue biopsies. Generally, cell segmentation models require pixel-level annotations, which are labor-intensive and expensive to obtain. This is because, typically, many cells are present within a single image, and annotating them requires trained pathologists. Weakly Supervised Image Segmentation (WSIS) addresses this challenge by using weak annotations, which may be present as image-level annotations4, scribbles5, point annotations6, or bounding boxes7,8,9,10,11. We study bounding box supervision, as it offers a more accurate estimate of a cell boundary12.

In this work, we explore WSIS with bounding box supervision for cell segmentation in histopathology images, utilizing Segment Anything Model (SAM)13. While SAM has demonstrated remarkable zero-shot performance in various segmentation tasks on natural images, its application to weak supervision, especially for cell segmentation, remains unexplored. We present BoxCell, a SAM-based method for generating segmentation masks using bounding boxes as prompts. BoxCell uses SAM in two ways – at train and test times. At train time, SAM, when prompted with gold bounding box annotations, generates pseudo-masks for training images. These (image, pseudo-mask) pairs supervise the training of a standalone image segmentation model, such as CaraNet14. At test time, BoxCell generates two segmentation masks. (1) First mask is generated by this standalone segmenter model. (2) An object detector like YOLO15, trained with box supervision, predicts bounding boxes on a test image, which are given as prompt to SAM to generate the second mask. Recognizing complementary strengths, with one excelling in localization and the other in learning cell shapes, BoxCell reconciles these strengths using a novel integer programming formulation, with intensity and spatial constraints. Our experiments on three cell segmentation datasets (CoNSep16, MoNuSeg17, and TNBC18) demonstrate BoxCell’s significant gains over state-of-the-art weakly supervised segmentation systems, achieving 6-10 point better Dice scores.

Related work

Segment anything in image segmentation

Several studies have examined the zero-shot performance of SAM in both natural and medical image segmentation, showing strong results in common scenes but limited performance in complex or low-contrast settings, and on small or irregular objects19,20. Its robustness to image corruptions has also been explored, making it applicable in real-world scenarios21. In medical imaging, SAM has been applied to tasks like liver tumor and brain MRI segmentation22,23. However, in whole slide images (WSIs), SAM performs well on large structures but struggles with small, densely packed cells24. Recent efforts like Segment Any Cell25, CellSAM26, MedSAM27 and \(\mu\)-SAM28 improve performance by fine-tuning SAM or guiding it with object detectors and box prompts. A common observation across these works is that SAM benefits significantly from prompt-based supervision, particularly with bounding boxes, in challenging medical segmentation tasks.

Weakly supervised image segmentation (WSIS)

Many studies address weak supervision by scribble29, class / attention guided30,31,31,32 or bounding box supervision for natural image segmentation. Early methods rely on the multi-instance learning (MIL). They assume that bounding boxes are tight8,33, so a line connecting two opposite edges must contain at least one positive pixel. More recent approaches like BoxInst9, BoxTeacher7 use box based mask alignment, and BoxSnake11 uses polygon based instance segmentation. Despite the strides made in WSIS for natural images, its progress in cell segmentation remains relatively limited34. This limitation can be attributed to challenges like ambiguous boundaries and low contrast variations between foreground and background35.

Ensembling segmentation masks

To ensemble multiple segmentation masks, a classic approach involves outputting the average foreground probability for each pixel36. Another method entails creating an ensemble with low precision and high recall, defined by the model diversity metric36. Alternatively, EmergeNet37 introduces a weighted average of all masks, with weights determined by their performance on the validation set. En-Seg38, use multiple masks to produce an average segmentation masks. We conduct a comparative analysis of BoxCell against these methods (wherever possible), demonstrating BoxCell’s superior performance.

Segmentation mask refinement

Various studies propose post-processing methods to enhance segmentation masks, including GrabCut39 and conditional random fields (CRF)40,41. A popular extension of these works is DenseCRF42, which uses a fully connected CRF that considers all pairs of pixels in an image. While this approach may be effective on some datasets, DenseCRF typically assumes that all images exhibit consistent and strong contrast between the foreground and background regions, which is often not the case in histopathology images. We conduct the comparative analysis of the BoxCell with DenseCRF and demonstrates BoxCell’s superior performance.

Fig. 1
figure 1

Inference pipeline for BoxCell, which produces masks \(M_{D}\) and \(M_{S}\) using a ITD and ITS. These masks are split into a \(K\) \(\times\) \(K\) grid, and GMMs are trained to estimate probability map (P). ILP solver refines P based on intensity and spatial constraints. This figure was created using draw.io.

BoxCell: Our proposed method

Weakly supervised image segmentation (WSIS) takes in training dataset \(D = \{X^T_{k}, B^T_{k}\}_{k=1}^{D}\), where \(X^T_{k}\) is a training image, and \(B^T_{k}\) (in our setting) represents bounding box annotations for the target class. Its goal is to train a model, which, given a test image \(X\), predicts a foreground (cell) segmentation mask \(M\). The proposed method, BoxCell, consists of two components: the Inference Time Detector (ITD) and Inference Time Segmenter (ITS). Both components leverage SAM, a prompt-based general-purpose segmentation model, operating at both training and test stages (see Fig. 1).

ITD and ITS operate on test image \(X\) and independently generate segmentation masks, \(M_D\) and \(M_S\). We find that the two masks possess complementary strengths, and better results can be achieved by merging the two. BoxCell achieves this via a novel Integer Linear Programming (ILP) formulation. The ILP outputs a final mask \(M\) by balancing the probability of pixel classification into classes (foreground and background), along with the goal that similar intensities at neighboring pixels should be assigned the same class.

Generating segmentation masks

Fig. 2
figure 2

Workflows with SAM in weak supervision. ITD uses the detection model \(D(\theta )\) to predict bounding boxes. The detection model is trained on the training data and is used to predict bounding boxes, which are used as box prompts for SAM during inference. ITS uses segmentation masks predicted by SAM as pseudo ground truth to train \(S(\phi )\). We only call \(S(\phi )\) during inference.

Inference time detector (ITD)

ITD (see Fig. 2), trains an object detector \(D(\theta )\) such as Yolov815 using images \(X^T_{k}\) and the set of gold bounding boxes \(B^T_{k}\). The object detector is trained with objectness, classification, and localization losses. Objectness loss (\(L_{obj}\)) is the confidence score indicating whether the box contains an object or not. Classification loss (\(L_{cls}\)) is computed as the binary cross-entropy between the predicted class and ground truth class. Localization loss (\(L_{loc}\)) is the error in predicted bounding box coordinates as compared to ground truth bounding box coordinates. The total detection loss is the sum of all three losses, given as \(L_{det} = L_{obj} + L_{cls} + L_{loc}\). This object detector \(D(\theta )\) predicts a set of bounding boxes \(\hat{B}\) for a test image \(X\). Each box \(b \in \hat{B}\) is used as a prompt to SAM to generate a segmentation mask within that box. All box-level masks are combined to generate the image-level segmentation mask \(M_D\).

Inference time segmentor (ITS)

During training, SAM generates masks \(M^T_{k}\) for the training images (\(X^T_{k}\)) using ground truth bounding boxes \(B^T_{k}\) as prompts. Despite SAM’s errors, these masks serve as pseudo-masks for training a standalone segmentation model \(Sg(\phi )\) like CaraNet14 – it is trained using a sum of Dice loss and BCE: \(L_{\text {seg}} = L_{\text {bce}} + L_{\text {dice}}\) (see Fig. 2). Binary cross-entropy (\(L_{bce}\)) improves the pixel-level classification of the segmentation mask, and Dice loss (\(L_{dice}\)) guides the intersection of the prediction with the ground truth masks, thereby improving the localization of the predicted masks. At test time, \(Sg(\phi )\) runs on \(X\) to directly generate an image-level segmentation mask, \(M_S\).

Integer programming for reconciling segmentation masks

We find that \(M_D\) excels in localization, whereas \(M_S\) is better at shapes; BoxCell merges the two for better performance. We make two key observations. First, for histopathological images, the intensity values within pixels of one class (foreground or background) vary significantly, due to variations in tissue structure and amount of staining from one part of image to another. So, any intensity distribution learned over the whole image is likely to be noisy, but could be meaningful if learned over a small patch of the image. Second, there still exists perceptible contrast between pixel intensities in the vicinity of the boundary of a segmentation mask. Following these observations, BoxCell learns Gaussian Mixture Models (GMMs) to model patch-level intensity distributions for foreground and background. It then casts an ILP that maximizes the GMM prediction probability along with a soft constraint that neighboring pixels are assigned different classes only if their intensity difference is high. To do so, BoxCell divides each pixel (ij) for a test image X into three sets: F, B and A. Here, F (and B) is the set where masks \(M_D\) and \(M_S\) agree on the pixel to be in foreground (resp., background); and A is where they disagree, i.e., \(A=\{(i,j)~|~M_D(i,j)\ne M_S(i,j)\}\). BoxCell accepts the pixel labels for F and B for final mask M and only attempts to reassign labels in A.

Learning GMMs

BoxCell splits the image X of size \(L\times H\) into \(K^2\) mutually exclusive and collectively exhaustive patches of size \(L/K \times H/K\) each. For each patch \(\gamma\), it learns two GMMs, one for foreground pixel intensities, and one for background. It uses pixels in \(F \cup B\) to learn these GMMs and ignores ambiguous pixels in the patch. More formally, \(G^{f}_{\gamma }\) and \(G^{b}_{\gamma }\) are N-component 3-dimensional (RGB) GMMs over the foreground and background pixels in a patch \(\gamma\). Let \(w = \{w_1,w_2...w_N\}\) be the mixture weights such that \(\sum w_n=1\) and \(0\le w_n \le 1\). Let \(\mu = \{\mu _1,\mu _2...\mu _N\}, \mu _n \in \mathbb {R}^3\) and \(\Sigma = \{\Sigma _1,\Sigma _2...\Sigma _N\}\), \(\Sigma _n = [\sigma ^2]_{3\text {x}3}\), respectively, denote the means and co-variances. The likelihood density of an RGB pixel \(c = (c^1, c^2, c^3)\) belonging to a mixture G is given by \(p'(c ~| \ G; \mu , \Sigma ) = \sum _{n=1}^{N}w_n N(c, \mu _n,\Sigma _n)\), where N is the Gaussian function.

$$\begin{aligned} \frac{1}{(2\pi )^{1.5}|\Sigma _n|^{0.5}}\exp \left( -\frac{(c-\mu _n)^T\Sigma _n^{-1}(c-\mu _n)}{2}\right) \end{aligned}$$
(1)

Since each pixel can either belong to foreground or background, we normalise probabilities as

$$\begin{aligned} p(c ~| \ G^{f}_{\gamma }) = \frac{p'(c ~| \ G^{f}_{\gamma })}{p'(c ~| \ G^{f}_{\gamma }) + p'(c ~| \ G^{b}_{\gamma })} \end{aligned}$$
(2)

Here, \(p(c~|~G^{f}_{\gamma })\) is the probability of pixel c being in the foreground, and \(p(c~|~G^{b}_{\gamma }) = 1 - p(c~|~G^{f}_{\gamma })\) of it being in the background. Note that these probabilities are solely based on pixel intensities and do not incorporate any spatial information. BoxCell merges \(p(c~|~G^{f}_{\gamma })\) for all the patches \(\gamma\) to create a complete probability distribution over the entire image – we denote it as P(c) for RGB pixel c.

Integer linear programming

For a pixel (ij), lets its color information (RGB) be denoted as \(c_{ij}\) (a 3-tuple). The ILP first defines a binary variable \(x_{ij}\) (for ambiguous pixels), which is 1, iff the pixel is assigned the foreground label. It defines a part of the objective function, \(O_{idf}\), where idf stands for Intensity Distribution Factor:

$$\begin{aligned} O_{idf} = {\sum _{(i,j)\in A}x_{ij}P(c_{ij}) + (1-x_{ij})(1-P(c_{ij}))} \end{aligned}$$
(3)

To ensure well-formedness of cells, ILP imposes that neighboring pixels that are assigned different labels must differ in their intensities. For this, it defines binary edge variables \(e_{ij0}\) and \(e_{ij1}\), which encode the edges between pixels (ij) and \((i+1,j)\), and between (ij) and \((i,j+1)\), respectively. The edge variables are assigned 0 if both pixels on the edge belong to the same class, and 1 otherwise. This is encoded in constraints as \(e_{ij0} = |x_{ij}-x_{(i+1)j}|\) and \(e_{ij1} = |x_{ij}-x_{i(j+1)}|\). If two neighboring pixels get different labels, the objective function gets penalized based on their intensity differences:

$$\begin{aligned} O_{scf} = \sum _{i=1}^{L-1}\sum _{j=1}^{H}e_{ij0}S_{ij0} + \sum _{i=1}^{L}\sum _{j=1}^{H-1}e_{ij1}S_{ij1} \end{aligned}$$
(4)

We name this part of objective as \(O_{scf}\), where scf stands for Spatially Constraining Factor. Here, \(S_{ij0}\) and \(S_{ij1}\) are a function of intensity differences, for which we employ the color similarity metric9, with \(\theta\) as a hyperparameter:

$$\begin{aligned} S_{ij0}= & exp\Bigg (\frac{-||c_{ij}-c_{(i+1)(j)}||_2}{\theta }\Bigg ), \end{aligned}$$
(5)
$$\begin{aligned} S_{ij1}= & exp\Bigg (\frac{-||c_{ij}-c_{(i)(j+1)}||_2}{\theta }\Bigg ). \end{aligned}$$
(6)

Overall, the complete ILP formulation is as follows, with \(x_{ij}\) values computing the final segmentation mask labels in M for pixels in set A:

$$\begin{aligned} \begin{aligned} \underset{x_{ij}, e_{ij0}, e_{ij1}}{\text {maximize}} O_{idf} - \lambda O_{scf} \\ \text {subject to} \\ e_{ij0} = |x_{ij} - x_{(i+1)j}|, \\ e_{ij1} = |x_{ij} - x_{i(j+1)}|, \\ x_{ij}, e_{ij0}, e_{ij1} \in \{0,1\}. \end{aligned} \end{aligned}$$
(7)
Fig. 3
figure 3

Qualitative analysis of segmentation masks. Column 1 is the original image, Columns 2-5 show cropped masks (shown in red box) generated from three comparison models and BoxCell. Last column is the ground truth. BoxCell exhibits best results, providing more accurate masks with better cell boundary and shape.

Results

The primary goal of our experiments is to compare BoxCell’s performance with existing box-supervised segmentation methods. Moreover, we wish to understand the qualitative differences between ITD and ITS masks, if any. Finally, we also compare BoxCell’s ILP formulation against existing mask merging and mask refinement approaches.

Datasets

CoNSep: Colorectal nuclear segmentation and phenotypes (CoNSeP)16 is a nuclear segmentation and classification dataset of H&E stained images. Each image is of 1000\(\times\)1000 dimension. The dataset deals with single cancer, and colorectal adenocarcinoma (CRA) images. It consists of a total of 41 whole slide images (WSI), which have a total of 24,319 annotated cells of 3 classes: inflammatory cells, epithelial cells, and spindle cells. A total of 27 images are used for training and validation, and the rest 14 are used for testing. Further, we split the 1000×1000 images into four sub-images of dimension 500×500. This results in a dataset of 98 train, 10 validation and 56 test images.

MoNuSeg: Multi-organ nuclei segmentation (MoNuSeg)43 is a nuclei segmentation dataset of H&E images representing cell nuclei from 7 different organs like breast, liver, kidney, prostate, bladder, colon and stomach to ensure diversity of nuclear appearances. It consists of a total of 51 images containing 28846 annotated cells. A total of 37 are used for training and validation, and 14 are used as test images. We split the 1000x1000 image into four 500x500 images, resulting in 133 train, 15 validation and 56 test images.

TNBC: It consists of H&E slides of triple negative breast cancer patients taken at 40x magnification18. The dataset contains 50 images with a total of 4022 annotated cells. Each image is of 512x512 dimension. This dataset proves valuable for evaluating model performance under varying degrees of cellularity. Out of 50 images, we use 34 as training samples, 5 for validation and 11 for the test set.

Since all datasets are originally image segmentation datasets, we converted instance segmentation masks into bounding boxes and used them to train our object detector, keeping the golden masks held out at the time of training. Some of these datasets also assign classes to the cells, but for the sake of our problem, we are only interested in binary segmentation and ignore the class labels.

Evaluation metrics

To evaluate BoxCell’s performance on the semantic segmentation task, we use the Dice coefficient and Intersection over Union (IoU) metrics to quantify the similarity between the predicted and ground truth masks.

Instance segmentation

Although BoxCell generates semantic segmentation masks, we convert these into instance segmentation masks to evaluate performance alongside instance segmentation methods. This conversion uses the outputs from both ITD and BoxCell segmentation masks. While generating ITD mask, each cell is assigned a unique instance ID. We then check for overlaps between the ITD and BoxCell masks, and in regions where overlap exists, we assign the corresponding ITD instance ID to those pixels of BoxCell mask. For regions predicted by BoxCell but not by ITD, we assign unique instance IDs to ensure they are included in the instance segmentation mask evaluation. In cases where the BoxCell mask contains connected cells, we apply k-means clustering to separate them into distinct instances. Additionally, we evaluate the instance segmentation quality using three metrics: Aggregated Jaccard Index (AJI)43, Panoptic Quality (PQ)16, and Boundary F1-score (BF1)44. Panoptic Quality (PQ) combines detection quality (DQ) and segmentation quality (SQ), offering a comprehensive measure of both the accuracy of object detection and the precision of segmentation. In our baselines, WSIS methods, except for SPN, directly produce instance segmentation masks. For ensembling and mask refinement methods, we follow the same process used in BoxCell to generate instance segmentation masks.

Implementation details

For the object detection component (ITD), we employ YOLOv8x 15 trained for 300 epochs with early stopping based on validation performance. The initial learning rate is set to 0.01 and gradually reduced by a factor of 0.01 during training using a cosine decay schedule with a 5-epoch warmup. We use a batch size of 32. Data augmentations include random horizontal and vertical flips, rotations (\(\pm 90^\circ\)), and color jitter (brightness/contrast \(\pm 0.2\)). During inference, we apply a detector confidence threshold of 0.3 and an NMS IoU threshold of 0.5. All experiments are conducted on NVIDIA RTX-5000 and Tesla A100 GPUs.

For the auxiliary segmentation stage (ITS), we adopt CaraNet 14 and SegFormer45, which utilizes a reverse axial attention mechanism effective for small or tiny object segmentation. CaraNet is trained for 200 epochs using the AdamW optimizer with an initial learning rate of \(1\times 10^{-4}\), cosine decay scheduling, and a 5-epoch warmup. We set the batch size to 16 and apply early stopping with a patience of 20 epochs based on validation Dice. The same data augmentations as ITD are used, and during inference, small objects with an area smaller than 20 pixels are removed. For comparison, we also evaluate BBTP++ 8, BoxInst 9, BoxSnake 11, BoxTeacher 7, and SPN 34, all trained under identical settings for fairness.

To reconcile the predictions from ITD and ITS, we propose an Integer Linear Programming (ILP)-based mask fusion strategy. We compare our ILP method against several existing mask merging strategies. (i) AP (Averaging) performs element-wise averaging of ITD and ITS masks, normalized by 2 and binarized at a threshold of 0.5. (ii) LP (Low Precision Averaging) 36 applies the same averaging but uses a higher binarization threshold of 0.9 to emphasize high-confidence regions. (iii) ENet 46 computes a weighted sum of ITD and ITS masks, where the weights are tuned on the validation set and remain balanced across datasets (e.g., CoNSeP: ITD=0.522 vs. ITS=0.477; MoNuSeg: ITD=0.523 vs. ITS=0.476; TNBC: ITD=0.493 vs. ITS=0.507). (iv) ILP (ours) formulates the mask reconciliation as an integer linear programming problem, achieving consistent improvements across datasets, as shown in Table 5.

Following mask fusion, we refine the results using DenseCRF post-processing with Pairwise Gaussian (size \(3\times 3\)) and Pairwise Bilateral (size \(5\times 5\)) kernels for 10 inference iterations. For BoxCell-ILP solver, each image is partitioned into a \(K\times K\) grid for RGB-GMM modeling, where we use \(K=5\) in all experiments. Hyperparameters were tuned via grid search using Gurobi 47, with \(\lambda \in \{0.5,1,2,5\}\) (pairwise smoothness weight), \(\theta \in \{1,5,10,25,30\}\) (color similarity scaling), and the number of GMM components \(\in \{2,3,4,5,6\}\). The best results were obtained with \(\lambda =2\), \(\theta =25\), and 2 GMM components. For an image of size \(500\times 500\), the complete processing time is approximately 67–71 seconds on an Intel Xeon processor with 32 CPU cores.

Table 1 Comparison of Bounding Box Supervised Methods.

Comparison of weak supervised approaches

Semantic segmentation

Table 1 compares all models in the weakly supervised image segmentation setting. BoxCell achieves substantial 6-10 point Dice improvements compared to the strongest non-SAM competitors across datasets, demonstrating the significant merit of our approach. Even when competing against SAM-based methods, BoxCell maintains a consistent 6-7 point Dice advantage. We believe that this is because the heuristics imposed by weak supervision losses are insufficient to guide SAM. For instance, BoxInst employs intensity-dependent losses in local neighborhoods that remain static across image patches, ignoring underlying image distribution variations. Similarly, SAM-BBTP++ lacks intensity-based criteria for mask integrity and fails to penalize contradictory predictions in neighborhoods. In preliminary experiments, we also made other attempts to use weak supervision losses for training SAM and found that they generally confuse SAM (because of catastrophic forgetting). Figure 3 illustrates sample predictions from each dataset, where BoxCell demonstrates superior accuracy in capturing cell boundaries and shapes.

We also compare with mask refinement method like DenseCRF. DenseCRF uses a unary classifier to learn long-term dependencies. However, we observed that in cases where the background varies throughout the image and the contrast between the foreground and background is minimal, such as in histopathology images, DenseCRF produces suboptimal performance. Therefore, learning local features proves to be more beneficial as done in BoxCell. BoxCell achieves, 1 − 4 pt dice gain as compared to DenseCRF. Closest to our work, ENSeg-ILP focuses on averaging segmentation results, whereas BoxCell aims to maximize accuracy by reconciling two complementary predictions—ITD and ITS—resulting in notable Dice score improvements (ITD: 82 \(\rightarrow\) 85, ITS: 80 \(\rightarrow\) 85). In addition, BoxCell introduces spatial constraints based on pixel color, contributing an additional 1.1 - 1.5 Dice points compared to using intensity constraints alone, as employed by ENSeg-ILP (see Table 7). Although the code for ENSeg-ILP is not publicly available for a direct comparison, BoxCell’s enhancements highlight its clear advantage.

As described in the dataset details, each ConSeP and MoNuSeg WSI (\(1000\times 1000\)) was divided into four \(500\times 500\) sub-images for evaluation, with per-WSI analysis showing negligible differences (ConSeP: 81.39 vs. 81.41; MoNuSeg: 81.74 vs. 81.73). TNBC was evaluated at full resolution. We performed paired t-test analysis to evaluate the statistical significance of BoxCell-ILP against existing bounding box supervised approaches. Statistical analysis confirms that BoxCell achieves significant improvements (\(p < 10^{-4}\)), with 95% confidence intervals for mean Dice gains of [5.7–5.8], [8.1–8.2], and [5.2–5.3] on ConSeP, MoNuSeg, and TNBC, respectively, validating its robustness across datasets.

Lastly, we compare BoxCell with several SAM variants (Table 2), including the recently introduced SAM2 49, designed for video segmentation, and MedSAM 50, trained specifically on medical images. We also include \(\mu\)-SAM, a SAM variant optimized for microscopy images. As shown in Table 2, BoxCell consistently improves ITD-ITS performance across all SAM backbones. For instance, on the CoNSeP dataset, BoxCell improves performance from 79.97/79.80 to 81.36 with SAM2, and on MoNuSeg, from 80.35/79.80 to 81.83. On TNBC, it increases from 82.24/80.54 to 84.46. Similarly, with MedSAM, the Dice score rises from 71.00 to 74.91, though its overall performance remains lower than SAM and SAM2 for cell segmentation. Quantitatively, BoxCell achieves improvements of approximately \(+6.5\) Dice on CoNSeP, \(+4.5\) Dice on MoNuSeg, and \(+2.3\) Dice on TNBC over \(\mu\)-SAM and MedSAM. These results highlight that beyond simply adapting SAM to biomedical images, BoxCell’s integer linear programming–based reconciliation of inference-time detection and segmentation yields consistent 2–3 Dice point gains across backbones and substantial absolute improvements overall.

Table 2 Comparison of BoxCell with various SAM backbones.

Instance segmentation

Table 3 compares BoxCell with detector-agnostic instance conversion methods, including connected components and watershed segmentation. In our current pipeline, the number of clusters (K) for splitting touching regions is determined by the number of overlapping bounding boxes from the detector, with centroids initialized at corresponding box centers. This introduces a degree of dependence on the detector. To assess its impact, we implemented detector-agnostic alternatives (connected-component and watershed) and found that, while they achieve reasonable performance, BoxCell’s detector-based instance conversion consistently yields substantially higher PQ and AJI scores across all datasets.

Table 3 Comparison of detector-agnostic instance conversion methods with BoxCell.

Furthermore, Table 4 compares BoxCell with other baselines on the instance segmentation task. Although BoxCell is primarily designed for semantic segmentation, we extend it to instance segmentation and observe consistent improvements in PQ, AJI, and Boundary F1. BoxCell achieves 2–7 point PQ gains over weakly supervised instance segmentation methods, particularly excelling in densely packed or low-contrast regions where competing approaches often struggle.

Table 4 Performance Comparison for Instance Segmentation.

Comparison with mask merging methods

Table 5 reports experiments where masks \(M_D\) and \(M_S\) from ITD and ITS are merged. We compare several mask merging strategies from the literature and our proposed ILP-based approach. AP (Averaging) performs element-wise averaging of ITD and ITS masks, followed by normalization by 2 and binarization at a threshold of 0.5. LP (Low Precision Averaging)36 uses the same averaging but applies a higher threshold of 0.9 to emphasize high-confidence regions. ENet46 computes a weighted sum of ITD and ITS masks, where the weights are tuned on the validation set. The weights are nearly balanced across datasets, indicating stability (e.g., CoNSeP: ITD=0.522 vs. ITS=0.477; MoNuSeg: ITD=0.523 vs. ITS=0.476; TNBC: ITD=0.493 vs. ITS=0.507). Finally, ILP (ours) reconciles ITD and ITS masks through integer linear programming. Across all datasets, BoxCell’s ILP achieves consistent improvements over them.

Table 5 Comparison of Mask Merging Methods.
Table 6 Comparison of Segmentors for ITS.
Table 7 Ablation Table.

Segmentation backbones

We utilized CaraNet as the backbone for training the Inference Time Segmentor (ITS) model. Additionally, we evaluated a transformer-based approach, SegFormer, for training ITS, as shown in Table 6. The table compares the performance of BoxCell with CaraNet as the ITS model (BoxCell-CaraNet) and BoxCell with SegFormer as the ITS model (BoxCell-SegFormer). BoxCell-SegFormer demonstrated performance comparable to BoxCell-CaraNet. However, due to the limited dataset size, larger transformer-based models like SegFormer may not yield significant benefits, as evidenced in the results.

Discussion: ablation and error analysis

Ablation study

We conduct an ablation study to evaluate the individual contributions of various components to the overall model performance, as summarized in Table 7. Specifically, we examine the impact of the Intensity Distribution Factor (IDF) and the Spatial Constraining Factor (SCF), alongside the ITD and ITS predictions. Incorporating IDF into the ITD-ITS framework yields an improvement of 1–2 points, while SCF contributes a gain of 0.1–0.7 points—particularly beneficial in scenarios with high foreground–background contrast. The combination of all four components results in the best overall performance.

Solver alternatives

We evaluated BoxCell with multiple alternatives to Gurobi, including open-source ILP solvers (CBC, OR-Tools), approximate inference (\(\alpha\)-expansion), and a sparse graph formulation. Table 8 report average time per image runtime, and Dice scores across datasets. Across datasets, we find that open-source solvers (OR-Tools and CBC) and approximate methods (\(\alpha\)-expansion, sparse graph) achieve accuracy close to or within 1–2 Dice of the Gurobi baseline, while offering different trade-offs in runtime. The sparse graph formulation is especially effective, reducing runtime significantly while maintaining competitive Dice. These results demonstrate that BoxCell is not tied to a commercial solver: open-source or approximate alternatives can also be used.

Table 8 Comparison of solver performance across datasets. Gurobi achieves the best Dice scores consistently, while open-source alternatives (e.g., OR-Tools) provide competitive results.

Impact of detector threshold

To analyze the impact of detector threshold, we varied the YOLOv8-m detection score threshold (0.1-0.9) and measured the resulting segmentation dice scores across datasets (see Table 9). Segmentation quality is relatively stable across a wide range of detector thresholds, with only fluctuations in CoNSep at high threshold. Dice peaks at threshold of 0.3 for CoNSep, at 0.7 for MoNuSeg, and at 0.5 for TNBC. This indicates that BoxCell is not overly sensitive to the detector operating point: once bounding boxes are reasonably accurate, the ILP-based reconciliation mitigates detection errors.

Table 9 Segmentation Dice scores across datasets at different YOLOv8-m detector thresholds.

Effect of grid size K

In our method, the grid size \(K \times K\) is used to partition each image for local RGB-GMM modeling. In all our experiments we used \(K=5\). To assess sensitivity, we varied K in the range \(\{3,5,10,15,20\}\). Table 10 provides dice score for varied value of K across the three datasets. As shown in Table 10, the performance remains stable across a wide range of K, with only marginal differences (within \(\pm 1\) Dice). This indicates that the method is not highly sensitive to the choice of grid size.

Table 10 Sensitivity of BoxCell performance (Dice score) to grid size K.

Cross-domain generalization and stain robustness

We evaluate BoxCell against the strongest baseline in cross-domain settings (Table 11), reporting train–test performance across datasets (C: CoNSeP, M: MoNuSeg, T: TNBC). BoxCell consistently outperforms SAM-BBTP++, demonstrating superior domain generalization. While performance decreases compared to within-domain results (81.39, 81.74, and 85.01 for CoNSeP, MoNuSeg, and TNBC, respectively), BoxCell still surpasses the closest baseline by effectively reconciling ITD and ITS. Training on one dataset and testing on another yields substantial Dice score gains (e.g., C \(\rightarrow\) M: +22.63, C \(\rightarrow\) T: +26.00, M \(\rightarrow\) C: +24.98), further confirming better robustness to domain shifts.

Table 11 Cross-domain performance (train \(\rightarrow\) test).

Stain Robustness: To evaluate stain robustness, we generated stain-varied test sets for each dataset. Figure 4 illustrates the original samples (row 1) and their stain-varied counterparts (row 2). Table 12 shows that BoxCell demonstrates markedly greater robustness to stain variation compared to the baseline SAM-BBTP++. This indicates that BoxCell’s reconciliation of ITD and ITS effectively handles staining variability, making it a more reliable approach for real-world scenarios.

Fig. 4
figure 4

Original and stain variation images across datasets.

Table 12 Performance comparison across datasets under stain variation and no stain variation conditions.

Compute capacity

Table 13 provide parameters and time comparision for BoxCell and closest baseline. Both models have comparable parameter counts, ensuring similar capacity. The longer runtime of BoxCell arises mainly from the ILP optimization step, not from model size. While BoxCell is slower due to the ILP solver, it achieves substantially higher Dice scores across datasets. Another variant of BoxCell with a sparse solver (see row 2 in Table 13), offers faster computation (15sec/image) while maintaining strong segmentation performance.

Table 13 Compute Capacity comparison between BoxCell and SAM-BBTP.

Error analysis

On analyzing the masks qualitatively, we find that ITS benefits from a global perspective, leveraging the full image view for producing masks. As a result, ITS has a better shape understanding. Conversely, in ITD, SAM’s inference is localized to the region defined by the box prompt. In scenarios where cell boundaries are ambiguous, ITS tends to overshoot the segmentation. In such cases, ITD often performs better, due to the localization provided by the box prompt. Another notable difference is the compounding of errors due to the pipelined nature of ITD. ITD is not able to recover from the mistakes of the object detector; if the box is a false positive, SAM generally outputs a false mask, and if a cell is missed by the detector, SAM can never output a mask for it. This is not an issue for ITS as it does not have a multi-step pipeline. With the ILP, BoxCell mitigates issues of both component models. See Fig. 5 for an illustration. BoxCell demonstrates a reduced dependence on the size of bounding boxes – it can make mask predictions outside the bounding boxes (A, Row 1). It can generate segmentation masks even when the detection model predicts no bounding box, thus mitigating ITD’s false negatives problem (B, Row 1). Finally, it yields qualitatively crisper boundaries, a characteristic only observed in BoxCell across all models ( C, Row 1).

Fig. 5
figure 5

BoxCell with only ITD does not predict foreground outside box-prompts. BoxCell can do so, reducing the number of false negatives and improving the mask quality for wrongly sized boxes (A and B). BoxCell produces finer segmentation masks (C). BoxCell performs less effectively for images with low contrast in f/g and b/g (A row 2) where its capability to mitigate false positives is limited (B in row 2).

Common failure modes for BoxCell include images with low contrast between the intensity of foreground and background pixels (A in Fig 5, Row2). The human ability to detect such cells relies on the shape of these faintly different regions of intensities. The method fails to detect and mitigate false positives in such cases (B in Fig 5, Row2). It tends to overlap instances of different cells due to no direct box supervision. It also tends to segment regions with intensity variation despite them not being cells. The detector’s performance bottlenecks the performance of all detector-based models because of the high dependence on the box prompts. For the best-performing detection model, we still retain a lot of false positives that generate segmentation masks even when no actual cell is present. False negatives do not get segmented if the detector fails to detect them. Although SAM-ILP tries to mitigate these effects, they are still persistent in some cases.

Effect on annotation efficiency

We observe that BoxCell substantially reduces the annotation time required by pathologists. To quantify this, we conducted an annotation-efficiency study with two pathologists, each annotating 25 randomly selected cells (total \(n=50\)). Manual polygon annotation, performed using LabelMe, required an average of 17.82 seconds per cell (95% CI: 17.39–18.25, SD 1.5), including identifying cell boundaries, drawing polygons, and assigning class labels. In contrast, BoxCell requires only bounding box generation, taking 5.73 seconds per cell, followed by 4.20 seconds for verifying and refining the generated instance masks, resulting in a total of 9.92 seconds per cell (95% CI: 8.93–10.91, SD 3.54). This represents a 7.9‑second reduction, corresponding to a 44.4% improvement in annotation efficiency.

Conclusion

We present BoxCell, the first approach to use SAM-based segmentation over histopathological images when only bounding box supervision is available. It computes two segmentation masks using SAM at train and test times; and reconciles them via a novel ILP that balances pixel likelihood and neighborhood objectives. Our experiments over three benchmark datasets show that our proposed approaches consistently beat the current weak supervision methods by up to 10 dice pts. Additionally, we compare with mask ensembling and refinement method and show the effectiveness of BoxCell. Our work opens up new possibilities for leveraging SAM in weak-supervision settings and using constrained optimization strategies to post-process segmentation masks.