Introduction

White blood cell (WBC) detection is a cornerstone of modern medical diagnostics, providing critical insights into a wide range of health conditions, including anemia, infections, inflammation, and immune system disorders1,2. Despite its significance, WBC detection remains a formidable challenge due to the inherent complexities of medical imaging, such as variations in staining techniques, imaging conditions, and the frequent occurrence of multicellular adhesion phenomena3,4,5,6,7,8,9,10,11,12,13,14,15. Recent advancements in deep learning have revolutionized the field of medical image analysis, with Convolutional Neural Networks (CNNs) and Transformers emerging as the two dominant architectures. While CNNs excel in hierarchical feature extraction, Transformers leverage self-attention mechanisms to capture long-range dependencies, making them highly effective in complex vision tasks. However, both architectures have limitations when applied to WBC detection, particularly in handling scale variability, computational efficiency, and feature redundancy. This study addresses these challenges by proposing a novel hybrid model, MCDAF-Net, which integrates the strengths of CNNs and Transformers while introducing innovative modules to enhance performance.

CNNs have long been the backbone of medical image recognition due to their ability to construct hierarchical representations through convolution and pooling operations. Several studies have demonstrated their efficacy in WBC detection. For instance, Geng et al. 3 utilized the attention mechanism of Mask R-CNN4 to improve WBC segmentation accuracy. Zheng et al. 5 combined Itti’s visual attention model with an adaptive center-surround difference operator and an enhanced CenterNet model for WBC detection. Sivarao et al. 6 proposed a framework using SegNet for segmentation and EfficientNet for feature extraction, achieving state-of-the-art classification of WBC subtypes. Islam et al. 7 further advanced CNN-based methods by incorporating image pre-processing techniques and interpretability tools like SHapley Additive exPlanations(0.40.0) and Gradient-weighted Class Activation Mapping++(1.7.1), outperforming existing models.

Despite these advancements, CNN-based approaches often rely on a two-stage detection framework, which introduces parameter redundancy and increases computational overhead. To address these issues, researchers have proposed optimized architectures. Xu et al. 16 introduced TE-YOLOF, which incorporates depth-separable convolutions and EfficientNet17 as a backbone. Han et al. 18 developed MID-YOLO, integrating attention mechanisms to enhance contextual understanding. Wang et al. 19 enhanced YOLOv5 with coordinate attention to better handle large-scale WBC samples. Polejowska et al. 20 investigated the impact of image quality on detection performance using YOLOv5 and RetinaNet21. However, CNN-based detectors still struggle with scale variability and feature representation, particularly for small objects, highlighting the need for more robust solutions.

Transformers22, originally designed for natural language processing, have recently gained traction in vision tasks due to their self-attention mechanisms, which excel at capturing long-range dependencies. Carion et al. 23 redefined object detection as a set prediction problem, achieving competitive results on the COCO dataset24. Zhu et al. 25 introduced Deformable-DETR, which enhances multi-scale feature handling. Wang et al. 26 proposed PnP-DETR to address spatial redundancy, reducing computational load. Zhang et al. 27 improved DETR’s training convergence and query significance through contrastive denoising training. In medical imaging, Chen et al. 28 designed a Vision Transformer (ViT) with shifted windows and transfer learning for WBC classification on the BCCD dataset. Li et al. 29 improved the Detection Transformer’s residual module, while Dipto et al. 30 employed Explainable AI (XAI) with federated learning to accelerate ViT training. Katar et al. 31 developed an interpretable ViT model using self-attention and Score-CAM for clinical applications.

Despite their promise, Transformers face significant challenges in medical imaging. Their self-attention mechanisms incur high computational complexity, making them less efficient for high-resolution images. Additionally, Transformers require large datasets to perform optimally, a limitation in data-scarce medical domains. These challenges underscore the need for innovative approaches that can leverage the strengths of Transformers while mitigating their limitations.

The integration of CNNs and Transformers has emerged as a promising approach to leverage their complementary strengths. Marzahl et al. 32 proposed a leukocyte detection method combining region-based proposals with attention mechanisms. Huang et al. 33 introduced ARML, which enhances feature representation with adaptive attention-aware residuals. Nugraha et al. 34 combined YOLOv8 and DETR to enhance multi-object detection performance, particularly with insufficient datasets. Tarimo et al. 35 designed a 2-way-2-stage approach integrating YOLO for fast object detection and ViT for robust image representation. Zhang et al. 36 enhanced YOLO with multi-scale feature integration, while Bayat et al. 37 proposed a multi-attention framework for fine-grained WBC classification.

However, current hybrid models remain in their early developmental stages, struggling with complex, multi-scale features and channel feature redundancy. To address these challenges, this study introduces MCDAF-Net, a novel network model tailored for WBC detection. The key innovations of this work include:

  • Attention Multi-scale Sensing Module (AMSM): This module combines multi-scale dilation convolution with self-attention mechanisms to capture critical features effectively, addressing the limitations of traditional CNNs and Transformers in handling scale variability.

  • Cross-Deformation Convolution Module (CDCM): This module extends feature representation through a ‘Split-Crossover-Fusion Deformation’ strategy, reducing channel feature redundancy and enhancing the model’s ability to distinguish between closely adhered WBCs.

  • Multi-Scale Cross-Deformation Attention Fusion Network (MCDAF-Net): By integrating AMSM and CDCM, this module achieves superior performance on public datasets such as LISC, BCCD, and WBCDD, setting a new benchmark for WBC detection accuracy and efficiency.

The proposed MCDAF-Net represents a significant advancement in WBC detection by addressing the limitations of existing CNN and Transformer-based models. By combining the hierarchical feature extraction capabilities of CNNs with the long-range dependency modeling of Transformers, MCDAF-Net offers a robust and efficient solution for medical image analysis. The integration of AMSM and CDCM not only improves the model’s ability to handle scale variability and channel redundancy but also enhances its interpretability and generalizability. This makes MCDAF-Net particularly suitable for clinical applications, where accuracy and efficiency are paramount. Furthermore, the model’s performance on public datasets demonstrates its potential to set new standards in WBC detection, paving the way for future research in medical image analysis.

The rest of the paper is organized as follows: In section “Methods”, the MCDAF-Net is introduced. Section “Experiments and results” shows the experimental results on three different data sets. Finally, a summary is presented in section “Discussion”.

Methods

Architecture

In this section, we present the overall structure of MCDAF-Net shown in Fig. 1; it consists of three components. The components are as follows: the backbone network of ResNet5038 for extracting shallow features, the MCDAF module for refining feature extraction, and the Transformer structure for acquiring long-range context dependencies. Within the MCDAF module, there are three submodules: the AMSM, the ADP module which consists of Adaptive Average Pooling and a 1 \(\times\) 1 Depthwise Convolution (DWconv)39, and the CDCM. The AMSM and ADP modules run in parallel, while the CDCM module follows the AMSM in sequence. Concretely, for the ResNet50 output feature map X, we first made four copies of it. Next, we downsize X by the ADP module to obtain the feature map \(X_{1}\). This step aims to reduce the dimensionality of the feature map, thereby reducing the computational complexity while retaining the key information. We utilize the AMSM module to obtain features \(X_{2}\), \(X_{3}\), and \(X_{4}\) with a broader effective receptive field. Subsequently, we obtain the channel-refined features by using the CDCM module. Next, the refined feature maps, along with the position encoding, are fed into the Transformer structure to obtain the bounding box and classification results of the object via the feed-forward neural network. We exploit both a broader effective receptive field and channel feature reconstruction in our MCDAF module, which can adaptively adjust feature weights to capture key features and can be connected to any CNN architecture to improve feature representation.

Fig. 1
Fig. 1
Full size image

The proposed model structure consists of a backbone network, a multi-scale feature reconstruction network, and a Transformer structure.

AMSM for features positioning

To take advantage of better positioning features and a broader effective receptive field, we introduce the AMSM, as shown in Fig. 2. This module utilizes different convolutional approaches and combines the self-attention mechanism. The purpose of these convolutional operations is to obtain a larger receptive field. The obtained features are concatenated and used as key features by utilizing the self-attention mechanism22. Specifically, given an input feature map \(\textit{X} \in R^{N \times C \times H \times W}\), where N is the number of batch-size, C is the number of channels, H, W are the height and width axes of feature map. We utilize horizontal, vertical40, and dilated convolutions41 in parallel with convolution kernels of 1\({\times }\)3, 3\({\times }\)1, and then concatenate them as follows:

$$\begin{aligned}&{F_1} = Con{v_2}(X) \oplus Con{v_3}(X) \end{aligned}$$
(1)
$$\begin{aligned}&{F_{2}} = Con{v_{4,n}}(X), \quad n = 2,3,6 \end{aligned}$$
(2)
$$\begin{aligned}&{\textit{Attention Weights} = softmax \left( \frac{QK^{T}}{\sqrt{C}}\right) } \end{aligned}$$
(3)
$$\begin{aligned}&{ {X_i} = \textit{Attention Weights} \odot (Con{v_5}(Cat({F_1},{F_2})), \quad i = 2,3,4 } \end{aligned}$$
(4)

where X is the feature extracted by ResNet50, \(Conv_{2}\), \(Conv_{3}\), and \(Conv_{5}\) are convolution kernels of 1\({\times }\)3, 3\({\times }\)1, 3\({\times }\)3 respectively. \(Conv_{4,n}\) is then the dilated convolution. We set the dilation rate n \(\in \left\{ 2,3,6\right\}\) for AMSM parallel processing features X and then obtain features \(X_{2}\), \(X_{3}\), and \(X_{4}\). The self-attention mechanism is employed afterward to assign attention weights to the concatenated features. Attention weights represent the importance or relevance of each key feature in the given context, serving as a score that indicates how much focus should be placed on each feature (Attention weights \(\in \left( 0,1\right)\)). The need for this self-attention mechanism arises from its ability to dynamically highlight the most relevant features, effectively improving feature selection and providing a more flexible, context-aware model. By combining these convolutional operations with self-attention, the AMSM is able to capture both local and global contextual information, enhancing the overall model performance.

Fig. 2
Fig. 2
Full size image

The architecture of the AMSM consists of two parts: the horizontal-vertical-dilation convolution and the self-attention mechanism.

CDCM for channel-refined features

To leverage channel-wise refined features, we propose the CDCM, as shown in Fig. 3. The module implements a ‘Split Crossover-Fusion Deformation’ strategy that systematically processes channel-wise refined features through three coordinated phases. This architecture works in conjunction with the AMSM module to reconstruct the attention maps generated by AMSM using the channel features from the CDCM. The deformable kernel explicitly models channel interactions, while the proposed module amplifies cross-channel discriminative patterns and suppresses redundant feature responses.

Split crossover

For given multi-scale features \(X_{2}\), \(X_{3}\), and \(X_{4}\) \(\in R^{c \times h \times w}\), we first split them into two channels of size a according to the number of channels \(\frac{c}{2}\). In this way, the multi-scale feature map is split into six features named \(X^1_{2}\), \(X^2_{2}\), \(X^1_{3}\), \(X^2_{3}\), \(X^1_{4}\), and \(X^2_{4}\). Subsequently, we first use 1\({\times }\)1 Pointwise convolution operations to obtain features \(X^2_{2}\), \(X^2_{3}\), and \(X^2_{4}\). Compared to standard convolution, Pconv42 serves to extract and integrate representative information between channels and reduces the number of parameters and computational effort. For the remaining features, no operation is performed, and they are used to ensure the accuracy of the features. Thus we aggregate the features \(X^2_{2}\), \(X^2_{3}\), and \(X^2_{4}\) obtained after the Pconv operation with the split original features \(X^1_{2}\), \(X^1_{3}\), and \(X^1_{4}\) to form the merged representative shape features. Where Cross-Block(N) (with N \(\in \left\{ 1,2,3\right\}\)) represents the operations in Eqs. (5), (6), and (7), respectively. The ‘Split-Crossover’ stage described above can be formulated as follows:

$$\begin{aligned}&{X_{2c}} = Cat (X^1_{2} \oplus X^2_{3}, X^1_{2} \oplus X^2_{4}) \end{aligned}$$
(5)
$$\begin{aligned}&{X_{3c}} = Cat (X^1_{3} \oplus X^2_{2}, X^1_{4} \oplus X^2_{4})\end{aligned}$$
(6)
$$\begin{aligned}&{X_{4c}} = Cat (X^1_{4} \oplus X^2_{2}, X^1_{4} \oplus X^2_{3}) \end{aligned}$$
(7)

where \(\oplus\) is element-wise summation, ‘Cat’ is concatenation, the split original is represented as feature \(X^1_{2}\) \(\in R^{\frac{c}{2} \times h \times w}\). We split the features for reconstruction, which not only reuses the original features, but also further refines the channel features and enhances the feature representation.

Fusion deformation

For the features obtained from the ‘Split Crossover’ operation, they are residually concatenated with the features from AMSM. Then, weight diffusion is performed again using DeformConv2d43, followed by the fusion of features at three different scales. The combination of these two modules not only fuses features across different scales but also enhances the model’s ability to capture multi-level contextual information, thereby improving its robustness and generalization. The CDCM, as a key component, leverages residual concatenation and weight diffusion through DeformConv2d to strengthen channel interactions, enabling the model to emphasize discriminative features while suppressing redundant ones. This leads to a more efficient and effective representation of the data. The formula is calculated as follows:

$$\begin{aligned}&{X_{i}}^{\prime } = X_{i} + X_{ic}, i = 2,3,4 \end{aligned}$$
(8)
$$\begin{aligned}&{X_{2,3,4}}^{\prime } =DeformConv (Cat ({X_2}^{\prime }, {X_3}^{\prime },{X_4}^{\prime })) \end{aligned}$$
(9)

where \({X_{i}}^{\prime }\) is the feature obtained by residual concatenation, which is intended to improve model performance and promote feature reuse. The standard convolution operation uses a fixed convolution kernel, which cannot well model complex spatial transformations. Deformable convolution enhances the ability to model spatial transformations by learning additional offsets that allow the convolution kernel to adaptively adjust its position.

Fig. 3
Fig. 3
Full size image

The architecture of the CDCM involves two main components. This module separates and cross-fuses multiple feature maps in order to facilitate the reconstruction of channel features.

Integration features

After the AMSM and CDCM modules are performed, the multi-scale feature \({X_{2,3,4}}^{\prime }\) is obtained, which removes the redundancy of channel information. Different from standard convolution, we use the ADP(yellow background in Fig. 1) module containing the \(1 \times 1\) DWconv to obtain the local features \(X_1\). This drastically reduces the model’s parameter count and preserves the network’s ability to learn cross-channel correlations and feature interactions. Meanwhile, we utilize adaptive average pooling to extract the global average information of the input features. Finally, we again use \(1 \times 1\) DWconv to integrate the two features that are fused together to obtain more accurate feature information, which is calculated as follows:

$$\begin{aligned}&{X}^{\prime } = DWconv (Cat({X_1}, {X_{2,3,4}}^{\prime })) \end{aligned}$$
(10)

where \({X_1}\) is the feature obtained by ADP module. The combination of adaptive average pooling and DWconv in feature extraction leverages their respective strengths to preserve the spatial information of the inputs, reduce the risk of overfitting, and effectively reduce the number of parameters in the model, thereby enhancing the representation of the features.

In brief, we adopt the AMSM, CDCM, and ADP modules to obtain more comprehensive multi-scale features and eliminate channel redundancy. Overall, AMSM can be deployed standalone or integrated with the CDCM operation. By arranging the AMSM and CDCM modules in a sequential manner and then juxtaposing them with DWconv and adaptive average pooling, the proposed MCDAF module is established.

Loss function

To better predict the classification of WBC and localize the bounding box, in this paper, we use the \(\textit{DIOU}\) loss44 to replace the \(\textit{GIOU}\) loss45 in the baseline model. This change can better address the WBC overlap problem and make the bounding box regression more stable. The loss function of our model consists of two parts: the cross-entropy loss for object classification and the sum of \(\textit{L1}\) Loss and \(\textit{GIOU}\) loss for bounding box regression. All of them are defined as follows:

$$\begin{aligned} L_{\mathrm{{cls}}}&= -\sum (y_i * \log (p_i)) \end{aligned}$$
(11)
$$\begin{aligned} L_{bbox}&= \varepsilon \frac{1}{n} \sum _{i=1}^{n} \left| y_j-\hat{y_j} \right| + \theta \frac{\rho ^{2} (G,F)}{c^{2} } \end{aligned}$$
(12)

where \(y_i\) in Eq. (11) denotes the value of i in the ground-truth labels, which takes the values of 0 or 1; \(p_i\) denotes the probability that the model predicts category i; \(\sum\) denotes the summation over all categories. The smaller the value of the cross-entropy loss function, the smaller the discrepancy between the model predictions and the ground-truth labels, i.e., the better the performance of the model. \(\varepsilon\) and \(\theta\) are hyperparameters that can be tuned to the relevant training data. Where \(\hat{y_j}\) in Eq. (12) is the true value, and \(y_j\) is the predicted value. The \(\rho ^{2} (G,F)\) in Eq. (12) represents the squared Euclidean distance between the coordinates of the center points of the ground-truth and predicted boxes, and c is the diagonal distance of the smallest box that contains them. \(\varepsilon\) and \(\theta\) are hyperparameters that can be tuned according to the relevant training data.

Table 1 Total number of various cells and their corresponding counts in the RSLI, LISC, WBCDD, and BCCD datasets.

Experiments and results

Datasets

To validate the model, we used four datasets in total, three of which are widely used public datasets: the LISC46, the BCCD47, and the WBCDD48. The RSLI49 is a private dataset. The adopted datasets respectively contain 250, 364, 684, and 288 samples. Both the LISC and WBCDD datasets contain five types of cells: Neutrophil, Monocyte, Eosinophil, Lymphocyte, and Basophil. The various cell types and the number of cells in the dataset are shown in Table 1, where \(\times\) means not categorized. The details of these four datasets are as follows:

RSLI: This dataset consists of hematological images obtained from peripheral blood smears. The images were captured using a Motic Moticam Pro 252A microscope camera with an N800-D motorized autofocus microscope. The dataset contains 288 rapidly stained peripheral blood smear images, including 1170 white blood cells (WBCs)-comprising 1016 neutrophils, 171 monocytes, 123 lymphocytes, and 61 eosinophils. The spatial resolution of each image is \(2048 \times 1536\).

LISC: This dataset is a collection of hematological images obtained from peripheral blood of healthy subjects. Smears were stained by the Gismo-right technique and observations were captured on an Axioskope 40 microscope at 100X magnification using a Sony model SSCDC50AP camera. The spatial resolution of each image is \(720 \times 576\).

BCCD: This dataset contains 364 WBC images taken from peripheral blood and annotated by experts. The smears were stained using the Giemsa-staining technique, and observations were captured using a CCD color camera with 100X conventional light microscopy. Each WBC image extracted from the smear images was annotated by an expert into one of five categories. Each with a resolution of \(640 \times 480\).

WBCDD: This dataset, labeled by a doctor who observes the patient’s blood images through a microscope, contains 684 images with a resolution size of \(4000 \times 3000\).

Data augmentation

To verify the validity of the proposed method, we perform data augmentation on four public datasets to assess the robustness of the trained model. We augmented all training datasets, increasing their size by a factor of ten. This was achieved through techniques such as rotating the images at multiple angles, flipping them symmetrically, adjusting their contrast and brightness, and adding Gaussian noise. To address category imbalance issues, such as the low numbers of eosinophils and basophils in the RSLI dataset, we applied enhancement weighting to balance the classes. The same approach was applied to the other datasets.

Table 2 Comparison of detection results of different leukocyte detection models on the LISC, the RSLI, and the WBCDD dataset. The best results are in bold.
Table 3 Comparison of detection results of different leukocyte detection models on the BCCD dataset. The best results are in bold.
Fig. 4
Fig. 4
Full size image

These are the results of the different methods of detecting LISC. Ground Truth is the truth label, (AD) are the DETR, Deformable-DETR, DINO-DETR and our results respectively, the first to the fifth row are the results of Basophils, Eosinophils, Lymphocytes, Monocytes and Neutrophils each with the ground truth (black) and the detection frame, as well as the category and the confidence level, which red boxes represent missed or incorrect detections.

Implement details

We implemented the model based on the pytorch deep learning framework on Windows10 OS with a 2.50GHz \({\text {Inter}}^{(R)}\) \({\text {Core}}^{(TM)}\) i7 CPU, 64 GB RAM, and an NVIDIA RTX 3090 GPU (with 24 GB memory). The backbone in our model is initialized by a pre-trained ResNet50 network and then fine-tuned in its weights using migration learning. To prevent overfitting, we employed an early stopping strategy, monitoring the validation loss, and halting training if no improvement was observed for a specified number of epochs.

For the four datasets, we trained the model for 300 epochs at a size of 4 batches, with a learning rate of 0.00001 for the backbone network and 0.0001 for the model. the StepLR strategy is used to decay the learning rate to 0.1 times the original value every 200 epochs; the detection network is optimized using the AdamW optimizer, with the hyper-parameters \(\beta _{1}\) and \(\beta _{2}\) set to 0.9 and 0.999, respectively, and the weights decayed to 0.0001. The composition of the training set, validation set, and test set follows a ratio of 8:1:1 for each dataset.

Confidence threshold selection

We compared our method with a range of previous methods on various leukocyte object detection datasets, including Faster R-CNN50, RetinaNet21, SSD51, TE-YOLOF16, DETR23, Deformable-DETR25, DINO-DETR27, YOLOV1152, and D-FINE53. To achieve the best detection results for each method, we tested them at different confidence thresholds. Finally, we set the confidence threshold to 0.5, which achieved the best performance for most methods.

Comparison of other methods

To better compare the performance of each detection model, our comparative results are based on the object detection results provided in the relevant papers and are computed using the same evaluation metrics. We report metrics including \({\text{AP}}_{50}\) (Average Precision at 50), \({\text{AP}}_{75}\) (Average Precision at 75), and AP (Average Precision), for individual cells.

The results of our model on the LISC leukocyte dataset are shown in Table 2. Our method achieves 80.7%, 100% and 99.5% on the LISC dataset for AP, \({\text{AP}}_{50}\), and \({\text{AP}}_{75}\) respectively. The experimental results show that our model can effectively improve the accuracy of leukocyte object detection by using different sets of convolutional operations to expand receptive fields and focus on feature manipulation. Compared to the end-to-end object detection model DETR, our model shows year-on-year growth of 5.2%, 0.8%, and 0.3% on AP, \({\text{AP}}_{50}\) and \({\text{AP}}_{75}\), respectively. The AP, \({\text{AP}}_{50}\), and \({\text{AP}}_{75}\) of the model are improved by 10.4%, 3.9%, and 6.8%, respectively, compared to the conventional one-stage detection model SSD based on multi-level feature extraction. The performance metrics of our model compared to the two-stage object detection model, Faster R-CNN, increase by 4.2% and 2.6% on AP, \({\text{AP}}_{75}\), respectively. In Table 2, AP values were calculated for each category of leukocytes, which helped to assess the performance of the model. In the LISC dataset, we observed significantly higher improvements in the AP values for monocyte and lymphocyte cells compared to other models. These results offer an important reference for further evaluating model performance.

To further evaluate the effectiveness of the model, we conducted the same experiments on two other public datasets, BCCD and WBCDD, and the private dataset RSLI. The results are shown in Tables 2 and 3. Our model performs well on all datasets and essentially obtains the best detection results. In the WBCDD dataset evaluation, our model achieves the best AP metric for the Basophils category, likely due to their unique morphological features, which our multi-scale feature fusion and attention mechanism effectively capture. More importantly, our model outperforms others in overall metrics (AP, AP50, AP75), demonstrating its global robustness and generalizability across all categories, rather than overfitting to a single class. Furthermore, to ensure transparency and fairness in our comparison, Table 4 provides a detailed overview of the hyperparameters for each of the models under consideration.

Fig. 5
Fig. 5
Full size image

These are the results of the different methods of detecting RSLI. Ground Truth is the truth label, (AD) are the DETR, Deformable-DETR, DINO-DETR, and our results respectively, the first to the fifth row are the results of Eosinophils, Lymphocytes, Monocytes, and Neutrophils each with the ground truth (black) and the detection frame, as well as the category and the confidence level, which red boxes represent missed or incorrect detections.

Table 4 Compare the detailed parameters in the methodology.
Fig. 6
Fig. 6
Full size image

These are the results of the different methods of detecting WBCDD. Ground Truth is the truth label, (AD) are the DETR, Deformable-DETR, DINO-DETR and our results respectively, the first to the fifth row are the results of Neutrophils, Lymphocytes, Basophils, Monocytes, and Eosinophils, each with the ground truth (black) and the detection frame, as well as the category and the confidence level, which red boxes represent missed or incorrect detections.

Ablation studies

Considering the methodological integrity required for comprehensive model assessment, we strategically conducted a systematic ablation study on all benchmark datasets to rigorously validate the efficacy of individual modules. Although we mitigated the category imbalance problem as much as possible through data enhancement techniques, to ensure the scientific validity of the study, we further merged three datasets with the same number of categories (RSLI, LISC, and WBCDD) and conducted ablation experiments. Such pan-dataset ablation methodology not only ensures the generalizability of our findings but also prevents overfitting to dataset-specific artifacts or annotation biases, thereby establishing a scientifically sound basis for interpreting the technical contributions.

Module effectiveness analysis

The proposed MCDAF module integrates three core components: the AMSM, the CDCM, and the ADP module, complemented by the DIoU loss function to specifically address cell overlap challenges. It should be emphasized that the sequential architecture inherently creates operational dependencies between AMSM and CDCM.Specifically, the CDCM module requires feature maps generated by AMSM as prerequisite inputs, making their combined implementation mandatory rather than optional. As quantitatively demonstrated in Table 5 through comprehensive ablation studies across all benchmark datasets, each constituent module exhibits statistically significant performance contributions. Notably, the progressive integration of these components systematically enhances detection accuracy, with the complete MCDAF configuration achieving optimal performance metrics, thereby confirming both the individual efficacy and synergistic value of our architectural design.

Table 5 Ablation study results and AP values at different dilation rates. The best results are in bold.

In particular, although the BCCD dataset has more detailed annotations covering white blood cells and various blood cells compared to other datasets, this breadth brings another challenge: the close arrangement of different blood cell types often leads to frequent object adhesion and occlusion, which increases the complexity of analysis. Therefore, in Table 5, the DIOU loss function used to solve cell overlap is of great help to it.

Analysis on dilation rate

To explore the effect of different dilation rates on the AMSM module, we gradually adjusted its size during training on the WBCDD dataset from 2, 3, and 4 to 4, 5, and 6 to compare the AP, AP50, and AP75 metrics, as shown in Table 6. We find that these metrics perform best when the dilation rate is 2, 3, and 6. This superiority is largely attributed to the power of the Transformer structure in understanding context, while the local information we need is complemented by AMSM. However, a smaller receptive field may affect the ability to pinpoint regions of interest, which is the reason why dilation rates of 2, 3, and 6 perform better than dilation rates of 2, 3, and 4.

Fig. 7
Fig. 7
Full size image

These are the results of the different methods of detecting BCCD. Ground Truth is the truth label, (AD) are the DETR, Deformable-DETR ,DINO-DETR and our results respectively, the first to the fifth row are the results of Basophils, Eosinophils, Lymphocytes, Monocytes and Neutrophils each with the ground truth (black) and the detection frame, as well as the category and the confidence level, which red boxes represent missed or incorrect detections.

Table 6 AP, AP50, and AP75 values at different dilation rates. The best results are in bold.

Model visualization analysis

In order to visually and powerfully highlight the accuracy and usefulness of the model predictions, we chose competitive models with Transformer structures for comparative visualisation. Figures 4, 5, 6, and 7 carefully present the cell categories and their locations predicted by the model, while superimposing the original images from the LISC, RSLI, WBCDD, and BCCD datasets and their ground truth and bounding boxes. Specifically, the red boxes in the figures clearly mark the ground truth, while the differently colored bounding boxes vividly illustrate the predictions made by our model and their corresponding confidence levels. These prediction bounding boxes accurately correspond to the detected cells, demonstrating the model’s ability to recognize and localize leukocyte images with high accuracy. Regarding the presentation of the BCCD dataset in Fig. 7, the original BCCD dataset comprises three categories: ‘Platelets’, ‘RBC’, and ‘WBC’. Given that the number of ‘RBC’ categories is significantly large, labelling each of them would impair the visualization. Therefore, we have chosen to display only the visualization results for the ‘RBC’ and ‘WBC’ categories.

The detailed display in the figure clearly shows that our model exhibits high prediction confidence and near-perfect positional accuracy in predicting these five key leukocyte categories. This not only confirms the model’s remarkable validity but also underscores its potential and value in medical image analysis, disease diagnosis, and other fields.

Discussion

This paper introduces a WBC detection method aimed at improving cell classification by integrating multi-scale features and eliminating channel feature redundancy. The method’s good classification performance is also demonstrated through extensive experiments. However, there are still two aspects that require further discussion.

Integration of multi-scale features

The proposed method integrates multi-scale features to enhance WBC detection by capturing both local details (e.g., texture, morphology) and global context (e.g., shape, spatial relationships). Local details help identify microstructural features, while global context provides broader spatial information, ensuring more accurate detection. This approach is particularly effective for WBC, which have significant variability in size, morphology, and structure, and are susceptible to background clutter.

Unlike traditional Transformer-based methods that focus on long-range dependencies, our method emphasizes the fusion of multi-scale information, extracting richer features at different scales. This allows us to better handle complex backgrounds and improve both detection and classification accuracy, especially in the presence of significant background noise.

Elimination of channel feature redundancy

In this paper, we propose a module designed to address the problem of channel feature redundancy by reducing redundancy through the interaction of multi-scale features. While multi-scale features are effective in enhancing local information extraction in Transformer-based models, they tend to introduce a large number of redundant features, which can be challenging to manage in high-dimensional feature spaces. To address this, we incorporate the CDCM module and sequentially integrate it with the multi-scale feature extraction module. This not only improves computational efficiency but also enhances the discriminative power of the model by focusing on the most salient features for WBC identification. This approach maintains the model’s lightness and versatility, effectively managing feature redundancy without compromising performance, especially in complex, high-dimensional feature spaces.

Conclusions

In this paper, we introduce MCDAF-Net, a novel architecture designed to enhance receptive field efficiency through the integration of multi-scale expansion convolutions and horizontal-vertical convolutions. This design not only broadens the receptive field but also preserves detailed local information. The incorporation of the self-attention mechanism further optimizes feature extraction by selectively focusing on salient regions, thereby improving the precision of feature representation. Additionally, the Channel Dependency and Correlation Modulation (CDCM) module effectively reconstructs channel features, significantly reducing redundancy and enhancing feature distinctiveness. The MCDAF module, being plug-and-play, offers unparalleled flexibility, allowing seamless integration into various convolutional operations without modifying the underlying network structure. To tackle the prevalent issue of cell overlap in leukocyte detection, we employ the DIOU loss function, which markedly improves detection accuracy. Our model’s robust performance across multiple datasets-LISC, BCCD, WBCDD, and the RSLI private dataset-demonstrates its practical applicability.

Future research will focus on two primary directions. First, despite our efforts to reduce the number of parameters, the current model still has a substantial parameter count. Therefore, we plan to develop lightweight variants of our model to enhance computational efficiency and scalability. Second, when integrating the attention mechanism with dilated convolution in other datasets, the dilation rate often needs to be adjusted manually. To address this, we aim to develop adaptive modules that can automatically adjust the dilation rate based on the dataset characteristics, thereby improving the model’s adaptability and generalization ability.