Introduction

The Plasmodium species include malaria parasites that are tiny and single-celled animals. They transmit malaria, which is a deadly mosquito-borne disease. According to the World Health Organization (WHO), malaria sickness increased by around 5 million per year in 2022, affecting 249 million people in total1. According to recent WHO estimates, malaria cases are expected to treble as a result of the COVID-19 outbreak. This motivates government and private organizations to investment more in malaria detection research.

Light microscopy is the standard method of detecting malaria parasites, but sometimes its accuracy is doubtful2,3. It has a sensitivity of 99%, specificity of only 57%4, and requires competent medical staff. Additionally, it includes time-consuming, repetitive, and error-prone processes. More advanced techniques, such as polymerase chain reaction (PCR) and rapid diagnostic assays, are evolving. Earlier, machine learning (ML) has made medical diagnosis easier to carry out, decreasing the need for experts and costly medical equipment. However, ML-based malaria detection and classification algorithms used manually generated feature extraction to discriminate medical images, which had limitations in terms of data dependency, accuracy and was not tested against a proper benchmark5.

In recent years, deep learning (DL) based methods have been adopted for medical image analysis using classification, recognition, localization, and segmentation tasks6,7. As a result of this development, the use of CNN models to detect malaria parasites is becoming more prevalent. CNN models can be more easily integrated into existing models because of their simplicity and ease of usage. CNN models excel at learning complex features and patterns from raw data. In the case of malarial parasite detection, these models can automatically learn discriminative features from microscopic images, which may not be easily identifiable or extractable using handcrafted features using traditional ML-based methods, however, the existing DL-based methods lack in terms of processing speed.

Several ML approaches for Red Blood Cell (RBC) classification have been developed, including support vector machine (SVM), k-nearest neighbors (KNN), and linear discriminant analysis (LDA)8. Furthermore, sophisticated DL techniques, such as Attention-Dense Circular Net (ADCN)9, deep belief-based classification networks10, Mosquito-Net, LeNet11, AlexNet12, and GoogleNet13, have been developed to identify malaria-infected blood cells using the Malaria Cell Images Dataset from the NIH. Despite the effectiveness of DL-based categorization for malaria parasites, there are still significant obstacles that are mentioned below.

  • Challenges The availability of training data is limited, parasite similarities across classes, and poor smear quality, affect the effectiveness of detection algorithms. Malaria parasite detection in thin blood smears is difficult due to low contrast and blurry borders. Minor variations in these characteristics can impact the classification of malaria parasites. Furthermore, intra-class similarity and variance are important factors influencing the accuracy of malaria parasite detection.

  • Motivation and Contributions Pre-trained CNN models can be useful in such situations; however, when training data is limited, they may overfit and produce biased results. Furthermore, the huge number of parameters in these models makes them computationally expensive for inference tasks and causes slower processing times. To address these issues, this research presents a lightweight CNN model for distinguishing malaria-infected from uninfected blood smears. In addition, a Dilated Attention block is proposed to improve blood cell classification, thereby increasing the efficiency in malaria parasite identification. The proposed model achieves 97.06% classification accuracy and 96.98% F1-score using only 20% training data samples. The proposed DANet is designed to have fewer parameters while maintaining high accuracy and can be run on mobile devices such as Raspberry Pi 4b.

The rest of the paper is organized as follows: section 2 explores past research on the topic to provide context for the proposed methodology. Section 3 expands on the proposed methodology, describing the methods adopted to solve the issues found in prior studies. Section 4 describes the experimental setup, dataset, pre-processing, and evaluation metrics. Section 5 describes the results achieved through experimentation and discusses the performance of the proposed model. Section 6 wraps up the paper by reviewing the important findings and suggesting some potential future research scopes.

Related work

Extensive research has been conducted for malaria parasite identification and classification tasks using computer-aided diagnostic systems, with two basic methodologies: the traditional ML techniques and the DL-based approaches, which is the emphasis of this section.

The traditional method of malaria parasite detection normally consists of several important stages: image preprocessing, blood component segmentation, component classification, and parasite candidate generation. Image preprocessing techniques are used to improve the quality of blood smear images, increasing the precision of future processing steps such as cell segmentation, feature extraction, and classification with typical machine learning approaches. Devi et al.14 developed a hybrid classifier that combined SVM, KNN, Naive Bayes, and Artificial Neural Networks (ANN), achieving a 96.3% accuracy. Their model extracted cell pixels from the background using thresholding and watershed approaches and then trained on morphologically segmented images. Gezahegn et al.15 suggested a method employing handcrafted Scale Invariant Feature Transform (SIFT) features for malaria infection classification, utilizing an SVM classifier. However, due to the limitations of handcrafted features for training, their approach achieved just 78.89% accuracy. May et al.16 used a median filter to minimize impulse noise and a Wiener filter to reduce additive noise in the preprocessing stages. Malihi et al.17 used the Otsu thresholding approach to preprocess blood samples before classifying them with the KNN Classifier. Mandal et al.18 developed a logistic regression-based classification strategy for parasite identification, which achieved an accuracy of 88.77%. Anggraini et al.19 used a Bayesian classifier to categorize malaria parasites into various stages, such as ring shapes and other artifacts, and achieved a 93.3% accuracy. Kshipra Charpe et al.20 provided a method for finding 15 RBC images that used the watershed transform technique for segmentation and then parasite classification. Somasekar et al.21 used SVM based on morphological procedures to categorize infected cells from 76 images. For detecting malaria-infected RBC images, these strategies primarily used classic computer vision and ML techniques, such as thresholding algorithms and feature extraction. While some traditional methods produced satisfactory results, they frequently required substantial prior information, and complex preparation steps, and were tested on tiny datasets, limiting the reliability of their findings.

Recent advances in DL have transformed computer vision and disease identification, with CNN-based models leading the way. Popular DL architectures such as VGG, ResNet, AlexNet, InceptionNet, and EfficientNet have exhibited outstanding performance in a variety of computer vision applications, including medical image classification, after being pre-trained on large-scale datasets such as ImageNet. Bibin et al.10 developed a Deep Belief Network for malaria-infected image categorization, which uses Restricted Boltzmann Machines (RBM) to process image pixels. Vijayalakshmi et al.22 introduced VGG-SVM, a hybrid CNN and ML model in which a pre-trained VGG network extracts features and then classifies them using SVM. Notably, VGG-SVM attained an accuracy of 89.21% for VGG16 and 93.13% for VGG19. Pattanaik et al.23 developed an auto-encoder-based model, specifically the Stacked Sparse Auto Encoder (SSAE), to get improved features, with subsequent classification using Functional Link ANN (FLANN), obtaining 89.10% accuracy with only 1182 images. They also developed Multi-Magnification ResNet (MM-ResNet), which addressed the vanishing gradient problem by concatenating input and output layers and produced encouraging results. Furthermore, Kumar et al.24 introduced Mosquito-Net, an attention-based classification model designed for lightweight malaria detection that can be deployed on mobile devices. While existing models perform well, their computational complexity limits their applicability to mobile devices and several models lack validation against accepted benchmarks. To address these limitations, we present a lightweight attention-based CNN model for malaria detection and evaluate it on the publicly available in NIH database. A paper by Khan et al. (2023)25 specifically explores deep learning applications for COVID-19 detection, evaluating models like ResNet, VGG, and EfficientNet on radiographic and clinical datasets. Their methodology revolves around transfer learning and ensemble strategies, achieving high diagnostic accuracy but with significant computational costs.

Despite significant advances in malaria parasite detection, several important research gaps remain unaddressed in the literature. Many high-performing deep learning models, such as VGG, ResNet, and EfficientNet, involve millions of parameters and demand substantial computational resources, limiting their feasibility for real-time diagnosis in low-resource or point-of-care environments. Furthermore, most existing approaches adapt generic computer vision architectures without explicitly accounting for the unique morphological characteristics of malaria parasites, potentially overlooking subtle yet crucial features necessary for accurate detection. The robustness of current models is also hindered by their sensitivity to variations in blood smear quality, low contrast, blurry cell boundaries, and high intra-class similarity, all of which present persistent classification challenges. Additionally, various reported methods have been evaluated on small or homogeneous datasets, raising concerns about their generalizability to broader clinical scenarios. To address these limitations, we propose DANet, a lightweight attention-based CNN that integrates a novel Dilated Attention Block (DAB) to effectively capture multi-scale contextual features while preserving computational efficiency. DANet achieves a high classification accuracy of 97.95% with only 2.3 million parameters, offering a domain-specific, resource-efficient solution suitable for real-time and low-cost deployment in healthcare facilities.

Methodology

Fig. 1
figure 1

An illustration of the proposed two Dilated Attention blocks (a) DAB-H featuring an additional max pool and a \(1\times 1\) convolution layer highlighted in a red box, and (b) DAB-S with a skip connection.

This section explains the basic components, provides an overview of the architecture, and discusses the theoretical framework behind the proposed Dilated Attention Network (DANet). The architecture of the proposed DANet is shown in Fig. 2.

Dilated attention block

Classifying malaria parasite-infected and uninfected RBC blood smear images provides unique problems. This is mostly owing to the blood smear’s similar color tones, the absence of defined boundaries, and the varying morphologies of diseased tissues as shown in Fig. 3. To address these challenges, attention mechanisms are crucial for increasing classification accuracy. In the proposed model, we introduce a Dilated Attention Block (DAB) to capture and highlight important characteristics in blood smear images.

We use a multi-dilation technique in our architecture to improve the receptive field without introducing substantial parameters while keeping the model high efficiency. Initially, a conventional convolutional layer with a \(1 \times 1\) kernel size is used. Next, we apply three independent \(3 \times 3\) convolutional layers with various dilation factors (DF) - precisely, DF=1, 2, and 3. This varied dilation method provides a wider perspective of the input while effectively gathering contextual information. The outputs of these convolutional layers are then combined to create a preliminary fused feature map, \(f_{preConcat} \in \mathbb {R}^{W \times H \times 3C}\). This fusion process helps to learn aspects associated with unclear boundaries and blood tissue variations, by combining information from a larger region surrounding each pixel and expanding the receptive field of the input features, the model acquires an extensive understanding of the image, making it easier to distinguish small details.

$$\begin{aligned} f_{preConcat} = Concatenate(Conv_1(f_c),Conv_2(f_c),Conv_3(f_c)) \end{aligned}$$
(1)

In Equation 1, \(f_c\) denotes the convoluted feature map derived from the \(1 \times 1\) kernel size convolutional layer. The notation \(Conv_k\) refers to a convolutional operation with a dilation factor of k, where k represents various DFs used in the model design.

The network then performs two distinct pooling operations: average pooling and max pooling. The features derived from these procedures are concatenated and denoted as \(f_{\text {poolConcat}} \in \mathbb {R}^{W \times H \times 6C}\). Max pooling identifies the most important properties within a region by extracting the largest value, whereas average pooling computes the average value to capture the data’s general trends and characteristics. This combination technique efficiently captures both detailed features and overarching patterns, improving the network’s generalizability and robustness.

$$\begin{aligned} f_{\text {poolConcat}} = Concatenate(MaxPool(f_{preConcat}), AvgPool(f_{preConcat})) \end{aligned}$$
(2)

Pooling can be done in two ways: first, by keeping the same dimensions as the input as shown in Fig. 1a, and second, by halving the size of the feature dimension relative to the input dimension as shown in Fig. 1b. Both methods of pooling have been tested in this paper to determine their effectiveness. The pooled features are enhanced with \(3 \times 3\) convolution layer, followed by layer normalization, a \(1 \times 1\) convolution layer, and a sigmoid operation. This procedure aims to create an attention map that can capture structural information of the affected blood smear.

$$\begin{aligned} f_{att} = Sigmoid(Conv(LN(Conv(f_{\text {poolConcat}}))))\end{aligned}$$
(3)

Following this, we conduct an element-wise multiplication between the attention feature map and the input feature.

$$\begin{aligned} f_{DAB}^s = f_{att} \otimes f_{in}\end{aligned}$$
(4)

However, this step is only feasible if the pooling procedure produces the same feature dimension as the input. When the feature dimension is halved, there is a dimension mismatch with the input. To address this, we use MaxPooling and a \(1 \times 1\) convolution layer on input features. These halved features are then multiplied element by element using the attention feature map.

where \(f_{DAB}^h\) and \(f_{DAB}^s\) denote the same and halved version of dilated attention block outputs. The architecture of \(DAB-H\), \(DAB-S\) are shown in Fig. 1a and b, respectively.

Overall architecture

Figure 2 shows the complete architecture of the proposed model. This network has a convolution layer, a few max pooling layers, nine DABs, two fully connected layers, and an output layer with a LogSoftMax activation function. To accommodate various attention methods, we create two separate model versions, using \(DAB-H\) and \(DAB-S\), which are described in Fig. 2a and b, respectively.

$$\begin{aligned} f_{DAB}^h = f_{att} \otimes Conv(MaxPool(f_{in})) \end{aligned}$$
(5)
Fig. 2
figure 2

Architecture of the proposed model using two attention blocks: (a) DANet-H and (b) DANet-S. DANet-H integrates DAB-H, which includes a max pooling layer, resulting in fewer max pooling layers compared to DANet-S.

DANet-H The model handles images of dimension \(224 \times 224 \times 3\), starting with a convolution layer with 8 filters of size \(3 \times 3\). This stage is followed by a max pooling layer with a pool size of \(2 \times 2\) and a stride of 2, lowering the spatial dimensions of the feature maps. Next, the model has nine \(DAB-H\) blocks. Following the initial attention block, the number of filters is doubled every two attention blocks, gradually increasing the model’s ability to capture increasingly complicated characteristics. To reduce the feature representation, an average pooling operation with a filter size of \(7 \times 7\) is used. This reduction stage is followed by two fully connected layers of 128 and 64 neurons, respectively, culminating in an output layer for classification.

DANet-S: In this configuration, the model uses nine \(DAB-S\) blocks and MaxPooling layers. The model starts with a convolution layer of 8 filters, each \(3 \times 3\) in size, to extract initial features from the input image. Next, a max pooling layer with a pool size of \(2 \times 2\) and a stride of 2 is used to reduce the spatial dimensions of the feature maps for better computational efficiency. The architecture includes \(DAB-S\) blocks, followed by max pooling to improve and emphasize key features while reducing spatial dimensions. Following the deployment of the initial attention block, the model adopts a strategy of stacking two \(DAB-S\) followed by max pooling. The number of filters is doubled for each consecutive pair of attention blocks. This systematic increase in filter count aims to gradually improve the network’s ability to detect and represent more complex features inherent in the data. The design of the fully connected layer remains consistent with the prior model version’s final classification result.

Experiments

Dataset

The suggested model is trained and evaluated using the Malaria Cell Images DatasetFootnote 1 from the National Institute of Health (NIH). RBC micrograph images were taken from the Chittagong Medical College Hospital. This dataset contains images of blood smear slides from 150 malaria-infected patients and 50 healthy persons, which were precisely labeled by professional physicians from the Mahidol Oxford Tropical Medicine Research Unit in Bangkok. The dataset is freely available on a variety of platforms, including the National Library of Medicine (NLM), Kaggle, and the NIH database. It is divided into two categories: parasitized and uninfected, as shown in the Fig. 3. With a total of 27,558 data points, half of which are malaria blood smear images and the rest representing uninfected samples. The dataset has been divided into three groups as shown in the Table 1: 70% for training, 10% for validation, and 20% for testing, though data-split variations may apply to the compared models as reported in their original studies.

Fig. 3
figure 3

Different types of uninfected and parasitized cells.

Table 1 Training, test, and validation size class-wise distribution.

Data pre-processing and augmentation

During the pre-processing stage, two fundamental processes are performed. The image is first downsized to \(224 \times 224\) dimensions, then normalized from the original range of [0, 255] to [0, 1]. This normalizing step is crucial for avoiding gradient separation and accelerating convergence during model training. Three augmentation strategies are used to reduce the risk of overfitting: random rotation, horizontal flip, and vertical flip. These strategies help to improve the model’s generalization capabilities. The random rotation has a maximum rotation angle of 15\(^\circ\), whereas horizontal flip and vertical flip are only conducted on 20% of the training set’s data at random. Figure 4 illustrates the data augmentation techniques applied to the NIH Malaria Dataset, showcasing original and augmented images of parasitized and uninfected cells. These augmentations simulate variations in smear orientation and appearance, improving DANet’s ability to handle low-contrast images and diverse parasite morphologies. This contributes to the model’s robustness, as demonstrated by its performance across reduced dataset samples.

Fig. 4
figure 4

Different types of uninfected and parasitized cells after using data pre-processing and augmentation techniques..

Experimental setup

The proposed model has been trained on the Kaggle notebook environment, using a robust framework that includes a 16GB Nvidia Tesla T4 GPU and 12GB RAM. The programming language used is Python 3.7.6, and the model has been implemented using PyTorch version 1.9.0. All investigations and tests have been carried out with the Nvidia Tesla T4 GPU. The RMSprop optimizer is applied for model training, using a default learning rate of 0.001 and the Negative Log Likelihood (NLL) loss function. In addition, the model has been trained with the Binary Cross Entropy loss function for performance evaluation. The hyperparameters listed in Table 2 have been determined through empirical testing on the training set, with adjustments made to optimize accuracy and computational efficiency. The training process takes 75 epochs on the Malaria Cell Images Dataset from the NIH.

The hyperparameters used to develop the model are given in Table 2. We have not used any additional explicit regularization techniques because DANet has a very low parameter count, and it does not show any signs of overfitting during training.

Table 2 Hyperparameters used to train the proposed model.

Evaluation metrics

We use multiple performance metrics to evaluate the proposed model in this study, including accuracy, sensitivity (recall), specificity, precision, F1-score, and parameter count. The metrics are specified as follows:

Accuracy The accuracy metric calculates the percentage of successfully predicted cases to the total number of cases evaluated by the model. It provides a comprehensive evaluation of the model’s predicted performance.

Fig. 5
figure 5

Training and validation loss and accuracy curves of different models (a) DANet-S + LogSoftMax, and (b) DANet-H + LogSoftMax.

$$\begin{aligned} \text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \times 100 \end{aligned}$$

Sensitivity (Recall) Sensitivity quantifies the proportion of true positive predictions relative to all actual positive cases. It gauges the model’s ability to correctly identify positive instances.

$$\begin{aligned} \text {Sensitivity or Recall} = \frac{TP}{TP + FN} \times 100 \end{aligned}$$

Specificity Specificity indicates the ratio of true negative predictions to all actual negative instances. It assesses the model’s capability to accurately identify negative cases.

$$\begin{aligned} \text {Specificity} = \frac{TN}{TN + FP} \times 100 \end{aligned}$$

Precision Precision represents the ratio of correctly predicted positive cases to all predicted positive samples. It reflects the model’s accuracy in identifying positive instances.

$$\begin{aligned} \text {Precision} = \frac{TP}{TP + FP} \times 100 \end{aligned}$$

F1-score The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of a model’s performance by considering both precision and recall.

$$\begin{aligned} F1_\text { score} = \frac{2 \times Precision \times Recall}{Precision + Recall} \times 100 \end{aligned}$$

Here, TP denotes true positive outcomes, TN represents true negative results, FP signifies false positive outcomes, and FN denotes false negative results. These metrics collectively offer a comprehensive evaluation of the model’s effectiveness in malaria detection.

Table 3 Performance comparison of the proposed model and existing models. Bold texts represent the proposed model.
Fig. 6
figure 6

ROC curve of different models (a) DANet-S + SoftMax (a) DANet-H + LogSoftMax, and (b) DANet-S + LogSoftMax.

Fig. 7
figure 7

Flops and accuracy comparison of DANet-S + LogSoftMax in terms of different number of channels.

Fig. 8
figure 8

Comparison of the processing time of the DANet in terms of different numbers of channels on Raspberry Pi 4b and Nvidia Tesla T4 GPU.

Fig. 9
figure 9

Sample images of uninfected and parasitized blood smears are shown with Grad-CAM-identified critical spots. These activation mappings identify certain pixels inside images that are important to the model’s decision-making in classification and attention performance of the proposed DANet-S + LogSoftMax.

Results and discussion

To evaluate and compare the classification performance of the proposed model, we have used several existing methods trained in the Malaria Cell Images Dataset from the NIH such as EfficientNetB026, ResNet15226, NASNetMobile26, InceptionV326, InceptionResNetV226, Yang F. et al.27, VGG1628, VGG1928, ResNet5024, DPN9232, DenseNet12131, DCNN(Falcon)-TL33, VGG16-SVM22, Mosquito-Net24, Alex-Net24, Xception-Net24, and DLRFNet34. This paper uses a variety of evaluation criteria, including accuracy, sensitivity, specificity, precision, and F1-score. Table 3 presents test results for existing and proposed models. The proposed models are compared to existing ones in three versions: DANet-S + Sigmoid, which incorporates a DAB-S block and Sigmoid activation in the final layer; DANet-S + LogSoftMax, which integrates a DAB-S block and LogSoftMax activation in the final layer; and DANet-H + LogSoftMax, which features a DAB-H block and LogSoftMax activation in the final layer. For models with Sigmoid activation in the final layer, Binary Cross Entropy loss is used for training, whereas NLL loss is used for those with LogSoftMax activation.

Figure 5 shows the accuracy and loss curves for the proposed models. Figure 5a demonstrates the performance of DANet-S + LogSoftMax, whereas Fig. 5b depicts DANet-H + LogSoftMax. These figures show that DANet-S + LogSoftMax has a faster convergence rate and stronger training stability than DANet-H + LogSoftMax. Furthermore, DANet-S + LogSoftMax has a smaller difference between training and validation accuracy and loss, indicating almost no overfitting than DANet-S + LogSoftMax.

Each version of the proposed model outperforms existing models on all evaluation measures. The DANet-S + LogSoftMax model has the maximum performance, with 97.95% accuracy, 97.76% precision, 98.07% sensitivity, 97.87% specificity, and 97.86% F1-score. In contrast, the DANet-S + Sigmoid model has the second-best performance, with 97.75% accuracy and 97.73% F1-score. In comparison, the DPN92 model produces the lowest results with 87.88% accuracy and 87.85% F1-score among the methods considered here for comparison. All versions perform similarly in terms of ROC. DANet-S + Sigmoid has an AUC of 97%, followed by DANet-H + LogSoftMax of 96% and DANet-S + LogSoftMax of 98%. Notably, the suggested model not only achieves higher accuracy and F1-score, but it also has the lowest parameter count, with just 2.3 million parameters. In comparison, DPN92, which has the lowest accuracy and F1-score, requires much more parameters (37.7 million). Figure 6 presents the Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves for three variants of the proposed model. The PR curves reveal that DANet-S+LogSoftMax maintains consistently high precision across a broad recall range, whereas alternative variants exhibit a sharper precision drop at higher recall levels, underscoring the robustness of the proposed approach in minimizing false negatives without sacrificing accuracy. In our comparison, we test the DANet-S + LogSoftMax model across increasing numbers of channels, measuring both accuracy and floating-point operations per second (Flops), as shown in Fig. 7. Our investigation shows that as the number of channels increases from 2 to 8, so do Flops and accuracy. However, increasing the number of channels to 16 reduces accuracy because it increases the model’s parameter count, thereby making the model more complex. This added complexity makes the model prone to overfitting. As a result, using 16 channels causes a small drop in accuracy, with about a 1% decrease in accuracy. In contrast, the use of 8 channels provides the optimal balance between the complexity and the performance, reducing overfitting and offering the best trade-off for accuracy.

Figure 8 illustrates a performance comparison of the proposed model on two different hardware platforms: a mobile device (Raspberry Pi 4b) and a high-performance device (Nvidia Tesla T4 GPU). The Raspberry Pi 4b is equipped with an ARM Cortex-A72 CPU and 2 GB of RAM, while the Nvidia Tesla T4 GPU boasts 16 GB of vRAM and 12 GB of system RAM. The results demonstrate that all three versions of the proposed model perform efficiently on both high-performing systems and mobile devices like the Raspberry Pi 4b. However, due to the significantly lower processing power of the Raspberry Pi 4b compared to the Nvidia Tesla T4 GPU, the processing speed on the Raspberry Pi is slower. Additionally, it is important to note that as the number of channels in the model increases, both the model parameters and FLOPs increase as shown in Fig. 7. This results in longer processing times on devices with limited computational capacity, such as the Raspberry Pi. Nevertheless, the proposed model remains lightweight, so the increase in processing time on high-performing devices like the Nvidia Tesla T4 GPU is minimal.

Table 4 shows how the model performs when trained on different proportions of a training dataset, from 20% to 100%. Despite the limited training data, the model is assessed using the same test set throughout. This experiment demonstrates the model’s capacity to generalize and remain resilient even with limited data. Interestingly, the model performs similarly across dataset sizes, indicating its robustness to the data shortage. While increasing data volume improves the model’s diagnostic capacity, it’s worth noting that limiting dataset size does not considerably reduce its diagnostic capability. The model outperforms larger datasets with only 40% of the training data, based on multiple classification measures.

We have used Gradient-weighted Class Activation Mapping (Grad-CAM) to better understand the input properties that impact the model in classification. The Fig. 9 shows both the input image and the associated Grad-CAM visualization. This visualization has a heatmap that highlights sections of the image, where the model concentrates its attention, showing the presence or absence of malaria infection and directing its decision-making process. In malaria-infected images, the model focuses attention on specific areas, but in uninfected images, it spreads attention across the cell. However, in both instances, the model does not focus on locations outside of the blood cell. This highlights the model’s attention capacity to identify significant characteristics within the cell and make correct classifications depending on the presence or absence of malaria infection.

Table 4 Model performance across different dataset sizes of DANet-S + LogSoftMax.
Fig. 10
figure 10

Confusion Matrix of the proposed model DANet-S + LogSoftMax. This plot illustrates the high-dimensional data embedded into two dimensions to showcase clustering patterns and distribution differences among the uninfected and parasitized samples.

To assess the performance of our binary classification model for distinguishing between Parasitized and Unparasitized cells, we have evaluated it on a test set comprising 2756 Parasitized and 2756 Unparasitized samples, totaling 5512 instances. The model’s performance is quantified using several metrics, including accuracy, precision, sensitivity (recall), specificity, and F1 score. Figure 10 presents the confusion matrix, which provides a detailed breakdown of the classification results.

Fig. 11
figure 11

t-SNE visualization of the proposed models, (a) DANet-S + LogSoftMax (b) DANet-H + LogSoftMax.

Using t-SNE plots, we analyze how the proposed approach better aligns the target class distribution with that of the source. The graphic depicts the feature distribution across all samples under consideration. Notably, the clusters representing the two groups are very different, demonstrating the efficacy of our technique as shown in Fig. 11. However, some overlapping samples point to occasional misdiagnosis. Figure 11 also shows that the DANet-S + LogSoftMax configuration produces more unique t-SNE clusters than DANet-H + LogSoftMax, implying that the former arrangement performs better in terms of categorization.

Cross-validation results

To assess whether the DANet-S+LogSoftMax model is overfitting, a 5-fold cross-validation scheme is applied for testing. Table 5 presents the results, showing the model’s performance across multiple metrics. On the test set, the model achieves an overall accuracy of 97.95% and an F1 score of 97.87%. However, during cross-validation, the average accuracy decreases slightly to 97.25%, with an average F1 score of 97.20%. This slight reduction in performance reflects the more stringent evaluation provided through the cross-validation scheme, which better assesses the model’s robustness and generalization. Notably, to the best of our knowledge, none of the existing models in the literature have undergone cross-validation for testing, making a direct evaluation difficult.

The DANet-S+LogSoftMax model applies the same training protocol for both standard training and cross-validation scenarios, ensuring a consistent and rigorous evaluation process. Despite the minor decrease in the cross-validation performance, the model consistently outperforms existing models, demonstrating superior performance and robustness across different evaluation techniques. These results indicate that the model does not face the overfitting problem, as it performs well on both the test set and in cross-validation, confirming its suitability for real-world applications.

Table 5 Results obtained using the 5-Fold cross-validation scheme.

Statistical test

We have performed the McNemar’s test to evaluate the performance of the proposed three models trained on the Malaria Cell Images Dataset from the NIH. This non-parametric test examines the distribution of paired nominal data, as shown in Table 6. The p-value denotes the probability of similarity of the models. A p-value less than 0.05 (or 5%) indicates less similarity between the two models i.e., they are statically different. Table 6 shows a comparison of three versions of the proposed model, with p-values approaching zero for all combinations. Thus, we find that the outcomes of the models are statistically significant.

Table 6 McNemar’s test between DANet-S + LogSoftMax, DAnet-H + LogSoftMax and DANet-S + Sigmoid model.
Table 7 Comparison of classification accuracies between the proposed model and state-of-the-art models on the ALL dataset. The proposed model’s performance is highlighted in bold.
Fig. 12
figure 12

ROC for ALL dataset with DANet-S + LogSoftMax.

Additional experimentation

To evaluate the model’s performance on an additional dataset and assess whether it is biased towards the Malaria Cell Images Dataset, the Acute Lymphoblastic Leukemia (ALL)Footnote 2 image dataset is considered. This dataset contains 3256 peripheral blood smear images from 89 suspected ALL patients, whose blood samples were processed and stained by skilled lab workers. It includes two distinct classes: benign and malignant. The ALL group comprises Early Pre-B, Pre-B, and Pro-B ALL malignant lymphoblast subtypes, while hematogone represents the benign class. The sample distribution consists of 985 Early Pre-B, 963 Pre-B, 804 Pro-B ALL, and 504 benign cases.

The proposed model achieves an accuracy of 99.08% on the said dataset. In contrast, pre-trained models such as InceptionV345 and ResNet5046 perform lower, with accuracies of 96.93% and 97.85%, respectively. As shown in Table 7, the proposed model consistently outperforms state-of-the-art models, demonstrating superior performance and generalization across different datasets. This performance further highlights the proposed model’s effectiveness in distinguishing between malignant and benign cases in the ALL dataset. Figure 12 presents the ROC curve of the proposed model on the ALL dataset. The model demonstrates an impressive AUC of 0.99 for the benign class, 0.99 for the Early Pre-B class, 0.99 for the Pre-B class, and a perfect AUC of 1.00 for the Pro-B class.

Conclusion

This research presents DANet, a lightweight yet high-performing architecture for malaria parasite detection, which introduced the Dilated Attention Block (DAB) as a novel attention mechanism. Two variants, DANet-S and DANet-H, have been trained and evaluated on the NIH Malaria Cell Images Dataset, achieving competitive results, with DANet-S reaching 97.95% classification accuracy and 97.86% F1-score, while using only 2.3 million model parameters. The model demonstrated consistent performance across varying training samples, and its robustness has been statistically confirmed using the McNemar’s test. In addition to outperforming previous models, DANet ensures computational efficiency, making it suitable for deployment on edge devices such as the Raspberry Pi 4b for real-time diagnosis. However, our study is limited by its reliance on the NIH Malaria Cell Images Dataset due to the scarcity of open-access alternatives, which may affect generalization to different imaging conditions or populations. Furthermore, while no explicit regularization techniques were applied because the low parameter count prevented overfitting in our experiments, this choice may limit scalability to more complex datasets. Also, we acknowledge that McNemar’s test is applied only to DANet variants, as results of the external baseline models are unavailable. Future work will address these challenges by expanding validation to diverse datasets, integrating domain adaptation methods, and testing real-time deployment in clinical settings, thereby enhancing the model’s practical utility for accessible, rapid, and reliable malaria diagnosis in resource-limited environments.