Introduction

Agricultural production plays a crucial role in the sustainable economic and societal growth of a country. High-quality crop yield production is essential for satisfying global food demands and better health. However, several key factors, such as environmental barriers, pollution, and climate change, adversely affect crop yield and quality. Nevertheless, poor soil-nutrition management causes severe plant stress, leading to different diseases and resulting in a substantial financial loss. Thus, plant nutrition diagnosis and disease detection at an early stage is of utmost importance for overall health monitoring of plants1. Nutrition management in agriculture is a decisive task for maintaining the growth of plants. In recent times, it has been witnessed the success of machine learning (ML) techniques for developing decision support systems over traditional manual supervision of agricultural yield. Moreover, nutrient management is critical for improving production growth, focusing on a robust and low-cost solution. Intelligent automated systems based on ML effectively build more accurate predictive models, which are relevant for improving agricultural production.

Nutrient deficiency in plants exhibits certain visual symptoms and may cause of poor crop yields2. Diagnosis of plant nutrient inadequacy using deep learning and related intelligent methods is an emerging area in precision agriculture and plant pathology3. Automated detection and classification of nutrient deficiencies using computer vision and artificial intelligence have been studied in the recent literature4,5,6,7,8. Diagnosis of nutrient deficiencies in various plants (e.g., rice, banana, guava, palm oil, apple, lettuce, etc.) is vital, because soil ingredients often can not provide the nutrients as required for the growth of plants9,10,11,12. Also, early stage detection of leaf diseases (e.g., potato, rice, cucumber, etc.) and pests are essential to monitor crop yield production13. A few approaches on disease detection and nutrient deficiencies in rice leaves have been developed and studied in recent times14,15,16,16,17,18. Hence, monitoring plant health, disease, and nutrition inadequacy could be a challenging image classification problem in artificial intelligence (AI) and machine learning (ML)19.

This paper proposes a deep learning method for plant health diagnosis by integrating a graph convolutional network (GCN) upon a backbone deep convolutional neural network (CNN). The complementary discriminatory features of different local regions of input leaf images are aggregated into a holistic representation for plant nutrition and disease classification. The GCNs were originally developed for semi-supervised node classification20. Over time, several variations of GCNs have been developed for graph structured data21. Furthermore, GCN is effective for message propagation for image and video data in various applications. In this direction, several works have been developed for image recognition using GCN22,23. However, little research attention has been given to adopting GCN especially for plant disease prediction and nutrition monitoring24. Thus, in this work, we have studied the effectiveness of GCN in solving the current problem of plant health analysis regarding nutrition deficiency and disease classification of several categories of plants.

The proposed method, called Plant Nutrition Deficiency and Disease Network (PND-Net), attempts to establish a correlation between different regions of the leaves for identifying infected and defective regions at multiple granularities. For this intent, region pooling in local contexts and spatial pooling in a pyramidal structure, have been explored for a holistic feature representation of subtle discrimination of plant health conditions. Other existing approaches have built the graph-based correlation directly upon the CNN features, but they have often failed to capture finer descriptions of the input data. In this work, we have integrated two different feature pooling techniques for generating node features of the graph. As a result, this mixing enables an enhanced feature representation which is further improved by graph layer activations in the hidden layers in the GCN. The effectiveness of the proposed strategy has been analysed with rigorous experiments on two plant nutrition deficiency and two plant disease classification datasets. In addition, the method has been tested on two different human cancer classification tasks for the generalization of the method. The key contributions of this work are:

  • A deep learning method, called PND-Net, is devised by integrating a graph convolutional module upon a base CNN to enhance the feature representation for improving the classification performances of unhealthy leaves.

  • A combination of fixed-size region-based pooling with multi-scale spatial pyramid pooling progressively enhances the feature aggregation for building a spatial relation between the regions via the neighborhood nodes of a spatial graph structure.

  • Experimental studies have been carried out for validating the proposed method on four public datasets, which have been tested for plant disease classification, and nutrition deficiency classification. For generalization of the proposed method, a few experiments have been conducted on the cervical cancer cell (SIPaKMeD) and breast cancer histopathology image (BreakHis 40\(\times \) and 100\(\times \)) datasets. The proposed PND-Net has achieved state-of-the-art performances on these six public datasets of different categories.

The rest of this paper is organized as follows: “Related works” summarizes related works. “Proposed method” describes the proposed methodology. The experimental results are showcased in “Results and performance analysis”, followed by the conclusion in “Conclusion”.

Related works

Several works have been contributed to plant disease detection, most of which were tested on controlled datasets, acquired in a laboratory set-up. Only a few works have developed unconstrained datasets considering realistic field conditions, which have been studied in this work. Here, a precise study of recent works has been briefed.

Methods on plant nutrition deficiencies

Bananas are one of the widely consumed staple foods across the world. An image dataset depicting the visual deficiency symptoms of eight essential nutrients, namely, boron, calcium, iron, potassium, manganese, magnesium, sulphur and zinc has been developed25. This dataset has been tested in this proposed work. The CoLeaf dataset contains images of coffee plant leaves and is tested for nutritional deficiencies recognition and classification26. The nutritional status of oil palm leaves, particularly the status of chlorophyll and macro-nutrients (e.g., N, K, Ca, and Mg) in the leaves from proximal multi spectral images, have been evaluated using machine learning techniques27. The identification and categorization of common macro-nutrient (e.g., nitrogen, phosphorus, potassium, etc.) deficiencies in rice plants has been addressed17,28. The percentage of micro nutrients deficiencies in rice plants using CNNs and Random Forest (RF) has been estimated28. Detection of biotic stressed rice leaves and abiotic stressed leaves caused by NPK (Nitrogen, Phosphorus, and Potassium) deficiencies have been experimented with using CNN29.

A supervised monitoring system of tomato leaves for predicting nutrient deficiencies using a CNN for recognizing and to classify the type of nutrient deficiency in tomato plants and achieved 86.57% accuracy30. The nutrient deficiency symptoms have been recognized in RGB images by using CNN-based (e.g., EfficientNet) transfer learning on orange with 98.52% accuracy and sugar beet with 98.65% accuracy31. Nutrient deficiencies in rice plants have reported 97.0% accuracy by combining CNN and reinforcement learning32. The R-CNN object detector has achieved accuracy of 82.61% for identifying nutrient deficiencies in chili leaves33. Feature aggregation schemes by combining the features with HSV and RGB for color, GLCM and LBP for texture, and Hu moments and centroid distance for shapes have been examined for nutrient deficiency identification in chili plants34. However, this method performed the best using a CNN with 97.76% accuracy. An ensemble of CNNs has reported 98.46% accuracy for detecting groundnut plant leaf images35. An intelligent robotic system with a wireless control to monitor the nutrition essentials of spinach plants in the greenhouse has been evaluated with 86% precision36. The nutrient status and health conditions of the Romaine Lettuce plants in a hydroponic setup using a CNN have been tested with 90% accuracy37. The identification and categorization of common macro-nutrient (e.g., nitrogen, phosphorus, potassium, etc.) deficiencies in rice plants using pixel ratio analysis in HSV color space has been evaluated with more than 90% accuracy17. A method for estimating leaf nutrient concentrations of citrus trees using unmanned aerial vehicle (UAV) multi-spectral images has been developed and tested by a gradient-boosting regression tree model with moderate precision38.

Approaches on plant diseases

The classification of healthy and diseased citrus leaf images using a (CNN) on the Platform as a Service (PaaS) cloud has been developed. The method has been tested using pre-trained backbones and proposed CNN, and attained 98.0% accuracy and 99.0% F1-score39. A modified transfer learning (TL) method using three pre-trained CNN has been tested for potato leaf disease detection and the DensNet169 has achieved 99.0% accuracy40. Likewise, a CNN-based transfer learning method has been adapted for detecting powdery mildew disease with 98.0% accuracy in bell pepper leaves41, and woody fruit leaves with 85.90% accuracy42. A two-stage transfer learning method has combined Faster-RCNN for leaf detection and CNN for maize plant disease recognition in a natural environment and obtained 99.70% F1-score43. A hybrid model integrating a CNN and random forest (RF) for multi-classifying rice hispa disease into distinct intensity levels44. A method of multi-classification of rice hispa illness has attained accuracy of 97.46% using CNN and RF44. An improved YOLOv5 network has been developed for cucumber leaf diseases and pest detection and reported 73.8% precision13. A fusion of VGG16 and AlexNet architecture has attained 95.82% testing accuracy for pepper leaf disease classification45. Likewise, the disease classification of black pepper has gained 99.67% accuracy using ResNet-1846. A ConvNeXt with an attention module, namely CBAM-ConvNeXt has improved the performance with 85.42% accuracy for classifying soybean leaf disease47. A channel extension residual structure with an adaptive channel attention mechanism and a bidirectional information fusion block for leaf disease classification48. This technique has brought off 99.82% accuracy on the plantvillage dataset. A smartphone application has been developed for detecting habanero plant disease and obtained 98.79% accuracy49. In addition, an ensemble method for crop monitoring system to identify plant diseases at the early stages using IoT enabled system has been presented with the best precision of 84.6%50. A dataset comprising five types of disorders of apple orchards has been developed, and the best accuracy is 97.3%, which has been tested using CNN51. A lightweight model using ViT structure has been developed for rice leaf disease classification and attained 91.70% F1-score52.

Methods on graph convolutional networks (GCN)

Though several deep learning approaches have been developed for plant health analysis yet, little progress has been achieved using GCN for visual recognition of plant diseases53. The SR-GNN integrates relation-aware feature representation leveraging context-aware attention with the GCN module22. Cervical cell classification methods have been developed by exploring the potential correlations of clusters through GCN54 and feature rank analysis55. On the other side, fusion of multiple CNNs, transfer learning and other deep learning methods have been developed for early detection of breast cancer56. This fusion method has achieved F1 score of 99.0% on ultrasound breast cancer dataset using VGG-16. In this work, a GCN-based method has been developed by capturing the regional importance of local contextual features in solving plant disease recognition and human cancer image classification challenges.

Proposed method

The proposed method, called PND-Net, combines deep features using CNN and GCN in an end-to-end pipeline as shown in Fig. 1. Firstly, a backbone CNN computes high-level deep features from input images. Then, a GCN is included upon the CNN for refining deep features using region-based pooling and pyramid pooling strategies for capturing finer details of contextual regions as multiple scales. Finally, a precise feature map is built for improving the performance.

Figure 1
figure 1

Proposed GCN-based method, PND-Net for visual classification of plant disease and nutrition inadequacy.

Background of graph convolutional network (GCN)

GCNs have widely been used for several domains and applications such as node classification, edge attribute modeling, citation networks, knowledge graphs, and several other tasks through graph-based representation. A GCN could be formulated by stacking multiple graph convolutional layers with non-linearity upon traditional convolutional layers, i.e., CNN. In practice, this kind of stacking of GCN layers at a deeper level of a network enhances the model’s learning capacity. Moreover, graph convolutional layers are effective for alleviating overfitting issue and can address the vanishing gradient problem by adopting the normalization trick, which is a foundation of modeling GCN. A widely used multi-layer GCN algorithm was proposed by Kiff and Welling20, which has been adopted here. It explores an efficient and fast layer-wise propagation method relying on the first-order approximation of spectral convolutions on graph structures. It is scalable and apposite for semi-supervised node classification from graph-based data. A linear formulation of a GCN could be simplified which, in turn, is capable of parameter optimization at each layer by convolution with filter \(g_{\theta }\) and \(\theta \) parameters, which can further be optimized with a single parameter. Here, a simplified graph convolution has been concisely defined20.

$$\begin{aligned} g_{\theta } *X\approx \theta \big (I_P+D^{-0.5}AD^{-0.5} \big )X \end{aligned}$$
(1)

The graph Laplacian (\(\Psi \)) could further be normalized to mitigate the vanishing gradients within a network.

$$\begin{aligned} \Psi = I_P+D^{-0.5}AD^{-0.5} \rightarrow \tilde{\textbf{D}}^{-0.5} \tilde{\textbf{A}} \tilde{\textbf{D}}^{-0.5} \end{aligned}$$
(2)

where the binary adjacency matrix \(\tilde{\textbf{A}}=\textbf{A}+\textbf{I}_{{P}}\) denotes \(\textbf{A}\) with self-connections and \(\textbf{I}_{{P}}\) is the identity matrix, and degree matrix is \(\tilde{\textbf{D}}_{ii}= \sum _{j}^{\hspace{0.2 cm}} \tilde{\textbf{A}}_{ij}\), and X is an input data/signal to the graph. The simplified convoluted signal matrix \(\Omega \) is given as

$$\begin{aligned} \Omega = \tilde{\textbf{D}}^{-0.5} \tilde{\textbf{A}} \tilde{\textbf{D}}^{-0.5} X \Theta \end{aligned}$$
(3)

where input features \(X \in \mathbb {R}^{{P}\times {C}}\), filter parameters \(\Theta \in \mathbb {R}^{{C}\times {F}}\), and \(\Omega \in \mathbb {R}^{{P}\times {F}}\) is the convoluted signal matrix. Here, P is the number of nodes, C is the input channels, F is the filters/feature maps. Now, this form of graph convolution (Eq. 3) is applied to address the current problem and is described in “namerefsec33”.

Convolutional feature representation

A standard backbone CNN is used for deep feature extraction from an input leaf image, denoted with the class label \(I_l\) \(\in \) \(\mathbb {R}^{h\times w\times 3}\) is passed through a base CNN for extracting the feature map, denoted as \(\textbf{F}\) \(\in \) \(\mathbb {R}^{h\times w\times C}\) where h, w, and C imply the height, width, and channels, respectively. However, the squeezed high-level feature map is not suitable for describing local non-overlapping regions. Hence, the output base feature map is spatially up-sampled to \(\textbf{F}\) \(\in \) \(\mathbb {R}^{H\times W\times C}\) and \(\omega \) number of distinct small regions are computed, given as \(\textbf{F}\) \(\in \) \(\mathbb {R}^{\omega \times h\times w\times C}\). These regions represents complementary information at different spatial contexts. However, due to fixed dimensions of regions, the importance of each region is uniformly distributed, which could be tuned further for extracting more distinguishable information. A simple pooling technique could further be applied at multiple scales for enhancing the spatial feature representation. For this intent, the region-pooled feature vectors are reshaped to convert them into an aggregated spatial feature space upon which multi-scale pyramidal pooling is possible. In addition, this kind of feature representation captures overall spatiality to understand the informative features holistically and solve the current problem.

Spatial pyramid pooling (SPP)

The SPP layer was originally introduced to alleviate the fixed-length input constraints of conventional deep networks, which effectively boosted the model’s performance57. Generally, a SPP layer is added upon the last convolutional layer of a backbone CNN. This pooling layer generates a fixed-length feature vector and afterward passes the feature map to a fully connected or classification layer. The SPP enhances feature aggregation capability at a deeper layer of a network. Most importantly, SPP applies multi-level spatial bins for pooling while preserving the spatial relevance of the feature map. It provides a robust solution through performance enhancement of diverse computer vision problems, including plant/leaf image recognition.

A typical region polling technique loses its spatial information while passing though a global average pooling (GAP) layer for making compatible with and plugging in the GCN. As a result, a region pooling with a GAP layer aggressively eliminates informativeness of regions and their correlation, and thus often it fails to build an effective feature vector. Also, the inter-region interactions are ignored with a GAP layer upon only region-based pooling. Therefore, it is essential to correlate the inter-region interactions for selecting essential features, which could further be enriched and propagated through the GCN layer activations.

Our objective is to utilize the advantage of multi-level pooling at different pyramid levels of \(n \times n\) bins on the top of fixed-size regions of the input image. As a result, the spatial relationships between different image regions are preserved, thereby escalating the learning capacity of the proposed PND-Net. The input feature space prior to pyramid pooling is given as \(\textbf{F}^{\omega \times (HW)\times C}\), which has been derived from \(\textbf{F}^{\omega \times H\times W\times C}\). It enables the selection of contextual features of neighboring regions (i.e., inter-regions) through pyramid pooling simultaneously. This little adjustment in the spatial dimension of input features prior to pooling captures the interactions between the local regions of input leaf disease. Experimental results reflect that pyramidal pooling indeed elevates image classification accuracy gain over region pooling only.

$$\begin{aligned} {F}_{SPP}=\textit{PyramidPooling}\Big ({\textbf {F}}_{\delta _i\times \delta _i}; {\textbf {F}}_{\delta _j\times \delta _j}\Bigl ) \end{aligned}$$
(4)

where \(\delta _i\) and \(\delta _j\) define the window sizes, which enable to pool a total of \(P=(i\times i) + (j\times j)\) feature maps after SPP, given as \(\textbf{F}^{P\times C}\). These feature maps are further fed into a GCN module, described next. The key components of proposed method are pictorially ideated in Fig. 1.

Graph convolutional network (GCN)

A graph \(G=({P},E)\), with P nodes and E edges, is constructed for feature propagation. A GCN is applied for building a spatial relation between the features through graph G. The nodes are characterized by deep feature maps, and the output \(\textbf{C}\) with the convoluted features per node. The edges E are described by an un-directed adjacency matrix \(\textbf{A} \in \mathbb {R}^{{P}\times {P}}\) for representing node-level interactions. This graph convolution has been applied to \(F_{SPP}\) (i.e., \(\textbf{F}^{P\times C}\)), described above. The layer-wise feature propagation rule is defined as:

$$\begin{aligned} \textbf{G}^{(l+1)} = \sigma \left( \hat{\tilde{\textbf{A}}}\textbf{G}^{(l)}\textbf{W}^{(l)}\right) ; \hspace{0.2 cm} \text {with} \hspace{0.2 cm} \textbf{G}^{(0)}=\textbf{F}^{P\times C}, \hspace{0.2 cm} \text {and} \hspace{0.2 cm} \textbf{G}^{(L)}=\textbf{F}^{P\times C} \end{aligned}$$
(5)

\(l=0, 1, \dots , L-1\) is the number of layers, \(\textbf{W}^{(l)}\) is a weight matrix for the l-th layer. A non-linear activation function (e.g., ReLU) is denoted by \(\sigma (.)\). The symmetrically normalized adjacency matrix is \(\hat{\tilde{\textbf{A}}}=Q\tilde{\textbf{A}}Q; \) and \(Q=\tilde{\textbf{D}}^{-1/2}\) is the diagonal node degree matrix of \(\tilde{\textbf{A}}\) (defined in Eq. 3). Next, the reshaped convolutional feature map \({\textbf {F}}\) is fed into two layers of graph convolutions, subsequently which is capable of capturing local neighborhoods via the non-linear activations of rectified linear unit (ReLU) in the graph convolutional layers. The dimension of the output feature maps remains the same input of GCN layers, i.e., \( \textbf{G}^{(L)}\rightarrow {\textbf {F}}_{G}\) \(\in \mathbb {R}^{{P}\times {C}}\). However, the node features could be squeezed to a lower dimension, which may lose essential information pertinent to spatial modeling. Hence, the channel dimension is kept uniform within the network pipeline in our study. Afterward, the graph-based transformed feature maps (\({\textbf {F}}_{G}\)) are pooled using a GAP for selecting the most discriminative channel-wise feature maps of the nodes.

Figure 2
figure 2

Sample images of banana dataset showing the nutrition deficiency of iron, calcium, and magnesium.

Figure 3
figure 3

Sample images of coffee nutrition deficiency of boron, manganese, and nitrogen.

Classification module

Generally, regularization is a standard way to tackle the training-related challenges of any network, such as overfitting. Here, the layer normalization and dropout layers are interposed for handling overfitting issues as a regularization technique. Lastly, \({F}_{final}\) is passes through a softmax layer for computing the output probability of the predicted class-label \(\bar{b}\), corresponding to the actual-label \(b \in Y\) of object classes Y.

$$\begin{aligned} \textit{F}_{final}=\textit{Regularization} \Bigl ( GAP\big ({F}_{G}\big ) \Bigl ) , \hspace{0.3 cm} and \hspace{0.3 cm} {Y}_{pred}={Softmax}\Bigl (\textit{F}_{final}\Bigl ). \end{aligned}$$
(6)

The categorical cross-entropy loss function (\(\mathscr {L}_{CE}\)) and the stochastic gradient descent (SGD) optimizer with \(10^{-3}\) learning rate has been chosen for experiments.

$$\begin{aligned} \mathscr {L}_{CE}= - \sum _{i=1}^{N} Y_{i}.log\hat{Y}_i \end{aligned}$$
(7)

where \(Y_i\) is the actual class label and \(log\hat{Y}_i\) is the predicted class label by using softmax activation function \(\sigma (.)\) in the classification layer, and N is the total number of classes.

Results and performance analysis

At first, the implementation description is provided, followed by a summary of datasets. The experiments have been conducted using conventional classification and cross validation methods. The performances are evaluated using the standard well-known metrics: accuracy, precision, recall, and F1-score (Eq. 8).

$$\begin{aligned} \begin{aligned} {Accuracy}&=\frac{TP+TN}{TP+TN+FP+FN} \\ {Precision }&=\frac{TP}{TP+FP} \\ {Recall}&= \frac{TP}{TP+FN} \\ {F1}\text {-}{Score }&= 2\times \frac{{Precision}\times {Recall}}{{Precision} + {Recall}} \end{aligned} \end{aligned}$$
(8)

where TP is the number of true positive, TN is the number of true negative, FP is the number of false positive, and FN is the number of false negative. However, accuracy is not a good assessment metric when the data distributions among the classes are imbalanced. To overcome such misleading evaluation, the precision and recall are useful metrics, based on which F1-score is measured. These three metrics are widely used for evaluating the predictive performance when classes are imbalanced. In addition, we have evaluated the performance using confusion matrix which provides a reliable performance assessment of our model. The performances have been compared with existing methods, discussed below.

Implementation details

A concise description about the model development regarding the hardware resources, software implementation data distribution, evaluation protocols, and related details are furnished below for easier understanding.

Summary of convolutional network architectures

The Inception-V3, Xception, ResNet-50, and MobileNet-V2 backbone CNNs with pre-trained ImageNet weights are used for convolutional feature computation from the input images. The Inception module focuses on increasing network depth using 5\(\times \)5, 3\(\times \)3, and 1\(\times \)1 convolutions58. Again, 5\(\times \)5 convolution has been replaced by factorizing into 3\(\times \)3 filter sizes59. Afterward, the Inception module is further decoupled the channel-wise and spatial correlations by point-wise and depth-wise separable convolutions, which are the building block of Xception architecture60. The separable convolution follows the depth-wise convolution for spatial (3\(\times \)3 filters) and point-wise convolution (1\(\times \)1 filters) for cross-channel aggregation into a single feature map. The Xception is a three-fold architecture developed with depth-wise separable convolution layers with residual connections. Whereas, the residual connection a.k.a. shortcut connection is the central idea of deep residual learning framework, widely known as ResNet architecture61. The residual learning represents an identity mapping through a shortcut connection following simple addition of feature maps of previous layers rendered using 3\(\times \)3 and 1\(\times \)1 convolutions. This identity mapping does not incur additional computational overhead and still able to ease degradation problem. In a similar fashion, the MobileNet-V2 uses bottleneck separable convolutions with kernel size 3\(\times \)3, and inverted residual connection62. It is a memory-efficient framework suitable for mobile devices.

These backbones are widely used in existing works on diverse image classification problems (e.g., human activity recognition, object classification, disease prediction, etc.) due to their superior architectural designs63,64 at reasonable computational cost. Here, these backbones are used for a fair performance comparison with the state-of-the-art methods developed for plant nutrition and disease classification65. We have customized the top-layers of base CNNs for adding the GCN module without alerting their inherent layer-wise building blocks, convolutional design such as the kernel-sizes, skip-connections, output feature dimension, and other design parameters. The basic characteristics of these backbone CNNs are briefed in Table 1. The network depth, model size and parameters have been increased due to the addition of GCN layers upon the base CNN accordingly, evident in Table 1.

Table 1 Design specifications of backbone CNNs and characteristics of PND-Net.
Table 2 Details of implementation specifications.

Two GCN layers have been used with ReLU activation, and the feature size is the same as the base CNN’s output channel dimension. For example, the size of channel features of ResNet-50, Xception and Inception-V3 is 2048, which is kept the same dimension as GCN’s channel feature map. The adjacency matrix is developed considering overall spatial relation among different neighborhood regions as a complete graph. Therefore, each region is related with all other regions even if they are far apart which is helpful in capturing long-distant feature interactions and building a holistic feature representation via a complete graph structure. Batch normalization and a drop-out rate of 0.3 is applied in the overall network design to reduce overfitting.

Data pre-processing and data splitting techniques

The basic pre-processing technique provided by the Keras applications for each backbone has been applied. It is required to convert the input images from RGB to BGR, and then each color channel is zero-centered with respect to the ImageNet dataset, without any scaling. Data augmentation methods such as random rotation (± 25\(\circ \)), scaling (± 0.25), Gaussian blur, and random cropping with 224 \(\times \) 224 image-size from the input size of 256\(\times \)256 are applied on-the-fly for data diversity in image samples.

We have maintained the same train-test split provided with the datasets e.g., PlantDoc. However, other plant datasets does not provide any specific image distribution. Thus, we have randomly divided the datasets into train and test samples following a 70:30 split ratio which is complied in several works. The details of image distribution is provided in Table 3. For cross-validation, we have randomly divided the training samples into training and validation set with a 4:1 ratio i.e., five-fold cross validation in a disjoint manner, which is a standard techniques adopted in other methods66. The test set remains unaltered for both evaluation schemes for clear performance comparison. Finally, the average test accuracy of five executions on each dataset has been reported here as the overall performance of the PND-Net.

A summary of the implementation specification indicating the hardware and software environments, training hyper-parameters, data augmentations, and estimated time (milliseconds) of training and inference are specified in Table 2. Our model is trained with a mini-batch size of 12 for 150 epochs and divided by 5 after 100 epochs. However, no other criterion such as early stopping has been followed. The proposed method is developed in Tensorflow 2.x using Python.

Dataset description

Figure 4
figure 4

Sample images of potato diseases infected by bacteria, pest, and nematodes.

Figure 5
figure 5

Sample images of infected leaves of soybean, tomato, and bell pepper from the PlantDoc dataset.

The summary of four plant datasets used in this work are summarized in Table 3. These datasets are collected from public repositories such as the Mendeley Data and Kaggle.

  • The Banana nutrition deficiency dataset represents healthy samples and the visual symptoms of deficiency of the: Boron, Calcium, Iron, Magnesium, Potassium, Sulphur, and Zinc. The samples of this dataset are shown in Fig. 2. More details are provided in Ref25.

  • The Coffee nutrition deficiency dataset (CoLeaf-DB)26 represents healthy samples and the deficiency classes are: Boron, Calcium, Iron, Manganese, Magnesium, Nitrogen, Potassium, Phosphorus, and more deficiencies. The samples of dataset are illustrated in Fig. 3.

  • The Potato disease classes are: Virus, Phytopthora, Pest, Nematode, Fungi, Bacteria, and healthy. The samples of this dataset are shown in Fig. 4. The dataset is collected from the Mendeley67 repository.

  • The PlantDoc is a realistic plant disease dataset65, comprising with different disease classes of Apple, Tomato, Potato, Strawberry, Soybean, Raspberry, Grapes, Corn, Bell-pepper, and others. Examples are shown in Fig. 5.

  • The Breast Cancer Histopathology Image Classification (BreakHis)68 dataset with 40\(\times \) and 100\(\times \) magnifications contain 8-classes: adenosis, fibroadenoma, phyllodes tumor, and tubular adenoma; ductal carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma. The samples of this dataset are exemplified in Fig. 6.

  • The SIPaKMeD69, containing 4050 single-cell images, which is useful for classifying cervical cells in Pap smear images, shown in Fig. 7. This dataset is categorized into five classes based on cytomorphological features.

Figure 6
figure 6

Sample images of the BreakHis-40\(\times \) dataset.

Figure 7
figure 7

Sample images of the SIPaKMeD dataset.

Table 3 Dataset summary with the baseline accuracy (%) using different base CNNs only.
Table 4 Overall performances (%) of the proposed PND-Net built upon different standard base CNNs.

Result analysis and performance comparison

A summary of the datasets with data distribution, and the baseline accuracy (%) achieved by aforesaid base CNNs are briefed in Table 3. The baseline model is developed using the pre-trained CNN backbones with ImageNet weights. A backbone CNN extracts the base output feature map which is pooled by a global average pooling layer and classified with a softmax layer. Four backbone CNNs with different design characteristics have used for generalizing our proposed method. The baseline accuracies are reasonable and consistent across various datasets, evident in Table 3.

Two different evaluation strategies i.e., general classification and k-fold cross validation (\(k=5\)) have been experimented. An average performance has been estimated from multiple executions on each dataset and reported here. The top-1 accuracies (%) of the proposed PND-Net comprising two-GCN layers with the feature dimension 2048, included on the top of different backbone CNNs, are given in Table 4. The overall performance of the PND-Net on all datasets significantly improved over the baselines. Clearly, it shows the efficiency of the proposed method. In addition, the PND-Net model has been tested with five-fold cross validation for a robust performance analysis (“Fivefold cross validation experiments”). These cross-validation results (Tables 5, 6 and 7) on each dataset could be considered as the benchmark performances using several metrics. Our method has driven the state-of-the-art performances on these datasets for plant disease and nutrition deficiency recognition.

Table 5 The performance of PND-Net on the potato disease dataset using fivefold cross validation.

An experimental study has been carried out on two more public datasets for human medical image analysis. The BreakHis with 40\(\times \) and 100\(\times \) magnifications68 and SIPaKMeD69 datasets have been evaluated for generalization. The SIPaKMeD dataset69 is useful for classifying cervical cells in pap smear images, illustrated in Fig. 7. This dataset is categorized into five classes based on cytomorphological features using the proposed PND-Net. The conventional classification results are given in Table 4, and the performances of cross validations are provided in Tables 6 and 7.

Table 6 The performance of PND-Net on the BreakHis-40\(\times \) dataset using fivefold cross validation.

Fivefold cross validation experiments

The fivefold cross-validation experiments on various datasets have been conducted for evaluating the performance of PND-Net using the ResNet-50 and Xception backbones, and the results are given in Table 5, 6, and 7. The actual train set is divided into five disjoint subsets of images for each dataset. In each experiment, four out of five subsets are used for training and the remaining one is validated independently. Finally, the average validation result of five folds is reported.

The results of five-fold cross validation on potato leaf disease dataset are provided in Table 5. The numbers of potato leaf images in each fold for training, validation, and testing are 1608, 402, and 869, respectively. The results using different metrics are computed and the last row implies an average performance of cross validation on this dataset.

Likewise, the performance five-fold validation on the BreakHis-40\(\times \) dataset has been presented in Table 6. In this experiment, the number of training samples in each fold is 1280 images, and validation set containing remaining 320 images. The test set contains 400 images which remains the same as used in aforesaid other experiments. Each of the five-fold experiment has been validated and tested on the test set. Lastly, an average result of five-fold cross validation has been computed, and given in the last row of Table 6.

A similar experimental set-up of five-fold cross validation has been followed for other datasets. The average performances of PND-Net on these datasets are provided in Table 7. The average cross-validation results are better than the conventional classification approach on the potato disease (ResNet-50: 94.48%) and BreakHis-40\(\times \) (ResNet-50: 97.10%) datasets. The reason could be the variations in the validation set in each fold enhances the learning capacity of model due to training data diversity. As a result, improved performances have been achieved on diverse datasets. The results are consistent with the results of conventional classification method on other datasets as described above. The overall performances on different datasets validates the generalization capability of the proposed PND-Net.

Table 7 Fivefold cross validation and test accuracy (%) of PND-Net on different datasets.
Table 8 Model parameters (millions) of PND-Net including base CNNs and GCNs with the feature size = 1024 and 2048, specified within parenthesis.
Figure 8
figure 8

Confusion matrices have been computed using the proposed PND-Net with ResNet-50 backbone on: (a) top-row: PlantDoc; (b) bottom-row: potato, coffee, and banana datasets.

Figure 9
figure 9

Confusion matrix on the BreakHis-40\(\times \) dataset (left) and smear PAP cell dataset (right) using the proposed PND-Net built upon the Xception backbone.

Figure 10
figure 10

The t-SNE plots on the Potato leaf dataset using PND-Net with ResNet-50 (left) and Inception-V3 (right).

Figure 11
figure 11

The Grad-CAM output of various datasets are shown, from left to right: nutrition deficiency, potato and corn diseases, and breast cancer. The top-row shows an original image and its corresponding Grad-CAM image is shown in the bottom row.

Model complexity and visualization

The model parameters are computed in millions, as provided in Table 8. The model parameters have been estimated for three cases: (a) baseline i.e., the backbone CNN only; and the output feature dimension of GCN layers is (b) 1024 and (c) 2028. An average computational time of PND-Net using ResNet-50 has been estimated. The training time is 15.4 ms per image, and inference time is 5.8 ms per image, and model size is 122MB (given in Table 2). The confusion matrices on these four plant datasets are shown in Fig. 8, indicating an overall performance using ResNet-50. Also, the feature map distributions are clearly shown in different clusters in the t-SNE diagrams70 represented with two backbone models on the potato leaf dataset, shown in Fig. 10. The gradient-weighted class activation mapping (Grad-CAM)71 has been illustrated in Fig. 11 for visual explanations which clearly show the discriminative regions of different images.

Performance comparison

The highest accuracy on Banana nutrition classification was 78.76% and 87.89% using the raw dataset and an augmented version of the original dataset72. In contrast, our method has attained 84.0% using lightweight MobileNet-V2 and the best 90.0% using ResNet-50 on the raw dataset, implying a significant improvement in accuracy on this dataset.

The performances of PND-Net on the Coleaf-DB (Coffee dataset) are very similar, and the best accuracy (90.54%) is attained by the Xception. The differences of performances with other base CNNs are very small, implying a consistent performance. The elementary result using ResNet-50 reported on this recent public dataset is 87.75%26. Thus, our method has set new benchmark results on Coleaf-DB for further enhancement in the future. Likewise, the Potato Leaf Disease dataset is a new one67, collected from Mendeley data source. We are the first to provide in-depth results on this realistic dataset acquired in an uncontrolled environment.

A deep learning method has attained 81.53% accuracy using Xception and 78.34% accuracy using Inception-V3 backbone on the PlantDoc dataset73. In contrast, our PND-Net has attained 84.30% accuracy using Xception and 81.0% using Inception-V3, respectively. It evinces that PND-Net is more effective in discriminating plant diseases compared to the best reported existing methods. Clearly, the proposed graph-based network (PND-Net) is capable of distinguishing different types of nutrition deficiencies and plant diseases with a higher success rate in real-world public datasets.

The BreakHis dataset has been studied for categorizing into 4-classes and binary classification in several existing works. However, we have compared it with the works of classifying into 8 categories at the image-level for a fair comparison. The top-1 accuracy attained using Xception is 94.83%, whereas the state-of-the-art accuracy on this dataset is 93.40±1.8% achieved using a hybrid harmonization technique74. The accuracy reported is 92.8 ±2.1% using a class structure-based deep CNN75. The cross-validation results (ResNet-50: 97.10%) are improved over existing methods.

Several deep learning methods have been experimented with the SIPaKMeD dataset. A CNN-based method achieved 95.35 ± 0.42% accuracy69, a PCA-based technique obtained 97.87% accuracy for 5-class classification76, 98.30% using Xception77, and 98.26% using DarkNet-based exemplar pyramid deep model78. A GCN-based method has reported 98.37± 0.57% accuracy54. A few more comparative results have been studied in Ref79. In contrast, our method has achieved 98.98 ± 0.20% accuracy and 99.10% test accuracy with cross validation using Xception backbone on this dataset. The confusion matrices on both human disease datasets are shown in Fig. 9. Overall rigorous experimental results imply that the proposed method has achieved state-of-the-art performances on different types of datasets representing plant nutrition deficiency, plant disease, and human disease classification.

Table 9 Ablation study: top-1 accuracy (%) of baseline CNNs in addition to region pooling.

Ablation study

An in-depth ablation study has been carried out to observe the efficacy of key components of the PND-Net. Firstly, the significance of computing different local regions is studied. These fixed-size regional descriptors are combined to create for a holistic representation of feature maps over the baseline features. Notably, the region pooling technique has improved overall performances on all the datasets, e.g., the gain is more than 12% on the Banana nutrition deficiency dataset using ResNet-50 backbone. The results of this study are provided in Table 9.

Afterward, a component-level study has been evaluated by removing a module from the proposed PND-Net to observe the influence of the key component in performance. An ablation study depicting the significance of spatial pyramid pooling (SPP) layer has been conducted, and the results are shown in Table 10. As the selection of discriminatory information at multiple pyramidal structures has been avoided, the model might overlook finer details which could have been captured at multiple scales by the SPP layer. It causes an obvious degradation of the capacity of network architecture, which is evident from the performances. Thus, capturing multi-scale features is useful to select relevant features for effective learning of plant health conditions.

Table 10 Ablation study: top-1 accuracy (%) except SPP layer in the PND-Net architecture.
Table 11 Ablation study: top-1 accuracy except the GCN layers.
Table 12 Ablation study: top-1 accuracy using one GCN layer (%).
Table 13 Ablation study: top-1 accuracy (%) with feature dimension 1024 in GCN layers of PND-Net.

Next, the efficacious GCN modules are excluded from the network architecture, and then, experiments have been conducted with regional features selected by our composite pooling modules (i.e., regions + SPP) from upsampled high-level deep features of a base CNN. The results are provided in Table 11.

Figure 12
figure 12

(a) The performances of various formulations of the numbers of regions and spatial pyramid pooling feature vectors; (b) the performances of different channel-wise node features within GCN layers activation and propagation in the proposed method using the ResNet-50 backbone.

It is evident that the GCN module indeed improves performance remarkably. In the case of the Banana dataset using Xception backbone, the accuracy of PND-Net is 89.25%. Whereas, averting GCN layers, the degraded accuracy is 81.46%, implying 7.79% drop in accuracy. Even though, one GCN layer (Banana: 86.0%) does not suffice to render the state-of-the-art performance on these plant datasets. The results of considering one layer GCN on all datasets are demonstrated in Table 12. Indeed, two layers in GCN are beneficial in enhancing the performance over one GCN layer, which is evident in the literature22. Hence, two GCN layers are included in the proposed PND-Net model architecture.

A comparative study on different number of regions and the number of pyramid pooled feature vectors using ResNet-50 is shown in Fig. 12a, which clearly implies a gradual improvement in accuracy on the PlantDoc and Banana datasets. Lastly, the influences of different feature vector sizes in GCN layer activations have been studied. In this study, the channel dimensions of feature vectors 1024 and 2048 have been chosen for building the graph structures using ResNet-50 backbone, implying the same channel dimensions have been considered in the PND-Net architecture. The results (Fig. 12b of such variations provide insightful implications about the performance of GCN layers.

The performances of PND-Net with GCN output feature vector size of 1024 have been summarized in Table 13. The results are very competitive with GCN’s size of 2048. Thus, the model with 1024 GCN feature size could be preferred considering a trade off between the model parametric capacity with the performance. The detailed experimental studies imply overall performance boost on all datasets, and the proposed PND-Net achieves state-of-the-art results. In addition, new public datasets have been benchmarked for further enhancement.

However, other categories of images such as high resolution, hyperspectral, etc. have not been evaluated. One reason is unavailability of such plant datasets for public research. Also, data modalities such as soil-sensor information could be utilized for developing fusion based approaches. Several existing ensemble methods have used multiple backbones, which suffer from a higher computational complexity. Though, our method performs better than several existing works, yet, the computational complexity regarding model parameters and size of PND-Net could be improved. The reason is plugging the GCN module upon the backbone CNN, which incurs more parameters. To address this challenge, the graph convolutional layer could be simplified for reducing the model complexity. In addition, more realistic agricultural datasets representing field conditions such as occlusion, cluttered backgrounds, lighting variations, and others could be developed. These limitations of the proposed PND-Net will be explored in the near future.

Conclusion

In this paper, a deep network called PND-Net has been proposed for plant nutrition deficiency recognition using a GCN module, which is added on the top a CNN backbone. The performances have been evaluated on four image datasets representing the plant nutrition deficiencies and leaf diseases. These datasets have recently been introduced publicly for assessment. The network has been generalized by building the deep network using four standard backbone CNNs, and the network architecture has been improved by incorporating pyramid pooling over region-pooled feature maps and feature propagation via a GCN. We are the first to evaluate these nutrition inadequacy datasets for monitoring plant health and growth. Our method has attained the state-of-the-art performance on the PlantDoc dataset for plant disease recognition. We encourage the researcher for further enhancement on these public datasets for early stage detection of plant abnormalities, essential for sustainable agricultural growth. Furthermore, experiments have been conducted on the BreakHis (40\(\times \) and 100\(\times \) magnifications) and SIPaKMeD datasets, which are suitable for human health diagnosis. The proposed PND-Net have attained enhanced performances on these datasets too. In the future, new deep learning methods would be developed for early stage disease detection of plants and health monitoring with balanced nutrition using other data modalities and imaging techniques.