Introduction

Rising cases of crop diseases, driven by climate change, globalisation and large scale agriculture, are a major threat to global food security and agricultural sustainability. This is why prompt detection and interception is so important – not only to reduce yield losses but to avert much bigger epidemics. Manual field inspection of all crops for inspecting crop diseases are so much time consuming and difficult task for farmers in agricultural field that needs trained people for distinction as diseases are difficult to detect and we know this is human error prone activity as these methods are largely based on human eyes. Hence, there is an urgent need to use modern technology (e.g., precision agriculture, artificial intelligence (AI), and drone imaging) in a scalable, accurate, and close to real-time disease surveillance over large agricultural landscapes1,2.

Recent studies have shown that in the controlled condition, computer vision models driven by AI can achieve classification with impressive accuracy for crop diseases3,4. Also, in large-scale agricultural imaging, uncrewed aerial vehicles (UAVs) equipped with multispectral and hyperspectral cameras have been successful5,6. Yet, several key limitations exist with existing methods. The majority of existing models are not real-time and thus unsuitable for launch on low-end devices. These do not often incorporate environmental contextual data, such as temperature, humidity, and soil moisture, which are essential for understanding the spread of the disease7,8. Furthermore, most studies focus on a particular crop or a small area, hampering scalability and generalizability9. For example, conventional Convolutional Neural Network (CNN) based models, such as MobileNet10 and Siamese Networks11, or the more recent Visual Transformer (ViT) approach12, demonstrate superior accuracy in laboratory conditions but are challenged by diverse real-world agricultural scenarios because of dataset bias and environmental change. These constraints require an advanced, comprehensive, and unified framework that incorporates deep learning methods, including drone imaging, IoT sensor fusion, and edge computing, for proper real-time monitoring in agriculture.

Though there is promising evidence, the lack of research on multimodal data and hybrid deep learning models that can provide actionable insights at scale still prevails. Current work considers only drone images, without environmental context on one side, and IoT sensor data on the other, indicating that neither can provide a holistic picture of crop health. Moreover, only a limited number of systems are tailored for real-time, low-latency execution on edge devices that can be deployed in the field. This shortage of integration and field validation has impeded the large-scale application of AI-based agricultural disease monitoring7,13.

To bridge these gaps, we present AgroVisionNet, a hybrid deep learning model that integrates CNN-based spatial feature extraction with Transformer-based contextual attention and IoT sensor data fusion. The model utilises edge computing to enable on-device, real-time disease detection across wide agricultural surfaces. The system is intended to provide farmers with early warning and call-to-action information, so they can implement interventions that reduce crop losses and promote sustainability. Integrating visual and environmental information, AgroVisionNet improves disease diagnosis accuracy and generalisation to different conditions. The use of drones in agriculture, along with some applications, is shown in Fig. 1.

While some of the earlier CNN-Transformer models for plant disease detection employ only visual cues, in contrast, AgroVisionNet employs a learnable adaptive fusion strategy that weighs both visual and IoT modalities to dynamically modulate (α, β) the importance of each modality using weights that are made trainable and end-to-end optimised using backpropagation. This allows the model to dynamically highlight the informative modality when weather conditions change in the field. In addition, TensorFlow Lite quantisation is applied to optimise the network at the edge further, reducing feature dimensionality and enabling real-time inference on low-power devices like the Jetson Nano. Such dual innovation enables sensor-aware adaptive fusion with efficient on-edge deployment, distinguishing AgroVisionNet from traditional hybrid architectures while facilitating practical, field-ready, scalable precision agriculture. The paper makes the following contributions.

  1. 1.

    We propose AgroVisionNet, such that its backbone is a new hybrid CNN-Transformer architecture where the CNN obtains spatial crop-health descriptors from drone images while the Transformer captures long-range contextual dependencies; this backbone is suitable for multimodal agricultural environments.

  2. 2.

    Rather than simply concatenating image embeddings and time-aligned IoT/environmental features, we present an adaptive cross-modal fusion block that leverages learnable weights to perform these operations, thereby making the overall model robust to changing field conditions and sensor noise.

  3. 3.

    We design an edge-centric optimisation pipeline (TensorFlow → TFLite dynamic quantisation, 224 × 224 fixed form inputs, low precision inference) and show that the complete multi-modal model is run on an NVIDIA Jetson Nano with reasonable latency for in-field operations.

  4. 4.

    We create a reference dataset that includes drone RGB/multispectral images with corresponding sensor streams from various crops and locations, and we provide train/validation/test splits for reproducible benchmarking.

  5. 5.

    To expose the disease-spot activation maps with an explainable AI (Grad-CAM) visualisation, we integrate them into our pipeline so that agronomists can achieve better transparency of the automated decision.

The proposed AgroVisionNet incorporates an adaptive, learnable fusion mechanism that temporally aligns sensor readings with image embeddings; in contrast, previous studies either fused image features with fixed sensor-based metadata or operated only in the cloud. Moreover, focusing on rare, detailed aspects of the design of an agricultural disease-detection model, we explicitly outline and validate the proposed model for low-latency execution on an embedded edge platform (Jetson Nano) using TensorFlow Lite–based quantisation and input-size standardisation. In the third contribution, we create a synchronised drone–IoT dataset across different crops to enable multimodal, reproducible evaluation.

The rest of this paper is structured as follows. Related Work Sect. “Related work” presents related work on: drones, AI for plant disease detection, IoT integration, and precision agriculture—the framework. We describe a new MTR system that consists of the following sections: 3.1 Data acquisition, model architecture and algorithms in Sect. “Related work”. Section “Experimental results” provides details on the experimental setup, the dataset, and the performance evaluation. Results and limitations of the study are described in Sect. “Discussion”. The paper concludes with a summary of the last Section in Sect. “Conclusion and future work”.

Related work

This section explores the recent literature on using artificial intelligence (AI), drones, and computer vision to improve agricultural monitoring and disease detection. The studies depict AI’s role in precision farming, with special attention to early detection and effective monitoring. In line with the objectives of this study, this review organises references into categories: drone use, AI-based disease diagnostics, IoT integration, sensor technology, and challenges.

Role of drones in agricultural monitoring

Drones have emerged as a disruptive technology in this agroengineering space by enabling high-resolution surveying of vast farmland areas with minimal workforce14. For instance, it has significantly boosted the efficiency of plant cultivation, including disease detection, pest monitoring, and environmental assessment. Studies by Slimani et al. Drones can reliably classify plant diseases and lower human errors in traditional inspection practices1. Similarly, Abbas et al. Drone techniques for early crop disease diagnosis were faster and more efficient than conventional ground-based methods2.

For example, there are some studies on the use of UAVs for agriculture. Velusamy et al. Numerous works7,13 addressed the application of UAVs to crop health and pest monitoring, providing new approaches to increase the precision and robustness of aerial imaging data. Xue et al.14 and Dutta and Goswami15 have reviewed the use of UAVs to overcome labour shortages and environmental consequences. Eagle/Fishtail Kite Zeta or Jiggle-Eye. In addition, Mogili and Deepak16 examined UAV systems for pesticide spraying, which pose health risks from direct exposure of farmers to sprayed chemicals at ground level.

The use of drones in field-based disease assessment and crop monitoring was studied in more detail by Kumar et al.17 and Puri et al.18, who reported that although UAVs demonstrated high potential for precision farming, there is still room to improve the hardware of drones, image processing analysis and autonomous flight control to improve the reliability and scalability.

Chin et al. have recently reviewed these issues10. and Peña et al.19, which examined the use of drones for plant disease detection and suggested future research directions. Peña et al.19 also discussed UAV applications for phytosanitation in oil palm plantations, particularly for pest and disease surveillance at the scale of the entire plantation. Hasanaliyeva et al.8 summarised recent advances in the development of UAV-based plant disease detection systems. They stressed the paramount importance of digital technology to augment decision-making.

More recently, developments combining IoT with drones have broadened their range of applications. Doggalli et al.9 and Gao et al.12 show that UAVs integrated with IoT networks could improve pest and disease monitoring by facilitating real-time information sharing. In these works, additional energy efficiency and flight endurance issues were explored in greater detail, both of which are essential parameters for the massive deployment of large-scale drones.

In all these applications, drones have been demonstrated to be highly effective for disease detection, pest control, irrigation management, and environmental monitoring. Yet most previous methods rely primarily on visual data and do not incorporate ecological sensors or real-time edge computing. This restricts their scalability and their ability to react in real time when context changes. Motivated by these observations, we develop the AgroVisionNet framework to achieve this objective by integrating drone images and IoT sensors for a real-time, multimodal solution to large-scale crop disease detection.

Artificial intelligence for plant disease detection

AI is a critical component in improving the precision of plant disease detection. Misra et al. The work of Swanson et al.4 helped provide some perspective on the use of AI in agriculture, in particular on the opportunities to enhance economic viability and feasibility. They illustrated practical issues regarding the effectiveness of different ML/DL techniques20, Ahmad et al. Ghosh et al.21. previously reviewed studies on DL for plant disease diagnosis, while22,23 identified gaps and suggested developing a tool that meets farmers’ needs. Latif et al. The work by24 introduced a high-accuracy DCNN-based method for rice disease detection, but identified dataset biases as a limitation. Kamilaris et al. According to25, DL models are highly accurate across various classification problems but remain underexplored for many agricultural problems beyond what traditional techniques can cover. Ai et al.26 et Nguyen et al. Convolutional neural networks for crop disease classification27, but limited to a low number of datasets.

While Ferentinos28 reports good accuracy with CNNs, overfitting is a limitation. Lee et al. Vitis and Turner29 and Shewale and Daruwala30 sought to enhance CNN specificity for broader agricultural use. Alkan et al. Hybrid DL models have been reviewed in the literature31, and many other enhancements of these models for more accurate detection have been proposed32,33. Parez et al. Only34 presented a lightweight DL model for detecting disease, whereas Albattah et al.35 and Jafar et al.36 Stressed on intelligent agri solutions by merging AI with IoT.

Recently, enormous efforts have been made to enhance plant disease recognition using lightweight CNN variants and Vision Transformer (ViT) architectures, which are also suitable for practical agricultural implementation, offering fast and accurate image classification. Sandler et al. MobileNetV337 introduced the Architecture and demonstrated that an efficient convolutional architecture reduces overall complexity, making mobile neural networks usable on drones and end devices. Similarly, Kunduracıoğlu and Paçal38 used EfficientNet models to detect sugarcane leaf disease. It would be more correct to say that they concluded that small models can still achieve high accuracy despite working under realistic resource constraints in field environments.

While Vision Transformers (~ ViTs) in the context of agricultural imagery garner distracted interest. Paçal39 proposed data-efficient ViT models for sugarcane leaf disease detection and demonstrated their generalisation performance on small datasets. Additionally, Kunduracıoğlu and Paçal40 evaluated the performance of CNNs and ViTs for grape disease detection, highlighting that the self-attention mechanism provides better feature embeddings and classification accuracy under changing environmental conditions.

Building on these advances, hybrid backbones that concatenate CNNs and ViTs have emerged as an encouraging approach for extracting local spatial features with CNNs and leveraging global attention mechanisms. Shandilya et al.41 designed a hybrid CNN-ViT model for maize leaf disease identification that achieved better performance than SC SCNNs and Transformers by efficiently fusing local and global feature learning. Moreover, Paçal and Kunduracıoglu42 conducted an extensive meta-study of CNNs with ViT-based models in a submission aimed at demonstrating that the hybrid model is better suited for large, complex plant datasets.

Even though these works are effective and introduce novelty, most focus solely on pure-image data and do not involve IoT-based environmental sensing or edge-computing optimisation needed for real-time, large-scale agricultural monitoring. Inspired by these challenges, we propose AgroVisionNet to bridge the gap between deep learning approaches and real-world practical constraints, focusing on deep learning algorithm robustness and multimodal data fusion integrated with practical edge deployment for scalable crop disease detection.

Deep learning models in precision agriculture

Deep learning models are increasingly central to precision agriculture. Zhang et al.43 and Altalak et al. While44 reviewed the applications of DL in dense scene parsing, other papers proposed techniques to better handle input data45,46,47. Yuan et al. Transfer learning methods and datasets in agricultural disease detection: Towards real-world datasets48,49 Coulibaly et al. Critical applications and challenges50,51 in DL for agriculture were identified in a bibliometric analysis by52. Ren et al.53 et Bharman et al. Through their work54, also provided insights into the pros and cons of DL models and identified areas for future research. Wang et al. DL applications in hyperspectral imaging were reviewed in55, with the need for conducting good data processing underlined in56,57. Sharma et al. It also uses segmented images to improve CNN performance58, and Zhang et al. And Pradhan et al.59,60,61,62. While studying a publicly available dataset63, the authors noted limitations in achieving high classification accuracy for the disease. Rakhmatulin et al. The Authors64. discussed on deep neural networks approaches for real-time weed detection in precision agriculture applications, discussing strengths such as CNN based on feature extraction, challenges in the dataset and preprocessing methods, and the integration of such automatic processes with IoT systems for practical precision agriculture approaches.

Integration of IoT, big data, and AI

The convergence of IoT, big data, and AI has driven agricultural growth. These technologies help address several challenges in the agri-food industry65,66 and were discussed by Purnama and Sejati3. Smart farming is mentioned by Dutta and Mitra6 in their work on IoT and sensors for agriculture. In this work, the authors prefer automation. Dhanya et al., Neupane and Gurel67, and PiLi (2023) also emphasised the concept of sensor merging and the use of real-time data for diagnosing illnesses68. Li et al. The AI Techniques for Disease Classification, Feature Extraction, and Their Efficiency review is not too old69. Albattah et al.35 et Pena et al. Ref19. addressed combining sustainable development with high-tech. In their paper, Materne and Inoue70 also reported on IoT-based systems for detecting pests and diseases at an early stage. Darwin et al.71, Senior et al.52 describe the development of big data analytics for crop monitoring and sustainability.

Sensor and imaging technology advancements

Soil-Pest input variables. Soil type, pest estimates for pest-sensitive crops, and soil moisture are all crucial variables in defining trend in crop production over time, and are therefore considered highly detailed input variables. Buchelt et al. Abstract: Explainable AI- A novel way to enhance Drones5 Gomez et al.72. and Fountsop et al. Disease detection through image annotation and model compression was reported73. How and how often sediment and structures erode. In74, a study by certain authors reviewed the use of drone imaging systems for plant disease identification which could be performed at real time, and later by direct delivery to end-users. Song et al.75 and Mhango et al. Reference76 emphasized the key role that high resolution images play in precision agriculture77. Bauer et al. Along with that of Peng et al.7 — but also provide some new evidence of the supply of energy in the tundra78. did a review of computer vision applications to aerial phenotyping and proposed further data integration. Ang et al. Shoji, B. used hyperspectral and multispectral data for crop analysis79.

Challenges in data collection and processing

Among them, these tasks require data collection which still remains a huge bottleneck and processing. Velusamy et al. Energy efficiency issues for UAVs were also identified by Mogili and Deepak16 along with data uncertainty issues. Gao et al. In order to address these limitations, frameworks have been proposed by12. Wani et al. Article sources11,80: and Albattah et al. Finally data was emphasized with35 focusing on ethics and environmental adaptability. Isiaka et al. Papers by81 and Alkan et al.31 and Hasanaliyeva et al. Gathering different types of data8. Coulibaly et al. Consequently52, called for improved statistical validation techniques. Sustainable farming is getting more and more attention. Doggalli et al.9 et Sharma et al. The authors of58 suggested methods for enhancing the robustness of the disease detection systems. Pena et al.19 and Bhatti et al. Use of renewable resources in agriculture82. Albattah et al.35, Isiaka et al.81, Shahi et al.83. Developing a sustainable future around stakeholder engagement and policies that have to be developed. Siddappa84 proposed innovations to drone systems to enhance its functionalities. Mowla and Avsar85 provide a survey on wireless protocols for smart-agriculture, covering topics including the protocols with regards to coverage, energy, and scalability; in addition, this paper discusses boundary issues of interoperability and connectivity in large-scale farm deployments. Mowla and Gk86 provide a comprehensive review of deep learning-based weed-detection networks and different types of architectures (including region-based and convolutional networks), datasets and challenges for real-time applications for precision agriculture. Francis and Deisy89 proposed a CNN-based visual learning framework for accurate agricultural plant disease detection and classification using image-based feature extraction and deep convolutional architectures.

Table 1 Synthesised review of existing Drone-/IoT-Based crop disease detection studies and identified research gap for AgroVisionNet.

There is a clear indication in the literature that these technologies such as AI and Drones have potential to transform agricultural disease diagnosis. This research responds to the crucial challenge of outbreak detection by employing innovative imaging technologies, IoT enabled integration systems, and deep learning. The proposed work is in alignment with the gaps identified as it addresses sustainability, data, processing and real-time applications to improve productivity and resilience. Table 1 consolidates key works already cited in the manuscript, contrasts their modalities, speeds, and robustness, and highlights the unresolved need for an edge-deployable, multimodal CNN-Transformer framework such as AgroVisionNet.

Proposed framework

With the status–quo of all these current challenges and opportunities, we envisioned the next generation of AgroVisionNet with cutting-edge deep learning, drone, IoT sensor and edge computing technologies to reduce complexity of dynamics and scaling in monitoring for agricultural disease. This framework aims to address limitations of current systems by being integrated, real-time, contextual, and scalable across heterogeneous agricultural systems. The framework consists of a few necessary components that enhance the sensitivity of disease detection. Their drones are outfitted with multispectral and hyperspectral imaging systems that acquire ultra-high resolution (cm-scale) images of crops. At the same time, IoT sensors keep track of current environmental parameters like the temperature, humidity, and soil moisture. The final massive multimodal dataset is pre-processed, denoised, georeferenced, and then further augmented to maximise the quality and robustness whilst keeping noise levels low. The model gets more power when you turn on CNN features which assist in spatial feature extraction in the image, and combine that with transformer layers ability of extracting contextual relations. The integration of IoT sensor data with visual features will not only enhance disease classification accuracy, but also provide higher level semantic interpretations of disease dynamics. Edge Computing Real Time Data Processing At Edge to minimize latencyThe framework provides tessellated insights, disease heat maps, and in-depth reports to the end-user in the timely manner, enabled through Edge computing.

Overview

Figure 1 shows our proposed framework applied to UA Vs (which have computer vision AI built into them to monitor large-area agricultural fields and alert upon covered plant disease outbreak). The architecture is implemented using three main modules. Data Acquisition model → Drones equipped with multispectral and hyperspectral cameras were used to obtain a high-resolution image dataset of different farming fields. They could scan thousands of acres using GPS-equipped drones fitted with high-resolution cameras and environmental sensors. We then designed autonomous flight paths with geospatial information to maximize coverage. Simultaneously, the images are dumped to an airborne relay station to be streamed continuously to a ground control station for near real-time processing.

The data processing module includes a complete preprocessing pipeline that is applied to each captured image. For example, noise reduction reduces the contribution of background noise from the external world; conversely, georeferencing aligns the images with the corresponding geographic coordinates. The dataset was augmented with different augmentation techniques such as flipping, rotation, scaling, etc. Classification and detection of disease: a mixed deep learning architecture. The HLDM (Hybrid deep learning model): CNNs and Transformer architectures63. These memories helpful to the model to get the importance features or attributes (like colour, texture and shape) which yield better accuracy of classification and separable features for classification with different plant diseases. What we did: Transfer learning (pre-trained models such as ResNet50 and EfficientNet yielded faster training and higher true posteriors). XAI (explainable artificial intelligence) methods should provide transparency in terms of affected areas and actionable, interpretable information to help to mitigate their effects.

Fig. 1
figure 1

Block Diagram of the Proposed Methodology for AI and Drone-Based Agricultural Monitoring.

These drones also carried IoT-enabled sensors for environmental sensing, giving context to temperature, soil moisture, etc. Edge computing means that drones would only have to analyze the data they collect locally (close to the point of collection), which would minimize latency and lead to more rapid responses to disease outbreaks. After the data for analysis has been selected, and the reservoir variation restricted with the algorithm, the analyzed data becomes relevant insights in the decision support module. The output was presented as heatmaps of disease severity and prevalence by each of the fields monitored. Based on the information from the images, the farmers were provided with in-depth reports in the app regarding the degree of spread of the disease, which type of crop it is attacking and how to prevent it from spreading in their crops.

There were myriad innovative milestones related to the methodology. Finally, a highly decisive hybrid model architecture that had incorporated the best features of CNNs and Transformers realized an impressive accuracy in disease detection. They also communicated with Cloud good enough and gave us real time monitoring through the whole process by Edge computing. Drone conduces the processing right at the very edge, providing timely responses. Multispectral imaging achieved higher performance than conventional detection methods at the naked eye stage of undetectable disease symptoms. In addition, XAI integration ensured that the predictions were interpretable and that users received actionable & trustworthy insights. These findings verify that the proposed system is feasible and effective in transforming precision, efficiency, and sustainable practices in agricultural monitoring and disease management.

The proposed deep learning model

Figure 2 shows that the proposed AgroVisionNet is a two-branch hybrid architecture where (i) drone images and (ii) IoT/environmental signals are processed separately at the first stage and then fused in a dedicated multimodal block to produce the final disease classification.

For the visual branch, an input image (224 × 224 × 3) is fed into a CNN backbone consisting of five convolutional stages (Conv–BN–ReLU) with 3 × 3 kernels and filter sizes {64, 128, 256, 512, 512}. After each stage, 2 × 2 max pooling is applied to reduce spatial resolution while retaining discriminative leaf textures, lesion colours, margins, and shapes. This results in a small feature map (generally 7 × 7 × 512), which is then flattened and linearly projected onto a 512-dimensional embedding that can be treated like a sequence.

Since CNNs by themselves are not able to capture long-range dependencies, we feed this visual embedding to a Transformer encoder with two layers, eight attention heads, a model dimension of 512, a feedforward dimension of 2048, and residual + layer norm. The self-attention functionality reweights spatial pixel tokens, ensuring that areas with disease symptoms (spots, necrosis, rust-like patches) receive higher attention, which is essential since symptoms are sometimes minor or partly hidden in UAV/drone images.

Fig. 2
figure 2

Proposed Model (AgroVisionNet) of the Proposed AI-Driven System for Agricultural Disease Monitoring.

In parallel, the sensor/IoT branch ingests the time-aligned environmental vector (e.g., temperature, humidity, soil moisture, light intensity). This vector is first normalised and passed through two fully connected layers (64 → 128 units, ReLU) to obtain a sensor embedding \(\:{F}_{s}^{{\prime\:}}\in\:{\mathbb{R}}^{128}\). A linear projection then upsamples it to the same dimension as the visual embedding (128 → 512) to make the two modalities compatible.

The two modalities are combined in an explicit fusion block. First, the visual embedding \(\:{F}_{v}\in\:{\mathbb{R}}^{512}\)and the sensor embedding \(\:{F}_{s}^{{\prime\:}}\in\:{\mathbb{R}}^{512}\)are concatenated and passed through a fusion MLP (512 + 512 → 512, ReLU). On top of this, we apply an adaptive, learnable weighting \(\:{F}_{\text{fusion}}=\alpha\:{F}_{v}+\beta\:{F}_{s}^{{\prime\:}},\) where \(\:\alpha\:\)and \(\:\beta\:\)are learnable parameters optimized during training (not fixed or hand-tuned). This makes the fusion data-driven: when visual cues are reliable, \(\:\alpha\:\)is emphasized; under visually ambiguous conditions but strong sensor cues (e.g., high humidity favoring a disease), \(\:\beta\:\)is emphasized. This answers the reviewer’s question on where and how the fusion happens.

The fused representation is then fed to the classification head (Dense 256 → Dropout → Dense C with Softmax), where C is the number of crop-disease classes. Grad-CAM is applied to the visual branch to produce heatmaps showing which image regions contributed to the decision, ensuring interpretability for agronomists.

Table 2:Finally, the entire architecture is designed with edge deployment in mind: input size fixed at 224 × 224, TensorFlow/TFLite exportable layers, and a single fusion point to avoid expensive cross-modal attention. This allows the model to run on Jetson Nano–class devices in near real time while retaining the full multimodal benefit.

Table 2 Notations used in the proposed System.

Mathematical perspective

The proposed system operates on a mathematical framework that integrates image processing, feature extraction, and classification underpinned by advanced deep learning models. The first step involves preprocessing the raw images captured by drones. Each image, denoted as \(\:I\left(x,y,\lambda\:\right)\), where \(\:x\) and \(\:y\) represent spatial dimensions and \(\:\lambda\:\) represents the wavelength for multispectral imaging, is processed to remove noise and enhance clarity. Noise reduction is achieved using a Gaussian filter, represented by the convolution operation as in Eq. 1.

$$\:{I}_{filtered}\left(x,y\right)=\sum\:_{u-k}^{k}\sum\:_{v-k}^{k}G\left(u,v\right)\cdot\:I\left(x-u,y-v\right)$$
(1)

where \(\:G\left(u,v\right)\)is the Gaussian kernel and\(\:\:k\) defines the kernel size. Feature extraction uses a hybrid deep learning model comprising Convolutional Neural Networks (CNNs) and Transformer architectures. The CNN layers extract spatial features by applying convolution operations to preprocessed images. For a feature map F, the convolution is expressed as in Eq. 2.

$$\:{F}_{i,j}^{l}=\sigma\:\left(\sum\:_{m-1}^{M}\sum\:_{n-1}^{N}{W}_{m,n}^{l}\cdot\:{I}_{i+m+j+n}+{b}^{l}\right)$$
(2)

where \(\:{W}_{m,n}^{l}\) are the convolutional weights, \(\:{b}^{l}\) is the bias term, \(\:\sigma\:\) is the activation function, and \(\:l\) represents the layer index. The Transformer architecture is applied for feature attention and aggregation to model complex relationships. The attention mechanism computes the importance of each feature using the query. \(\:\left(Q\right)\), \(\:key\left(K\right)\), and \(\:value\left(V\right)\) matrices as in Eq. 3.

$$\:Attention\left(Q,K,V\right)=softmax\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right)V,$$
(3)

where \(\:{d}_{k}\) is the dimension of the key vectors. This mechanism ensures that critical features are prioritized for disease classification. The classification task predicts the disease class \(\:C\:\)based on the extracted features. The classification probability for each class \(\:p\left(C=c|F\right)\)is calculated using a softmax function as in Eq. 4.

$$\:p\left(C=c|F\right)=\frac{exp\left({z}_{c}\right)}{{\sum\:}_{i-1}^{C}exp\left({z}_{i}\right)}$$
(4)

where \(\:{z}_{c}\) is the output logit for class \(\:c\), and \(\:C\) is the total number of classes. The predicted class is determined as in Eq. 5.

$$\:\widehat{C}=\underset{c\in\:C}{\text{argmax}}p\left(C=c|F\right)$$
(5)

The decision support module integrates contextual data from IoT sensors, modeled as \(\:S\left(t\right)=\left\{{s}_{1}\left(t\right),{s}_{2}\left(t\right),\dots\:,{s}_{n}\left(t\right)\right\}\:\) where \(\:{s}_{i}\left(t\right)\)represents the sensor reading at time \(\:t\). The combined insights from image classification and sensor data generate heatmaps and reports. The spatial distribution of disease severity D(x, y)D(x, y) is visualized using interpolated values as in Eq. 7.

$$\:D\left(x,y\right)=\sum\:_{i-1}^{N}{w}_{i}\cdot\:{d}_{i}$$
(6)

where \(\:{d}_{i}\) represents the disease intensity at a point \(\:i\), and \(\:{w}_{i}\) are the interpolation weights. This mathematical framework has been successfully implemented, demonstrating high accuracy and efficiency in identifying disease outbreaks in large agricultural areas. The results validate the feasibility of the proposed system in transforming agricultural monitoring through advanced computational techniques.

Flow of the proposed system

Figure 3 Workflow of the proposed AI-based system to monitor multiple diseases per crop/field. It shows the development processes from deploying drones to delivering insights to aid decision-making. It is developed as a high-end mechanisation system, capable of automatically capturing, processing, and analysing quality data, which is then used to generate insights, potentially enabling rapid detection and management of crop diseases across large areas of agricultural land. First, drones are flown over the fields to take high-resolution images using multispectral and hyperspectral imaging and environmental sensors. The key here is this data acquisition step because until this step, we will be having all such data available, you know, in the form of images of the crop and sensors related temporal data (like temperature, humidity, etc.), and this will be done to provide context to the crop health model.

After the data is scraped, it is cleaned to improve it and make it suitable for analysis. Noise from the surrounding atmosphere is eliminated to free images from interruption. Anyway, mapping the photos to their geographical coordinates (georeferencing) allows for identifying areas of the disease. The dataset is robustified through various other processing techniques, such as data augmentation. Feature extraction using Convolutional Neural Network (CNN), which specialises in picking up the aspects (spatial and texture features) that should be decided from preprocessed images. These features are needed to distinguish healthy crops from diseased ones. Next, the classification is performed using a hybrid deep learning model based on these features. We benefit from the complementary classification potential of CNNs and the transformer mechanism to provide more precise predictions of disease categories, combining informative features from both networks as well as complementary semantic/contextual information learned by the transformer.

Fig. 3
figure 3

Flowchart of the Proposed System for AI-Driven Agricultural Disease Monitoring.

Data Fusion Process: This step takes the IoT sensor readings and the selected LTE features after the classification to improve the analysis. This coupling of the two data types enhances the interpretation of growth-stage data because soil moisture, temperature, etc., vary seasonally and are all positively correlated with disease severity (e.g., 1). The aggregated data is processed locally using edge computing, enabling real-time processing and reducing latency within the system. With disease heatmaps and other detailed reports, these insights are analysis-led. These heatmaps will allow you to visually see the hot spots and spread of the disease in the agricultural areas being monitored, and the reports will identify affected crops, the severity of the disease, and formulate basic advisories. We expose these insights to farmers through simple mobile or web-based interfaces that are easy to access and use. At last, farmers and agricultural stakeholders are provided with insights valuable for decision-making. Abstract. This system offers a complete solution for farming disease monitoring, integrating state-of-the-art image capture and transmission, AI-based image analysis, IoT data collection and transmission, and edge computing.

Edge computing integration

The AgroVisionNet system is built on the principle of edge computing for real-time in situ detection of agricultural diseases. For this work, we ran the model on an NVIDIA Jetson Nano edge device, powered by a quad-core ARM Cortex-A57 CPU, 4 GB RAM, and a 128-core Maxwell GPU. This platform was chosen for its small size, low power consumption (5–10 W), and the ability to perform on-device deep learning inference without constant reliance on the cloud.

Our data pipeline begins with the multispectral and hyperspectral images collected by the drone, and IoT sensor data that includes temperature, humidity, and soil moisture. All these streamset packages are processed on the drone and then wirelessly transmitted to the edge device. AGROVISIONNET: The deployed model runs for classification on the edge device, and gives both prediction and heatmaps together. Results are displayed on a local interface and transmitted to a central server, optionally, selectively, such that operations of the application can have a low latency even while the connection is not continuous.

During the implementation, much effort was devoted to the examination of the real-time computer performance feasibility. Latency for each image was ~ 1.2 s, and the throughput was up to 45 images/second (best possible settings). Each edge device uses pipelined execution, and we concurrently execute preprocessing, inference and data transfers. These optimisations are essential, as AgroVisionNet has to meet the stringent timing requirements of real-world agriculture sensing.

IoT data normalisation and fusion strategy

The IoT sensor data recorded in AgroVisionNet comprises temperature, humidity, soil moisture, and light intensity. It is in the different scales and units that it is hard to combine them with visual features directly. To solve this issue, we normalise the IoT data so that all readings are mapped to the same range. This step is necessary to ensure that no single sensor type overpowers the others solely because of its scale.

The normalised IoT readings are fed into a small fully connected layer for compact feature extraction. This representation aligns with the visual features computed by AgroVisionNet’s CNN and Transformer modules. In the fusion step, the IoT and visual features are fused with adaptive weights. This is to say that the information that mismatched it (in this case, IoT) serves to supplement visual features and provide context, e.g., whether a leaf discolouration was caused by environmental stress or disease.

Vanelasticity testing was performed to empirically tune the contributions of the IoT data in weight-balancing, which is done to increase model performance while ensuring the utility of IoT data does not get saturated by visual features. This allows AgroVisionNet to use image-based and weather-based patterns for more sensitive and robust identification of the disease. This could also enhance the transferability of the trained system to areas with analogous conditions but using different crops. AgroVisionNet represents a harmonious balance of image processing — harnessed by these two datasets — and contextual knowledge, resulting in rapid and accurate decision-making in the field.

Proposed algorithms

We implemented a system that contained four different algorithms to accomplish the goal of effective and efficient agricultural disease monitoring. These algorithms work in conjunction to solve every aspect of the workflow, from acquiring data to creating actionable insights. Input—Drones: Data acquisition and preprocessing (clean, geo-referenced) The second one is a hybrid deep learning model with CNN and Transformer- based mechanisms for features extraction and classification of different diseases. The last algorithm fuses IoT sensor data through Edge computing to analyze in real-time. In the end, the output is going to be heat maps and reports which will help the decision support algorithm in assisting to transform the data obtained into actionable insights to use for combating crop diseases.

Algorithm 1
figure a

Data Acquisition and Preprocessing.

Algorithm 1: Drone-based data acquisition and data preprocessing (An element of the proposed approach) They begin by sending drones along already established flight paths that benefit from the wind, which have been pre-tested over large areas. There are advanced options for multi-spectral and hyperspectral imaging drones that include high resolution multispectral and hyperspectral imaging sensors for imaging and a few environmental sensors for technological context, e.g. ground/air temperature, humidity, soil moisture, etc.

Data acquired from the drone is preprocessed to enrich data quality and make it suitable for further analyses. Gaussian filter works well in removing external noise like shadows or uneven illumination by smoothing images. This pipeline basically ensures that the input data does not contain the artefacts which may affect feature extraction. Georeferencing is the process of mapping the processed images to geographic coordinates, which allows to accurately position the areas affected by the disease. To generate actionable insights and appropriate recommendations, farmers should correlate individual weather patterns with mapping.

We also perform data augmentation, such as rotation, flipping, and scaling, to preprocess the dataset and make it more diverse and robust. This step is essential for mitigating overfitting during deep learning model training and improving its generalisation to new data points. Ultimately, it returns the cleaned, augmented, georeferenced images ready for feature extraction and classification. It implements the algorithm at the heart of the system and solidifies the procedure for collecting and preparing data to identify the type of disease.

Algorithm 2
figure b

Feature Extraction and Disease Classification.

Image features extraction and disease classification: Algorithm 2 The hybrid deep learning models (CNN and transformer architecture) are mainly used for it. In the first stage, the preprocessed image from the previous stage goes through the CNN to extract relevant spatial features (texture, color, and shape). Convolutional layers perform a series of progressively more complex convolutions on the image and generate feature maps that make salient features in the image that are presumed to correlate with disease pathology7. Pooling layers work to downsize such feature maps by retaining only the important features and discarding unnecessary computations.

The CNN extracts features and those will be fed to the Transformer layers for better contextual features. In this, points of the image are given weights to show how relevant they are and which areas of the image are highlighted as important — this is done by the attention mechanism of the Transformer. The mechanisms for capturing long-range dependency and embedded features are crucial for high-dimensional data because such features are often needed as specific regions are critical for distinguishing similar types or stages of disease. The context is very balanced as the transformers have excellent local context while CNN networks have high long filler strength, hence combined they provide a perfect balance to the algorithm.

A classification is made from the output of the Transformer layers using a fully connected layer. At the output layer, a softmax activation function computes the probabilities for all the classes of disease, and the class with the highest probability is output as the model prediction. The saturates model facilitates Explainable AI (XAI) methods that highlight the image areas which inform the classification decision, the transparency making trust and end user actionable insights possible. This represents the core analytic functionality of the System we have developed and provides a robust and an accurate classification of the diseases.

Algorithm 3
figure c

IoT-Based Data Fusion and Edge Computing.

In agricultural fields, we collect real-time environmental data using IoT sensors as shown in Algorithm 3. These sensors have the strongest influence on crop strength and disease development as they identify critical parameters such as temperature, humidity, soil moisture and light intensities. The image features, + the sensor data are normalized for consistency and compatibility. IoT comes from this tree of entropy module which is defined by the readout and the corpus of instrumentation data defining this approximate image and then the imagined one giving rise to something the algorithm does to correlate this mapping between the two to normalize the multivariate distribution near the readout. The number providing weight to each source of information provides an approximate floating point value and that visual and environmental sources of information slowly counteract each other. Adding latent information which is inaccessible in the image and adding the vision and text in a way that gives the model more leverage helps enhance more robustness and dependability in terms of disease prediction. It also gives more information to the model which would not produce the same outcome by purely looking visually for symptoms.

The collocation of the data makes it possible for data to be processed locally, using edge computing capabilities of the drones and in closely placed devices, for immediate analysis. By processing data closer to its origin instead of sending all of it to centralized servers, edge computing minimizes latency. That is particularly necessary for rapid–delivery ailments or sick to respond instantly to detrimental instances. It will return a fused feature set, drawing from both visual and environmental data strengths, ready for actionable insights within this algorithm. This combination of IoT and edge computing provides this holistic and efficient approach towards health monitoring.

Algorithm 4
figure d

Decision Support and Visualization.

In the step 1 of Algorithm 4, values of disease severity are interpolated over the spatial field that is monitored. In the second step, the spatial distribution of disease is calculated via the features melted from the previous algorithm, resulting in a continuous severity map of the phenomenon. Such interpolation enables localized disease hotspots to be evidenced and visualised even in regions with few images or direct measurements.

This is how the algorithm helps in creating heat maps by providing the visual image of spread and severity of the diseases in the field. These heatmaps show color gradients that correspond on how concentrated the infections are on that location, making it easier for the user to see what are the most affected zones. Deep reports complement the visual output by providing crucial details such as disease detection, severity, affected areas, and recommendations. These reports are prepared in lay terms and are in fact implementation plans — either treatment regimens or prevention regimens for specific diseases.

In order to simplify it, these insights are presented in an easy interface (ex: Mobile or Web apps) The interactive interface allows farmers as well as other stakeholders to view the heat maps and reports in real-time so that the decision regarding any emerging disease threats can be taken as soon as possible. Combined visual and textual outputs from this algorithm provide timely and actionable results that can aid better decision-making. The last and crucial step of the system that connects research to end-user utility is the wrapping of this algorithm into practical and interpretable results.

The AgroVisionNet model was later implemented on an edge device NVIDIA Jetson Nano (4 GB RAM, Quad-core ARM Cortex-A57 1.43 GHz CPU and the 128-core Maxwell GPU) to conduct field testing in real time. The size of this system also makes it suitable for drone-based agricultural flights. During field deployment the device consumed on average 5–10 W, also dependent on computational load and active sensors. Despite these hardware constraints, AgroVisionNet remains lightweight and can achieve efficient performance while performing disease detection and data fusion in real-time.

Algorithm complexity analysis

To evaluate the computational efficiency of AgroVisionNet, we calculated the time and space complexities of its core modules. Therefore, the framework sets out a multiple of four primary algorithms, namely CNN for feature extraction, Transformer for context modelling, IoT data normalisation and fusion, and classification stages. All algorithms were created with optimization for performance and co-optimization for low-compute (edge devices and drones) computing requirements.

The feature extraction module of the CNN is applied on both multispectral and hyperspectral images to enhance disease-related spatial characteristics. The time complexity of our method is (N images × K² kernels × C channels). The space complexity is O(M), where M is the number of parameters learnt by the network, which in the case of AgroVisionNet is about 8.2 million. The Transformer-based contextual modelling scheme aims to capture global dependencies and interrelationships among feature maps. The time complexity of the model is O(L² × D), where L denotes the number of feature tokens, and D is the dimensionality of each token. The space complexity is O(L × D), taking memory for attention weights and intermediate feature representations.

The IoT data normalisation and fusion procedure combines visual features with environmental sensor inputs. The computational overhead of this function is low due to a small number of sensor readings. The time and space complexities are both O(P), where P is the number of sensor features processed. The time complexity of the weight combination for disease classification, which provides the predicted disease class, is O(H × W), since H and W are dimensions of fused feature vector propagated through fully connected layers. Its space complexity is O(H × W) as it requires the storage of the fused feature weights and outputs. By keeping these optimised computational bounds, AgroVisionNet guarantees real-time inference with low latency and memory footprint further making it conducive to deploy on drones and edge devices for large scale agricultural monitoring.

Evaluation methodology

To measure the performance of the proposed system, we will use the following metrics: accuracy, precision, recall, F1-score, latency, and computational efficiency. These metrics are essential for confirming the system’s disease detection and classification, real-time data processing, and actionable insights for agricultural monitoring. Standard metrics are used to evaluate the classification performance. The accuracy \(\:\left(A\right)\) of the system is calculated as the ratio of correctly classified instances \(\:\left(TP+TN\right)\) to the total number of instances \(\:\left(TP+TN+FP+FN\right)\) as in Eq. 7.

$$\:A=\frac{TP+TN}{TP+TN+FP+FN}$$
(7)

Where \(\:TP\) represents true positives, \(\:\:TN\) true negatives, \(\:\:FP\) false positives, and \(\:FN\) false negatives. \(\:Precision\left(P\right)\), which measures the proportion of accurate optimistic predictions among all positive predictions, is calculated as in Eq. 8.

$$\:P=\frac{TP}{TP+FP}$$
(8)

\(\:Recall\left(R\right)\), also known as sensitivity, evaluates the system’s ability to identify all positive cases and is given by Eq. 9.

$$\:R=\frac{TP}{TP+FN}$$
(9)

The F1-score (F1) provides a harmonic mean of precision and recall to balance their trade-offs, expressed in Eq. 10.

$$\:F1=2\cdot\:\frac{P\cdot\:R}{P+R}$$
(10)

\(\:Latency\left(L\right)\:\)is evaluated to ensure the system meets real-time requirements. It is the average time to process an input image and generate actionable insights, as in Eq. 11.

$$\:L=\frac{{\sum\:}_{i-1}^{N}{t}_{i}}{N}$$
(11)

Where\(\:\:{t}_{i}\) is the time taken for the \(\:i\)-th image, and \(\:N\) is the total number of images processed. Computational efficiency is assessed by measuring the system’s throughput \(\:T\), which is the number of images processed per second, as in Eq. 12.

$$\:T=\frac{N}{{\sum\:}_{i-1}^{N}{t}_{i}}$$
(12)

To evaluate spatial accuracy in disease severity mapping, the mean squared error \(\:MSE\) between predicted disease severity \(\:\left(\widehat{D}\left(x,y\right)\right)\) and ground truth severity \(\:\left(D\left(x,y\right)\right)\) is calculated as in Eq. 13.

$$\:MSE=\frac{1}{N}\sum\:_{i-1}^{N}{\left(D\left({x}_{i},{y}_{i}\right)-\widehat{D}\left({x}_{i},{y}_{i}\right)\right)}^{2}$$
(13)

To illustrate the improvement, the overall system performance was compared with baseline methods. To show that the proposed methodology is better than the baseline, we perform statistical significance tests, such as paired t-tests. When combined, these evaluation metrics prove the system’s working, capability, and usability in real time.

To recompense the performance of AgroVisionNet, here we compared with four commonly-used baseline models: VGG16, ResNet50, Inception V3 and DenseNet121. These architectures were selected, as they cover a range of classical and modern CNNs that are used in agricultural image analysis. VGG16 has a classical CNN architecture and ResNet50 proposes the use of residual connections for deep learning Inception V3 is an approach of multi-scale features extraction, while DenseNet121 can be used according to the feature reuse and parameters efficiency. This choice guarantees a fair and thorough comparison of AgroVisionNet with established deep learning methods.

Experimental results

Using large-scale multispectral images of food and IoT sensor datasets obtained from extensive agricultural areas, the experimental results evaluate the effectiveness of the proposed AgroVisionNet model. This dataset consists of various crop diseases across different conditions, which experts extensively look at for annotation, thereby providing large-scale truth. This study examines and demonstrates the effectiveness of the proposed model against some of the most used deep learning models, such as VGG16, ResNet50, Inception V 3, and DenseNet121 (Rakhmatulin et al.) for agricultural implementation64., Ravi et al.12. We conducted experiments on a workstation equipped with an NVIDIA RTX 3090 GPU and set up edge simulations on a Jetson Nano, with implementation and optimization via PyTorch and TensorFlow Lite.

Dataset description

An unique benchmark dataset was also created, which aimed to simulating the real condition of agriculture in reality, are used for testing and comparing the AgroVisionNet method. This data set comprises 10,000 multispectral and hyperspectral high-resolution images of different types of crops infected with various diseases taken over the course of six months from different geographical locations. Providing a varied and challenging sample for model training, validation and testing is the motivation of this dataset.

The data is captured using multi-spectral cameras mounted on drones and picks up key spectral bands that are required for the detection of subtle disease symptoms which may not be visible in standard RGB images. To improve detection accuracy, hyperspectral imaging was acquired while data being collected which offered more spectral depth and details. This combination provides a strong representation of disease phenotype under variable crop growth stages and environments.

Realtime environmental sensor data of temperature, humidity, soil moisture and light intensity was monitored together with the image acquisition on each flight. Each picture was timestamped with its coordinate environmental measurements for visuo-environmental integration. This alignment is also conducive for more in-depth analysis of environmental drivers associated with the onset and progression of disease, rendering the dataset relevant for precision agriculture. The diversity of the dataset is summarized in Table 3, where we present the crop types, disease classes, number of images and geographical regions included. Such broad coverage allows the system to generalize well across varying types and sizes of agricultural land.

We collected a custom multimodal dataset with an RGB/multispectral camera on a drone and the co-located IoT nodes (temperature, humidity, soil moisture, light intensity) over large agricultural plots. A total of 10,000 image samples were kept after quality filtering and annotation by two expert agricultural scientists. For reproducible training, the dataset was stratified by crop type and disease class, and split with 70% (7,000 images) for training, 15% (1,500 images) for validation, and 15% (1,500 images) for testing. All baseline and SOTA experiments reported in Sect. 4 were using the same split. x to avoid evaluation bias. Timestamp matching of sensor records allowed to fuse these records with the corresponding drone image before fusion.

Table 3 Distribution of dataset by crop Type, disease Category, and geographical Region.

The dataset is reliably annotated and validated by agricultural experts to guarantee a high-quality ground truth. Disease type, severity level and affected regions were annotated for every image, which is suitable for classification and localization. The labeling step also allows the construction of disease heat maps and action orientated decision support applications.

Preprocessing was carried out for data preparation, by the noise reduction (reduction of environment artifacts), the geo-referencing (registering images in accordance with space), and finally, data augmentation, which is composed of flip, rotate, and scale.

This extensive preparation of the dataset, makes it available for a variety of analyses e.g., disease detection, environmental association studies or real time operational deployment. AgroVisionNet generalises effectively across a wide range of crop species, disease types and growth stages, as well as across environmental conditions.

Experimental setup

The training and evaluation of the deep learning models were performed on a workstation with an NVIDIA RTX 3090 graphics processing unit, an Intel 9th-generation processor, and 64 GB of RAM. For real-time processing, edge computing tests were carried out with an NVIDIA Jetson Nano device, which gave us an idea of how the system performs in a resource-constrained environment. The software stack used PyTorch for training and testing models, OpenCV for pre-processing tasks, and TensorFlow Lite to optimize models for edge deployment. The system uses the MQTT protocols to integrate various environmental data measured by IoT sensors into the system in real-time and synchronizes it with the image data. These drones were fitted with multispectral and hyperspectral cameras and could take high-resolution pictures at different spectral bands needed to detect any diseases. The cameras took pictures with data on temperature, humidity, soil moisture, and light intensity through sensors in synchronization. Having both these datasets ensured single crop health monitoring. The real-time upload and tracking of drone working operation zones is implemented via a mobile ground station application.

Hyperparameter configurations for the different deep learning models were explicitly defined to ease the replicability of the experiments. The learning rate was 1 × 10 − 4 for the CNN component, and the batch size was 32, with five convolutional layers. All convolutional layers used a kernel size of 3 × 3 and ReLU activation functions and the model was optimized using the Adam optimizer. A dropout of 0.5 was applied to reduce overfitting, and the model was fit for 50 epochs maximum. The Transformer part of the hybrid had four layers, eight attention heads, and a hidden dimension of 512. The feed-forward net dimensions were 2048, with a dropout of 0.1, and the learning rate was 2 × 10 − 4. Using Early stopping with the patience of 10 epochs based on validation loss, weights were initialized using Xavier initialization. The preprocessing pipeline was done using OpenCV for tasks such as Gaussian noise reduction and data augmentation. Images were georeferenced in QGIS to provide their corresponding coordinates. Training the CNN layers initializing their weights using ImageNet as the pre-trained data (transfer-learned) helped the model generalize better. IoT sensor data were streamed in real-time through an MQTT broker, and during inference, image features were fused with sensor data using a weighted fusion technique.

The TensorFlow models were converted to TensorFlow Lite using dynamic quantization to optimize the NVIDIA Jetson Nano for edge deployment. We took the liberty of standardizing the image resolution to 224× 224 to correspond with the model. This system was built to perform inference on a single image with a minimum of latency, which means real-time analysis could be accomplished. Disease severity mapping heatmaps were created in Matplotlib, and a user interface was developed in Flask, providing farmers with insights on a web or mobile application. This clearly defines the experimental setup in detail, which can be replicable for other researchers to reproduce the experiments to validate the proposed system under identical conditions. The reproducing periods with refinements or more implementations can explain an offer to extend the reproducibility to support ballooning usage.

Systematic hyperparameter tuning was used for training the models to achieve optimal settings. Grid search was used to try various combinations of learning rates, batch sizes, and regularizations. After configuration around, learning rate was ultimately configured as 0.001 with cosine decay policy and the batch size of 32 was chosen for fast convergence. Convergence speed and stability were balanced by using the Adam optimizer with default momentum parameters (β1 = 0.9; β2 = 0.999).

Training schedule is set to 120 for the stopping algorithm in order to avoid overfitting. Dropout layers with rate of 0.4 and L2 regularization with coefficient of 0.0005 were used to prevent over-fitting of the model. Data augmentation was adopted to enhance dataset diversity and robustness, such as rotation and flipping and contrast variation. The main hyperparameters and optimization protocols adopted in this study have been presented in Table 4 to make the results reproducible for future research.

Table 4 Summary of hyperparameters and optimization strategies for AgroVisionNet Training.

To test performance in different network conditions, we ran simulations of three scenarios: (i) A stable connection with high bandwidth, (ii) a moderate field-like connectivity characteristic to rural agricultural area and (iii) low or no-connection environments that represent remote areas. These cases were a way to check the robustness of the framework in real deployments scenario, when the network availability may be dynamic. The edge device worked in an autonomous manner, where it stored data locally during periods of limited connectivity and synchronised with the cloud once the connection was re-established.

With field deployment, the practical usability of edge computing was verified. The NVIDIA Jetson Nano was connected to drone and IoT data source in wireless using a separate communication module. The device analyzed the incoming data streams on-board, so decisions could be made in near real-time without reliance on always-on high-bandwidth connection.

Network testing is performed in three scenarios: high-bandwidth network, moderate rural network and offline cases. In case of offline mode, the device could however process locally and store results for later synchronisation. This strategy exhibited the flexibility of the system under various delivery conditions that are often observed in agricultural areas.

Performance analysis

This section provides a detailed performance analysis of AgroVisionNet, including accuracy, precision, recall, F1-score, latency, and throughput analysis. It also compares the proposed model’s performance amongst state-of-the-art deep learning models such as VGG16, ResNet50, Inception V3, and DenseNet121. The comparison shows that AgroVisionNet outperforms existing models in terms of classification accuracy across the datasets and the speed of real-time deployment, confirming its viability for large-scale monitoring of agricultural diseases.

Fig. 4
figure 4

Confusion Matrices for AgroVisionNet and Baseline Models.

The confusion matrices for AgroVisionNet and four baseline deep learning architectures—VGG16, ResNet50, Inception V3 and DenseNet121—on the wheat disease test dataset with four categories: Healthy, Rust, Blight and Smut are provided for comparative purposes in Fig. 4. Each subfigure 4(a)–(e) illustrates model performance based on the quantity of instances that are classified correctly and incorrectly.

As can be seen from Subfigure 4(a) with respect to AgroVisionNet, the corresponding confusion matrix is intensely concentrated along the diagonal, which indicates higher accuracy of classification and also low misclassification of the diseases classes. The small off-diagonal entries validate that AgroVisionNet has excellent discriminative performance, even for visually similar categories (see Rust vs. Blight).

In contrast, for the baseline models the off-diagonal elements are more spread out, indicating the nature of higher false positives and false negatives. Blight vs. Smut confusion is quite moderate in VGG16 and ResNet50, whereas Inception V3 and DenseNet121 are consistent and precise but still worse than AgroVisionNet. Overall, this figure emphasises the quantitative performance of AgroVisionNet in differentiating between wheat disease classes with high accuracy, precision, and recall, corroborating its architecture’s enabling more rapid and reliable diagnosis of crop diseases compared to traditional deep learning baselines.

Table 5 Performance comparison of AgroVisionNet with baseline deep learning Models.

The comparison of performance of the AgroVisionNet against baseline deep learning models (VGG16, ResNet50, Inception V 3 and DenseNet121) is shown in Table 5. AgroVisionNet achieves the highest accuracy of 94.2% (baseline accuracy in the 87.8–91.3%) and performs statistically significantly better than all baselines. Values of precision, recall, and F1-score as shown in81 obtained shows the effectiveness of the proposed model in classification, where the precision score (0.94), recall score (0.89), and F1-score (0.91) reported for AgroVisionNet shows its excellent classification capabilities. AgroVisionNet yields state-of-the-art accuracy, with superior speed performance achieving the minimum latency (1.2 s per image) and maximum throughput (45 images per second) rates when we estimate the model in real-time. These improvements are attributed to the hybrid architecture of model and other optimizations for edge compute, which is motivating for practical deployment for large-scale real-time agriculture.

Fig. 5
figure 5

Performance Comparison of AgroVisionNet and Baseline Models Across Six Metrics with Differentials.

In these, we represent an extensive qualitative comparison of AgroVisionNet against the baseline models, VGG16, ResNet50, Inception V3, 3and DenseNet121 on six key evaluation metrics − accuracy, precision, recall, F1-score, latency and throughput, as shown in Fig. 5. Though AgroVisionNet outperforms all baseline models in all metrics, it still illustrates the ability to contain correct predictions, and the off-diagonal cells contain detection tasks. The performance discrepancies are annotated, where agroVisionNet outperforms everybody else highlighted in red.

AgroVisionNet performed well at very high accuracy (94.2) and improving close baseline (2.9%) over DenseNet121 which shows the compatibility of the AgroVisionNet model for agri-image classification. This improvement is attributed to good architectural framing of AgroVisionNet that not only extenuates feature extraction but also incorporates layers that are tailored to fit the trends in agricultural images. Precision and recall are 93.5 and 92.8 respectively, values closer to 1 means lesser false positives and false negatives which signifies an in-built robustness of AgroVisionNet. This balance makes it a reliable and consistent behavior across datasets. AgroVisionNet (F1-score = 93.1%) balances precision and recall well, while performing 2.6% better than DenseNet121. This is likely due to the layer getting reconfigured better and its ability to generalize. The very last hit test is so powerless that the latency is decreased to 1.2s/image compared to baseline models (processing 1 image now takes multiple seconds). Then this reduces the requirements of agricultural disease detection real-time applications.

AgroVisionNet also has a high throughput of 45 images/s, which is 1.5x faster than VGG16 and much faster than other baselines. This is an improvement because it means that high-volume data can be processed, which is key for agricultural monitoring systems to scale. Architectural optimizations can achieve a better throughput increase by utilizing the computational resources and parallel processing more efficiency. Finally, we believe that AgroVisionNet is superior to existing methods along most of the metrics since it is custom-built for the detection of agricultural diseases with an appropriate choice of pre-processing, feature extraction and model layers that are domain-relevant. These improvements offer greater accuracy, faster processing times, and greater scalability, overcoming major obstacles to monitoring agricultural disease. So, the combined outcomes stand as testimony to what each is capable of, and therefore imply that, as a precision agriculture next-gen AI-powered solution, AgroVisionNet is an invaluable asset.

To provide deeper insights into the AgroVisionNet classification performance, a confusion matrix based on the test dataset was generated. This matrix shows the number of true-positives, false-negatives, false-positives and true-negatives related to all disease category.

Table 6 Confusion matrix showing true Positives, false Negatives, and misclassifications for wheat disease categories in the test Dataset.

Table 6 displays the raw counts to enable clear analysis of performance of our model at a per-class resolution. Results suggest that most confusions were made among visually similar diseases like rust and blight in wheat and blast and sheath blight in rice. This shows the difficulty in differentiating early cough and dyspnoea, with overlapping visual features. Despite making some difficult cases, AgroVisionNet achieved high correct classification rates in most of the major disease groups, indicating that linking visual information with IoT data worked well.

Fig. 6
figure 6

Confusion Matrix for Wheat Disease Categories.

The confusion matrix for the classification of wheat disease using AgroVisionNet is shown in Fig. 6. The diagonal cells are for correct predictions, and the off-diagonal cells are for misclassifications between the disease types. This visualisation shows the model’s strong performance in each category and where similar cases are sometimes misclassified.

To provide a fair comparison, classical CNN baselines were adapted to receive the same IoT feature stream as input to AgroVisionNet. A multimodal variant was formed by concatenating environmental features (temperature, humidity, and soil moisture) with the penultimate-layer embeddings of each baseline network. In this experiment, the gain from the AgroVisionNet fusion strategy is isolated, rather than the benefit of IoT data in general.

Table 7 Quantitative comparison of multimodal (Image + IoT) baseline models with AgroVisionNet.

The results in Table 7 show the classical CNN baselines adjusted for more than a single modality by combining IoT sensor data and a image embedding. Moderate improvements over image-only variants for each baseline indicate that the multiview features have captured useful information and that the AgroVisionNet with its adaptive fusion and edge-optimized architecture achieves both the highest accuracy and throughput, validating its potential for real-time precision agriculture.

Fig. 7
figure 7

Comparative Analysis of Accuracy, Precision, Recall, F1-Score, and Throughput between Multimodal (Image + IoT) Baseline Models and the Proposed AgroVisionNet.

As shown in Fig. 7, for all multimodal baseline variants, the incorporation of IoT data resulted in a small but clear improvement over their image-only counterparts, confirming that environmental context is useful for predictions. Nevertheless, AgroVisionNet outperforms in terms of overall metrics at a 1.6 — 2.0% greater accuracy and F1-score over the best multimodal baseline (DenseNet121 + IoT). The enhancement is due to its adaptive fusion block and context modeling with Transformers that are more consistent with spatial-spectral and environmental cues. Added complexity is justified, given that, on the edge still achieves a higher (45 imgs/s) throughput. These findings indicate that the strength of AgroVisionNet does not solely stem from multimodal input but from its unique fusion and optimization strategy, enabling a transparent and valid comparison.

Ablation study

To perform an explainable ablative analysis, rather than removing components of AgroVisionNet they replaced each one, one at a time, with a simplified surrogate. Data popularity for method: It used a standard Global Average Pooling layer to flatten image embeddings when it did not employ a CNN feature extractor In the case of sequential dependencies, a two-layer bidirectional LSTM was employed, and Transformer layers were not included. Likewise, there was no learnt weighted concatenation of image and sensor streams without this fusion module. We adopted a replacement strategy to keep the data flowing without interfering with the module-wise contribution to performance and efficiency.

Table 8 Detailed ablation study showing component contribution and runtime Impact.

As shown in Table 8, the Transformer block is the most influential component for learning contextual features, leading to an accuracy improvement of nearly 2.5% over the Bi–LSTM alternative. Compared to its removal (replaced with global pooling), the CNN encoder provides improved fine-grained spatial discrimination (especially around subtle lesion boundaries) at the cost of throughput and thus noticeable accuracy loss by tuning the importance of visual and environmental inputs, and adaptive clipping. The fusion module again improves predictions by compensating for a less informative input modality; its absence leads to lower accuracy and more false-positive detections. The analysis of latency and throughput also confirms that, with all modules enabled, the model maintains reasonably good runtime performance (≈ 45 images/s) on the Jetson Nano, and that this complex architecture remains feasible for real-time deployment. The ablation as a whole corroborates that each component provides unique strengths that complement one another, and that the full AgroVisionNet offers the best trade-off between prediction performance and computational efficiency.

Table 9 Percentage performance degradation when removing individual AgroVisionNet Components.

As shown in Table 9, the most significant contributions to overall accuracy and class-wise balance come from the CNN feature extraction block, followed by the IoT sensor integration and Transformer layers. We replaced each component with a simpler alternative (while keeping the pipeline valid), so the degradation reported here reflects only that module’s contribution. Further, for edge optimisation, turning off quantisation increased latency from 1.2 s/s/image to 2.5 s/s/image and decreased throughput from 45 images/s to ~ 20 images/s, indicating the need for optimisation for real-time field deployment on the Jetson Nano.

Fig. 8
figure 8

Ablation Study: Performance Metrics Comparison Across AgroVisionNet Variants.

Four important classification metrics for the complete AgroVisionNet model and its three ablated variants are displayed in Fig. 8. On the x-axis are the varying configurations to try: (i) Full AgroVisionNet, (ii) w/o Transformer (bi-LSTM used instead), (iii) w/o CNN (static global average pooling instead), and (iv) w/o Fusion Module (final concatenation rather than weighted connection). The y-axis shows the metric values in % (%).

The qualitative analysis shown in Fig. 8 confirms that Full AgroVisionNet outperforms other models over all metrics (Accuracy 94.2%, Precision 93.5%, Recall 92.8%, F1-Score 93.1%), indicating that the complete architecture is optimal. The transformer is a superior feature model — swapping the Transformer out for a Bi-LSTM results in a drop across all metrics of 2–2.5%-points, which is significant, although not outsized. In general, the lowest-scoring variant is w/o CNN (GAP), meaning that the convolutional backbone is critical for obtaining robust spatial features. It indicates that a well-designed fusion way outperforms naive concatenation and the monolithic without the Fusion Module performs marginally better than the GAP variant and still worse than the full model. Overall for the figure, it gives us additional visual evidence that the pairwise addition of each AgroVisionNet component outperforms the recognition (recognition column).

Fig. 9
figure 9

Ablation Study: Runtime Performance Analysis (Latency vs. Throughput) of AgroVisionNet Configurations.

Comparisons of runtime efficiency of AgroVisionNet and non-ensembled variants (Latency (s/image) top) (Throughput (images/s) bottom) Evaluated configurations: (i) Full AgroVisionNet(ii) w/o Transformer replaces with Bi-LSTM w/o CNN replaces with Global Average Pooling w/o Fusion Module [direct concatenation] Graph has two Y-axes vertically, left (in red) latency; right (in blue) throughput.

Full AgroVisionNet has highest latency (1.2 s/s/image) and least throughput (45 images/s), however, this agrees with its multi-modal architecture and additionally the increased model complexity as compared to the others. In place of the Transformer we have a Bi-LSTM which achieves similar accuracy as the Transformer but reduces latency to 1.0 s/s/image and increases throughput to 52 images/s, which is an ideal tradeoff of performance as it gains in throughput and latency with a minimal sacrifice in accuracy. Because feature extraction is not very sophisticated, the w/o CNN (GAP) situation is the best in terms of latency (0.8 s/image) and throughput (58 images/s), leading to faster inference. Yet it is the variant hit hardest by precision. Model w/o Fusion module have intermediate values (1.1 s/s/image latency, 48 images/s throughput) implying that it is a good trade-off between speed and performance. To sum up, Fig. 9 shows the speed-accuracy trade-off property of each of the architectural components: full AgroVisionNet obtains the highest recognition accuracy while contributing a relatively small amount to the computational load. On the other hand, its reduced models focus on optimising for the runtime, sacrificing performance of predictions.

Computational efficiency

We evaluated AgroVisionNet for computational efficiency with respect to inference time, throughput and power consumption with real-time embedded drone deployment. In stable network environment, average inference latency was 1.2 s/image with the throughput being 45 img/s and under mild connectivity the latency surged to 1.8 s, while in low connectivity data was buffered locally at the edge until synchronisation became possible. Report power was in the range of 5–10 W, demonstrating that edge device will work on field conditions where power supply is not convenient. These results confirm the ability of the proposed framework to provide low-latency and energy-efficient processing for immediate and real-time decision-making tasks in agricultural interventions.

Table 10 Computational performance of AgroVisionNet under different network conditions on edge Device.

Table 10. Different network conditions (latency, throughput, and power consumption) have been presented in Table 8 to show the computational performance of AgroVisionNet. We demonstrate how the system adapts itself to high-bandwidth, moderate, and low-connectivity environments while performing real-time inference consistently, thus enabling effective deployment of edge devices in variable agricultural field scenarios.

Fig. 10
figure 10

Computational Performance of AgroVisionNet Under Varying Network Conditions.

Performance under different network conditions. The performance of AgroVisionNet is shown in Fig. 10 for various network factors. They evaluate it on latency, throughput, and power consumption in high-bandwidth, medium-connectivity, and low-connectivity scenarios. Results indicate that the framework can be adapted to support energy-efficient, real-time operation on edge devices across various field conditions with changing network availability.

Statistical significance testing

Statistical significance testing was also employed to validate the performance gains of AgroVisionNet over baselines. A 2-tailed paired t-test was done on multiple performance metrics (accuracies, precisions, recalls, and F1-scores) using a threshold of p < 0.05 and a 95% confidence interval. The t-test results showed that AgroVisionNet’s performance improvement over traditional deep learning models such as VGG16, ResNet50, and Inception V3 was statistically significant, indicating that these differences are not due to random variation.

The p-values from statistical analysis for each metric are summarised in Table 11. All p-values are below the significance level, thereby supporting that AgroVisionNet significantly outperforms baseline models in a meaningful and repeatable way.

Table 11 P-Values from statistical significance testing comparing AgroVisionNet with baseline Models.

These results show that the gain obtained with AgroVisionNet is not due to random fluctuations and indicate the great potential of this approach for real agricultural decision-making.

Fig. 11
figure 11

P-Values from Statistical Significance Testing Comparing AgroVisionNet With Baseline Models.

Figure 11 shows the significance testing results based on different measures. The red dashed line indicates the p-value cut-off of 0.05 for statistical significance. All P-values for comparisons between AgroVisionNet and the baselines are below this threshold, indicating adequate relevance and avoiding the possibility that the performance improvements were due to chance.

Error analysis

Error analysis was conducted in detail to identify and investigate the failure modes of AgroVisionNet vis-à-vis other baseline models such as VGG16, ResNet50, and Inception V3. In this study, we have explored a few of the problem scenes from the accentually important drone AGI, e.g., images taken under dark conditions, mutual occlusion (e.g., leaf overlap), and similar disease symptoms (e.g., visually very challenging to discriminate between disease classes).

However, our approach performs much better, particularly compared to bottom-line models such as VGG16, which report higher error here because they struggle to model both the local fine-grained details and the global context of these images under such challenging circumstances. ResNet50 and Inception V3 also achieved slightly better accuracy but struggled with occlusion and distinguishing similar disease classes. By leveraging localised feature extraction and global attention models, AgroVisionNet significantly reduced these errors with its hybrid CNN-Transformer architecture. It offered a preferable way for the model to differentiate between subtle disease signals and noise encountered in complex field sites. Table 12: Comparison of error rates across scenarios; the apparent superior robustness and generalisation power of AgroVisionNet across different challenges are evident from the results.

Table 12 Error distribution comparison between AgroVisionNet and baseline Models.

That basing on the a prior knowledge (that within the datasets provided), significant performance gain in terms of misclassification rates can be achieved by effectively targeting the primary sources of error–as seen with the implementation of AgroVisionNet. The residual errors were associated to extremely high environmental noise levels, very low quality degradation, and trained groups respectively. They also help in the future optimisations like including more samples from the rare disease datasets and also applying more pre-processing methods to remove noise.

Fig. 12
figure 12

Error Distribution Comparison Between AgroVisionNet and Baseline Models.

Figure 12 Comparison error rates for three challenging scenarios: lighting issues, occlusion issues and similar-symptom errors. The low error rates of AgroVisionNet compared to the baseline models highlight its robustness to the adverse conditions in an agricultural field, and its ability to classify crop diseases when visual and environmental conditions are not ideal.

Impact of IoT sensor data on model performance

Experimental setup for IoT integration

To assess the benefit of the IoT sensor data available to AgroVisionNet, two experimental scenarios were designed. The first variant employed only image input, with disease classification based exclusively on multispectral and hyperspectral images captured by drones. The second setting was multimodal, gathering both images and environmental sensor data (e.g., temperature, humidity, soil moisture, light intensity).

Sensor information further helped establish the field environment’s context, allowing the model to differentiate visually similar conditions caused by environmental stress from actual disease symptoms. This comparison experiment helped quantify the benefits we could achieve from integrating IoT data.

Quantitative results

A performance comparison between image-only and multimodal approaches is shown in Table 13. The integration of IoT data clearly improved AgroVisionNet’s performance across all classification measures: accuracy, precision, recall, and F1-score.

Table 13 Comparison of AgroVisionNet performance with Image-Only and multimodal (Image + IoT Data) Configurations.

These results also showed that IoT sensor data is an essential environmental context for reducing traffic sign misclassifications, especially when visual cues are uncertain. For example, leaf discolouration due to drought was previously misclassified as a disease when using only images, but was correctly identified in multimodal. This result highlights the additional benefit of incorporating IoT for accurate, real-time agricultural disease diagnosis.

Explainable AI (XAI) analysis

This is an absolute must: the predictions made by AgroVisionNet for agricultural disease detection need to be validated to ensure reliability, and here, explainability comes into play. We used Grad-CAM, Integrated Gradients (IG), and LRP to interpret our model’s decisions in this study. Samples from the wheat test dataset that were accurately (correct) and incorrectly (incorrect) classified were subjected to these XAI methods to visualise discriminative regions relevant to classification decisions.

Class-discriminative heatmaps of the most relevant spatial areas in the convolutional layers of AgroVisionNet, generated by Grad-CAM. The red-highlighted areas in the lesion regions show good agreement with diseased regions on the leaves (Fig. 13(a)), indicating that the model is attending to agricultural vital features rather than random texture cues. As seen in Fig. 13(b), Integrated Gradients provided pixel-wise attributions based on the model’s gradient, further supporting the intuition that the model focuses its attention on regions of the image where visible infection is present. As shown in Fig. 13(c), LRP decomposed the final prediction output into contribution scores for each pixel, yielding similar saliency patterns across disease types.

To validate interpretability, we quantitatively assessed insertion–deletion curves, which measure the model’s confidence decrease as salient pixels are added or removed step by step. The area under the insertion curve (AUC_ins) is to be 0·84 and the area under the deletion curve (AUC_del) is to be 0·27, reflecting high model fidelity and consistent localisation of disease regions. These findings validate that the produced explanations have a significant correlation with the model’s internal decision-making process and, by offering reliable, interpretable reasons through visualisation, provide the agronomist and end user with a clear understanding.

Fig. 13
figure 13

Explainable AI Visualisations for AgroVisionNet Predictions.

The interpretability of AgroVisionNet is analysed using three complementary XAI techniques, as illustrated in Fig. 13. Grad-CAM heatmaps (high-intensity area refers to the lesion area of wheat leaves with diseases, and panel 13(a) shows the upright disease area covered by the model, with more focused attention on the actual disease pattern). Example outputs are shown in panel 13(c) with Integrated Gradients attribution maps for each panel 13(b), highlighting pixel-level importance coherent with apparent signs of infection. Visualisation: 13(c) Layer-wise Relevance Propagation (LRP): relevance is consistently spread over the disease part of the image across the disease categories for healthy, rust, blight, and smut. Collectively, these visualisations confirm that AgroVisionNet is predicting based on biologically meaningful signals rather than background noise, thereby enhancing model transparency, interpretability, and trust for practical precision agriculture applications.

Edge inference configuration and performance ablation

We exported a TensorFlow Lite model and performed post-training quantisation to analyse the deployment of AgroVisionNet on edge devices with constrained resources, with real-time inference required. The input image was set to 224× 224 × 3, as this is a good trade-off between accuracy and computational cost. All experiments were run on a Jetson Nano (4 GB RAM, quad-core ARM A57 CPU, 128-core Maxwell GPU) with Ubuntu 18.04 and JetPack 4. The base AgroVisionNet model was trained in FP32 and then optimised for inference using FP16 and INT8 post-training quantisation. The quantised models were deployed in a batch size of one to mimic real-time streaming from cameras mounted on UAVs.

Quantisation consisted of operator fusion (Conv + BatchNorm + ReLU) and dynamic-range calibration across 500 validation images from the dataset. Quantising the model to an INT8 representation gave us a good trade-off between inference speed and accuracy. FP32 Model —This was the baseline, unoptimized model. FP16 Model — This offered a halfway house, with modest size savings and moderate speed optimisation. The quantised versions were evaluated under the same conditions to allow equitable comparisons of inference performance, throughput, and model size.

Table 14 Accuracy–Speed Trade-off for AgroVisionNet on Jetson nano (Input 224 × 224 × 3, batch Size = 1).

Table 14. The results demonstrate that the INT8 TFLite model achieves the best trade-off between speed and accuracy, enabling real-time inference at approximately 45 frames per second with only a marginal 1.1% reduction in accuracy compared to the complete FP32 baseline. This confirms that the reported throughput is realistic only under the quantised configuration. All classification metrics reported in the previous sections correspond to the FP32 training model, whereas the real-time evaluation results represent the INT8-quantised deployment. To ensure reproducibility, the exact TensorFlow–TFLite conversion scripts and Jetson Nano configuration used in this study will be made publicly available in the project repository upon acceptance.

Comparison with existing methods

To demonstrate the effectiveness of the proposed AgroVisionNet framework, we have qualitatively compared its performance with several state-of-the-art (SOTA) deep learning and object detection models commonly used in crop disease diagnosis and monitoring. Among these are classical CNN-based classifiers (VGG16, ResNet50, and DenseNet121), a DDS-oriented object detection method for vine leaf disease detection31, and the one proposed in72, which uses the YOLO approach to design an object detector for standard bean disease classification. In this work, we further included YOLO11 and D-FINE as two of the latest real-time object detectors, with up-to-date performance in dynamic agricultural conditions, as external benchmarks.

The VGG16, ResNet50, and DenseNet121 models are commonly used for plant disease detection because of their excellent feature extraction and good generalisation performance across diverse datasets. However, these models are primarily applied to image-level classification, which cannot capture spatial context or support real-time, large-scale disease detection in the field. In our studies, these models exhibited a bottleneck in recognising disease symptoms under harsh conditions (e.g., overlapping leaves, varying illumination, and partial occlusion), which are commonly encountered in crop images from UAVs. Their results were reliable but not sensitive enough in detection, and post-processing was needed to interpret the outputs.

The deep learning hybrid model proposed by Ahmet Alkan et al.31 achieved better results by combining multiple CNN backbones to enhance feature learning. This model achieved superior accuracy compared to other single-CNN-based systems; however, it lacked a global attention mechanism and real-time processing, resulting in slow inference speed and poor scalability. Likewise, another YOLO-based model introduced by Daniela Gomez et al.72 achieved remarkable real-time localisation of disease areas, in contrast to conventional classifiers. On the one hand, they were faster than region proposal network (RPN)-based architectures for detection and bounding box (BB) generation, sufficiently fast at the field level to be used operationally. But their accuracy was significantly affected when the symptoms were inconspicuous or environmental conditions, such as dust, shadowing and varying light, hampered picture quality.

On the other hand, modern models such as YOLO11 and D-Fine were more robust in processing complex scenes and recognising multiple diseases simultaneously. These architectures featured sophisticated attention mechanisms and multi-scale feature extraction, which resulted in strong sensitivity to small visual cues. Despite performing well, these models were limited to unimodal visual data (and to a peak bandwidth for performance). This rendered their deployment on edge devices difficult and limited their application to real-time monitoring of drones operating in large agricultural fields.

AgroVisionNet, on the other hand, yielded better qualitative results by alleviating the gross inconsistencies across methods. This architecture performs fusion between model representational features. Their integrated CNN-Transformer architecture effectively learns local spatial and global contextual information, which is quite appealing for fine-grained disease symptoms under varied field conditions. Furthermore, by integrating IoT data, AgroVisionNet incorporated environmental parameters, i.e., temperature, humidity, and soil moisture, into its decision-making process. Previous work had indicated that such a multimodal mechanism yielded context-conditioned representations that were comparably stronger, facilitating stronger detection when visual signals are weak or variable. To compare with state-of-the-art detection accuracy, real-time performance, scalability, and multimodal edge deployment, Table 15 compares AgroVisionNet with state-of-the-art models.

Table 15 Qualitative comparison of AgroVisionNet and State-of-the-Art models for crop disease Detection.

From an operational deployment standpoint, we see AgroVisionNet’s edge computing compatibility as a distinct advantage. Although YOLO11 and D-FINE could only run inference on a powerful centralised server, AgroVisionNet worked well on drone-mounted edge devices in an efficient, real-time DTID scenario on the field. This flexibility also minimised bandwidth requirements and increased responsiveness, making it an appropriate approach for quickly intervening in extensive agricultural operations.

Through qualitative field observations, the AgroVisionNet model showed better sharpness and consistency in bounding box detection, enabling discrimination between overlapping leaves (and other plants around). Explainable heatmaps were also generated as visual outputs, along with integrated XAI modules to support screening results for farmers and agricultural professionals. In contrast, the detections of other models were mostly noisy, or it was hard to understand why they predicted certain regions.

In conclusion, although VGG16, ResNet50, and DenseNet121 as well as hybrid deep learning solutions could have also contributed on the field of crop disease detection in real-time environmental monitoring (Gautam et al., 2018) and YOLO-based detectors (Qin et al., 2020; Sharma A. et al., 2019a, b) even with latest architectures like YOLO11 or D-FINE still exhibit several limitations if tested on the scenarios of real-time scalable agriculture contextual surveillance. The proposed AgroVisionNet framework qualitatively outperformed these by leveraging multimodal data fusion, edge-based real-time processing, and hybrid deep learning, providing a full-fledged solution for sustainable crop disease management in massive agricultural fields.

Although we have limited the experiments to a single dataset, the strong positive results for all crop types and disease categories present in this dataset suggest that AgroVisionNet has high scalability potential. Validation against independent datasets and across different field conditions should be conducted to assess its field-applicability.

We validated the evaluation by re-training and evaluating AgroVisionNet on the same dataset, preprocessing, and evaluation metrics as for AgroVisionNet of the two recent visual models, EfficientNetV2-S [90 and Swin Transformer (Swin-T)88, widely adopted in the past months. The quantitative comparison provides a fair basis and demonstrates the benefits of technological advancement enabled by multimodal fusion and Transformer-based contextual reasoning over a pure vision model.

Table 16 Quantitative comparison of AgroVisionNet with recent State-of-the-Art Models.

We observe from Table 16 that the proposed AgroVisionNet outperforms two strong SOTA baselines, EfficientNetV2-S and Swin-T, across accuracy, precision, recall, F1-score, latency, and throughput on the same dataset and split. AgroVisionNet delivers the best overall classification performance yet maintains edge-friendly inference speed, making the case for multimodal fusion and task-specific optimisation.

Fig. 14
figure 14

Model Quantitative Performance Comparison.

We further compared performance metrics across AgroVisionNet and two other SOTA baselines (as shown in Fig. 14). Compared to EfficientNetV2-S and Swin-T, the proposed model achieves higher accuracy and F1-score, suggesting its competence for better generalisation in complex field conditions. The throughput plot also aligns with its real-time performance on edge hardware, maintaining 45 images per second despite the hybrid CNN-Transformer architecture. This performance gain is mainly due to the adaptive cross-modal fusion of AgroVisionNet, which utilises environmental sensor data when unambiguous visual features that typical SOTA vision models can disambiguate are lacking. These findings further strengthen our empirical SOTA benchmark that multimodal reasoning with lightweight Transformers better aligns with the high-impact journal desiderata of accuracy, interpretability, and deployability.

Discussion

AgroVisionNet outperforms the aforementioned state-of-the-art methods by integrating spatial-spectral reasoning and multimodal environmental context via a modality fusion approach. VGG16 and ResNet50 are traditional CNN-based models that model only local textures, neglecting long-range contextual cues, leading to misclassifications under different lighting and occlusions when visually similar disease patterns appear. The second branch of AgroVisionNet (the Transformer branch) introduces self-attention, which establishes dynamic relationships among distant leaf regions, enabling the model to differentiate between true disease lesions and illumination artefacts or background noise.

As we discuss in the error analysis (Table 10; Fig. 10), baseline models exhibit higher false negatives and false positives, confounding rust and blight symptoms mainly due to similar colour distributions or partial occlusions. On the other hand, AgroVisionNet fuses spatial features learned from the CNN encoder and contextual dependencies learned from the Transformer, reducing such ambiguities by 30–40% as shown in the fine-grain analysis. This separation is aided by a well-designed IoT sensor fusion module that provides temperature and humidity sensor data, helping the model discriminate between environmental stress and pathogen-induced discolouration.

However, the confusion matrix (Fig. 4) shows that some misclassifications persist within visually similar classes, such as rust and smut. These errors are usually associated with a lack of minority disease samples during acquisition, and the high acquisition cost sometimes hampers proper drone lighting during photography. This shortcoming will be resolved by class-specific data augmentation for rare classes and adaptive illumination compensation during preprocessing.

From an application perspective, this means multimodal, attention-based models will play a massive role in making precision agriculture more efficient by lowering false alarm rates and providing more accurate early detection, although it is still expensive to label all this data manually. A precise classification would not only alert to pesticide use promptly but also aid population monitoring at scale via edge devices (e.g., Jetson Nano), delivering a route to sustainability with low-latency field deployment.

Dataset limitations and biases

While this dataset was specifically selected to represent diversity in crop type, disease status and geography, some limitations do exist that could hinder the strength and generalisability of the AgroVisionNet structure we developed here. A limitation of the model may be underrepresentation of rare diseases, as exposure to early (atypical) presentations of characteristic diseases may be limited. This has resulted in underperformance in the practical detection of such infrequent cases. Disease categories are imbalanced, with some having thousands of images while others have only a few; thus, the model may be forced to perform better on more frequent diseases.

This method also suffers from environmental noise. While images were taken across several regions and seasons, the number of samples for extreme conditions such as drought, heavy rain, or pest infestation is limited. Therefore, the model’s performance is not as good in these cases.

Geographic distribution may also introduce bias, as the amount of data reported in some areas is higher than in others. Such a limitation might limit the model’s generalizability to less-represented regions with different environmental and agricultural contexts. To address these challenges, we will add more samples collected from the various areas, crop classes and growth conditions to the dataset in our future work. To artificially balance the underrepresented classes, data augmentation methods will be used. More importantly, external validation on independent datasets will be conducted to demonstrate AgroVisionNet’s generalisation ability across various practical settings.

Ethical and privacy considerations

All drone imaging and IoT data collection took place on private test plots for which the farm owners provided their informed consent. We anonymised data and generalised GPS coordinates before training the 1 st and 2nd models. The system only stores sensor data and image fragments that are not identifiable and are not necessary for inferring the disease. Subsequent deployments will continue as before — respecting data-minimisation principles and complying with local privacy and agricultural regulations.

Practical deployment challenges

Although AgroVisionNet demonstrates robust performance in experimental evaluation, several practical issues must be addressed before successful real-world deployment in the agricultural field. One reason: drone flight regulations. Strict regulations apply to drones in several places, including limits on altitude, restricted zones, and even requirements for a special license. Failure to comply could also lead to legal trouble and the suspension of drone operations. For a seamless deployment, first and foremost, coordination with local aviation authorities and compliance with their regulations are required.

Having a long battery life is just as important in the field. Drone batteries are short-lived, with most able to sustain flight for only 20 to 40 min at a time – still not enough time to cover the area of a typical food crop farm. This appears to be a hindrance to data collection, especially in less connected or complex areas. To tackle this, methods such as replaceable battery packs, flying swarms of drones, and scheduling flights to fly the most optimal trajectories, time-wise, are suggested. Sensor calibration is also an important subject. It is essential to regularly calibrate multispectral sensors, and hyperspectral sensors are used to detect disease. Environmental factors: Dust accumulation and long-term use may affect the sensor’s accuracy. If the sensor produces incorrect readings, the classification is erroneous, and AgroVisionNet’s prediction will be poor. Thus, to support operation over long time periods, regular calibration procedures and automated self-test capabilities are needed. By actively considering these and other common issues, AVN can move beyond the inherent constraints (scale, platform integration) of controlled experimental conditions and towards essential large-scale parallel applications across diverse agricultural scenarios. These operational procedures need to be further fine-tuned for the framework to be used in practice.

Adaptability to other domains

AgroVisionNet is developed for agricultural diseases detection, but due to its flexibility, expandability, and the nature of problems solved, can be used in other domains requiring vision analysis in real-time use and context aware reuse of environmental data. The generalizability of this framework, as emphasized by its key frameworks and techniques (Architecture, System of Image Acquisition, IoT Sensor Integration and Edge Computing) enables its applicability for different monitoring scenarios. The system can then be adapted for use in forestry where it can monitor the health of trees, look out for potential pests, and track early signs of fires starting in forests. Drones built for aerial applications with thermal/multispectral cameras can capture canopy-level data while IoT sensors focused on the ground can provide measures of soil moisture and temperature. Authorities can leverage this combined data to implement a more proactive policy for forest management and reduce ecosystem depletion.

Environmental monitoring can also be identified for AgroVisionNet, including water quality and pollution sources, and climate variables. Dumping ecological sensor data on top of visual image creates a holistic view of the environment. Using edge computing, the powerful insights are created in almost real time in the more remote or bandwidth-constrained areas. As a base enabling learnings to be integrated into different real-world tasks, the system is as adaptive as possible with changes within a domain (e.g., type of sensor, classifer upgrades, etc.). The ability to scale makes it possible to further apply AgroVisionNet not only in agriculture, but also in sustainable resources management, conservation, and environmental protection interventions. The limitations of the proposed study are discussed in Sect. 5.5, with indications for how it could be better and implemented more widely.

Limitations of the study

There are some limitations in the present study that warrant further exploration. First, it is a dataset centred on specific crops and diseases, making the system ungeneralizable to various agricultural situations. Second, although integrating IoT sensors can improve our system’s accuracy, sensor placement and data quality variability can reduce system performance in heterogeneous environments. Third, edge computing provides the capability for real-time processing; however, Third, edge computing offers the capability for real-time processing; however, it is not inherently scalable for large-scale deployments across multiple application fields with various types of resources that are easier to optimize for scaling and management with numerous kinds of resources that is easier to optimise for scaling and management. Widening datasets, adaptive sensor fusion, and distributed edge architectures will naturally make it more robust and scalable, thereby removing such limitations.

Conclusion and future work

The AgroVisionNet architecture was developed for real-time agricultural disease identification, by utilizing drone image data and IoT sensor details through edge computing to address the problems and challenges. Specifically, it proposed a scalable multimodal deep learning model consisting of a CNN for spatial feature extraction, a Transformer for context-aware information processing, as well as environmental sensing. It was shown through the experimental results that the level of accuracy, precision and realtime response that our proposed approach achieves was sufficiently high as to have the potential of revolutionising the precision agriculture industry and being an integral component of the farmer data driven decision process. This model covers the design of hybrid CNN-Transformer, adaptive cross-modal fusion for IoT and visual fusion, and edge-level processing to mitigate the curse of latency in inference. Extensive comparison, statistical significance testing, error analysis, and ablation studies were performed to demonstrate the validity and scalability of the proposed method. These results show evidence that AgroVisionNet improves classification accuracy whilst also speeding up the data collection to decision-making cycle in a field-use case. However, these technologies are not applicable in every situation. The dataset, rather, is diverse — but it does not focus on other rare diseases nor extreme environmental conditions. Lastly, there are some expected deployment troubles due to operational reasons — drone flight policy, battery life, and sensor calibration. This, of course, is a limitation that should be solved in order to help polish the system stability and scalability further. We will, eventually, continue to grow the dataset by using larger, multi-year, datasets from diverse crops, regions, and climates. We will also extend to fully distributed edge computing deployments for a wider area, networked/area based deployments i.e. across multiple farms. Moreover, AgroVisionNet is a scalable platform with the possibility of transferring to other key verticals (i.e., forestry and environmental monitoring), offering a more generic solution beyond agriculture. AgroVisionNet captures the promise of cutting-edge deep learning technology along with key practical considerations for deployment, connecting the two domains as a valid step towards future progressive agricultural systems. This study serves as a first step to supporting the development of technology-based sustainable agriculture, to transformational changes in productivity, efficient resource use, and economic gains on the part of the farmers.