Scientific Reports

Table 1 Synthesised review of existing Drone-/IoT-Based crop disease detection studies and identified research gap for AgroVisionNet.

From: AI-driven drone technology and computer vision for early detection of crop disease in large agricultural areas

Study/Source In Manuscript	Approach/Focus	Limitation In the Context Of Your Work	Research Gap Filled By AgroVisionNet
Pacal⁸⁷ (maize disease on a large dataset)	High-accuracy CNN on single-modality crop images	Image-only; does not consider field/ambient variability from sensors	Add an IoT/environmental stream to make detection robust to field conditions
Pacal et al⁸⁸. (AI-enhanced MetaFormer for olive leaves)	Transformer-style vision for leaf diseases	Strong visual modelling, but unimodal and not edge-oriented	Combine CNN + Transformer with edge-optimised deployment for field drones
Pacal and Işık (CNN + ViT for corn leaf disease)	Comparison of CNN and ViT for precise disease ID	No multimodal fusion; assumes good-quality images	Learnable fusion block (image + IoT) to handle visually ambiguous cases
Avşar and Mowla⁸⁵ (smart-agri wireless/IoT review)	Rich discussion of sensor-based smart farming	Sensing only; no image-level disease confirmation	Fuse sensor cues with drone imagery in one DL pipeline
Mowla and Gök⁸⁶ (weed detection networks)	Deep networks for vegetation/weed detection	Focused on vision task only, not multimodal and not UAV-scale	Extend to UAV-based, multimodal crop-disease detection with attention
Daniela Gomez/YOLO-style agri detector⁷²	Real-time object detection for leaf/bean diseases	Fast but image-only; performance drops under dust, shadow, and occlusion	Keep real-time property but add multimodal fusion to stabilise predictions
Hybrid DL for vine leaf disease (Ahmet Alkan et al.)³¹	Multiple CNNs for richer features	Heavier, slower, still unimodal, not edge-friendly	Lighter CNN–Transformer backbone plus TFLite/quantisation for Jetson-class devices
This work (AgroVisionNet)	CNN + 2-layer Transformer + adaptive fusion (learnable α, β)	—	Provides a single, edge-deployable, multimodal architecture that aligns image features with sensor context and supports XAI

Back to article page

Search

Advanced search

Quick links