Table 1 Synthesised review of existing Drone-/IoT-Based crop disease detection studies and identified research gap for AgroVisionNet.
Study/Source In Manuscript | Approach/Focus | Limitation In the Context Of Your Work | Research Gap Filled By AgroVisionNet |
|---|---|---|---|
Pacal87 (maize disease on a large dataset) | High-accuracy CNN on single-modality crop images | Image-only; does not consider field/ambient variability from sensors | Add an IoT/environmental stream to make detection robust to field conditions |
Pacal et al88. (AI-enhanced MetaFormer for olive leaves) | Transformer-style vision for leaf diseases | Strong visual modelling, but unimodal and not edge-oriented | Combine CNN + Transformer with edge-optimised deployment for field drones |
Pacal and Işık (CNN + ViT for corn leaf disease) | Comparison of CNN and ViT for precise disease ID | No multimodal fusion; assumes good-quality images | Learnable fusion block (image + IoT) to handle visually ambiguous cases |
AvÅŸar and Mowla85 (smart-agri wireless/IoT review) | Rich discussion of sensor-based smart farming | Sensing only; no image-level disease confirmation | Fuse sensor cues with drone imagery in one DL pipeline |
Mowla and Gök86 (weed detection networks) | Deep networks for vegetation/weed detection | Focused on vision task only, not multimodal and not UAV-scale | Extend to UAV-based, multimodal crop-disease detection with attention |
Daniela Gomez/YOLO-style agri detector72 | Real-time object detection for leaf/bean diseases | Fast but image-only; performance drops under dust, shadow, and occlusion | Keep real-time property but add multimodal fusion to stabilise predictions |
Hybrid DL for vine leaf disease (Ahmet Alkan et al.)31 | Multiple CNNs for richer features | Heavier, slower, still unimodal, not edge-friendly | Lighter CNN–Transformer backbone plus TFLite/quantisation for Jetson-class devices |
This work (AgroVisionNet) | CNN + 2-layer Transformer + adaptive fusion (learnable α, β) | — | Provides a single, edge-deployable, multimodal architecture that aligns image features with sensor context and supports XAI |