Table 1 Synthesised review of existing Drone-/IoT-Based crop disease detection studies and identified research gap for AgroVisionNet.

From: AI-driven drone technology and computer vision for early detection of crop disease in large agricultural areas

Study/Source In Manuscript

Approach/Focus

Limitation In the Context Of Your Work

Research Gap Filled By AgroVisionNet

Pacal87 (maize disease on a large dataset)

High-accuracy CNN on single-modality crop images

Image-only; does not consider field/ambient variability from sensors

Add an IoT/environmental stream to make detection robust to field conditions

Pacal et al88. (AI-enhanced MetaFormer for olive leaves)

Transformer-style vision for leaf diseases

Strong visual modelling, but unimodal and not edge-oriented

Combine CNN + Transformer with edge-optimised deployment for field drones

Pacal and Işık (CNN + ViT for corn leaf disease)

Comparison of CNN and ViT for precise disease ID

No multimodal fusion; assumes good-quality images

Learnable fusion block (image + IoT) to handle visually ambiguous cases

AvÅŸar and Mowla85 (smart-agri wireless/IoT review)

Rich discussion of sensor-based smart farming

Sensing only; no image-level disease confirmation

Fuse sensor cues with drone imagery in one DL pipeline

Mowla and Gök86 (weed detection networks)

Deep networks for vegetation/weed detection

Focused on vision task only, not multimodal and not UAV-scale

Extend to UAV-based, multimodal crop-disease detection with attention

Daniela Gomez/YOLO-style agri detector72

Real-time object detection for leaf/bean diseases

Fast but image-only; performance drops under dust, shadow, and occlusion

Keep real-time property but add multimodal fusion to stabilise predictions

Hybrid DL for vine leaf disease (Ahmet Alkan et al.)31

Multiple CNNs for richer features

Heavier, slower, still unimodal, not edge-friendly

Lighter CNN–Transformer backbone plus TFLite/quantisation for Jetson-class devices

This work (AgroVisionNet)

CNN + 2-layer Transformer + adaptive fusion (learnable α, β)

—

Provides a single, edge-deployable, multimodal architecture that aligns image features with sensor context and supports XAI