Abstract
Existing lightweight Convolutional Neural Network (CNN) detectors deployed on Unmanned Aerial Vehicle (UAV) platforms struggle with small object recognition and fail to capture long-range spatial dependencies, while standard Vision Transformer (ViT) architectures suffer from quadratic computational complexity that prohibits real-time inference on embedded hardware. This paper bridges this gap by proposing an integrated framework that adapts ViT for UAV-based real-time object detection through edge computing infrastructure. Our work presents three key contributions: (1) a hierarchical attention mechanism with shifted windows that reduces complexity from O(n²) to O(n), (2) a dynamic token pruning strategy that adaptively discards uninformative background tokens based on attention variance, and (3) a dual-mode edge-UAV collaborative architecture enabling seamless switching between autonomous onboard processing and server-assisted computation. The lightweight ViT variant achieves 68% reduction in floating-point operations (FLOPs) while preserving 94.3% relative accuracy. Through systematic optimization combining mixed-precision quantization, structured pruning, and operator fusion, we obtain 11.2× inference speedup over baseline implementations. Experiments on our collected aerial dataset demonstrate 73.9% mAP@0.5:0.95 at 39.2 frames per second (FPS) on NVIDIA Jetson Xavier NX, surpassing YOLOv5s by 4.7% in accuracy under identical real-time constraints. Notably, small object detection improves by 7.4% Average Precision (AP) compared to CNN baselines. Week-long field trials on DJI Matrice 300 RTK validate sustained performance across varying illumination, platform vibration, and intermittent network connectivity, confirming practical viability for time-critical applications including search and rescue, disaster response, and infrastructure inspection.
Data availability
All data generated and analyzed during this study are presented in the Supplementary Materials, including detailed experimental results, per-class performance metrics, ablation study data, and field deployment logs. The newly collected UAV dataset specifications and annotation guidelines are provided in Supplementary Table S1-S3. The VisDrone dataset is publicly available at http://aiskyeye.com/and https://github.com/VisDrone/VisDrone-Dataset. The UAVDT dataset is publicly available at https://sites.google.com/view/grli-uavdt/.
Abbreviations
- UAV:
-
Unmanned Aerial Vehicle
- ViT:
-
Vision Transformer
- CNN:
-
Convolutional Neural Network
- FPS:
-
Frames Per Second
- mAP:
-
mean Average Precision
- AP:
-
Average Precision
- IoU:
-
Intersection over Union
- NMS:
-
Non-Maximum Suppression
- FLOPs:
-
Floating Point Operations
- GPU:
-
Graphics Processing Unit
- CPU:
-
Central Processing Unit
- MSA:
-
Multi-head Self-Attention
- W-MSA:
-
Window-based Multi-head Self-Attention
- HD:
-
High Definition
- IMU:
-
Inertial Measurement Unit
- CLAHE:
-
Contrast Limited Adaptive Histogram Equalization
- TDP:
-
Thermal Design Power
- CUDA:
-
Compute Unified Device Architecture
- cuDNN:
-
CUDA Deep Neural Network library
- ONNX:
-
Open Neural Network Exchange
- COCO:
-
Common Objects in Context
- TP:
-
True Positive
- FP:
-
False Positive
- FN:
-
False Negative
- DSP:
-
Digital Signal Processor
- \(\:X\) :
-
Input sequence matrix Eq. (1)
- \(\:N\) :
-
Number of tokens Sect. 2.1
- \(\:D\) :
-
Feature dimensions Sect. 2.1
- \(\:Q,K,V\) :
-
Query, key, value matrices Eq. (1)
- \(\:{d}_{k}\) :
-
Key dimension Eq. (1)
- \(\:h,w,C\) :
-
Height, width, channel dimensions Eq. (3)
- \(\:W\) :
-
Weight matrix Eq. (4)
- \(\:{W}_{q}\) :
-
Quantized weights Eq. (4)
- \(\:\varDelta\:\) :
-
Quantization step size Eq. (4)
- \(\:b\) :
-
Bit-width Eq. (4)
- \(\:M\) :
-
Binary pruning mask / Window size Eq. (5) / Eq. (10)
- \(\:\tau\:\) :
-
Pruning threshold Eq. (5)
- \(\:T\) :
-
Temperature parameter Eq. (6)
- \(\:\alpha\:,\beta\:,\eta\:\) :
-
Workload allocation weights Eq. (18)
- \(\:{r}_{l}\) :
-
Pruning ratio at layer \(\:l\) Eq. (11)
- \(\:{A}_{l}\) :
-
Attention scores at layer \(\:l\) Eq. (11)
- \(\:{\lambda\:}_{cls},{\lambda\:}_{loc},{\lambda\:}_{ctr}\) :
-
Loss function weights Eq. (14)
- \(\:\mathcal{A}\) :
-
Detection accuracy Eq. (8)
- \(\:\mathcal{L}\) :
-
Inference latency Eq. (8)
- \(\:\mathcal{E}\) :
-
Energy consumption Eq. (8)
- \(\:\theta\:\) :
-
Model parameters Eq. (8)
References
Chen, Y., Zhang, H., Wang, L. & Wu, Q. A survey on vision-based UAV navigation. IEEE Trans. Intell. Transp. Syst. 24 (3), 2835–2854. https://doi.org/10.1109/TITS.2022.3224325 (2023).
Zhang, Y., Yuan, Y., Feng, Y. & Lu, X. UAV swarms in urban environments: A survey of emerging trends and challenges. Drones 7 (2), 98. https://doi.org/10.3390/drones7020098 (2023).
Li, K., Ni, W., Tovar, E. & Jamalipour, A. On-board deep learning in UAV-based applications: opportunities and challenges. IEEE Internet Things J. 10 (4), 3553–3575. https://doi.org/10.1109/JIOT.2022.3215698 (2023).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR). (2021). https://doi.org/10.48550/arXiv.2010.11929
Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surveys. 54 (10s), 1–41. https://doi.org/10.1145/3505244 (2022).
Shi, W., Cao, J., Zhang, Q., Li, Y. & Xu, L. Edge computing: vision and challenges. IEEE Internet Things J. 3 (5), 637–646. https://doi.org/10.1109/JIOT.2016.2579198 (2016).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 779–788. (2016). https://doi.org/10.1109/CVPR.2016.91
Zhang, Y., Yuan, Y., Feng, Y. & Lu, X. Cascade det: A universal cascade detection framework for small objects in drone-captured scenes. IEEE Trans. Geosci. Remote Sens. 60, 1–13. https://doi.org/10.1109/TGRS.2021.3138849 (2021).
Xu, Z., Wu, W., Qi, L. & Lu, X. Efficient vision Transformers for edge devices: A survey. IEEE Access. 11, 28456–28478. https://doi.org/10.1109/ACCESS.2023.3258741 (2023).
Rizk, Y., Awad, M. & Tunstel, E. W. Cooperative heterogeneous multi-robot systems: A survey. ACM Comput. Surveys. 52 (2), 1–31. https://doi.org/10.1145/3303848 (2019).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 25, pp. 1097–1105. (2012). https://doi.org/10.1145/3065386
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30, pp. 5998–6008. (2017). https://proceedings.neurips.cc/paper/7181-attention-is-all-you-need
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR). (2015). https://doi.org/10.48550/arXiv.1409.0473
Touvron, H. et al. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML) 10347–10357. (2021). https://doi.org/10.48550/arXiv.2012.12877
Vaswani, A. et al. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12894–12904. (2021). https://doi.org/10.1109/CVPR46437.2021.01270
Satyanarayanan, M. The emergence of edge computing. Computer 50 (1), 30–39. https://doi.org/10.1109/MC.2017.9 (2017).
Mao, Y., You, C., Zhang, J., Huang, K. & Letaief, K. B. A survey on mobile edge computing: the communication perspective. IEEE Commun. Surv. Tutorials. 19 (4), 2322–2358. https://doi.org/10.1109/COMST.2017.2745201 (2017).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2704–2713. (2018). https://doi.org/10.1109/CVPR.2018.00286
He, Y., Liu, P., Wang, Z., Hu, Z. & Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4340–4349. (2019). https://doi.org/10.1109/CVPR.2019.00447
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. (2015). https://doi.org/10.48550/arXiv.1503.02531
Jiang, X., Hadid, A., Pang, Y., Granger, E. & Feng, X. Deep learning-based face detection and recognition on mobile devices. IEEE Access. 7, 154714–154735. https://doi.org/10.1109/ACCESS.2019.2949098 (2019).
Cheng, G., Zhou, P. & Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54 (12), 7405–7415. https://doi.org/10.1109/TGRS.2016.2601622 (2016).
Xie, M. & Jean, N. Low altitude thermal airborne platform for tracking and monitoring of wildlife in open-canopy landscapes. Drones 8 (3), 87. https://doi.org/10.3390/drones8030087 (2024).
Liu, M. et al. UAV-YOLO: small object detection on unmanned aerial vehicle perspective. Sensors 20 (8), 2238. https://doi.org/10.3390/s20082238 (2020).
Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 7464–7475. (2023). https://doi.org/10.1109/CVPR52729.2023.00721
Wang, Z. et al. UAV-assisted mobile edge computing: dynamic trajectory design and resource allocation. Sensors 24 (12), 3948. https://doi.org/10.3390/s24123948 (2024).
Zhang, J., Xing, W., Xing, M. & Sun, G. Terahertz image reconstruction based on compressed sensing and inverse Fresnel diffraction. Opt. Express. 30 (3), 3175–3195. https://doi.org/10.1364/OE.449787 (2022).
Tian, Z., Shen, C., Chen, H. & He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 9627–9636. (2019). https://doi.org/10.1109/ICCV.2019.00972
Bodla, N., Singh, B., Chellappa, R. & Davis, L. S. Soft-NMS: Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 5561–5569. (2017). https://doi.org/10.1109/ICCV.2017.593
Lin, T. Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2117–2125. (2017). https://doi.org/10.1109/CVPR.2017.106
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 10012–10022. (2021). https://doi.org/10.1109/ICCV48922.2021.00986
Kirillov, A., Girshick, R., He, K. & Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6399–6408. (2019). https://doi.org/10.1109/CVPR.2019.00656
Kong, T. et al. FoveaBox: Beyound anchor-based object detection. IEEE Trans. Image Process. 29, 7389–7398. https://doi.org/10.1109/TIP.2020.3002345 (2020).
Nagel, M. et al. A white paper on neural network quantization. ArXiv Preprint arXiv:2106 08295. https://doi.org/10.48550/arXiv.2106.08295 (2021).
Frankle, J. & Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations (ICLR). (2019). https://openreview.net/forum?id=rJl-b3RcF7
Vanholder, H. Efficient inference with TensorRT. GPU Technology Conference. (2016). https://developer.nvidia.com/tensorrt
Liu, L., Li, H. & Gruteser, M. Edge assisted real-time object detection for mobile augmented reality. In Proceedings of the 25th Annual International Conference on Mobile Computing and Networking 1–16. (2019). https://doi.org/10.1145/3300061.3300116
Zhu, P. et al. Detection and tracking Meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 44 (11), 7380–7399. https://doi.org/10.1109/TPAMI.2021.3119563 (2021).
Li, C. et al. YOLOv6: A single-stage object detection framework for industrial applications. ArXiv Preprint arXiv:2209 02976. https://doi.org/10.48550/arXiv.2209.02976 (2022).
Buslaev, A. et al. Albumentations: fast and flexible image augmentations. Information 11 (2), 125. https://doi.org/10.3390/info11020125 (2020).
Lin, T. Y. et al. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV) 740–755. (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Ye, T., Zhao, Z., Luo, H. & Liu, H. Real-time object detection network in UAV-vision based on CNN and transformer. IEEE Trans. Instrum. Meas. 72, 1–13. https://doi.org/10.1109/TIM.2023.3259032 (2023).
Chen, C., Zheng, Z., Ding, X., Huang, Y. & Dou, Q. Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8869–8878. (2021). https://doi.org/10.1109/CVPR46437.2021.00876
Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E. & Yu, F. Frustratingly simple few-shot object detection. In Proceedings of the International Conference on Machine Learning (ICML) 9919–9928. (2020). https://proceedings.mlr.press/v119/wang20j.html
Mueller, M., Smith, N. & Ghanem, B. A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision (ECCV) 445–461. (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Huang, X., Ge, Z., Jie, Z. & Yoshie, O. NMS by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10750–10759. (2020). https://doi.org/10.1109/CVPR42600.2020.01076
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 (2017).
Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML) 6105–6114. (2019). https://proceedings.mlr.press/v97/tan19a.html
Carion, N. et al. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV) 213–229. (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics YOLOv8. GitHub repository. (2023). https://github.com/ultralytics/ultralytics
Tan, M., Pang, R. & Le, Q. V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10781–10790. (2020). https://doi.org/10.1109/CVPR42600.2020.01079
Zhao, Y. et al. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 16965–16974. (2024). https://doi.org/10.1109/CVPR52733.2024.01605
Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18–20, 1967, Spring Joint Computer Conference 483–485. (1967). https://doi.org/10.1145/1465482.1465560
Wang, C. Y., Yeh, I. H. & Liao, H. Y. M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision (ECCV) 1–21. (2024). https://doi.org/10.1007/978-3-031-72751-1_1
Wang, A. et al. YOLOv10: Real-time end-to-end object detection. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 37. (2024). https://arxiv.org/abs/2405.14458
Bakirci, M. Performance evaluation of low-power and lightweight object detectors for real-time monitoring in resource-constrained drone systems. Eng. Appl. Artif. Intell. 144, 117774. https://doi.org/10.1016/j.engappai.2025.117774 (2025).
Acknowledgements
The authors would like to thank the Department of Computer Science at Lishui University for providing computational resources and experimental facilities. We also acknowledge the developers of VisDrone and UAVDT datasets for making their data publicly available.
Funding
This research was supported by the China University Industry-University-Research Innovation Fund Project (Project Number: 2025DX006).
Author information
Authors and Affiliations
Contributions
WZ and KC conceptualized the research framework and designed the methodology. WZ developed the lightweight Vision Transformer architecture, implemented the detection module, and conducted the algorithm optimization. KC designed the edge computing infrastructure, established the experimental platform, and coordinated the field deployment trials. WZ performed the data collection, dataset construction, and model training. Both authors contributed to the experimental design and result analysis. WZ drafted the initial manuscript. KC provided critical revisions and supervised the overall research direction. Both authors reviewed, edited, and approved the final manuscript. KC served as the corresponding author and handled all correspondence.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval and consent to participate
Not applicable. This study involves technical research on computer vision algorithms and UAV systems without human subjects participation. All aerial imagery used in experiments was collected in compliance with local aviation regulations and privacy laws. No identifiable personal information was included in the datasets.
Consent for publication
All authors have reviewed and approved the final manuscript for publication. The authors consent to the publication of this work in its current form.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhu, W., Chen, K. Real-time object detection for unmanned aerial vehicles based on vision transformer and edge computing. Sci Rep (2026). https://doi.org/10.1038/s41598-026-37938-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-37938-5