Table 1 Brief related work with research gaps and future scope in the YOLO series.

From: An optimized YOLO NAS based framework for realtime object detection

Author

Dataset

Model

Results

Research Gap

Alina Ciocarlan et al.23

NUAA-SIRST, IRSTD-1 K datasets

A deep contrario framework consisting of an NFA module is proposed.

The model achieved better results when compared with other SOTA models for detecting tiny objects.

The model is not able to produce bounding boxes for real-time object detection.

Sri Padma et al.24

Kaggle image dataset and live stream videos are used to capture the images

YOLOv2

The paper presents a comprehensive study of YOLOv2 and improved YOLOv2 for face mask detection. The improved model achieved an accuracy of 95%.

The profounded model cannot detect the face mask of persons more significant than two in any image.

Sijie et al.25

PASCAL VOC07 + 12, MS COCO datasets

A lightweight YOLOv4 detection model is suggested, with MobileNetV2-CA as the attention mechanism in security contexts instead of the backbone model.

The model achieved mAP of 74.73%.

In the future, accuracy loss and detection accuracy will be enhanced. The model will be optimized to attain better detection accuracy.

Zhengwei et al.26

Drone vehicle dataset and UCAS-AOD dataset

A lightweight rotational YOLOv5 (R-YOLOv5) is proposed for vehicle detection in dense scenes.

The model’s accuracy for both datasets was 84% and 90%, respectively.

The model cannot predict the images of very small objects occluded and illuminating light conditions.

Chhaya Gupta et al.27

MS-COCO dataset

A fine-tuned transfer learned YOLOv6 is introduced for real-time object recognition.

It produced better results by comparing the suggested model to SSD, FasterRCNN, Mask RCNN, YOLOv4, and YOLOv6.

A kernel pruning algorithm will be applied to improve the detection accuracy further.

Ignazio et al.28

UAV dataset in agriculture

YOLOv7 is utilized to detect crop weeds in the Chicory plant.

The model achieved a mAP of 56%.

The model is not improved enough to detect crop weeds of all types.

Abdur et al.29

MS-COCO dataset

YOLOv7 is used to detect and count vehicles in real-time.

The model produced favorable outcomes.

The model was not able to detect fast-moving vehicles in videos.

Armstrong et al.30

Custom dataset

YOLOv8 is used to detect helmet violations in real-time video frames.

With a mAP of 58.61%, the model produced satisfactory results.

The model can be enhanced further with transfer learning algorithms for better detection accuracy.

Haitong et al.31

Tiny person dataset, PASCAL VOC 2007 dataset

Using camera sensors, DC-YOLOv8 is suggested for real-time object detection.

The model performed well compared to YOLOX, YOLOR, YOLOv3, scaled YOLOv5, YOLOv7-tiny, and YOLOv8.

The model cannot detect a person sitting in a fast-moving vehicle without a helmet.

Yugen Yi et al.32

DUTS, HKU-IS, PASCAL-S, ECSSD, DUT-OMRON datasets

A GPONet optimization network merged with a gate fusion network is proposed for salient object detection.

The model achieved good results when compared with other SOTA models.

The model detected some non-salient objects, and segmentation was improper.

Arpita Dutta et al.33

eBDtheque, DCM, Manga, BCBId, COMICS datasets

For comic emotion analysis, a framework named EmoComicNet is proposed.

The model achieved better results.

The model is limited to Bangle and English comic datasets only.

Priyanka et al.34

YOLO-NAS

Tropical cyclone intensity is estimated using the YOLO-NAS model with the help of satellite images in real-time.

The model achieved an accuracy of 81%.

The model depends on image quality and computational and real-time processing constraints.

Nguyen et al.35

YOLO-NAS

Container damage detection using deep learning model.

The model achieved a mAP of 91.2%.

Despite its advantages, the model faces limitations, including its reliance on high-quality, annotated datasets and potential challenges in detecting damage in cluttered or occluded environments. Minor defects, such as rust stains, may also go undetected, and its high computational demands can pose difficulties for resource-constrained settings.

Anil Kumar et al.36

RMFD, IMFD, and real-time video captures

MobileNetV2 with transfer learning (CAFFE)

Achieved 95.3% detection accuracy for masked/unmasked faces; inference tested on webcam

No support for multi-object environments; lacks real-time speed benchmarking (FPS); no metaheuristic tuning included.

Anil Kumar et al.37

Custom masked face dataset

Modified MobileNetV2, CAFFE framework, Transfer Learning

Achieved 91.2% accuracy for masked face age prediction, with robust performance in low-light and real-time conditions.

No cross-dataset validation; lacks integration with object detection pipelines like YOLO; no optimization strategy applied.