Introduction

Underwater object detection1,2,3,4,5 plays a pivotal role in marine exploration, particularly in the realm of unmanned underwater exploration. Manual underwater detection is dangerous and time-consuming, which does not satisfy the growing demand for the ocean development6,7. Vision-based underwater detection provides a promising way to meet the demand due to its high efficiency and reliable performance8,9. Underwater detection methods based on traditional vision10,11 mainly rely on fixed descriptors by extracting limited features to identify and locate underwater targets. The descriptors are meticulously crafted to extract specific underwater object attributes such as contours12 and edges13, making it challenging to handle complex marine environments. The advent of Convolutional Neural Networks (CNNs) advances computer vision by automatically extracting features. Specifically, CNN-based object detection models, notably pioneered by two-stage and one-stage methods, such as Faster R-CNN14, YOLO15, and EfficientDet16. These techniques have attained state-of-the-art (SOTA) performance in various detection scenarios17,18,19,20. Thus, the object detectors based on deep learning are extensively utilized in underwater exploration realms21,22,23. The excellent real-time detection performance is required in underwater applications, and one-stage models can fulfill the aim very well, especially the YOLO series algorithms. However, traditional object detection methods are only partially effective for underwater tasks, primarily due to the numerous factors that degrade image quality in underwater environments. Examples are provided in Fig. 1a, the targets blurred targets caused by image quality degradation. Moreover, the natural camouflage of underwater organisms makes it challenging to distinguish the foreground from the background. As shown in Fig. 1b, the colors of echinus and starfish closely resemble those of the surrounding rocks, making them difficult to distinguish.

Fig. 1
Fig. 1
Full size image

Challenges in underwater object detection: examples of difficult targets.

To address the challenges of complex underwater environments, such as Chen et al.24 first enhances the underwater image data and then applies the YOLOv5 method for underwater object detection. Although detection performance can be significantly improved, this method is time-consuming in the data processing stage, and its performance improvement is limited. Zhang et al.25 incorporates the Transformer architecture into the YOLOv5 method, the ability of detector to extract features of underwater targets is enhanced. Although the detection performance is significantly improved, the introduction of Transformer increases the parameters of YOLOv5, which negatively impacts detection efficiency. To improve the feature extraction ability of models and spatial semantic localization of underwater targets while maintaining detection efficiency. Therefore, this paper proposes a Partial Semantic Encoding Module (PSEM). This method applies partial channel convolutions26 to ensure model lightweighting while enhancing the ability of models to integrate semantic features of underwater targets, thereby improving model perception. Furthermore, an attention-based detection head named Split Dimension Weighting Head (SDWH) is proposed. The detection head is applied to weight the foreground semantic rather than background. The above methods can be applied in YOLO series detectors and bring significant performance improvements. The contributions of the paper are as follows.

  • A Partial Semantic Encoding Module (PSEM) is proposed. In this method, the local semantic information of underwater targets obtained through convolution is processed using residual connections and channel concatenation to ensure semantic accuracy. Partial channel convolutions are applied for refinement of the semantics, maintaining a lightweight model structure.

  • A Split Dimension Weighting Head (SDWH) is designed for weighting target semantics information of the feature maps at different scales. The ability of models to discriminate foreground objects is enhanced. SDWH employs attention mechanisms27 to sequentially weight the fused high-level semantic information across channel, spatial, and hierarchical dimensions in a cascaded manner, leading to the final detection output. Through attention weighting, the foreground information of underwater targets is effectively enhanced, while interfering background information is suppressed, thereby improving detection performance.

  • PSEM and SDWH can be integrated into the neck and head structure of YOLO series. PSEM can replace the original convolution operations in the neck structure of the YOLO models, enhancing the fusion of high-level feature information. The fused features exhibit more accurate semantics. SDWH applies multi-dimensional semantic weighting to the fused features, further refining the spatial and channel information of foreground targets, thereby enhancing the detection capability for foreground objects of YOLO models.

The rest of this paper is structured as follows. “Related work” section introduces relevant deep learning algorithms for underwater object detection and relevant real-time detectors. In “Materials and methods” section, the relevant theories of PSEM and SDWH methods are introduced. The overall structure, which describes the modifications made to YOLO series models, is described. “Experimental results and discussion” section presents the experiments conducted with related methods on the datasets. Finally, “Conclusions” section concludes the paper.

Related work

In this section, “Underwater object detection” section introduces the relevant algorithms for underwater object detection, including one-stage, two-stage, and Transformer-based detection algorithms. It also analyzes the advantages and disadvantages of each algorithm. The real-time detection algorithms, along with YOLO-related algorithms and their application in underwater scenarios, are described in “Real-time object detectors” section.

Underwater object detection

The intricate underwater environment, insufficient lighting, and limitations of imaging equipment present substantial challenges for underwater object detection. In recent years, researchers and experts have proposed a variety of underwater object detection models. Early methods primarily relied on handcrafted features for detection. For example, Chuang et al.28 designed a fish recognition framework. Object parts are initialized using saliency and slack markers to ensure accurate matching. Non-rigid component models are subsequently learned based on separability, fitness, and discrimination criteria. Gupta et al.29 proposed a novel model for salient image detection in underwater scenes, identifying text or characters as salient regions. This approach aims to overcome the limitations of traditional feature extraction methods, which often depend on human supervision and lack automation. Sudhakar et al.30 identified that factors such as light disturbance, suspended particles, ocean snow, and color deviation interfere with underwater image acquisition. To address this, they developed an underwater image foreground extraction algorithm for the region of interest. The method involves color correction, contrast enhancement, and target area segmentation using the interactive GrabCut algorithm. Although the aforementioned methods utilize specific feature description operators, the underwater target area is identified and localized using bounding boxes. However, these feature descriptions typically depend on manual feature selection processes and set thresholds. The quality of features directly impacts subsequent applications, and efficiency remains a concern. As data scale increases, meeting the growing demands of applications becomes challenging.

The emergence of deep learning has propelled object detection advancements by automating feature extraction, significantly enhancing generalization across diverse application scenarios. Significant progress has been made by directly applying the vanilla detectors to underwater object detection31. such as YOLOv332 and Faster RCNN14. To overcome the limitations of manually set thresholds and feature selection in traditional methods. Many researchers have applied deep learning methods to underwater object detection. Chen et al.33 treated the noisy data as outliers and introduced an Invert Multi-Class Adboost (IMA) algorithm to disregard the learning of these potential outliers. These approaches yielded strong performance on noisy underwater datasets. Fan et al.34 introduced FERNet, a network designed to extract multi-scale semantic information from underwater images. These one-stage detection algorithms can avoid many limitations caused by manual involvement and efficiently complete detection tasks. However, one-stage detectors apply a single feature extractor, which is ineffective for extracting multi-scale underwater features. Therefore, these algorithms require significant improvements in detection accuracy.

A two-stage detection network and transformer-based methods are significant techniques commonly employed in underwater applications. For example, Zeng et al.35 integrated the Adversarial Occlusion Network (AON) with Faster R-CNN to enhance detection abilities for small targets. Song et al.36 applied weighted processing to the extracted features during the candidate box selection step, further refining the feature information and enhancing the detection performance. Dulhare et al.37 applied illumination transformation, angle transformation, and image degradation techniques to augment the dataset, which is then used to train Faster-RCNN, significantly improving detection performance. Gao et al.38 proposed an underwater object detection algorithm based on the Transformer architecture, named PE-Transformer. This method leverages the capability of Transformer to capture long-range semantic information to extract effective features. By employing a cross-fusion approach to integrate features, the model achieves strong feature perception capabilities for multi-scale targets. Although two-stage and Transformer-based underwater target detection methods achieve high accuracy, the methods are hindered by high model complexity and low detection efficiency. At the same time, numerous enhancements are applied to the original image to improve detection accuracy, which is time-consuming, and the improvement in accuracy is often limited. The YOLO algorithm improved with PSEM and SDWH achieves significant enhancement in detection performance while maintaining efficient detection speed. Additionally, it does not require additional data processing.

Real-time object detectors

The YOLO series has risen to prominence for its efficiency in real-time object detection after years of advancement. RT-DETR39 enhances the encoder–decoder architecture to address the inefficiency of DETR-like models, surpassing YOLO-L/X in both speed and precision. However, DETR-like structures still lag in the small model domain, where YOLO models remain state-of-the-art (SOTA) in balancing accuracy and speed. SSD40 demonstrates excellent real-time performance, it faces challenges such as redundant bounding boxes and lacks robustness in detecting small objects. For the field of underwater exploration, which has high real-time requirements, YOLO models remain the preferred choice. YOLOv1-v315,32,41 established the foundational single-stage detection architecture with backbone, neck, and head components, and utilized multi-scale branches to predict objects of various sizes, becoming a representative one-stage detector. However, due to the low accuracy of this classic model, they cannot adequately meet the requirements of underwater detection tasks. YOLOv442 optimizes the Darknet backbone and introduces enhancements such as the Mish activation function, PANet, and advanced data augmentation techniques. YOLOv543 builds on YOLOv442 with an enhanced data augmentation strategy and a wider range of model variants. Due to its improved data augmentation, YOLOv5 exhibits robust adaptability to complex environments, making it a popular choice for detection in underwater research. YOLOX44 integrates Decoupled Head, Anchor-free, and Multi Positives into its architecture, establishing a new paradigm for YOLO model design. Nevertheless, the real-time performance of YOLOX is not as good as YOLOv5. YOLOv645 introduces the reparameterization method to the YOLO series for the first time. Because it introduces the Rep-PAN Neck and EfficientRep Backbone, YOLOv6 has excellent real-time performance. YOLOv746 and YOLOv947 concentrate on evaluating the impact of gradient path on model efficacy. However, the detectors cannot achieve good lightweight performance. YOLOv848 merges the merits of its predecessors to achieve state-of-the-art (SOTA) performance within the YOLO series.

Because of the excellent real-time detection performance of YOLO methods, many researchers have improved and applied them to the field of underwater object detection. For example, Lei et al.49 utilized the Swin-Transformer architecture to replace the backbone of YOLOv5 and optimized Feature Pyramid Network (FPN) with residual connection mechanisms, which enhances the ability of extracting context from underwater datasets. Fu et al.50 introduced SE attention into the YOLOv5 backbone, modified the neck structure, and added the CBAM attention mechanism before the output, enhancing detection performance. Liu et al.51 proposed the TC-YOLO network, which improves the YOLOv5 backbone by introducing a coordinate attention mechanism and Transformer structure to enhance detection performance. Additionally, image augmentation methods are used to further improve detection metrics. Guo et al.52 improved YOLOv8 by incorporating the FasterNet structure, making the model more lightweight and further enhancing detection efficiency. These YOLO-based methods have achieved notable success in underwater object detection. However, attention mechanisms impact detection efficiency. On the other hand, lightweight models improve detection speed but reduce accuracy. The YOLO-n series of YOLO methods improved with PSEM and SDWH can enhance detection performance while maintaining efficient detection speed, achieving a better balance between accuracy and real-time. Moreover, PSEM and SDWH are universally applicable to the YOLO-n models.

Materials and methods

In this section, a network overview is provided in “Overview” section, following which the proposed partial semantic encoding module is described in “Partial semantic encoding module (PSEM)” section. The proposed split dimension weighting head is explained in “Split dimension weighting head (SDWH)” section. The methods described in this paper utilize publicly available datasets. The RUOD and UTDAC2020 datasets were used for testing and evaluation. All data and images used in this study are openly accessible, and the original authors of these datasets have granted permission for their inclusion in open-access publications. This research does not involve ethical issues or experiments related to biological samples. All methods were conducted in accordance with relevant guidelines and regulations. It has been confirmed that all experimental protocols have been approved by Shandong Zhongqing Intelligent Technology Co., Ltd and informed consent has been obtained from all participants and/or their legal guardians.

Fig. 2
Fig. 2
Full size image

Demonstration of model architecture improvements in YOLO series.

Fig. 3
Fig. 3
Full size image

Overview of improvements in YOLO series methods.

Overview

As shown in Fig. 2a, in the YOLO series algorithms, the Neck structure is responsible for multi-scale feature fusion, playing a crucial role in the ability of models to recognize underwater objects of varying sizes. Therefore, enhancing the effectiveness of multi-scale feature fusion plays a critical role in improving the performance of underwater object detection. As shown in Fig. 2b, the original Neck structure is improved by introducing the PSEM method, which enhances the feature extraction capability of model while improving the ability of YOLO to perceive features of multi-scale objects. To further enhance the model’s ability to recognize underwater foreground objects, SDWH is introduced before the final output of the model. By applying weighting to the fused effective information and reducing the impact of irrelevant information, the recognition capability of YOLO models is improved.

The overview of improvements in YOLO series methods is provided in Fig. 3. The YOLO series models comprise three essential components: a backbone for extracting features, a neck for integrating features, and a head for making predictions. CNNs are used in the three parts of YOLO networks. The models take input images of dimensions \(H\times W\times 3\). The backbone network is utilized to extract multi-scale features via multi-layer convolutions, encompassing low-level edge and texture features as well as high-level semantic features. This paper selects DarkNet as the backbone for extracting multi-scale features, comprising five blocks. The feature maps of the last three blocks, can be defined as \(C_i\in \textbf{R}^{h_i\times w_i\times c_i}(c_{i=\{1,2,3\}}=\{256, 512, 1024\})\) are then employed for further processing. And \(h_i\) and \(w_i\) can be formulated as \(H/2^i\) and \(W/2^i\). The neck structure plays a crucial role in integrating features from multi-layer feature maps. Using single-layer feature maps for prediction has been demonstrated to be insufficient, as it fails to accommodate the detection of objects with varying scales. To optimize the utilization of multi-layer feature maps, the pyramid structure has been introduced to facilitate feature fusion and reuse. To improve the effectiveness of semantic feature fusion for foreground targets of underwater in the network. PSEM is used in the neck structure of YOLO series detectors.

As shown in Fig. 3. Firstly, from the top to bottom, the neck network adopts a \(3 \times 3\) convolution to process \(C_3\) and get \(F_3\), which can be formulated as

$$\begin{aligned} F_3=Conv_{3*3}(C_3), \end{aligned}$$
(1)

where \(Conv_{3*3}(.)\)53 is a \(3 \times 3\) convolution. Then feature maps \(C_i(i={1,2})\) and \(F_{i}(i=2,3)\) are used as inputs, upsampling to align the channel and size of feature maps in different layers. Then adjacent feature maps are fused by a simple concatenation operation. The process can be defined as

$$\begin{aligned} F_i=PSEM(Concat(C_{i},Up(F_{i+1}))),\quad i=1,2, \end{aligned}$$
(2)

where \(F_i\in \textbf{R}^{h_i\times w_i\times c_i}(i=1,2,3)\), PSEM(.) is the method named partial semantic encoding module, Concat(, )53 is concatenation operation, Up(.)43 is the \(2\times\) upsampling operation by using bilinear interpolation. Secondly, from the bottom to top, the neck network adopts a \(3 \times 3\) convolution to process \(F_1\) and get \(p_1\), which can be computed as

$$\begin{aligned} P_1=Conv_{3*3}(F_1). \end{aligned}$$
(3)

From the bottom to top, feature maps \(F_i(i={2,3})\) and \(P_{i}(i=1,2)\) are used as inputs, upsampling to align the channel and size of feature maps in different layers. Then adjacent feature maps are fused by a simple concatenation operation. The process can be defined as follows:

$$\begin{aligned} P_i=PSEM(Concat(F_{i},Up(P_{i-1}))),\quad i=2,3, \end{aligned}$$
(4)

where \(P_i\in \textbf{R}^{h_i\times w_i\times c_i}(i=1,2,3)\). Finally, SDWH is utilized to weight the objects semantic, which can be formulated as

$$\begin{aligned} O_i=SDWH(P_i),\quad i=1,2,3, \end{aligned}$$
(5)

where \(O_i\in \textbf{R}^{h_i\times w_i\times c_i}\) is the outputs of YOLO series methods, SDWH(.) is the detection head named split dimension weighting head. PSEM and SDWH can be applied to boost the performance of YOLO methods by enhancing feature fusion and weighting the foreground of underwater datasets.

Fig. 4
Fig. 4
Full size image

The structure of partial semantic encoding module (PSEM).

Partial semantic encoding module (PSEM)

The PSEM (partial semantic encoding module) utilizes residual point-wise summation and concatenation to perform global semantic encoding on feature maps. Subsequently, it employs partial convolution26 to distill the most refined channel semantic information of underwater small targets while ensuring lightweight processing. As is shown in Fig. 4a, \(X\in \textbf{R}^{h_i\times w_i\times c_i}\) is the input of the module, which can be divided into two branches. \(C_{in1}\) is processed to get \(P_{in1}\) by three CBS blocks. \(P_{in1}\) can be used for pixel-wise summation, which can be derived as

$$\begin{aligned} C_{out1}=CBS(C_{in1}) + P_{in1}, \end{aligned}$$
(6)

where CBS(.)46, shown in Fig. 4b, is a block which combines a \(3\times 3\) convolution, a batch normalization and a SiLU function. \(C_{in1},P_{in1}C_{out1}\in \textbf{R}^{h_i\times w_i\times c_i}\). Another branch, \(C_{in2}\) is processed by a \(1\times 1\) convolution, which can be computed as

$$\begin{aligned} P_{in2}=Conv_{1*1}(C_{in2}), \end{aligned}$$
(7)

where \(C_{in2},P_{in2}\in \textbf{R}^{h_i\times w_i\times c_i}\), \(Conv_{1*1}(.)\)53 is a \(1\times 1\) convolution. \(P_{in2}\) and \(C_{out1}\) are processed by concatenation operation to get \(C_{out2}\). Finally, PConv which is shown in Fig. 4c, is used to process \(C_{out2}\) and get \(C_{out3}\). The operation can be formulated as follows:

$$\begin{aligned} C_{out3}=PConv(Concat(P_{in2},C_{out1})), \end{aligned}$$
(8)

where \(C_{out3}\in \textbf{R}^{h_i\times w_i\times c_i}\), PConv(.)26 is a partial channel convolution. Concat(.)53 is the concatenation operation.

The PSEM is utilized to enhance the channel semantic information between multi-scale feature maps, which can improve the fusion effect of network features for underwater small targets. Thereby, the module enhances the detection performance of the YOLO methods in underwater environments.

Split dimension weighting head (SDWH)

Attention mechanisms can emphasize more important semantic information while attenuating the semantic features of less significance27 by weighting. Attention weighting can enhance the ability of the models to distinguish foreground and recognize the positioning of small targets in underwater realms.

Fig. 5
Fig. 5
Full size image

The structure of Split Dimension Weighting Head (SDWH).

SDWH utilizes three types of attention to weight the output features of the neck network, as shown in Fig. 5a. Firstly, a \(1\times 1\) convolution is used to integrate channel information, and then Level-wise is applied to weight the channel information to get \(O_{in1}\), which is shown in Fig. 5b. The specific formulas are as follows:

$$\begin{aligned} & O_{in}=Conv_{1*1}(C_{out3}), \end{aligned}$$
(9)
$$\begin{aligned} & O_{in1}=O_{in}\odot HSigmoid(Relu(Conv_{1*1}(AvgPool(O_{in})))), \end{aligned}$$
(10)

where \(C_{out3},O_{in1}\in \textbf{R}^{h_i\times w_i\times c_i}\), \(\odot\) is pixel-wise multiplication, AvgPool(.)54 is the operation of average pooling, Relu(.)55 is an activation function, HSigmoid(.)56 is Hard sigmoid activation function. \(Conv_{1*1}(.)\)53 is a \(1 \times 1\) convolution. Detection targets possess various scales, corresponding to different scale-sized feature maps. Level-wise can enhance the scale perception ability of YOLO methods, by altering the expressive capability of different levels within the neck. Secondly, spatial-wise enhances the localization ability of detectors for foreground targets by weighting the spatial semantic information of \(O_{in1}\), which is provided in Fig. 5c. The processing can be shown below:

$$\begin{aligned} O_{in2}=O_{in1}\odot Conv_{3*3}(Conv_{3*3}(O_{in1})) \odot Sigmoid(Conv_{3*3}(O_{in1})), \end{aligned}$$
(11)

where \(O_{in2}\in \textbf{R}^{h_i\times w_i\times c_i}\), Sigmoid(.)56 is an activation function. \(Conv_{3*3}(.)\)53 is a \(3\times 3\) convolution. Underwater targets may appear at arbitrary positions within the image, corresponding to different spatial feature maps. The spatial positional perception ability of YOLO methods can be enhanced, by modifying the expressive capability of different spatial levels.

Finally, task-wise processes the global semantic information of \(O_{in2}\), which can be seen in Fig. 5d. The process is defined as

$$\begin{aligned} O_i=Relu(O_{in2} \odot HSigmoid(FC(Relu(FC(AvgPool(O_{in2})))))), \end{aligned}$$
(12)

where \(O_{i}\in \textbf{R}^{h_i\times w_i\times c_i}\), FC(.)57 is a fully connected layer. Relu(.)55 is an activation function. The different task information representations of detecting targets are concentrated in the channel dimension. By weighting them, the important semantics of detecting targets can be highlighted, leading to improve detection performance.

SDWH weights feature maps across three dimensions: level, spatial, and channel, sequentially. This enhances the ability of detectors to discern underwater foreground targets and improves the localization capabilities of the YOLO methods.

Experimental results and discussion

In this section, the datasets used for model training and testing and the evaluation metrics employed in the model testing are introduced in “Datasets and evaluation metrics” section. The hyperparameter settings and the choice of optimizer for the model are described in “Implementation details” section. The testing and comparative experiments of the algorithm on the UTDAC2020 dataset are presented in “Demonstration of the proposed methods on UTDAC2020” section. The experiments on the RUOD dataset are described in “Demonstration of the proposed methods on RUOD” section. The analysis of the real-time performance and the complexity are introduced in “Complexity and real-time performance analysis” section.

Datasets and evaluation metrics

For evaluating the performance of the proposed method, The UTDAC2020 and RUOD datasets are utilized. UTDAC2020 consists of 6461 images with 4 categories. RUOD consists of 14,000 images with 10 categories. We randomly select 1036 different images from RUOD dataset to test the proposed methods, and the rest of images are used to train the models. 518 images from UTDAC2020 dataset are selected to test the models, and the rest of the images are applied to train the detectors.

For a comprehensive evaluation of performance, average precision (AP) with a specific threshold is used as the metric to evaluate the performance of the proposed methods. AP@0.5 and AP@0.75 are the values that at the thresholds of 0.5 and 0.75. The mean Average Precision (mAP), calculated as the average of AP values at a threshold of 0.5, is used as the metric. Moreover, APs, APm, and APl are used in this paper to evaluate the mAP performances in the detection of small, medium, and large objects, respectively.

Implementation details

Our methods are implemented on a GPU named NVIDIA GeForce RTX 2080 SUPER. Network parameters are initialized with a normal distribution. Stochastic Gradient Descent (SGD) is used as the optimizer, with a weight decay of 0.0001 and a momentum of 0.9. With a batch size of 32, the initial learning rate is set to 0.01, and it will be decreased by multiplying by 0.1 at epochs 24 and 30 out of a total of 50 epochs. During the training and testing phases, input images are resized to 640 \(\times\) 640 pixels. The input images are augmented by horizontal and vertical flipping with a probability of 0.5.

Demonstration of the proposed methods on UTDAC2020

Ablation experiments on UTDAC2020

The proposed methods, PSEM and SDWH, aim to address the issues of feature fusion and resistance to background interference for underwater targets. In the ablation experiments, YOLO series models, which include v5n, v6n, and v8n are utilized as baselines. The way of improvement for baselines is shown in Fig. 3. The convolution operations in the neck networks, following the concatenation operations, are replaced with PSEM. SDWH is added to the detection head of the detectors.

Then, we evaluate the design choices of PSEM and SDWH, incrementally adding them to analyze their effects. The evaluation results are shown in Table 1. The application of PSEM and SDWH consistently improves performance of the models. PSEM improves the performance of YOLOv8n from 80.1% mAP to 81.0% mAP and SDWH bring a better improvement of 2.7% mAP. Meanwhile, the AP50 and AP75 performances are improved simultaneously. By using both of them, the proposed methods, PSEM and SDWH, bring a considerable improvement of 2.8% mAP (80.1% vs 82.9%). In addition, ablation experiments were conducted on YOLOv5n and YOLOv6n. PSEM improves the performance of YOLOv5n from 80.8% mAP to 80.9% mAP. SDWH brings an improvement from 80.8% to 81.2%. At the same time, AP50 and AP75 performances are improved simultaneously. By using both of them, the proposed methods, PSEM and SDWH, bring a considerable improvement of 1% mAP (80.8% vs 81.8%). YOLOv6 has an improvement from 77.2% mAP to 78.1% mAP by PSEM. SDWH brings a better improvement of 2.7% mAP. Meanwhile, the AP50 and AP75 performances are improved simultaneously. By using both of them, the proposed methods, PSEM and SDWH, bring a considerable improvement of 3% mAP (77.2% vs 80.2%). The experiments demonstrate the significance of PSEM in feature fusion, and the significance of SDWH in weighting the foreground targets.

Table 1 Ablation analyses on UTDAC2020 dataset.
Fig. 6
Fig. 6
Full size image

Examples of detection results on UTDAC2020 dataset.

Fig. 7
Fig. 7
Full size image

Typical underwater objecs of UTDAC2020 dataset and visualization of the feature maps are depicted. (a) Shows the ground truth of the objects. (b,c) Visualize the feature maps in which the color represents the degree of activation. The feature maps obtained by the YOLOv8 and the improved YOLOv8 with the proposed methods are shown in (b,c), respectively. The improved YOLOv8 can extract the feature information of ground truth exactly.

Performance demonstration on UTDAC2020

Examples of underwater object detection, are shown in Fig. 7. The corresponding feature maps obtained by using YOLOv8 and the improved YOLOv8 are visualized in Fig. 7b,c. Evidently, the activation of underwater targets features achieved by the improved YOLOv8 surpasses that obtained with YOLOv8, which means that the proposed methods can guide the attention of network towards the ground truth regions. These methods significantly enhance the identification of underwater target features, thereby improving the classification and regression of underwater objects.

To further demonstrate the improved performance of YOLO models with PSEM and SDWH, a performance comparison is conducted with underwater real-time detection algorithms. The results of the comparison experiments are provided in Table 2. Among the compared methods, MIPAM-YOLO achieves the highest mAP of 81.2% and AP75 of 48.2%. Compared with MIPAM-YOLO, YOLOv8Plus can achieve 82.9% mAP and 48.4%AP75. Meanwhile, YOLOv5Plus achieves the highest AP75 of 49.1%mAP. In the APs, APm, and APl metrics, YOLOv8Plus also achieved the highest performance. The improvements are obvious, which demonstrates the effectiveness of PSEM and SDWH.

Examples of underwater object detection in UTDAC2020 are shown in Fig. 6. The results demonstrate that PSEM and SDWH are efficient at detecting underwater targets in low contrast environments.

Table 2 Performance of the YOLO series detectors using PSEM and SDWH assessed on UTDAC2020 dataset.
Table 3 Ablation analyses on RUOD dataset.

Demonstration of the proposed methods on RUOD

Ablation experiments on RUOD

The ablation experiments are conducted on RUOD dataset to comprehensively validate the effectiveness of PSEM and SDWH. The results are provided in Table 3. As shown in Table 3, a single PSEM or SDWH could stably boost performance. The improvement by using PSEM for v8n, v5n and v6n are 0.9% mAP, 0.5% mAP, 1.1% mAP respectively. While that of SDWH are 1.9% mAP, 0.9% mAP, 3.1% mAP. The improvement from using PSEM and SDWH for v8n, v5n and v6n is 2.7% mAP, 1.5% mAP, 3.7% mAP respectively. Concurrently, there is simultaneous improvement in the performances of AP50 and AP75. The experiments highlight the importance of PSEM in feature fusion and the role of SDWH in weighting foreground targets.

Fig. 8
Fig. 8
Full size image

Examples of detection results on RUOD dataset.

Fig. 9
Fig. 9
Full size image

Typical underwater targets of RUOD dataset are depicted alongside the visualization of feature maps. (a) Presents the ground truth of these targets, while (b,c) display the feature maps, with colors indicating activation levels. Feature maps from YOLOv8 and the enhanced version by our proposed methods are shown in (b,c), respectively. The enhanced YOLOv8 accurately extracts feature information corresponding to ground truth.

Performance demonstration on RUOD

Examples of underwater object detection from the RUOD dataset are illustrated in Fig. 9. The feature maps generated by YOLOv8 and its enhanced version are depicted in Fig. 9b,c, respectively. Clearly, the improved YOLOv8 outperforms YOLOv8 in activating features related to underwater targets, indicating that the proposed methods effectively direct the attention of network towards ground truth regions. These approaches notably enhance the detection of underwater target features, thus improving the classification and regression of underwater objects. To illustrate the effectiveness, PSEM and SDWH are applied to YOLO series detectors. The results of the comparison experiments are presented in Table 4. Among the comparison methods, MIPAM-YOLO achieves the highest mAP of 79.8% and KCF-Faster R-cnn achieves the highest AP75 of 47.1%. Compared with MIPAM-YOLO and KCF-Faster R-cnn, YOLOv8Plus achieves 80.9% mAP and YOLOv5Plus achieves 47.3%AP75. Meanwhile, YOLOv5Plus achieves the highest APs of 19.0%. In the APm, and APl metrics, YOLOv8Plus also achieves the highest performance. The improvements are evident, which demonstrate the effectiveness of PSEM and SDWH.

Figure 8 showcases examples of detection results from the RUOD dataset. The results demonstrate the effectiveness of PSEM and SDWH in detecting weak and small targets. The conducted experiments and visualizations collectively substantiate the efficacy of the proposed methods. The presented results highlight the substantial enhancement of underwater targets detection achievable through our approach. Notably, PSEM and SDWH seamlessly integrate with YOLO series detectors, resulting in a discernible performance improvement in underwater applications.

Table 4 Performance of the YOLO series detectors using PSEM and SDWH assessed on RUOD dataset.
Table 5 Complexity and real-time performance comparison of prevalent detectors before and after applying the proposed PSEM and SDWH.

Complexity and real-time performance analysis

Due to the high real-time requirements of underwater environment detection, network complexity and real-time performance are significant concerns for evaluating the practicality of detectors.

Taking YOLOv8n, YOLOv5n and YOLOv6n as the baselines, we analyze the floating point operations (FLOPs), parameters (Params), and frames per second (FPS) to compare the complexities and real-time performances of the methods. As shown in Table 5, for YOLOv8n, with the utilization of PSEM and SDWH, the FLOPs increase by about 10%. while the Params increase by about 5%. The inference speed is reduced by 7 FPS. About YOLOv5n, after enhancement, the FLOPs increase by about 18%. while Params by about 16%. The inference speeds are reduced by 9 FPS. For YOLOv6n, the FLOPs increase by about 9%. while the Params increase by about 10%. The inference speed is reduced by 14 FPS.

As a result, The enhancement of PSEM and SDWH for YOLO series detectors can meet real-time requirements in underwater applications. The additional time cost brought by the proposed methods falls within the acceptable range.

Conclusions

This paper proposes a method to enhance feature fusion by locally encoding channel information in the neck network features, called the Partial Semantic Encoding Module (PSEM). An attention mechanism is integrated into the detection head in this paper, called the Split Dimension Weighting Head (SDWH). PSEM utilizes residual pixel-wise addition block to encode the spatial semantics of the feature maps in the neck networks. Moreover, the partial channel convolution is used by PSEM to enrich the contextual information with low cost. SDWH operates on feature maps that have been fused by concatenation, where the hierarchical semantics, spatial semantics, and channel semantics of the feature maps are weighted separately. Finally, the detection outputs of the model are completed. PSEM and SDWH improve the ability of detectors to locate and identify underwater foreground targets. The proposed methods can be seamlessly integrated into different YOLO series models for efficient detection in the underwater realm. Experiments are conducted on the UTDAC2020 and RUOD datasets. PSEM and SDWH can effectively improve the performance of the models with an acceptable computational cost. The effectiveness of PSEM and SDWH indicates further potential for real-world underwater applications.