Introduction

Underwater pipelines are prone to corrosion and deformation, which, if undetected, can cause serious economic and environmental damage. Traditionally, inspection relies on remotely operated vehicles (ROVs), with data manually analyzed by skilled operators1. However, this process is costly, time-consuming, and prone to human error2, motivating the need for automated solutions. Despite advances in artificial intelligence (AI), the noisy and low-quality data collected in underwater environments limit the effectiveness of existing approaches1. ROVs typically provide operational, ultrasonic, and vision-based data.

In3, pipelines were identified by adjusting the ROV’s thrusters: the vertical thruster maintained depth, equal forces on the side thrusters enabled forward motion, and differential forces induced angular deviations. Experiments showed that the ROV followed the pipeline path with errors of 0.072 m and 0.037 m along the x- and y-axes, respectively. A different method for identifying and positioning buried pipelines was presented in4, where a multi-sensor surveying system acquired acoustic profile images and both over- and under-water topography. A position deviation correction method improved identification accuracy, reducing the average error to 0.06 m and 0.07 m along the x- and y-axes. Deep learning techniques have also been explored. In5, GoogleNet was applied to side-scan sonar images, achieving a 90% identification accuracy. The study further highlighted the importance of dataset selection in transfer learning, reporting that pre-training with ImageNet improved accuracy by about 10%. Similarly, field experiments in6 used multi-beam and forward-looking sonar to construct a dataset capturing various structural damages. A segmentation network combining channel and spatial attention mechanisms demonstrated strong segmentation performance while maintaining relatively fast inference speed. More recently, a method was presented in7 for identifying and positioning exposed underwater pipelines in 3D sonar images using an enhanced YOLOv5 model. The region-growing algorithm, initialized from YOLOv5 detected bounding box centers, refined the pipeline localization. The final positioning was determined using spatial relationships among the pipelines, ROV, and tracking ship. This method achieved a reported performance of 77%. While methods using magnetic8, sonar9, or ultrasonic data have been explored, each suffers from nonlinear distortions, low contrast, or multipath reflections. In contrast, optical sensors offer high-resolution, fast acquisition, and additional cues such as shadows, markings, and textures10, making them well-suited for pipeline inspection. Yet optical images also face challenges such as blur, distortion, and scattering11.

Classical methods for optical image analysis—edge detection, morphological operations, and machine learning models such as SVMs and random forests12,13,14—depend heavily on edge information and require complex preprocessing. Their accuracy degrades in harsh underwater environments15. Deep learning offers significant advantages by automatically extracting robust features, reducing preprocessing, and improving accuracy16,17. Recent studies have applied CNNs and transformers to pipeline identification, with methods ranging from YOLO-based detectors and segmentation networks such as UNet and DeepLabv3 + to transformer-enhanced models. While these approaches achieve strong results on specific datasets, performance often drops under different underwater conditions, highlighting poor generalization. For pipeline identification, a transfer learning approach18 was applied to six neural networks: UNet, SegNet, DeepLabv3+(ResNet-18), DeepLabv3+(ResNet-34), DeepLabv3+(ResNet-50), and DeepLabv3+(ResNet-101). The best-performing network was DeepLabv3+(ResNet-101), which was initially trained on ImageNet and then fine-tuned on the PASCAL VOC2012 dataset. Using 11,463 underwater pipeline images, this network achieved a mean Intersection over Union (mIoU) of 99.12%. The performance of YOLOv5, YOLOv6, YOLOv7, and YOLOv8 was also evaluated for underwater pipeline identification19. Experiments were conducted on the same dataset of 3,021 images collected by an ROV. The results showed that YOLOv5 achieved the highest mAP of 97.10%, followed by YOLOv7 with 96.30%, YOLOv6 with 95.30%, and YOLOv8 with 95.10%. The effect of data augmentation and transfer learning techniques has been investigated in MobileNet, MobileNet-V2, Inception-V3, Xception, and Inception-ResNet-V2 networks for underwater cable image identification20. Following a comparative analysis of these models, MobileNet-V2 outperformed the others, achieving the highest accuracy of 93.50% while also requiring the lowest computational time.

Recently, transformers have been widely adopted for underwater image processing due to their ability to model long-range dependencies and capture complex features. These capabilities are particularly useful for underwater images, which often suffer from poor lighting and challenging environmental conditions such as sea fog or sand cover21,22. A method named TR–YOLOv5s was introduced in23, comprising preprocessing, down-sampling, automatic identification, and localization steps. Identification was performed using a transformer module combined with the YOLOv5s model, enhanced by an attention mechanism. Experiments demonstrated that this method achieved a mAP of 85.6%, representing a 12.5% improvement over YOLOv5s, with a mean test time of approximately 0.068 s. An underwater object identification algorithm based on Faster R-CNN was presented in24 to address challenges such as color offset, low contrast, and object blur. In this approach, the Swin-Transformer (Swin-T) served as the backbone, and deep and shallow feature maps were fused using a path aggregation network. Online hard example mining improved training efficiency, and replacing ROI pooling with ROI align enhanced identification accuracy, achieving a mAP of 80.54% on the URPC2018 dataset containing 5,543 images. The Swin-T25 generates a hierarchical feature representation using a shifted windowing process, which limits self-attention computation to non-overlapping local windows while allowing cross-window connections. This mechanism enables efficient extraction of local features while preserving global information. Pre-trained weights on ImageNet further improve the robustness of feature representations. Swin-T has been applied at various scales and demonstrates linear computational complexity with respect to image size. It has achieved state-of-the-art performance on COCO object detection and ADE20K semantic segmentation, significantly surpassing previous methods.

For underwater object segmentation, a method was proposed in26, using the Efficient Fish Segmentation Network (EFS-Net) and Multi-level Feature Accumulation-based Segmentation Network (MFAS-Net). EFS-Net employed convolutional layers in the early stages for optimal feature extraction, while MFAS-Net used feature refinement and transfer blocks to enhance low-level information and propagate it to deeper stages. Additionally, MFAS-Net applied multi-level feature accumulation to improve pixel-wise predictions for indistinct objects. The networks were evaluated on the DeepFish and SUIM datasets, achieving mIoUs of 76.42% and 92.0%, respectively. A simple scaling method was introduced in27 to allow users to scale baseline models to target resource constraints while maintaining efficiency. This method was applied to MobileNets and ResNet, and neural architecture search was used to design a baseline network that could be scaled into a family of models, called EfficientNets. EfficientNets achieved accuracies of 91.70% on CIFAR-100 and 98.80% on Flowers, demonstrating that mobile-sized models can be effectively scaled while surpassing state-of-the-art performance with significantly fewer parameters.

To address these limitations, we propose a hybrid segmentation model combining the Simplified Swin- Transformer (Swin-T) and a modified Efficient Fish Segmentation Network (EFS-Net). Swin-T leverages hierarchical self-attention with shifted windows, enabling simultaneous local feature extraction and global context preservation. The modified EFS-Net incorporates trainable layers (Strided-Conv, Tra-Conv) and EfficientNetB027 as the initial feature extractor, providing stable, multiscale representation even from imperfect data. This hybrid design enhances accuracy, robustness, and efficiency for underwater pipeline segmentation. An additional contribution of this work is the introduction of a large-scale and challenging underwater pipeline dataset, HOMOMO, consisting of 123,876 RGB images captured across 1.2 km of seabed pipelines under diverse conditions including sea fog, sea snow, low light, and complex occlusions. The main contributions of this paper are summarized as follows:

  1. 1.

    A novel hybrid segmentation framework that combines a Simplified Swin-Transformer and a Modified EFS-Net via a three-head cross-attention fusion module. This design leverages global contextual and local spatial features simultaneously while maintaining lightweight computational cost.

  2. 2.

    Introduction of the HOMOMO dataset, a large-scale and challenging underwater pipeline segmentation benchmark comprising over 120,000 expert-annotated images captured under diverse real-world conditions.

  3. 3.

    Extensive experimental validation across three datasets, demonstrating superior accuracy, robustness, and generalization compared to state-of-the-art CNN and Transformer-based segmentation models.

  4. 4.

    A practical contribution toward enabling efficient and reliable autonomous inspection of subsea pipelines, with potential applications in offshore oil and gas, renewable energy, and critical infrastructure monitoring.

The remainder of this paper is organized as follows: Some related works for object recognition based on deep learning are investigated in Sect. 2. Section 3 describes the proposed method in detail. Section 4 presents experimental results, dataset descriptions, and comparisons with state-of-the-art approaches. Section 5 concludes the paper.

Related work

In this section, some related works for object recognition based on deep learning are investigated, and the reasons for not using these architectures in the proposed model are explained. This helps to understand the proposed method more clearly.

A soft-assignment color histogram strategy was introduced in28 to develop a differentiable underwater color disparity for underwater images. Also, an underwater image enhancement framework was developed based on visual–textual fusion. In addition, the discrete wavelet transform was employed to decompose images into low and high-frequency components. Low-frequency color restoration was guided by differentiable underwater color disparity, and high-frequency detail refinement was guided by detail-intensity regularization. This guided fusion ensured that enhanced images exhibited both natural color appearance and finely reconstructed textures. However, a non-deep learning-based histogram-based color compensation method was also introduced in29, which applied multiple attribute adjustment techniques, including max-min intensity stretching, luminance map-guided weighting, and high-frequency edge mask fusion. In addition, a multilayer information fusion and self-organized stitching method was introduced in30 for improving the clarity of underwater scene. However, the approach’s limitation was that it struggled in turbid waters with severe blue-green attenuation. An end-to-end architecture31 based on AquaSketch-enhanced cross-scale information fusion was presented to address the underestimation of basic sketch features caused by underwater image distortion. The architecture used a top-down dual-branch pyramid for cross-scale information fusion to overcome the insufficient integration of multiscale representations of underwater objects. An adaptive attenuated channel compensation approach was developed in32 based on optimal channel pre-correction and a salient absorption map-guided fusion method to eliminate color deviations in the RGB color space. Then, it used an algorithm to enhance the contrast of channel L and an adaptive color distribution specification method to improve contrast and match the color distribution in the Lab color space. Finally, an edge-enhanced mask fusion technique was applied for correcting blurry details. This non-deep learning approach improved underwater images to be as colorful as natural images. A plot classification network, named S2G-GCN, was presented in33, integrating spectrum-to-graph modeling and graph convolutional network (GCN). First, the constant false alarm rate detection algorithm was applied to R-D spectra to capture potential target plots. For each plot, an echo energy diffusion region was built to include several resolution cells around its spectral peak. Then, these cells were modeled as a graph, where each node corresponded to a cell and edges were defined by spatial proximity and energy similarity between neighboring nodes. Finally, a (GCN)-based classifier was employed to learn discriminative features from the constructed graph and classified each detected plot into one of the true target, sea clutter, ground clutter, or noise classes. In the proposed method, no initial processing is performed on the input image; instead, the architecture itself identifies pipelines in the image, even if it is completely noisy and of low quality.

An underwater salient instance segmentation architecture was introduced in34 based on the Segment Anything Model (SAM) for the underwater domain. The architecture used an underwater-adaptive ViT encoder to incorporate underwater-domain visual prompts into the segmentation network. An out-of-the-box underwater salient feature prompter generator (SFPG) was also designed to generate salient prompters instead of explicitly providing foreground points or boxes as prompts in SAM. A WaterMask was designed in35 for underwater image instance segmentation. First, the differences-similarity graph attention module was devised to recover lost detailed information due to image quality degradation and down-sampling. Then, the multi-level feature refinement module was presented to predict foreground and boundary masks separately using features at different scales, and to guide the network via a boundary-mask strategy with a boundary-learning loss to yield finer predictions. Our model’s segmentation performance is more efficient than that in34 and 35.

A geometric mapping framework was presented in36 to address the multiple matches in cross-modal retrieval. The rectangular matching of P2RM and R2RM were developed. The P2RM treated all retrieved candidates as rectangles with zero volume and the query as a box. While the R2RM encoded all heterogeneous data into rectangles. Both strategies can be employed to improve the retrieval performance of baselines using off-the-self approaches. An evidence-based multi-feature fusion model was introduced in37 to prevent DNNs from being deceived by the contaminated features only from a single block view. First, the model introduced an evidential deep learning approach to produce a reasonable uncertainty estimate for features from different blocks within an architecture. Then, it integrated multi-block features at the evidence level using Demster-Shapfer’s theory for trusted prediction. A geometric representation was presented in38 to estimate the semantics of heterogeneous data via sector embedding. The input data (image/text) was projected onto a sector, with its symmetric axis representing mean semantics and the aperture estimating uncertainty. A sector matching loss was also introduced to encourage candidates to be contained within the apertures of a query sector. An approach, named ACMR, was presented in39 to learn both discriminative and modality-invariant representations for cross-modal retrieval. The ACMR employed two processes: a feature projector that generated modality-invariant and discriminative representations, and a modality classifier that detected the modality of an item given an unknown feature representation. A triplet constraints were also introduced to ensure that the cross-modal semantic data structure was well preserved when projected into a common subspace. Our model does not apply any retrieval techniques for underwater pipeline recognition.

A model was presented in40 for collision-free intelligent vehicle navigation to avoid obstacles using deep reinforcement learning. The model utilized multimodal perception to achieve reliable online interaction with the surroundings. It used transfer learning to implement the virtual learning policies in the real-world environments. The model integrated camera, Lidar, and inertial measurement unit data, so, a series of cross-domain self-attention layers were applied. The ASHT-KD teacher-student approach was presented in41 for all-day mobile visual place recognition. The framework learned the all-day place recognizer through knowledge transfer from several teachers to a limited number of students, depending on the environment’s complexity. The teacher network was a two-level sampling ViT pipeline, while the Siamese student network was a lightweight pipeline consisting only of one-level down-sampling ViT. The model has drawbacks, including high computational complexity, strong dependence on the quality of the teacher model, complex hyper-parameter tuning, and the risk of information loss. An end-to-end trainable dark-enhanced Net was presented in42 to alleviate the impact of poor illumination and environmental noise for mobile place recognition. First, a lightweight dark enhancement module was trained to improve image illumination quality. Then, a dual-level sampling pyramid transformer was constructed to extract discriminative features through aggregating descriptors. Moreover, a re-ranking method based on the cross-entropy loss was used for final place matching. The objectives and applications of methods in40,41,42 and our method are distinct. Applying transfer learning through a teacher-student approach is our future research for underwater pipelines recognition.

A network was introduced in43 for landslide extraction that leveraged the characteristics of context association. A two-branch multiscale context feature extraction module that captured the contextual relationships across different scales through an attention mechanism while concurrently extracted context information within the same scale. A supervised classifier was also presented to refine the prediction accuracy. The model’s limitations are its accuracy in landslide delineation and its adaptability across diverse scenarios, both of which require further improvement. A multifaceted collaborative salient object detector was presented in44 based on Transformer architecture for optical remote sensing images, incorporating aspects of localization, balance, and affinity. The network focused on locating targets, balancing detailed features, and modeling global image-level context. The global distribution affinity learning module concentrated on leveraging deep features to construct image-level global context association graphs through explicit affinity learning to recognize global patterns within images. This module also fostered a more cohesive expression of multilayer features by applying deep supervision within the decoding layer. A cross-view intelligent person search method was presented in45 based on multi-feature constraints. First, the global-local context-aware module was stablished to extract differential personnel features. Second, the semantic complementarity and feature aggregation module was constructed to address personnel-scale feature constraints across different contexts. Third, the method was constrained to use only person spatial, person identity, and detection confidence features to improve person search accuracy. Our EFSNET-Swin-T segmentation model achieves high accuracy on underwater pipelines images. The objectives and applications of methods in43,44,45 and our method are distinct.

The Mamba architecture was explored46 for remote sensing change detection tasks. Three architectures, MambaBCD, MambaSCD, and MambaBDA were developed for binary change detection, semantic change detection, and building damage assessment tasks, respectively. The encoders across all three architectures used the cutting-edge Visual Mamba architecture, enabling full learning of global spatial contextual information from input images. For the change decoder, three spatial-temporal relationship modeling mechanisms were introduced that leveraged their attributes to capture the spatial-temporal interactions among multi-temporal features, thereby obtaining accurate change information. A Mamba-Convolution network was presented in47 for underwater image enhancement, which inherited the global dependencies modeling of Mamba architecture and the local dependencies modeling of convolution architecture to improve enhancement performance. To capture global and local dependencies in image features, a Mamba-convolution hybrid block was introduced, integrating the global modeling capability of mamba blocks with the local modeling capability of the CNN-based feature attention module. Moreover, a cross fusion Mamba block was presented to fuse the image feature maps of encoder-decoder layers at different levels. A multi-label method was presented in48 that used an ensemble learning approach to detect coral reef conditions and extract ecological information. The method’s architecture combined Swin-Transformer-Small, Swin-Transformer-Base and EfficientNet-B7. The method classified the coral reef conditions as healthy, compromised, dead and rubble. It also identified corresponding stressors, including competition, disease, predation and physical issues. Our method modifies EfficientNet-B0 and combines it with the simplified Swin-T for underwater pipelines recognition, achieving a mAP of 98.98%.

Proposed method

We propose a hybrid framework for semantic segmentation of underwater pipelines that integrates two complementary architectures: the Simplified Swin-Transformer (Swin-T) and a Modified Efficient Fish Segmentation Network (EFS-Net). The overall flow of the method is illustrated in Fig. 1, and its major components are described below.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The flowchart of the proposed method.

Input image

Each input is an underwater RGB image with dimensions of 1080 × 1920. To reduce the computational cost, these images are resized to 256 × 256 before processing. The 256 × 256 size was selected after conducting numerous experiments on underwater pipeline datasets to achieve the highest recognition accuracy. However, reducing the 256 × 256 size to lower sizes removed details of edges and texture in underwater pipelines images, especially those covered with sand and underwater plants, and reduced the performance of the proposed method. Therefore, a size of 256 × 256 is the best for our experiments.

Modified EFS-Net

The architecture of the Modified EFS-Net is shown in Fig. 2. Unlike the original EFS-Net, which uses five convolutional layers for feature extraction, we replace them with a simplified EfficientNetB0 to capture multiscale features while preserving spatial details. The original EfficientNetB0 consists of seven MBConv (Mobile Inverted Bottleneck Convolution) blocks, but only the first four are retained to emphasize low-level features such as edges and textures, which are crucial for pipeline segmentation. Extracted features by the Modified EFS-Net are then processed through Strided-Conv layers, and subsequently passed through down-sampling and up-sampling modules.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The structure of modified EFS-Net.

MBConv

Each MBConv block consists of depthwise separable convolutions, linear bottlenecks, and inverted residuals49. This design significantly reduces parameters and computational cost while maintaining feature extraction accuracy.

Down-sampling

The Down-Sampling block comprises Strided-Conv, ReLU-BN, and Conv layers across three stages. Unlike pooling, Strided-Conv preserves spatial information, enabling more accurate segmentation. At the final stage (“max-depth”), deeper semantic features are extracted, producing output feature maps of size 256 × 4 × 4.

Up-sampling

The Up-Sampling block mirrors the down-sampling process, employing Transposed Convolutions (Tra-Conv), ReLU-BN, and Conv layers over three stages. This reconstruction restores spatial resolution while avoiding the spatial loss typically introduced by un-pooling.

Simplified Swin-T

The Simplified Swin-Transformer (Fig. 3) is derived from the original Swin-T (Swin-UperNet), but only the first three stages are retained for efficiency in underwater pipeline identification.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

The structure of simplified Swin-T.

The components of Simplified Swin-T are:

  • Patch Partitioning: The input image \(\:x\in\:\:{R}^{3*W*H}\)is divided into 4 × 4 non-overlapping patches.

  • Linear Embedding: Each patch is embedded into a 96-channel feature representation.

  • Swin-Transformer Block: Using shifted window multi-head self-attention (SW-MSA)25, local windows (8 × 8) are processed efficiently, while window shifting allows inter-region information exchange.

  • Patch Merging: Features are hierarchically merged across scales, reducing spatial size while increasing channel depth for multiscale representation.

The schematic of patch merging is shown in Fig. 4, where W is the input width, H is the input length, C is the number of channels, and s is the stage number.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The schematic of patch merging.

Feature fusion

Features from the two encoders are fused using a three-head cross-attention mechanism (Fig. 5). Cross-attention adaptively weights feature maps from each encoder using the trainable weight matrices of WQ, WK, and WV, thereby emphasizing salient spatial and contextual cues. Compared to concatenation, this approach reduces computational complexity and avoids redundant information.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

The structure of three-head cross attention50, and CBAM.

The three-head cross-attention module takes two feature maps: one from the Modified EFS-Net (80 channels, 32 × 32) and the other from the Simplified Swin-T (384 channels, 16 × 16). After spatial alignment via Tra-Conv, both features are projected to 32 × 32 × 256. The three-head attention computes the attention weights by (1).

$$Attention=Soft\hbox{max} (\frac{{Q{K^T}}}{{\sqrt {{K^d}} }}) \otimes V$$
(1)

Where Q, K, and V are linear projections of the input features, and symbol of \(\otimes\)denotes element-wise multiplication with broadcasting. Each head specializes in different aspects of pipeline recognition, including local edges, segment relationships, and global structure. The output of attention module (1024 × 85 × 3) after applying BN, and ReLU is concatenated with a feature of size (1024 × 255). Then it is projected to 32 × 32 × 256. This output is refined by CBAM block. CBAM stands for Convolutional Block Attention Module, which consists of channel attention and spatial attention modules. The CBAM’s channel attention module generates Channel-Wise Weights (CWW) according to (2), to emphasize pipeline’s relevant features.

$$CWW=Sigmoid\left( {MLP\left( {AvgPool\left( F \right)} \right)+\,MLP\left( {MaxPool\left( F \right)} \right)} \right)$$
(2)

Where F is the final output of three-head cross attention after projection. Subsequently, CBAM’s spatial attention module generates Spatial-Wise Weights (SWW) according to (3) to produce a spatial mask that focuses computation on regions likely to contain pipelines.

$$\begin{gathered} F^{\prime}=CWW \otimes \,F \hfill \\ F^{\prime}\_Concatinate\left( {AvgPool\left( {F^{\prime}} \right);\,MaxPool\left( {F^{\prime}} \right)} \right) \hfill \\ SWW=Sigmoid\left( {Con{v_{7 \times 7}}\left( {F^{\prime}\_Concatinate} \right)} \right) \hfill \\ \end{gathered}$$
(3)

The final output of CBAM is as SWW\(\otimes\)CWW\(\otimes\) F.

The inclusion of CBAM further enhances spatial and channel attention. It suppresses noise and stabilizes training. We observed experimentally that using three attention heads yielded the best trade-off between accuracy and complexity for underwater pipeline segmentation.

When comparing the cross-attention feature fusion method with other commonly used feature fusion methods, several advantages become apparent. First, cross-attention is performed adaptively and dynamically, based on the importance of the features extracted by each network. Consequently, more attention is given to the most important features and, therefore, to the most relevant parts of the image. Second, cross-attention can better model the relationships between features, allowing for more accurate identification of objects in complex images with multiple features and intricate relationships. Third, cross-attention has lower computational complexity than the concatenation method due to its use of the attention mechanism and dynamic weighting for feature fusion. However, calculating attention scores requires additional time during training, which can be problematic, especially for large datasets and complex networks. In contrast, the concatenation fusion method combines features without considering their relative importance, increasing the dimensionality of the feature vector, which can lead to higher computational complexity and the inclusion of redundant information. In transformer-based fusion methods51, both cross-attention and self-attention are employed to fuse features of different types, such as images and text from different inputs. Since the proposed method uses only image inputs, there is no need for the additional complexity of self-attention alongside cross-attention.

Decoder

The decoder (Fig. 6) reconstructs the segmentation mask through four steps:

  1. 1.

    A 3 × 3 Conv layer (128 filters) with ReLU and BN processes fused features.

  2. 2.

    Up-sampling quadrupled the feature dimensions; a 3 × 3 Conv (64 filters) reduces channel count.

  3. 3.

    Another up-sampling restores input resolution with a 3 × 3 Conv (32 filters).

  4. 4.

    Finally, a 1 × 1 Conv with sigmoid activation generates the binary segmentation mask.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

The structure of decoder26.

The decoder architecture employs a designed progressive up-sampling strategy that optimally balances computational efficiency with reconstruction accuracy. The gradual channel reduction (256→128→64→32) following each up-sampling operation ensures efficient memory usage while preserving essential features for pipeline segmentation. Sequential 3 × 3 convolutions with BN provide sufficient receptive field and training stability, while the final 1 × 1 convolution with sigmoid activation directly produces the binary segmentation mask. This design is particularly advantageous for detecting thin and partially occluded pipelines, as the staged reconstruction process minimizes information loss and maintains boundary precision throughout the decoding stages.

Table 1 compares the main characteristics of the Simplified Swin-T and Modified EFS-Net feature extractors. By combining global contextual reasoning (Swin-T) with efficient local feature extraction (EFS-Net), the proposed hybrid design achieves accurate, robust, and computationally efficient segmentation of underwater pipelines.

Table 1 Characteristics of two feature extractors of simplified Swin-T and modified EFS-Net.

Totally, our contribution is not merely an improved Swin-T, but rather a novel fusion paradigm for underwater pipeline recognition that consists of, three-head cross-attention fusion, and sequential feature refinement pipeline. In the three-head cross-attention fusion:

  • Query is from Modified EFS-Net which leverages local spatial features for attention guidance.

  • Key/Value from Simplified Swin-T which Provides global contextual information.

  • Three-head cross attention which enables multi-scale feature interaction across different representation. Sub-spaces.

In addition, our sequential feature refinement pipeline represents a deliberate design choice where:

  • Cross-attention enables dynamic feature selection.

  • Concatenation preserves maximum information from both branches.

  • CBAM provides adaptive spatial-channel refinement of combined features, which is particularly effective for challenging underwater conditions.

  • Decoder reconstructs precise segmentation masks.

Therefore, our key conceptual contribution lies in the purpose-built design for underwater pipeline recognition, which presents unique challenges that generic segmentation models fail to address effectively. We intentionally leverage the complementary strengths of each component in the proposed.

  • Modified EFS-Net’s strength: Preserves spatial details and edge information critical for pipeline boundary detection.

  • Simplified Swin-T’s strength: Provides global contextual understanding for pipeline trajectory prediction.

  • Our fusion innovation hybrid architecture: The cross-attention mechanism intelligently balances these aspects specifically for linear structure detection.

The Simplified Swin-T configuration is not mere parameter reduction. Our analysis revealed that for linear structure detection:

  • Deep abstraction in Stage 4 of Swin-T actually harms linear feature preservation.

  • Early-stage Swin features (Stages 1–3) provide optimal balance of context and spatial resolution.

  • This configuration represents domain-aware architectural optimization, not simplification.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

A complete architecture diagram including (Input → Dual Encoder → Three-Head Cross-Attention → Decoder).

While individual components exist separately, our integration methodology introduces:

  • Asymmetric feature alignment: EFS-Net features (preserving pipe edges) guide Swin-T feature selection.

  • Progressive context integration: From local pipe textures to global pipeline networks.

  • Robustness to underwater degradation: Specifically designed to handle turbidity and low visibility.

The proposed architecture addresses pipeline-specific challenges:

  • Continuous structure maintenance across long distances.

  • Occlusion resilience from marine fouling.

  • Scale invariance for pipes of varying diameters.

  • Orientation awareness for pipeline route analysis.

As a result, the proper design of these components in the proposed model and proper adjustment of their inputs, and outputs stablishes the proposed method with high performance for special task of underwater pipeline recognition in the comprehensive and challenging dataset of HOMOMO. The complete architecture diagram of the proposed model, including (Input → Dual Encoder → Three-head cross-attention → Decoder), is shown in Fig. 7.

Results, evaluation, and comparison

The proposed method was implemented on an NVIDIA A100 GPU (40 GB HBM2, 1555 Hz bandwidth). Experiments were conducted on three datasets.

Image datasets

HOMOMO: A custom dataset of 123,876 RGB images (1920 × 1080, resized to 256 × 256) captured from 1.2 km of seabed pipelines under challenging conditions (sea fog, sea snow, sand, vegetation) by professional diver to have real underwater pipelines videos. Data were split into training (60%), validation (20%), and testing (20%). Training data were augmented (rotation, scaling, color inversion, Gaussian noise), producing 187,000 images resized to 256 × 256. Images were labeled into two classes: pipeline and non-pipeline.

Roboflow: 5,980 simulated RGB images (640 × 640, resized to 256 × 256) containing synthetic underwater pipelines with varying lighting and reflection conditions. Available at: https://universe.roboflow.com/underwaterpipes/underwater_pipes_orginal_pictures.

YouTube: 28,622 real RGB images (720 × 1280, resized to 256 × 256) with fog, buried pipelines, and vegetation, gathered from YouTube online sources.

Experimental setup

The proposed model was trained with a fused loss function including Cross Entropy and Dice Loss52, optimized using Adam (\({l_r}\)= 0.0001, batch = 16, 100 epochs). Five-fold cross-validation was applied. Transfer learning with ImageNet pre-trained weights was used, followed by fine-tuning on HOMOMO. To evaluate generalizability, the fine-tuned model was tested directly on Roboflow and YouTube without retraining. The Hyper-parameters of the proposed model are shown in Table 2.

Table 2 Hyper-parameters of the proposed method.

Evaluation metrics

Performance was assessed using standard metrics: mean Intersection over Union (mIoU), Accuracy, Precision, Recall, F-score, and F-boundary, defined in (4) to (10)3,26. The F-boundary metric was computed with a threshold of 2 pixels, determined experimentally, which shows the maximum acceptable distance between an identified boundary pixel, and that pixel on its corresponding ground truth to be considered as a true positive.

$$IoU=\frac{{{T_p}}}{{{T_p}+{F_N}+{F_p}}}$$
(4)
$$Accuracy=\frac{{{T_p}+{T_N}}}{{{T_P}+{T_N}+{F_P}+{F_N}}}$$
(5)
$$\Pr ecision=\frac{{{T_p}}}{{{T_p}+{F_p}}}$$
(6)
$$\operatorname{Re} call=\frac{{{T_p}}}{{{T_p}+{F_N}}}$$
(7)
$$Specificity=\frac{{{T_N}}}{{{T_N}+{F_p}}}$$
(8)
$$F - score=\frac{{2 \times \Pr ecision \times \operatorname{Re} call}}{{\Pr ecision+\operatorname{Re} call}}$$
(9)
$$F - boundary=\frac{{2 \times \Pr ecisio{n_b} \times \operatorname{Re} cal{l_b}}}{{\Pr ecisio{n_b}+\operatorname{Re} cal{l_b}}}$$
(10)

Where, \({T_p}\) is the number of correctly identified pixels for pipelines, \({T_N}\)is the number of correctly identified pixels for non-pipelines, \({F_N}\)is the number of pixels of pipelines that are not identified, and \({F_P}\)is the number of pixels incorrectly identified as the pipelines. For computation of \(\operatorname{Re} cal{l_b}\) and \(\Pr ecisio{n_b}\), \({T_p}\) is the number of correctly identified pixels for edges of pipelines, \({T_N}\)is the number of correctly identified pixels for edges of non-pipelines, \({F_N}\)is the number of edges pixels of pipelines that are not identified, and \({F_P}\)is the number of pixels that are incorrectly identified as edges of pipelines.

Ablation, and simulation results

The ablation experiments results of different components of the proposed model are indicated in Table 3. These results clearly demonstrate the additive effect of each component of the proposed architecture in improving the performance of underwater pipeline segmentation. At the baseline level, the single models Swin-T with 1-head (87.5% mIoU) and EFS-Net with 1-head (85.2% mIoU) each demonstrated their complementary capabilities in understanding the overall context and extracting spatial details. The integration of these two architectures in the hybrid model led to a significant jump to 92.8% mIoU, indicating the synergistic effect of combining these two approaches. Then, by introducing the three-head attention mechanism in the next model, the performance was improved to 96.3% mIoU, confirming the importance of simultaneous processing of different scales in pipeline detection. Finally, the addition of the CBAM module in the full model led to a final accuracy of 98.44% mIoU and 82.01% F-boundary, indicating the vital role of this module in boundary refinement. This continuous improvement trend in all metrics – precision from 85.8% to 98.98%, recall from 85.5% to 98.52%, and F1-score from 85.6% to 98.74% - clearly demonstrates the justification of the hierarchical design of the proposed architecture and the effect of each added component to it in achieving near-perfect accuracy.

Table 3 Ablation experiments results of different components of the proposed model.

The proposed method is trained and tested on HOMOMO dataset, but only tested on Roboflow and YouTube datasets. These experimental results including their standard deviations ​​are shown in Table 4. From these results, the generalizability of the proposed method is confirmed. In fact, the results show the ability of the proposed architecture to recognize underwater pipelines on unseen data as well.

Table 4 Results of underwater pipelines recognition using the proposed method in HOMOMO, Roboflow, and YouTube including their standard deviations.

Table 5 compares the proposed hybrid architecture with original EFS-Net, Modified EFS-Net, original Swin-T, and Simplified Swin-T in complexity, speed, and recognition to show the proposed method performance totally is higher, due to the combination of proper components; Modified EFS-Net and Simplified Swin-T.

Table 5 Comparison of the proposed hybrid architecture with original EFS-Net, modified EFS-Net, original Swin-T (Swin-UperNet), and simplified Swin-T in complexity, speed, and recognition.

As it is seen, the proposed architecture achieved the best performance across all metrics, e.g., mIoU improved by 11.73% over Swin-T, 21.32% over EFS-Net, 16.63% over Simplified Swin-T, and 18.52% over Modified EFS-Net. Testing time was slightly higher than original EFS-Net (+ 1–2 ms) but lower than original Swin-T (–10 ms) and Simplified Swin-T (–6.5 ms), represented an excellent accuracy–efficiency balance. In addition, the complexity of the proposed architecture according to the parameters number, GFLOPs, memory usage, and number of frames per second (FPS) confirmed that the proposed model provides an excellent trade-off between computational efficiency and segmentation accuracy, making the proposed method both theoretically sound and practically viable.

Generalization results

In the top rows of Table 6, the proposed method performance is reported on HOMOMO, Roboflow, and YouTube. Despite being trained only on HOMOMO, the method generalized well to unseen datasets, maintaining high segmentation accuracy under varying conditions (e.g., pipelines occluded by sand or vegetation, low light, fog).

Table 6 Comparison of the proposed method with state-of-the-art methods in HOMOMO, Roboflow, and YouTube datasets.

Visual examples in Fig. 8 illustrate accurate detection even with challenges such as hidden pipelines by sand in seabed, in the presence of sea fog, sea snow, and limited light. The maximum observed error was 18.71%, calculated as the ratio of misclassified pixels to ground-truth pipeline pixels.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Visual examples of the proposed method on HOMOMO, Roboflow, and YouTube.

Comparison with state-of-the-arts

For fair benchmarking, U-Net, DeepLabV3 (ResNet101), SwinUNet, Mask2Former, YOLOv5, TransUNet, YOLOv11, and YOLOv12 were re-implemented under the same setup. Results (Table 5) show the proposed method consistently outperformed all baselines across mIoU, Accuracy, Precision, Recall, F-score and F-boundary. While YOLOv5 achieved lower inference time, its accuracy lagged behind; the small speed gap is negligible compared to the proposed model’s substantial accuracy gains. Notably, the proposed method surpassed SwinUNet in both segmentation and boundary detection (F-boundary). Performance improvements over TransUNet highlight strong adaptability to diverse datasets. Figures 9 and 10 further illustrate performance gains over state-of-the-art models, confirming that the proposed hybrid approach provides both robust segmentation accuracy and generalization across unseen data.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Comparison of the proposed method performance over state-of-the-art methods on HOMOMO.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Performance improvements of the proposed method respect to state-of-the-art methods on HOMOMO database.

Conclusion

Accurate identification of underwater pipelines is critical for the safety and reliability of marine infrastructure, yet existing models often fail under varying environmental conditions. This paper proposed a hybrid segmentation framework that integrates a Simplified Swin-Transformer with a Modified EFS-Net, fused through a cross-attention module. The design leverages the Swin-Transformer’s ability to capture long-range dependencies and EfficientNetB0’s strength in extracting local features, while maintaining a lightweight structure for computational efficiency. Extensive experiments demonstrated the superiority of the proposed method. On the challenging HOMOMO dataset, the model achieved a mIoU of 98.44%, mean F-boundary of 82.01%, and consistently outperformed state-of-the-art approaches including U-Net, DeepLabV3 + ResNet101, Swin-UNet, TransUNet, Mask2Former, YOLOv5, YOLOv11, and YOLOv12. Importantly, the method generalized effectively to unseen datasets (Roboflow and YouTube), where it maintained strong accuracy despite variations such as sea fog, sand occlusion, and vegetation. Compared to baseline models, the proposed network achieved substantial improvements in segmentation metrics, while retaining competitive inference speed. These results establish the proposed framework as a practical and robust solution for real-world underwater inspection. By effectively balancing accuracy, generalization, and efficiency, it offers a new paradigm for underwater visual perception.

While the proposed method demonstrates state-of-the-art performance in underwater pipeline segmentation, it may experience reduced sensitivity for pipelines narrower than 2 pixels due to information loss during feature down-sampling, though this represents a fundamental trade-off between computational efficiency and spatial resolution common across deep learning approaches. Similarly, completely buried pipelines or those heavily obscured by marine growth pose significant challenges, as our vision-based method relies on visual cues - a limitation shared by all optical imaging techniques in turbid environments. Furthermore, our evaluation focused exclusively on computer vision-based methods to maintain a controlled comparison within the scope of this research, acknowledging that multi-sensor approaches incorporating sonar or Lidar could provide complementary advantages in scenarios where optical visibility is severely compromised. These limitations, however, highlight valuable directions for future work rather than diminish the substantive advances achieved by the proposed method in optical pipeline recognition. Future work will focus on extending the framework to multi-class segmentation tasks, real-time deployment on embedded hardware, and integration into autonomous inspection systems.