Heat exchanger pipe orifice recognition using an improved YOLOv5 model integrated with dual attention mechanisms

Yang, Ruijun; Zhu, Yining; Xu, Bin; Dai, Wenqi; Ji, Shoucheng; Meng, Dejian

doi:10.1038/s41598-025-34704-x

Download PDF

Article
Open access
Published: 10 January 2026

Heat exchanger pipe orifice recognition using an improved YOLOv5 model integrated with dual attention mechanisms

Ruijun Yang¹,
Yining Zhu¹,
Bin Xu¹,
Wenqi Dai¹,
Shoucheng Ji¹ &
…
Dejian Meng¹

Scientific Reports volume 16, Article number: 4508 (2026) Cite this article

951 Accesses
Metrics details

Subjects

Abstract

Heat exchangers are important equipment in the industrial sector, and the technology for cleaning the fouling on their densely packed tube ports is rapidly evolving towards automation, efficiency, and intelligence. To address the issues of missed detection, false detection, and repeated detection in existing visual algorithm models for tube port recognition, an improved YOLOv5 model integrating a dual attention mechanism, called DANet-YOLOv5 (Double Attention Net), is proposed. This model incorporates dual attention mechanism components into the backbone network, using second-order attention pooling, feature adaptive distribution, and efficient aggregation and propagation of global features. This allows the network to more comprehensively utilize both global and local information from the image, optimizing the sensitivity and accuracy of heat exchanger tube port image recognition. Experimental results demonstrate that this algorithm model outperforms existing small-object detection improvement algorithms, single attention mechanism improvement algorithms, and the mainstream YOLOv8 algorithm in key metrics such as error rate, recall rate, and mAP value, significantly enhancing the accuracy and stability of tube port detection and laying the foundation for the application of automated cleaning robots.

Image-based detection of bolts and bolt-missing defects in multi-angle and complex background scenarios

Article Open access 02 March 2026

Advancing e-waste classification with customizable YOLO based deep learning models

Article Open access 25 May 2025

Automated defect classification and localization in sewer pipelines using hybrid ResNet50–Swin transformer and modified YOLOv8 on CCTV inspection images

Article Open access 25 November 2025

Introduction

With the development of the economy, the heat exchanger industry now involves nearly 30 different fields, and various structural types of heat exchangers have undergone rapid development. As crucial equipment in industrial applications (Fig. 1), heat exchangers face significant technical challenges in cleaning internal wall deposits within their dense pipe clusters. Traditional manual cleaning methods not only suffer from low efficiency and high costs, but also pose substantial safety risks to cleaning personnel¹. As national requirements for the safety and efficiency of heat exchanger cleaning become increasingly stringent, this has driven the evolution of heat exchanger cleaning technologies toward automation, high efficiency, and intelligence². To enhance the cleaning efficiency of dense pipe clusters while avoiding damage to heat exchange pipelines, a robotic arm solution has been proposed where cleaning tools mounted on its end effector are sequentially inserted into pipe orifices for rapid automated cleaning. The primary technical challenge lies in achieving precise positioning and recognition of dense pipe orifices, followed by obtaining the central coordinates of these openings. This necessitates the application of target recognition and detection algorithms to accomplish accurate positioning of pipe orifice locations in such scenarios. Consequently, vision-based recognition technology has enabled the widespread adoption of cleaning robots for heatexchangers^3,4.

In the research of heat exchanger pipe orifice recognition technology, scholars both domestically and internationally have proposed various algorithmic solutions tailored to different application scenarios. Traditional methods primarily rely on image processing techniques. For example, Chen Yuanqing et al.⁵ introduced an improved Hough transform algorithm combining partitioned region scanning with known radius constraints. By segmenting pipe orifice regions through row-by-row and column-by-column scanning and utilizing known pipe diameter parameters to narrow the parameter space, this approach achieved over 30% higher recognition efficiency compared to conventional Hough transforms. Practical tests demonstrated a pipe orifice recognition rate of 98.3% with positioning errors controlled within 1.4%, effectively addressing the challenge of rapid and precise localization in dense pipe clusters. However, issues such as high computational load and parameter sensitivity remain. Liu Chang⁶ leveraged pipe orifice arrangement patterns for center positioning, though its application is limited by the requirement for regular pipe bundle distributions. With the advancement of deep learning, target detection algorithms based on convolutional neural networks (CNNs) have gradually emerged as a mainstream research direction.

In domestic research, Wang Biao’s team⁷ pioneered the application of an improved YOLOv3 algorithm for heat exchanger pipe orifice detection. By introducing a lightweight MobileNet v2 backbone network, optimizing anchor box clustering, and integrating multi-level feature fusion with non-maximum suppression, they achieved high-precision detection of small targets (96.89% average precision, 0.014 s per image). Spatial localization was accomplished through camera calibration and dynamic path planning (1.35 mm error, 0.18 s for the full process), though the impact of complex lighting and occlusion was not thoroughly analyzed. Dai Fengyan et al.⁸ further proposed the HDS-UNet model, optimized with multi-scale hybrid dilated convolutions and depthwise separable convolutions, which demonstrated exceptional performance in segmenting welded and corroded regions (Dice coefficient: 90.89%, 7.8% improvement over U-Net). The prediction time was reduced to 2 min and 29 s, achieving nearly fivefold acceleration. Xiao Lunhui et al.⁹ developed a binocular stereo vision system and enhanced the Faster-RCNN algorithm by integrating DetNet-59 and CBAM attention mechanisms for pipe orifice detection. Combined with disparity calculation, they achieved 3D localization (error: ±10 mm), validating the feasibility of replacing manual labor in cleaning operations.

In international research, Buchanan¹⁰ developed a semi-automatic guided positioning system that achieved preliminary automation but still required manual intervention for pipe orifice recognition. John et al.¹¹ designed a triangular bracket positioning device, which improved stability through mechanical structural optimization, yet failed to overcome bottlenecks in visual recognition technology. In recent years, Carion N’s team¹² have attempted to integrate the Transformer architecture into pipe orifice detection, though its real-time performance remains inferior to YOLO-series algorithms. Notably, the second-order attention pooling technique proposed by Krizhevsky et al.¹³ offers a novel approach for dense orifice feature extraction. By leveraging global feature aggregation and adaptive allocation mechanisms, it significantly enhances small target detection performance.

From the perspective of technological evolution trends, current research exhibits the following characteristics: (1) Lightweight improvements have become mainstream, with the application of technologies like MobileNet and depthwise separable convolutions making algorithms better suited for industrial deployment requirements; (2) Innovations in attention mechanisms and feature fusion techniques are continuously breaking through bottlenecks in dense target detection; (3) Multi-sensor fusion solutions are gradually gaining traction, where collaborative optimization of visual localization and motion control has become a critical aspect of system integration. However, existing studies still have room for improvement in areas such as generalization capabilities under complex working conditions and multi-scale pipe orifice synchronous detection accuracy, which highlights directions for future research.

Therefore, leveraging the advantages of YOLOv5s—such as its small number of network parameters and stable framework—this paper proposes DANet-YOLOv5 (Double Attention Net-YOLOv5), a detection model specifically designed for dense pipe orifices, i.e., single dense targets. Based on images captured in real-world scenarios, a pipe orifice dataset was constructed through data augmentation and used to validate the effectiveness of the proposed model in pipe orifice recognition.

The basic framework of the YOLOv5 model

YOLOv5 incorporates the strengths of numerous algorithms, effectively balancing detection accuracy and speed to achieve real-time object detection. It stands as one of the most widely adopted and implemented target detection algorithms today. As shown in Fig. 2, its basic framework consists of three components: the Backbone network, Neck network, and Head network, which collectively execute target detection tasks efficiently using regression-based methods. This algorithm integrates image preprocessing and data augmentation techniques, significantly enhancing detection performance, and can be conveniently and widely applied in practical scenarios.

Improved model algorithm incorporating dual attention mechanisms

The dual attention module is related to several recent research efforts, including Squeeze-and-Excitation Networks¹⁴, covariance pooling¹⁵, non-local neural networks¹⁶, and the Transformer architecture¹⁷. However, compared to these existing works, the A² module offers unique advantages: First Attention Operation: It implicitly computes second-order statistics of pooled features, capturing complex appearance and motion correlations that cannot be detected by the global average pooling used in SENet¹⁴. Second Attention Operation: It adaptively allocates features from a compact set, proving more efficient than full relational associations between all positions and each specific location as in ^16,17. Extensive experiments on image and video recognition tasks validate these advantages of the proposed method.

The DANet used in this paper aims to enable convolutional layers to instantly access features from the entire spatio-temporal space of their neighboring layers by introducing a novel network component. Its core idea is to first aggregate critical information from the entire spatial domain into a compact set and then adaptively distribute these features to each location. This allows subsequent convolutional layers to perceive features from the full spatial scope even without large receptive fields. To achieve this, DANet incorporates a unified framework implemented through an efficient dual attention mechanism. First Operation: A second-order attention pooling operation selectively gathers key information from the entire spatial domain. Second Operation: An adaptive attention mechanism assigns a task-beneficial subset of features to each spatio-temporal location, complementing local details. This dual attention module, termed the A² module, forms the basis of the resultant network architecture, named A²-Net.

Second-order attention pooling-based global feature aggregation

Convolutional operators are designed to focus on local neighborhoods, thus lacking the ability to “perceive” the entire spatial and/or temporal domain, such as an entire input frame or a position across multiple frames. Consequently, CNN models typically employ multiple convolutional layers (or recurrent units^18,19 to capture global features of the input. Meanwhile, self-attention and correlation operators like second-order pooling have recently demonstrated strong performance in many tasks^15,17,20. In this section, we propose a component capable of collecting and distributing global features to each spatio-temporal location of the input, enabling subsequent convolutional layers to instantly perceive the full spatial domain and capture complex relationships. We begin by formally describing this component through a general formulation. Next, we introduce our dual attention module, an efficient method to implement this component. Finally, we discuss the relationship between our approach and other recent related methods.

Figure 3 illustrates an example of single-frame input to explain the concept of the dual attention method. Here, the global feature set is computed once and then shared across all positions. Meanwhile, each position generates its own attention vector based on the needs of its local features to select a subset of global features that complement the current position and form the enhanced feature. Figure 4 demonstrates the dual attention operations applied to a 3D input array a. The first attention step (top) produces a global feature set, while at position, the second attention step generates new local features .”Let X∈Rc×d×h×w denote the input tensor of a spatio-temporal (3D) convolutional layer, where: c: Number of channels,

d: Temporal dimension, h, w: Spatial dimensions of the input frame.

For each spatio-temporal input position $i=1, \ldots ,dhw$ and its local feature ${v_i}$, we define

$${z_i}={F_{distr}}\left( {{G_{gather}}\left( X \right),{v_i}} \right),$$

(1)

As the output of an operator that first aggregates features across the entire spatial domain and then distributes them back to each input location i, and incorporates the local features at that position ${v_i}$. Specifically, ${G_{gather}}$adaptively aggregates features from the entire input space, ${{\mathbf{F}}_{{\text{distr}}}}$distributes the gathered information to each location i, conditioned on the local feature vector ${v_i}$”²⁴.

The concept of information gathering and distribution draws inspiration from Squeeze-and-Excitation Networks (SENet)¹⁴. However, Eq. (1) presents this idea in a more generalized form, leading to insightful observations and optimizations. In¹⁴, the gathering process employs global average pooling, and the resulting single global feature is uniformly distributed to all locations, neglecting the diverse requirements of individual positions. To address these limitations, this generalized formulation is introduced, and a dual attention block is proposed. Here, global information is first collected via second-order attention pooling (instead of first-order average pooling) and then adaptively allocated to each location through a second attention mechanism, tailored to the demands of the current local features.

This approach achieves two key advantages:

1.
Complex Global Relationships: A compact set of features captures richer global correlations.
2.
Customized Feature Allocation: Each position receives task-specific global information that complements its existing local features, thereby enhancing the learning of intricate relationships.

The proposed component is schematically illustrated in Fig. 3. Below, we first detail its architecture, followed by discussions on specific instantiations and connections to other state-of-the-art methods.

1.
Feature Gathering

“A recent work²⁰ employs bilinear pooling to capture the second-order statistics of features and generate global representations. Compared to traditional average pooling and max pooling, which compute only first-order statistics, bilinear pooling better captures and preserves complex relationships. Specifically, bilinear pooling performs sum-pooling of second-order features derived from the outer products of all feature vector pairs (ai,bi) within two input feature maps A and B:

$${G_{bilinear}}\left( {A,B} \right)=A{B^ \top }={\sum _{\forall i}}{a_i}b_{i}^{ \top ,}$$

(2)

where$A=\left[ {{a_{1, \ldots }},{a_{dhw}}} \right] \in {R^{m \times dhw}}$, $B=\left[ {{b_1}, \ldots ,{b_{dhw}}} \right] \in {R^{n \times dhw}}$, in cnn, A and B can be feature maps from the same layer (i.e., A = B) or from two distinct layers $A=\phi \left( {X;{W_\phi }} \right)$, $B=\vartheta \left( {X;{W_\vartheta }} \right)$, The parameters are ${W_\varphi }$ and ${W_\vartheta }$.”²⁴. By introducing the output variable of bilinear pooling $G=\left[ {{g_1}, \cdots ,{g_n}} \right] \in {{\mathbb{R}}^{m \times n}}$, and reformulate the second feature B as $B=\left[ {{{\bar {b}}_1}; \cdots ;{{\bar {b}}_n}} \right]$, where each ${\bar {b}_i}$, is an $dhw$-dimensional vector, We can reformulate Eq. (2) as

$${g_i}=A\bar {b}_{i}^{ \top }={\sum _{\forall j}}{\bar {b}_{ij}}{a_j},$$

(3)

Equation (3) provides a novel perspective on the result of bilinear pooling: it is not merely about computing second-order statistics—the output of bilinear pooling is essentially a collection of visual primitives, where each primitive${g_i}$ is computed by aggregating local features weighted according to ${\bar {b}_i}$. This inspires an attention-based feature aggregation operation. Further applying a softmax operation to B ensures ${\sum _j}{\bar {b}_{ij}}=1$, i.e., a valid attention weighting vector, leading to the following second-order attention pooling process:

$${{\mathbf{g}}_i}=A{\text{softmax}}{\left( {{{{\mathbf{\bar {b}}}}_i}} \right)^ \top },$$

(4)

The first row in Fig. 4 demonstrates the second-order attention pooling corresponding to Eq. (4), where A and B are the outputs transformed from the input X via two distinct convolutional layers.

In implementation, let$A=\varphi \left( {X;{W_\varphi }} \right)$, $B=softmax\left( {\theta \left( {X;{W_\theta }} \right)} \right)$. The second-order attention pooling offers an effective way to gather critical features: when ${\bar {b}_i}$densely attends to all positions, it captures global characteristics such as textures and illumination; whereas when ${\bar {b}_i}$ sparsely focuses on specific regions, it detects the presence of particular semantics, such as an object and its parts. Notably, a similar understanding has been proposed in¹³, where a rank-1 approximation of the bilinear pooling operation associated with a fully connected classifier was introduced. However, in practical scenarios, attention pooling is applied to aggregate visual primitives across different locations, pooling them into a set of global descriptors using softmax attention maps, without imposing any low-rank constraints.

2.
Feature Allocation

After gathering features from the entire spatial domain, the next step is to distribute them to each location of the input. This ensures that subsequent convolutional layers, even with small kernel sizes, can access global information.

Unlike SENet¹⁴, which distributes the same summarized global features to all positions, our approach achieves greater flexibility by adaptively allocating a set of visual primitives tailored to the demands of each location’s feature. This allows each position to select features complementary to its current ones, simplifying training and enabling the capture of more complex relationships. Specifically, this is realized through soft attention, which selects a subset of feature vectors from ${G_{gather}}\left( X \right)$:

$${z_i}={\sum _{\forall j}}{v_{ij}}{g_j}={G_{gather}}\left( X \right){v_i},where{\sum _{\forall j}}{v_{ij}}=1,$$

(5)

Equation (5) formulates the soft attention mechanism for feature selection. In implementation, a softmax function is applied to normalize ${v_i}$ into a sum-to-one form, which has been empirically observed to improve convergence. The second row in Fig. 4 illustrates the aforementioned feature selection step. Similar to the generation of attention maps, the set of attention weight vectors is generated via a convolutional layer followed by a softmax normalizer, $V=softmax\left( {\rho \left( {X;{W_\rho }} \right)} \right)$, where ${W_\rho }$ contains the parameters of this layer.

Module architecture design: computational graph and implementation strategy of dual attention in A²-Net

By combining the two attention steps described above, we form the proposed dual attention module, whose computational graph within a deep neural network is illustrated in Fig. 5. To formally define the dual attention operation, we substitute Eqs. (4) and (5) into Eq. (1), yielding:

$$\begin{aligned} Z = & \;F_{{distr}} \left( {G_{{gather}} \left( X \right),V} \right) = G_{{gather}} \left( X \right)softmax\left( {\rho \left( {X;W_{\rho } } \right)} \right) \\ = & \;\left[ {\varphi \left( {X;W_{\varphi } } \right)softmax\left( {\theta \left( {X;W_{\theta } } \right)} \right)^{{ \top }} } \right]softmax\left( {\rho \left( {X;W_{\rho } } \right)} \right), \\ \end{aligned}$$

(6)

Figure 6 illustrates the combined dual attention operation, and Fig. 6 presents its computational graph. Here, the feature arrays A, B, and V are generated by processing the input feature array X through three distinct convolutional layers, followed by softmax normalization where applicable. The output Z is obtained by performing two matrix multiplications along with necessary shape transformations and transposition operations. An additional convolutional layer is appended at the end to expand the channel dimension of Z, enabling its reintegration into the input X via element-wise addition. During training, gradients of the loss function can be efficiently computed using automatic differentiation and the chain rule.

There are two distinct approaches to implementing the computational graph for Eq. (6). The first follows the left association in Eq. (6), with its computational graph illustrated in Fig. 4. The alternative is to use right association, as shown below:

$$Z=\varphi \left( {X;{W_\varphi }} \right)\left[ {softmax{{\left( {\theta \left( {X;{W_\theta }} \right)} \right)}^ \top }softmax\left( {\rho \left( {X;{W_\rho }} \right)} \right)} \right],$$

(7)

Note that these two distinct associations are mathematically equivalent, thus producing identical outputs. However, they differ in computational cost and memory consumption. The computational complexity of the second matrix multiplication in the “left association” of Eq. (6) is $O\left( {mndhw} \right)$, while that of the “right association” in Eq. (7) is $O\left( {m{{\left( {dhw} \right)}^2}} \right)$.Regarding memory costs, storing the output of the first matrix multiplication consumes $mn/{2^{18}}$ for left association and ${\left( {dhw} \right)^2}/{2^{18}}$ for right association. In practice, when using right association, an input data array X with 32 frames of size 28 × 28 and 512 channels can easily exceed 2GB of memory, whereas the memory cost for left association remains at 1 MB. Thus, left association is computationally more efficient than right association. Therefore, for common scenarios ${\left( {dhw} \right)^2}>nm$, we recommend implementing the left association defined in Eq. (6).

Integration with YOLOv5: The DANet module is inserted into the 3rd and 4th C3 layers of the YOLOv5s Backbone (after the 5th and 8th convolutional blocks). This position balances global feature aggregation and local detail preservation.

Channel expansion: A 1 × 1 convolution layer is used to expand the channel dimension of the DANet output from c to 2c(consistent with the input channel dimension of the C3 layer), ensuring smooth element-wise addition with the original feature map.

Training settings: No weights were frozen during training; the entire model was fine-tuned. The optimizer (SGD) and learning rate schedule (warm-up for 5 epochs, then cosine annealing) were kept consistent with the baseline YOLOv5s.

Aspect	Original A²-Net (Chen et al.)²⁴	DANet (our work)
Application scenario	General image/video recognition	Dense small-target detection (heat exchanger pipe orifices)
Pooling mechanism	Pure global second-order pooling	Local-global hybrid pooling (captures local pipe orifice texture + global layout)
Computational complexity	High (12.5 × 10⁹ FLOPs for 512 × 512 input)	Reduced by 26.4% (9.2 × 10⁹ FLOPs) via channel reduction (c→c//4)
Integration with YOLOv5	Not designed for YOLO’s C3 module	Adapted to C3 layer with residual connection (avoids feature degradation)
Feature allocation	Uniform global feature distribution	Task-specific allocation (prioritizes features of occluded pipe orifices)

DANet is not a direct application of A²-Net but a task-oriented optimization for industrial dense small targets, with improved efficiency and adaptability.

Experiments and analysis

Dataset

Since the content of this paper is based on practical engineering applications and lacks existing datasets, the dataset used in this study is a custom one. It was constructed from multiple densely packed pipeline images captured at engineering sites (as shown in Fig. 7) and generated through data augmentation techniques such as transformations, cropping, and rotations. The dataset contains a single class label, pipe (pipe orifice).

Total number of samples: The self-constructed dataset contains 1,200 images of heat exchanger pipe orifices, collected from 3 different models of shell-and-tube heat exchangers (SHE-100, SHE-200, SHE-300) at a chemical plant in Shanghai. The images cover diverse scenarios: normal lighting (600 images), low-light (300 images), backlighting (150 images), and occluded (150 images, including pipe-to-pipe occlusion and fouling occlusion).

Training/validation/test split: The dataset is divided into training (960 images, 80%), validation (120 images, 10%), and test (120 images, 10%) sets using stratified sampling to ensure consistent distribution of scenarios (e.g., 80% of low-light images are assigned to the training set).

Average number of objects per image: Each image contains 28–42 pipe orifices, with an average of 35 targets per image.

Annotation standards: We used the LabelImg tool for dual annotation: (1) Bounding boxes (x1, y1, x2, y2) with pixel-level precision (error < 1 pixel); (2) Center coordinates (cx, cy) of each pipe orifice to facilitate robotic arm positioning.

Difficult/ambiguous cases: A total of 187 “difficult targets” (defined as occlusion area > 30% or edge blur degree > 0.5) were labeled with a “difficult” flag. We added supplementary statistics: the detection rate of DANet-YOLOv5 for difficult targets is 89.2%, which is 12.7% higher than that of YOLOv8s (76.5%).

As indicated by the width-height distribution plot of the training set in Fig. 8, most data points cluster in the lower-left corner, implying that the aspect ratios of targets in the dataset are less than 1/10 of the original image dimensions, classifying them as small objects.

Experimental setup and evaluation metrics

All experiments in this paper were conducted on a Windows 11 operating system, using Python 3.8, PyTorch 2.4.1, and CUDA 12.4. The models were trained, validated, and inferred on an NVIDIA RTX 4070 GPU. The hyperparameter settings during training are summarized in Table 1.

The improved algorithm’s detection performance is evaluated using the following metrics: Precision, Recall, mean Average Precision (mAP), and Error Rate (the discrepancy between the number of pipes detected by the model and the actual count).Precision measures the accuracy of the model’s positive predictions.Recall evaluates the model’s ability to identify true positive samples.Average Precision (AP) refers to the average of precision values calculated across different confidence thresholds for a single class.mean Average Precision (mAP) is the average of AP values across all classes, providing a comprehensive assessment of the model’s performance over all categories.Error Rate quantifies the deviation in the predicted versus actual number of pipes.By computing mAP, we holistically assess the target detection model’s performance, balancing both localization and classification accuracy.

Table 1 Hyperparameters setting.

Full size table

Comparative experiments and visual analysis

To validate the superiority of the DANet-YOLOv5 algorithm in dense pipeline orifice detection scenarios, we conducted comparative experiments against both the mainstream YOLO-series detector YOLOv8 and enhanced YOLOv5s variants incorporating distinct attention mechanisms: CBAM, ECA, EVC, and Triplet Attention. The detection accuracy of these models was evaluated using the following metrics: mean Average Precision (mAP), error rate (deviation between detected and actual pipe counts), Precision, and Recall.

Table 2 DANet-YOLOv5 model training results.

Full size table

Table 3 CBAM-YOLOv5 model training results.

Full size table

Table 4 ECA-YOLOv5 model training results.

Full size table

Table 5 EVC-YOLOv5 model training results.

Full size table

Table 6 TripletAttention model training results.

Full size table

Table 7 YOLOv5 model training results.

Full size table

Table 8 YOLOv8 model training results.

Full size table

Table 9 Ablation studies.

Full size table

Table 10 Deployment-related metrics of DANet-YOLOv5 and comparative models.

Full size table

Inference speed was tested on an NVIDIA RTX 4070 (for industrial servers) and NVIDIA Jetson Nano (for edge devices, e.g., cleaning robots). Robustness tests were conducted under low-light (50 lx) and high-occlusion (30–50% occlusion) conditions, confirming DANet-YOLOv5’s superiority in complex industrial environments.

Table 11 Comparative experiments with 3 state-of-the-art attention mechanisms.

Full size table

CoAtNet (2023) combines convolutional and transformer attention; SimAM (2024) is a lightweight attention mechanism with low computational cost. The results show DANet-YOLOv5 outperforms these latest models in error rate (reduced by 59.4% vs. CoAtNet-YOLOv5) and recall.

Table 12 Fairness comparison table.

Full size table

1.
Comparative Analysis of Key Metrics

Based on the experimental data from Tables 2, 3, 4, 5, 6, 7 and 8, DANet-YOLOv5 demonstrates significant advantages in dense pipeline orifice detection scenarios. After 360 training epochs, its error rate drops to 0.013, recall reaches 0.947, and mAP@0.5 achieves 0.972—all surpassing other comparative models. Specifically: DANet-YOLOv5 attains the lowest error rate (0.013) around epoch 360 and exhibits faster convergence (Fig. 9). In contrast, YOLOv8 maintains a consistently higher error rate of 0.167 (Table 8), indicating substantial risks of missed or false detections in practical scenarios.While YOLOv8 achieves a notably higher recall (0.991) than DANet-YOLOv5 (0.947), its elevated error rate stems from over-detection (Fig. 13b–c), leading to duplicate detections.DANet-YOLOv5 effectively mitigates missed detections (Fig. 13a) by balancing global and local features through its dual attention mechanism.Although YOLOv8 slightly outperforms DANet-YOLOv5 in mAP@0.5 (0.995 vs. 0.972), its high error rate reveals that this precision is limited to simple scenarios, whereas DANet-YOLOv5 shows superior robustness in complex environments with dense, overlapping targets (Tables 9, 10, 11, 12).

2.
Effectiveness Validation of Attention Mechanism Improvements

Compared to models with single attention mechanism enhancements (e.g., CBAM-YOLOv5, ECA-YOLOv5), DANet-YOLOv5 significantly improves detection performance through its second-order attention pooling and feature adaptive allocation strategy: CBAM-YOLOv5 (Table 3) achieves an error rate of 0.027 at 600 epochs—higher than DANet-YOLOv5’s 0.022—with slightly lower recall (0.949). This indicates that merely combining channel and spatial attention inadequately aggregates global features.ECA-YOLOv5 (Table 4) exhibits a notably higher error rate (0.038), as its exclusive focus on channel correlations neglects spatial information, resulting in insufficient local feature allocation.TripletAttention-YOLOv5 (Table 6) introduces redundancies in inter-group and intra-group attention computations, leading to unstable performance with large error rate fluctuations (0.013–0.072).

3.
Comparative Analysis of Practical Detection Performance

As visualized in the detection results of Fig. 13: DANet-YOLOv5 (Fig. 13a) precisely localizes densely packed pipe orifices without missed or duplicate detections. In contrast, other algorithms (Fig. 13b–c) exhibit notable flaws in complex backgrounds or overlapping target regions: YOLOv8 suffers from missed detections (Fig. 13b) due to its lightweight feature extraction modules compromising discriminative power.EVC-YOLOv5 generates false positives (Fig. 13c) caused by insufficient local feature aggregation, leading to erroneous identifications.

4.
Training Stability and Convergence Speed

The curves in Figs. 9, 10, 11 and 12 demonstrate that DANet-YOLOv5 achieves high accuracy (mAP@0.5 = 0.969) during the early training phase (150 epochs), with its error rate declining rapidly as training progresses. This confirms that its dual attention mechanism effectively accelerates the feature learning process. In contrast, YOLOv8 exhibits no significant downward trend in its error rate curve (Fig. 9), indicating that its overly complex architecture poses challenges for fine-tuning and adaptation to dense scenarios.

Conclusion

This study addresses the practical need for detecting densely packed pipeline orifices in heat exchangers by proposing DANet-YOLOv5, an improved YOLOv5 model incorporating dual attention mechanisms. By embedding a Dual Attention Module (DANet) into the YOLOv5s backbone network, the model employs second-order attention pooling and feature adaptive allocation strategies to efficiently aggregate global features and dynamically distribute critical information, significantly enhancing detection performance for dense small targets.Experimental results demonstrate DANet-YOLOv5’s superiority over existing algorithms across key metrics on the custom pipe orifice dataset: At 400 training epochs, DANet-YOLOv5 achieves an error rate of 0.013, substantially lower than YOLOv8 (0.167, Table 8) and the original YOLOv5 (0.05, Table 7).

Compared to other attention-enhanced variants, it reduces error rates by 56.7% (vs. CBAM-YOLOv5: 0.03), 65.8% (vs. ECA-YOLOv5: 0.038), and 66.7% (vs. TripletAttention-YOLOv5: 0.039) (Tables 2, 3, 4, 5 and 6).While DANet-YOLOv5 attains a slightly lower recall (0.949 at 600 epochs) than YOLOv8 (0.991), its minimal error rate (0.013) confirms effective mitigation of YOLOv8’s over-detection-induced duplicate counting (Fig. 13b), balancing precision and stability.Despite a marginally lower mAP@0.5 (0.973 at 600 epochs) compared to YOLOv8 (0.995), DANet-YOLOv5 exhibits superior practical detection performance in dense scenarios.Visual comparisons reveal DANet-YOLOv5’s robustness: Successful detection of three overlapping pipe orifices missed by YOLOv8 (Fig. 13a vs. b).Smoother error rate decline curves and earlier convergence (Fig. 9) validate the dual attention mechanism’s role in accelerating feature learning.The dual attention mechanism—enabling efficient global feature interaction and dynamic local information adaptation—provides a lightweight yet effective solution for dense object detection. Future work will:

Explore the model’s adaptability to broader industrial scenarios (e.g., part defect detection, multi-class dense object recognition).

Integrate knowledge distillation or dynamic network architecture optimization to enhance real-time performance and deployment efficiency.

Drive intelligent automation in cleaning technologies through scalable industrial AI applications.^21,22,23

Data availability

All data analyzed during this study are included in this published article.

References

Stephan, D. Maintenance of shell-and-tube heat exchangers[J]. Process. Ind. 624 (2), 17–19 (2022).
Google Scholar
Dai Fengyan, C. et al. Research on automatic control system for tube-side cleaning of heat exchangers[J]. J. Beijing Inst. Petrochemical Technol. 26 (4), 53–57 (2018).
Google Scholar
Qiu, S. et al. Brain-Machine interfaceand visual compressive Sensing-Based teleoperation control of an exoskeleton Robot[J]. IEEE Trans. Fuzzy Syst., 25 (1), 58–69 (2017).
Luo, S. M., Wu, L. & Deng, S. Q. Design of intelligentcleaning robot based on Fischertechnik [J]. Adv. Mater. Res. 706/708, 724–728 (2013 ).
Article Google Scholar
Chen Y. Research on the Visual Navigation System of Heat Exchanger Cleaning Robot [D] (Beijing University of Petroleum and Chemical Technology, 2019).
Liu, C. Multi-target search and localization algorithm for condenser tube Inlet images [J]. J. Instruments Instrum. 32 (11), 2515–2522. https://doi.org/10.19650/j.cnki.cjsi.2011.11.018 (2011).
Article Google Scholar
Wang Biao, L. et al. Monocular vision recognition and localization of heat exchanger tube inlets and cleaning path planning [J]. Manuf. Autom. 46 (05), 43–47 (2024).
Google Scholar
Dai Fengyan, Z. et al. Heat exchanger tube Inlet image recognition method based on Multi-Scale HDS-UNet [J]. Manuf. Autom. 45 (11), 157–160 (2023).
Google Scholar
Xiao L. Design and Implementation of Tube Inlet Recognition and Localization System for Shell-and-Tube Heat Exchanger in Aquaculture [D] (Shanghai Ocean University, 2022). https://doi.org/10.27314/d.cnki.gsscu.2022.000798
Moll, F. J., Buchanan, J. & Crock, T. L. Semi-automated Heat Exchanger Tube Cleaning Assembly and Method: US, WO2012037560A2 [P], 2012-03-22.
John. E., William, S., James, A. et al. (eds) Automated Heat Exchanger Tube Cleaning Assembly and Method: US, 8524011B2 [P], 2013-09-03.
Carion, N. et al. End-to-end object detection with transformers[C]. ECCV. 2020: 213–229. (2020).
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural Networks[J]. Adv. Neural. Inf. Process. Syst., 25(2), (2012).
Hu, J., Shen, L., & Sun, G. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).
Li, P., Xie, J., Wang, Q. & Zuo, W. Is second-order information helpful for large-scale visual recognition? ArXiv Preprint ArXiv:1703.08050, (2017).
Xiaolong Wang, R., Girshick, A., Gupta & He, K. Non-local neural networks. In Computer Vision and Pattern Recognition (CVPR), (2018).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017).
Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634, (2015).
Ng, J. Y. H. et al. Beyond short snippets: Deep networks for video classification. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEEConference on, 4694–4702. IEEE, (2015).
Lin, T. Y., RoyChowdhury, A., & Maji, S. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, 1449–1457, (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778, (2016).
Bao, X. & Wang, S. A survey of object detection algorithms based on deep Learning[J]. Transducer Microsyst. Technol. 41 (04), 5–9. https://doi.org/10.13873/J.1000-9787(2022)04-0005-05 (2022).
Article Google Scholar
Guihua Yang, Z., Wu, Z. & Yang Research on QFN chip surface defect detection technology based on YOLOX[J]. Transducer Microsyst. Technol. 44 (03), 46–49. https://doi.org/10.13873/J.1000-9787(2025)03-0046-04 (2025).
Article Google Scholar
Chen, Y., Kalantidis, Y., Li, J. & Yan, S. Jiashi Feng.A2-Nets: Double Attention Networks,arXiv:1810.11579.

Download references

Funding

The author (s) received no financial support for the research, authorship, and/or publication of this article.

Author information

Authors and Affiliations

School of Intelligent Technology—School of Mechanical Engineering, Shanghai Institute of Technology, Shanghai, 201418, China
Ruijun Yang, Yining Zhu, Bin Xu, Wenqi Dai, Shoucheng Ji & Dejian Meng

Authors

Ruijun Yang
View author publications
Search author on:PubMed Google Scholar
Yining Zhu
View author publications
Search author on:PubMed Google Scholar
Bin Xu
View author publications
Search author on:PubMed Google Scholar
Wenqi Dai
View author publications
Search author on:PubMed Google Scholar
Shoucheng Ji
View author publications
Search author on:PubMed Google Scholar
Dejian Meng
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.R. (Yang Ruijun) and Z.Y. (Zhu Yining) designed the core framework of the DANet-YOLOv5 model, including the integration of the dual attention mechanism into the YOLOv5 backbone network, and wrote the main manuscript text.All authors reviewed the manuscript, and Y.R. and J.S. finalized the manuscript.

Corresponding author

Correspondence to Shoucheng Ji.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, R., Zhu, Y., Xu, B. et al. Heat exchanger pipe orifice recognition using an improved YOLOv5 model integrated with dual attention mechanisms. Sci Rep 16, 4508 (2026). https://doi.org/10.1038/s41598-025-34704-x

Download citation

Received: 12 September 2025
Accepted: 30 December 2025
Published: 10 January 2026
Version of record: 02 February 2026
DOI: https://doi.org/10.1038/s41598-025-34704-x