Introduction

As urbanization continues to accelerate, the maintenance and management of urban infrastructure have become increasingly important1. As essential components of urban drainage, power, and communication systems, road manhole covers are directly linked to the daily lives and public safety of citizens. However, due to factors such as aging, damage, or theft, issues like missing or damaged road manhole covers frequently arise, posing risks to public safety and causing inconvenience to commuters. The rapid and precise detection of road manhole covers is thus critical for improving infrastructure maintenance efficiency and safeguarding public safety, with broad applications in smart city management, autonomous driving, and road maintenance2.

In recent years, deep learning, especially object detection methods, have made considerable advancements in image processing. Most deep learning-based object detection approaches are broadly categorized into two groups: two-stage and one-stage methods. Two-stage methods, exemplified by Faster R-CNN3, first generate a set of candidate RoIs that are likely to contain objects, and then classify and refine the locations of the objects within these regions to produce the final detection outcomes. In contrast, one-stage methods, such as the YOLO series4,5,6,7,8,9 and SSD10, bypass the region proposal stage by directly performing classification and bounding box regression on the feature maps produced by the backbone network. This end-to-end design streamlines the model architecture and improves detection speed. Among these, the YOLO series is particularly renowned for its rapidity and accuracy. As the latest iteration, YOLOv11 leverages the strengths of its predecessors while further enhancing both speed and accuracy11, positioning it as a highly promising solution for road manhole cover detection applications.

However, directly applying the YOLOv11 algorithm to the task of road manhole cover detection presents several challenges. First, road manhole covers come in circular and square shapes, with varying sizes and materials, and are often affected by factors such as lighting, occlusion, and dirt. Furthermore, road manhole cover defects are complex and can be categorized as lost, broken, or misaligned. Differences in damage severity and coverage make certain types visually difficult to distinguish, which presents additional challenges for feature extraction.

To address these challenges, this paper introduces an Edge-Enhanced Feature Aggregation method named EEFA-YOLO for road manhole cover defect detection, using YOLOv11 as the baseline. This approach proposes two new modules: Multi-Scale Edge Enhancement (MSEE) and Feature Aggregation Pyramid (FAP), specifically designed for the characteristics of road manhole cover defects. Additionally, a dataset encompassing diverse scenes and road manhole cover types was constructed for training and evaluating algorithm performance.

The main contributions of this study are summarized as follows:

  • Proposed the Multi-Scale Edge Enhancement (MSEE) module, combining multi-scale feature extraction, edge enhancement, and convolutional operations, with the aim of capturing features across scales while emphasizing edge details. By integrating these features into an enhanced representation through convolutional layers, this module improves model sensitivity to targets and edge information, particularly suited for detecting fine objects in complex backgrounds.

  • Proposed the Feature Aggregation Pyramid (FAP) module, utilizing feature aggregation and diffusion mechanisms, which combines local and contextual information by utilizing multiple convolutional kernels. A unique diffusion mechanism then evenly distributes rich contextual information across scales, substantially enhancing the network’s ability to extract information and effectively mitigating challenges related to scale variation and background interference.

  • Developed a specialized dataset for road manhole cover defect detection, covering four target classes: good, lost, broken, and misaligned manhole covers. These categories comprehensively capture common defect characteristics in real-world applications, with each target class meticulously annotated to provide high-quality label information. This dataset serves as a robust foundation for model training and evaluation, supporting future research on manhole cover defect detection.

The organization of the remainder of this paper is as follows: the “Related work” offers a summary of the progress made in research on detecting road manhole covers. The “Methodology” section outlines the proposed approach in detail. The “Experiments” section presents the experimental design, dataset, and results, accompanied by an in-depth analysis of findings. Finally, the “Conclusion” section summarizes the effectiveness of the proposed method and outlines potential directions for future research.

Related work

The YOLO and its evolutions

The YOLO series has become widely adopted for real-time applications, owing to its remarkable efficiency and speed, unlike traditional object detection methods like the R-CNN family. Since its introduction in 2015, the YOLO algorithm has evolved significantly. YOLOv14 transformed object detection by reformulating it as a regression task, utilizing a single neural network to predict bounding boxes and class probabilities directly within an end-to-end detection system. While it offered high real-time performance, it had limitations in detecting small objects and achieving precise localization. YOLOv25 introduced batch normalization, anchor boxes, and multi-scale training to improve model performance. YOLOv312 adopted multi-scale feature maps and a Darknet-53 backbone, enhancing detection accuracy and performance for small objects. YOLOv46 incorporated CSPDarknet53, the Mish activation function, and CIoU loss, further optimizing detection speed and accuracy. YOLOv5, released by Ultralytics, streamlined the training process and enhanced model extensibility. YOLOv613 is an object detection framework developed by Meituan’s Visual Intelligence Department, focusing on industrial applications. It integrates the latest network design, training strategies, testing techniques, quantification, and optimization methods, aiming to achieve a balance between high precision and high inference efficiency. YOLOv77 and YOLOv8 advanced upon their predecessors with complex network designs and Transformer-based enhancements, further improving model expressiveness and robustness. YOLOv98 introduced programmable gradient information (PGI) and the generalized ELAN (GELAN) architecture, achieving higher parameter efficiency with a lightweight, fast, and accurate design. YOLOv109, developed by researchers at Tsinghua University, introduced an innovative real-time detection approach that overcame the constraints related to post-processing techniques and the model’s structural design in earlier iterations of YOLO. Through the removal of Non-Maximum Suppression (NMS) and refinement of various model elements, it enhanced both efficiency and accuracy, YOLOv10 significantly reduced computational overhead while demonstrating exceptional performance and efficiency.

Developed by Ultralytics, YOLOv11 represents the latest generation of YOLO models, integrating substantial improvements in architecture and training methods. Through enhanced model structures, advanced feature extraction techniques, and optimized training protocols, YOLOv11 achieves new state-of-the-art (SOTA) results across diverse object detection tasks14.

Road manhole cover detection

Road manhole cover detection is a crucial task in urban management and maintenance. Traditional detection methods often rely on image processing techniques, such as edge detection and shape analysis. While these methods are effective under controlled conditions, they are susceptible to environmental factors, such as lighting variations and occlusion in real-world scenarios. Recently, the advent of deep learning has significantly improved the accuracy and robustness of manhole cover detection by leveraging convolutional neural networks (CNNs) for feature extraction in conjunction with object detection methods.

Guan et al.15 proposed the MGB-YOLO model, an enhanced version of the YOLOv5s network, for road manhole cover detection. This model integrates MobileNet-V3, GAM, and Bottleneck CSP, achieving a balance between accuracy and efficiency, making it ideal for vehicle-mounted embedded devices. Zhang et al.16 introduced an improved Faster R-CNN model specifically designed for manhole cover detection by optimizing the feature extraction network to handle small objects and adjusting the candidate region generation strategy and loss function. This approach significantly enhanced detection accuracy and computational efficiency. Zhang et al.17 addressed the issue of limited data by employing a data-augmented deep learning model to detect abnormal manhole covers. Through techniques such as flipping, rotation, cropping, and brightness adjustments, they created a diverse dataset that improved model recognition performance across various environments.

Ji et al.18 adopted a multi-sensor fusion approach that combines LiDAR, camera, and GPS data to collect comprehensive road surface information, enabling efficient and accurate manhole cover detection. By utilizing multi-source data fusion and deep learning, their model demonstrated robust performance in complex road environments, successfully identifying both the spatial location and status of manhole covers. Wang et al.19 applied super-resolution reconstruction to UAV-captured images of manhole covers, enhancing image clarity and emphasizing detailed features. By performing deep learning-based classification on these high-resolution images, they accurately identified manhole cover types, achieving improved classification accuracy even on low-resolution images, thereby enhancing system generalization. Similarly, Lukas et al.20 rasterized 3D point cloud data from mobile mapping to facilitate 2D deep learning analysis. Their approach, which employed transfer learning within a fully convolutional neural network (FCNN), leveraged large-scale image data to improve model generalization, yielding promising results despite the limited availability of manhole cover data.

Meanwhile, some research advances in other fields also provide valuable ideas for our study. For example, Huang et al.21 developed a multi-scale feature aggregation and adaptive fusion network for the automatic and accurate segmentation of pavement crack defects. Similarly, DSTNet22 integrates a dual-stream transformer module into a single 2D convolutional neural network (CNN) architecture to improve the classification performance of building cracks while reducing computational costs. CCTNet23 enhances crack recognition by processing more input pixels and combining convolutional channel attention with a window-based self-attention mechanism. In the domain of video analysis, Zhang et al.24 proposed a perceptual video compression framework that incorporates deep learning for efficient compression and salient feature extraction in object detection. These studies demonstrate the potential of deep learning in feature enhancement and compression optimization, providing inspiration for manhole cover detection. In the field of remote sensing, ADSFNet25 achieves accurate detection of changing targets by fusing multi-source data, offering valuable insights for designing multi-dimensional feature fusion strategies in manhole cover defect detection.

Although the above studies have provided ideas for the detection of road manhole cover diseases, they often overlook the visual similarity among different defect types and the subtle differences in edge features, leading to limited detection performance. the proposed EEFA-YOLO introduces two novel modules: MSEE and FAP. The MSEE module enhances edge information, improving the model’s sensitivity to subtle differences and effectively capturing edge variations. The FAP module, through feature aggregation and diffusion, ensures that multi-scale contextual information is fully diffused across scales, enhancing the model’s ability to represent features. By addressing challenges such as occlusions and diverse environmental backgrounds, this approach significantly improves the model’s ability to handle real-world complexities, effectively overcoming the limitations of previous methods.

Methodology

Overview of YOLOv11

YOLOv11 stands out for its impressive combination of speed, accuracy, and efficiency, making it one of the most powerful models created by Ultralytics to date. Compared to the YOLOv8 model, YOLOv11 introduces several key improvements, including replacing the C2f module with C3K2, adding a C2PSA module after the SPPF module, and incorporating the head architecture from YOLOv1026. By employing depthwise separable convolutions, YOLOv11 reduces redundant computations and improves efficiency. These improvements enhance the model’s feature extraction capabilities, enabling it to more accurately capture complex patterns and details in challenging environments. The main innovations of YOLOv11 include enhanced feature extraction, optimized speed and efficiency, high accuracy with fewer parameters, environmental adaptability, and support for a wide range of tasks. The architecture of YOLOv11 consists of three main components: the Backbone, the Neck, and the Head, which work in synergy to form an efficient process for feature extraction, feature fusion, and object classification and localization.

The loss functions of the YOLO series consist of three parts: classification loss, localization loss (bounding box regression loss), and confidence loss. Classification loss is the part used to optimize the prediction accuracy of the model for the target category. Localization loss is used to optimize the difference between predicted bounding boxes and ground truth bounding boxes, while confidence loss is used to solve the problem of class imbalance in object detection and improve the performance of the model when dealing with small targets and difficult samples.

Architecture of EEFA-YOLO

This study builds upon YOLOv11 as the baseline model and proposes a novel Edge-Enhanced Feature Aggregation detection network, EEFA-YOLO, tailored to the characteristics of road manhole cover damage detection. Similar to mainstream object detection networks, the proposed EEFA-YOLO consists of three components: the Backbone, the Neck, and the Head27. As illustrated in Fig. 1, the Backbone constructs a five-layer feature pyramid using a CNN network to extract features with varying scales and information. Layers 2 to 5 consist of the MSEE and CBS modules. The MSEE module combines multi-scale feature extraction, edge enhancement, and convolutional operations to extract features at different scales while highlighting edge information, enhancing the representation capabilities based on feature extraction and edge enhancement. In the Neck, the FAP module further refines and integrates the features. This module consists of the feature Aggregation (FA) and C3K2 modules, which aggregate and diffuse features, utilizing skip connections to merge multi-scale features. The Head layer receives processed features from the Neck and carries out object detection, predicting object locations and classes.

Fig. 1
figure 1

Architecture of EEFA-YOLO, the Backbone builds a five-layer feature pyramid for multi-scale feature extraction, while the Neck, utilizing the FAP module, further refines and integrates the features. The Head layer performs target localization and classification.

Figure 2 presents the detailed module structure of EEFA-YOLO, with the novel methods proposed in this paper highlighted in red to emphasize the contributions of this study.

Backbone: Consists of multiple convolutional blocks (CBS), which are used to extract features from the input image. Each CBS block comprises convolution operations (with k representing the kernel size and s representing the stride), BatchNorm, and SiLU layers. To improve the model’s ability to capture edge information, the MSEE module is added at specific layers. This module strengthens edge features, thus improving the extraction of features from road manhole covers with varying shapes and damage degrees. At the bottom, the Backbone integrates SPPF and C2PSA modules, boosting multi-scale feature map representation and enabling better adaptation to various target sizes.

Neck: FAP performs upsampling and downsampling on features from different scales extracted by the Backbone, and merges these multi-scale features using the FA module. This process generates rich, multi-scale feature representations, which are then diffused to subsequent layers. By employing the FA module, the model can capture target information across various scales more effectively, improving detection accuracy in complex backgrounds.

Head: Consists of multiple decoupled detection heads, each independently predicting the target’s class and position. Additionally, the Head utilizes a multi-scale feature fusion strategy, enabling the model to perform target detection at various scales. This further enhances detection performance, particularly in complex or dynamic environments.

Fig. 2
figure 2

Structure of the EEFA-YOLO Model.

MSEE module

The shapes of road manhole covers range from circular to square, and factors such as material, lighting, occlusion, and damage lead to significant visual differences. Furthermore, the types of road manhole cover defects are complex, including damage, improper coverage, and displacement. Variations in severity, coverage, and other factors make some defect types difficult to distinguish visually, posing challenges for feature extraction. Given these characteristics of road manhole cover defects, edge features are shown to effectively address these issues, highlighting the distinct characteristics of different defect types. Therefore, based on the YOLOv11 backbone, this paper proposes the Multi-Scale Edge Enhancement (MSEE) module, as shown in Fig. 3. This module integrates multi-scale feature extraction, edge information enhancement, and convolution operations. The main goal of the module is to extract features at multiple scales, highlight edge information, and then combine these multi-scale features, ultimately outputting enhanced features through convolutional layers. The module exhibits strong representational capability based on feature extraction and edge enhancement. The module includes the following steps:

Multi-Scale Feature Extraction: Multi-scale pooling is performed using AvgPool to extract local information at different sizes, facilitating the capture of multi-level features of the image.

Edge Enhance: The Edge Enhance (EE) module is specifically designed to extract edge information, enhancing the network’s sensitivity to edges, a crucial aspect of visual tasks such as object detection and semantic segmentation.

Feature Fusion: Features extracted at different scales are aligned to the same scale through interpolation and then concatenated together. Finally, these features are fused into a unified feature representation through convolution layers, enhancing the model’s ability to perceive multi-scale features28.

Fig. 3
figure 3

Structure of the MSEE Module, (a) Overall Architecture of the MSEE Module, (b) Composition of the Edge Enhance Module.

Figure 3a is the Structure of the MSEE. Given the input feature map \(F_{\textrm{input}}\) , we first obtain multi-scale feature maps through average pooling operations with different window sizes of 3, 6, 9, and 12. This step captures information at different resolutions using multi-scale average pooling. Each of the resulting multi-scale feature maps is then processed by a convolution operation to obtain the final feature map \(F_{\mathrm {conv,s_i}}\), which is represented as:

$$\begin{aligned} \begin{aligned} F_{conv,s_i}=\textrm{Conv}(\textrm{AvgPool}_{k=s_i}(F_{\textrm{input}})) \end{aligned} \end{aligned}$$
(1)

here, \(s_i\in \{3,6,9,12\}\) represents the size of the pooling window. After this, the resulting feature maps \(F_{\mathrm {conv,s_i}}\) are passed through the EE module. In the EE module, the feature maps first undergo average pooling, and then the result is subtracted from the original input to extract edge information. This operation allows for the extraction of prominent edge features. The final output of the EE module is obtained by applying a convolution operation and adding it to the \(F_{\mathrm {conv,s_i}}\) , the input from the previous layer. The output of the EE module is expressed as \(F_{\mathrm {edge,s_i}}\):

$$\begin{aligned} \begin{aligned} F_{\textrm{edge},s_i}=F_{\textrm{conv},s_i}+Conv(F_{\textrm{conv},s_i}-\textrm{AvgPool}(F_{\textrm{conv},s_i})) \end{aligned} \end{aligned}$$
(2)

The edge-enhanced feature maps \(F_{\mathrm {edge,s_i}}\) are then upsampled to restore them to their original resolution. Afterward, the upsampled feature maps \(F_{\mathrm {up,s_i}}\) from all scales are concatenated with the input feature map \(F_{\textrm{input}}\). A convolution operation is then applied to generate the final output feature map \(F_{\textrm{output}}\) , which is expressed as:

$$\begin{aligned} \begin{aligned} F_{\textrm{up},s_i}=\textrm{UpSample}(F_{\textrm{edge},s_i})\\F_{\textrm{output}}=\textrm{Conv}(\textrm{Concat}\big (\textrm{Conv}(F_{\textrm{input}}),F_{\textrm{up},3},F_{\textrm{up},6},F_{\textrm{up},9,}F_{\textrm{up},12}\big )) \end{aligned} \end{aligned}$$
(3)

The MSEE module aims to preserve spatial resolution while improving the representation of object contours and boundaries by selectively enhancing edge features. It operates by leveraging edge-aware filters and multi-resolution processing, ensuring the retention of both low-level edge details and high-level semantic features. This approach enhances the model’s ability to accurately detect road manhole covers in complex road environments, where subtle visual differences and various environmental factors can challenge detection accuracy.

FAP module

Multi-scale feature fusion is critical to improving detection performance in object detection tasks. However, directly combining features from different scales is challenging due to disparities in representation, resolution, and semantics. Traditional methods like feature stacking or concatenation overlook the complementary nature and scale differences, limiting fusion effectiveness. They typically fuse features at a single level, ignoring multi-level semantic and resolution information, and the importance of features across different scales29.

To address these issues, we propose a FAP module, deployed in the network’s Neck section. The FAP module efficiently weights features at various scales, integrates local and contextual information, and propagates it across scales, significantly boosting feature fusion. With FAP, our model excels in multi-scale feature fusion, enhancing detection accuracy and generalization for road manhole cover detection. The structure of the FAP module is shown in Fig. 2, consisting of the FA, C3K2, and feature diffusion mechanisms. The FA module receives inputs from three different scales and aggregates multi-channel information through a series of parallel depthwise convolution operations. The feature diffusion mechanism, comprising the C3K2 module, upsampling, and concatenation operations, allows features rich in contextual information to be effectively diffused across scales, facilitating subsequent object detection and classification tasks.

Fig. 4
figure 4

Submodule Composition of the FAP Module (a) FA Module, (b) C3K2 Module.

The FA module, as shown in Fig. 4, align high-dimensional \(C_{\textrm{h}}\in \mathbb {R}^{H_h\times W_h\times C_h}\), current layer \(C_{\mathfrak {p}}\in \mathbb {R}^{H\times W\times C}\) , and low-dimensional features \(C_l\in \mathbb {R}^{H_l\times W_l\times C_l}\) using operations such as Conv and ADown. After this, the aligned features are concatenated to obtain \(C_u\), represented as:

$$\begin{aligned} \begin{aligned} \mathrm {C_u=Concat(Conv(C_h),Conv(C_p),ADown(C_l))} \end{aligned} \end{aligned}$$
(4)

Where Conv(·), Concat(·), and ADown(·) denote convolution, concatenation, and down-sampling, respectively. The resultant features undergo parallel depthwise convolutions (DWConv) with kernel sizes 5, 7, 9, and 11 to capture contextual information, avoiding sparsity by not using dilated convolutions. A 1\(\times\)1 convolution then combines local and contextual features, integrating varying receptive field sizes. The final output is:

$$\begin{aligned} \begin{aligned} Z=Conv_1\left( \mathrm {C_u}+\sum _{i=2}^5\textrm{D}W\textrm{Con}v_{2i+1}(\mathrm {C_u})\right) \end{aligned} \end{aligned}$$
(5)

The FAP module handles multi-scale inputs, selects feature dimensions adaptively, captures extensive contextual information, and maintains feature integrity during fusion. The diffusion mechanism distributes rich features across scales, enhancing detection accuracy.

Experiments

Dataset

The dataset used in this study is partially derived from the dataset designated for the China College Student Service Outsourcing Competition, with additional data collected from online sources and manually captured images in real-world scenarios. In total, the dataset consists of 1,300 images of road manhole covers, which include three common types of defects: broken, lost, and misaligned. Furthermore, there are significant variations in the severity, area, and aspect ratio of each defect type. The dataset is split into training, validation, and test sets with a 7:2:1 ratio. The types are categorized into four classes, and Fig. 5 shows the number of instances and target sizes for each defect type.

Fig. 5
figure 5

Dataset Categories and Target Sizes. The left image represents the distribution of the dataset categories, Class_0: Good, Class_1: Broken, Class_2: Lost, and Class_3: Misaligned. The right image shows the proportion of target pixels in the images.

Figure 6 shows some examples of datasets. “Good” indicates that the manhole cover is intact and in place, without any damage or loss. “Broken” denotes that the cover has visible damage or cracks but is still in place. “Lose” indicates that the manhole cover is completely missing, exposing the opening. “Misaligned” means the cover has shifted or been opened, with the opening partially or completely exposed. From the image, it can be seen that intact and misaligned covers may appear visually similar, especially when the cover has only slightly shifted, which makes classification challenging. The shape and position of the cover can vary from different angles, further complicating detection methods. For example, some covers may be partially or fully obscured by irregular objects, and outdoor lighting conditions may affect the image clarity and cover visibility, particularly under shadows or uneven lighting. Moreover, varying backgrounds such as grass, sidewalks, or roads can introduce additional interference for detection. Broken covers may have slight cracks or extensive damage, causing significant variation in detecting broken types, making it difficult for methods to recognize covers with different degrees of damage consistently. Some covers may have minimal damage and could be misclassified as good, while others with more displacement could be mistakenly identified as good due to a larger covered area, leading to blurred classification boundaries. These challenges pose significant difficulties for detection methods.

Fig. 6
figure 6

Image Dataset of Four Types of Manhole Covers.

Evaluation metrics and experimental environment

Evaluation metrics

In this experiment, network performance is assessed using precision (P), recall (R), average precision (AP), and mean average precision (mAP) . All predictions are considered positive samples. True positives (TP) are correctly detected samples, false positives (FP) are incorrect detections, and false negatives (FN) are missed targets.

Precision measures the probability of correct predictions among all predictions, evaluating the accuracy of the algorithm’s outputs. Recall reflects the ratio of correctly predicted results to the actual occurrences, assessing the algorithm’s ability to detect all target objects. These metrics correspond to the probabilities of false detection and missed detection, respectively.

$$\begin{aligned} \begin{aligned} precision=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}\\ recall=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}} \end{aligned} \end{aligned}$$
(6)

The average precision (AP) for each class is the area under the precision-recall (P-R) curve, derived by varying the number of positive samples. Mean average precision (mAP) is the average of AP values across all classes. The formulas for these metrics are:

$$\begin{aligned} \begin{aligned} mAP=\frac{\sum _{n=1}^NAP_n}{N} \end{aligned} \end{aligned}$$
(7)

Experimental environment

The training parameters are configured as follows: the model is trained for 300 epochs with a batch size of 16. The training utilizes 8 worker threads, and the model is optimized using the SGD optimizer with a learning rate scheduler. The initial learning rate is set to 1 \(\times\) \(10^{-3}\), with a minimum value of 1 \(\times\) \(10^{-5}\). To mitigate overfitting, a weight decay strategy is applied, with a weight decay coefficient of 5 \(\times\) \(10^{-4}\) and a momentum of 0.937. Additionally, an Early Stopping strategy is employed during training.

The experimental environment consists of the Red Hat 4.8.5-28 operating system, PyTorch 2.1, Python 3.11, and CUDA 12.1. The training is carried out on a cluster with four of A800 GPUs, each equipped with 256GB of memory. To accelerate the experimental process while maintaining result accuracy, various levels of data augmentation are applied based on the model size. With a 640\(\times\)640 pixel, three-channel input, the proposed method contains a total of 565 layers, 2.68 million parameters, and a computational cost of 7.6 GFLOPs. After 300 training epochs, the final model size is 5.6 MB.

Ablation study

To validate the effectiveness of each module in the proposed method, ablation experiments were conducted on the road manhole covers dataset using YOLOv11 as the baseline. The results of these experiments are presented in Table 1.

Table 1 Ablation Study on Road Manhole Covers Dataset.

Table 1 presents the results of the ablation study conducted on the road manhole covers dataset. Without MSEE and FAP modules, the baseline model achieved a precision of 86, recall of 82.5, mAP_0.5 of 89.3, and mAP_0.5:0.95 of 77.7. This performance is primarily due to YOLOv11’s inherent feature extraction capabilities, which allow it to capture major information in the images and achieve a mAP_0.5 of 89.3 on the road manhole covers dataset. After adding the MSEE module, precision slightly decreased to 84.3, but recall increased to 83.5, with mAP_0.5 rising to 91.1 and mAP_0.5:0.95 improving to 79.1. The MSEE module enhances the model’s sensitivity to edge features, enabling it to capture finer details, which improves recall and mAP. However, the focus on edge enhancement might lead to some false positives, explaining the slight drop in precision. Enabling the FAP module resulted in a precision of 83.2, recall of 83.2, mAP_0.5 of 90.6, and mAP_0.5:0.95 of 78.9. The FAP module, through feature aggregation and diffusion mechanisms, helps the model better integrate contextual information, enhancing its adaptability to multi-scale targets. This improvement boosts recall and mAP, although precision slightly decreased compared to the baseline. When both MSEE and FAP modules were enabled, precision significantly improved to 88.8, recall increased to 85.1, mAP_0.5 rose to 91.7, and mAP_0.5:0.95 reached 79.7. This combination produced the best results, highlighting the synergistic effect between the MSEE and FAP modules. MSEE enhances the model’s perception of details and edges, while FAP further integrates local and global information, significantly improving the model’s performance in complex backgrounds.

Comparative experiment

To further analyze the effectiveness of our method, feature maps for two manhole cover images are shown in Fig. 7. The first row shows the misaligned image, with the feature map generated from the fifth layer of the backbone network. The second row shows the broken damage case, with the feature map after feature fusion in the Neck module. The color intensity represents different levels of activation strength, with green and yellow regions indicating higher activation, which corresponds to features important for model decision-making. From the first row, it can be observed that our method, by utilizing the MSEE module in the backbone, extracts more edge information. The activation regions are clearer compared to the baseline method, indicating that our approach is better at focusing on edge information. This enhances the model’s ability to differentiate visually similar damage types. In the second row, after feature fusion in the Neck, the feature map produced by our method exhibits clearer hierarchical feature representations with richer activation regions, demonstrating a higher sensitivity to different types of damage. Visually, the feature maps from our method focus more on specific image areas, allowing for the extraction of more distinct features. The baseline shows more scattered activation, suggesting that it does not pay enough attention to the target regions, which impacts detection accuracy, resulting in false positives or missed detections.

Fig. 7
figure 7

Comparison of the feature maps generated by different methods. (a) shows two images from the manhole cover dataset, (b) Feature maps generated by the Baseline method, and (c) Feature maps produced by our method.

As observed in the figure, the feature maps generated by our method contain richer detail information, exhibit more prominent hierarchical structures, and capture a substantial amount of fine-grained features, demonstrating the effectiveness of our approach in focusing on critical details.

Fig. 8
figure 8

Detection results of EEFA-YOLO on the manhole cover dataset.

Figure 8 illustrates the detection results for the four categories of manhole cover defects in the dataset. Some of these categories are visually very similar, such as “Good” and “Misaligned”. The third row includes images containing two targets, where the more distant target is distorted due to the shooting angle. Our method can detect these cases accurately, demonstrating that it maintains high detection precision and robustness for manhole cover defects, even under challenging conditions such as perspective distortions or visual similarities between defect categories. This highlights the superior detection capability of our method in handling complex real-world scenarios.

Fig. 9
figure 9

Training and validation losses, precision, recall, and mAP metrics of the EEFA-YOLO over 300 epochs.

Figure 9 illustrates the training and validation performance of the EEFA-YOLO model. The training and validation losses for bounding box regression, classification, and distribution focal loss (top left) show a consistent downward trend, indicating effective model convergence. Precision, recall, and mAP metrics (top right and bottom row) steadily improve and stabilize, demonstrating the model’s robust performance in accurately detecting and classifying targets. These results confirm that the model achieves a good balance between fitting the training data and generalizing to unseen data.

Fig. 10
figure 10

Multi-class ROC curve for EEFA-YOLO.

The Fig. 10 illustrates the ROC curves for our task using the EEFA-YOLO model. Each curve represents the performance of the model for a specific class, evaluated by the true positive rate (TPR) versus the false positive rate (FPR). The area under the curve (AUC) values for Class_0, Class_1, Class_2, and Class_3 are 0.89, 0.82, 0.97, and 0.83, respectively, indicating high predictive accuracy for Class_2 and robust performance across all classes compared to random guessing (dashed line).

Table 2 presents a comparison of the detection performance (COCO metrics) of various methods on the Road Manhole Covers test set by, including Faster-RCNN, Cascade-RCNN, YOLOX, YOLOv8n, YOLOv10n, YOLOv11n, and our proposed EEFA-YOLO. From the table, it can be observed that EEFA-YOLO outperforms most of the other methods across most of the metrics, particularly in mAP_50 and mAP_m, demonstrating its superior ability to handle medium and large-scale targets effectively. This highlights the efficacy of the proposed MSEE and FAP modules in enhancing feature extraction and improving detection performance for manhole covers with varying sizes and conditions.

Table 2 Performance Comparison of Detection Methods on the Road Manhole Cover Dataset.

The mAP_50 of EEFA-YOLO reaches 0.905, significantly higher than other methods, particularly when compared to Faster-RCNN and Cascade-RCNN, demonstrating its robustness under a lower IOU threshold. The mAP@75 is 0.869, slightly lower than YOLOv8n but still surpasses other methods. The mAP_m is 0.778 and mAP_l is 0.771, both showing superior performance across all methods, with mAP_m achieving the best result. This indicates that EEFA-YOLO performs well in detecting both medium and large-scale targets.

This improvement is primarily attributed to EEFA-YOLO’s unique feature extraction modules, which enhance the model’s ability to capture features from the target, allowing for better recognition and differentiation of detection objects. The significant advantage in mAP_50 suggests that the model is capable of detecting targets with higher precision under lower IOU thresholds, likely due to its specialized network architecture that enables more accurate learning of target boundaries. The good performance on medium and large-scale targets is likely a result of the network structure or feature pyramid design, which allows the model to maintain high feature discriminability across different scales.

Conclusion

This paper presents an enhanced YOLO based detection model, EEFA-YOLO, to address the challenges faced in road manhole cover detection, such as complex backgrounds, diverse shapes, and varying damage types. By proposing MSEE and FAP module, EEFA-YOLO greatly enhances the model’s capability to detect visually similar targets in challenging environments. The MSEE module enhances edge information, enabling the model to more accurately recognize manhole covers in different conditions, especially under abnormal scenarios like broken, lost, and misalignment. The FAP module aggregates and diffuses multi-scale features, allowing the model to better adapt to target scale variations while suppressing background interference. Experimental results demonstrate the significant advantages of EEFA-YOLO across multiple evaluation metrics, particularly in mAP_50 and medium-scale detection accuracy (mAP_m). When compared to mainstream detection models and various versions of YOLO, EEFA-YOLO outperforms them in terms of detection performance. Therefore, the proposed EEFA-YOLO provides effective and reliable technical support for the detection of road manhole covers in smart city management, contributing to enhanced road safety and maintenance efficiency. It also offers an innovative approach and reference for future intelligent infrastructure detection systems.

Although EEFA-YOLO demonstrates superior performance in road manhole cover detection, several issues and limitations remain. For instance, while the introduction of the MSEE and FAP modules effectively enhances detection accuracy, it also increases computational complexity and model size. In real-time applications, such as dynamic monitoring and instant alerts, the inference speed of EEFA-YOLO may not meet the requirements. While the method performs well on multi-scene test datasets, its robustness and generalization ability have not been fully validated under extreme lighting conditions (such as night and backlight) or adverse weather conditions (such as rain and snow). Future work could consider optimizing EEFA-YOLO through techniques like model pruning, quantization, and knowledge distillation techniques can employed to reduce computational complexity and enhance inference speed, making it more suitable for real-time applications. Additionally, employing data augmentation techniques (such as synthesizing data under extreme weather and lighting conditions) or designing more robust feature extraction modules could further improve the model’s performance in complex environments. Introducing adaptive detection mechanisms that allow the model to dynamically adjust parameters in response to environmental changes could also enhance its robustness and adaptability.