Introduction

Remote sensing images are pivotal in disciplines such as geographic information systems (GISs), environmental monitoring, disaster assessment, and urban planning, where they are extensively applied in ground object recognition, resource monitoring, risk prediction, and urban development decision-making1. However, the inherent high dimensionality and large-scale nature of remote sensing data pose significant challenges for the efficient and accurate detection of ground objects, indicating an urgent need for innovative solutions2. In response to these challenges, automated and intelligent ground object detection techniques have gained increasing prominence. Leveraging their superior pattern recognition capabilities, machine learning and deep learning methods can autonomously extract complex spatial features from remote sensing images, greatly enhancing detection accuracy and efficiency. These approaches not only adaptively learn latent patterns within the data but also deliver precise insights into the types, distributions, and dynamic trends of ground objects3. Additionally, the integration of real-time data processing with multisource data fusion has substantially improved the analytical efficiency and practical applications of remote sensing images, facilitating more comprehensive and precise ground object monitoring4. Conversely, traditional methods that rely on manually designed features and classifiers often fall short in addressing the complexity, dynamics, and scale of high-resolution remote sensing images. However, their limited detection speed and accuracy fail to meet the demands of real-world applications. Therefore, advancing automation and intelligent technologies for remote sensing image target detection is essential for increasing the efficiency and accuracy of practical applications5.

The rapid advancement of artificial intelligence has brought machine learning and deep learning to the forefront of remote sensing image analysis, demonstrating notable advantages such as automated feature extraction, high detection precision, and rapid processing of large-scale datasets6,7, which has made the application of AI technologies to ground object detection and recognition in remote sensing images a research focus, particularly in improving detection efficiency, enhancing model generalization, and addressing complex scenarios8,9. Machine learning, a key branch of artificial intelligence, predicts and makes decisions by uncovering patterns in data, and its applications in remote sensing image detection include traditional machine learning methods such as support vector machines (SVMs)10, random forests (RFs)11, K-nearest neighbors (KNNs)12, and linear regression13, which rely on manually extracted features to convert image pixels or blocks into feature vectors for classification on the basis of spectral or texture characteristics; for example, Lizarazo14 evaluated SVM-based automated classification on hyperspectral datasets, noting its high accuracy through parameter tuning but limited capability for automated optimization, whereas Van der Linden et al.15 highlighted accuracy degradation in segmentation-based urban hyperspectral data classification as segmentation size increased, and other studies, such as those by Liu et al.16 and Jin et al.17, showcased the efficacy of RF models in high-resolution imagery, achieving kappa coefficients of 0.93 and supporting generalization through comprehensive training, with further demonstrations by Huang et al.18 and Costache et al.19 of KNN-based methods that refine classification accuracy for hyperspectral imagery and Furthermore, supervised and unsupervised learning techniques such as K-means20 clustering and principal component analysis (PCA)21 excel in analyzing the statistical characteristics of data with or without labeled datasets, with methods such as UAV image classification combining K-means with thresholding achieving high accuracies22,unsupervised approaches such as Lk-means based on Lévy flight trajectories23 demonstrating superior clustering performance, and PCA effectively applied to rank critical parameters for soil erosion analysis24 and identify peatland degradation via Landsat-8 data25. Additionally, ensemble learning, which integrates predictions from multiple classifiers such as AdaBoost and XGBoost, has been shown to improve detection accuracy in remote sensing, with studies by Kang et al.26 optimizing AdaBoost for high-resolution imagery to achieve a kappa coefficient of 0.83, Hu et al.27 and Xiao et al.28 demonstrating that ensemble methods enhance classification accuracy in specific datasets, and Zheng et al.29 achieving notable results in salt marsh vegetation classification via XGBoost with time series interpolation. While traditional machine learning methods have improved detection accuracy, their reliance on feature engineering presents challenges for high-dimensional data, whereas deep learning methods, including CNNs, RNNs, GANs, and GNNs, particularly CNNs, have addressed these issues by autonomously learning data features; for instance, recent advances such as attention mechanisms in EfficientNet-B3 models30 and CNN-Transformer hybrids31 have further improved classification performance across diverse datasets, underscoring the transformative potential of deep learning in remote sensing applications.

Convolutional neural networks (CNNs) are specialized neural networks designed for processing image data that excel at extracting key features. Building upon CNNs, the You Only Look Once (YOLO) system is an end-to-end object detection framework that formulates detection as a unified regression task. This approach enables simultaneous prediction of object categories and bounding boxes in a single forward pass, simplifying the detection pipeline and significantly enhancing both speed and efficiency. By integrating feature extraction and various detection components, YOLO achieves high precision and speed, making it particularly suitable for real-time applications32. Research on YOLO has advanced through successive versions. In YOLO-V3, Qu et al.33 enhanced real-time performance and accuracy for remote sensing images by incorporating techniques such as image segmentation, DIoU loss functions, CBAM, and ASFF, achieving mAP and frame rate improvements of 5.36% and 3.07 FPS, respectively. Similarly, Yang et al.34 improved the precision, recall, and F1 scores for airplane detection in remote sensing images by optimizing the loss function and IoU evaluation, with increases of 12.57%, 5.11%, and 0.0863, respectively. Subsequent iterations, such as YOLO-V4, have been built upon these foundations; Zakria et al.35, improved the detection performance on the DOTA dataset by introducing optimized classification and anchor box schemes. Further advancements, including CNN pruning strategies and novel loss functions, enabled models such as the multiconvolution YOLO-V4 by Shen et al.36 to achieve a balance of accuracy and speed on challenging datasets such as DOTA and DOTA V2. YOLO-V5 introduced new innovations such as K-Means + + for anchor box optimization and a dual IoU decoupling head (DDH), resulting in significant mAP improvements on datasets such as DIOR37. Enhanced small-target detection and integration with CBAM modules were also reported by Ding et al.38, who achieved 88.5% accuracy at 26 FPS. In YOLO-V7, models such as CURI-YOLO-V739 were optimized for UAV remote sensing applications, achieving superior real-time performance and adaptability for embedded systems. Similarly, YOLO-V8 introduces groundbreaking algorithms such as YOLO-GE40, which combine ghost convolutions with CSP strategies to increase the robustness of small-target detection. This version also demonstrated its versatility across applications, from subsidence funnel detection41 to postdisaster damage assessment42. The most recent versions, such as YOLO-V10 and YOLO-V11, continue to refine these approaches with lightweight architectures such as GAS-YOLO43 and segmentation-focused models such as YOLO-V11-Seg44, showing exceptional performance in resource-constrained and complex scenarios. Collectively, these advancements underscore the transformative potential of the YOLO series in high-speed, high-precision object detection tasks.

This study employs the YOLOv11 model for training and detecting ground targets in high-resolution remote sensing images, systematically investigating its potential to improve detection accuracy and efficiency. Through a series of detection experiments targeting diverse object categories within remote sensing data, the results confirm YOLOv11’s exceptional capability in processing large-scale datasets, navigating complex backgrounds, and addressing multitarget scenarios. The study further evaluates the model’s adaptability and robustness in high-resolution imagery, highlighting its marked advantages in precise target localization and enhanced detection efficiency. These findings provide valuable theoretical insights and practical guidance for advancing the intelligent application of target detection technologies in remote sensing image analysis.

Overview of the YOLOv11 algorithm

Algorithm overview

YOLOv11, developed by the Ultralytics team, represents the latest advancement in the YOLO series of object detection models, building upon the classic YOLO architecture with groundbreaking innovations and enhancements. Designed as a highly efficient, accurate, and robust detection system, YOLOv11 demonstrates exceptional performance in handling complex environments, dynamic scenes, and small object detection tasks. The model integrates state-of-the-art technical modules, including C3k2 blocks, spatial pyramid pooling fusion (SPPF), and channel-to-pixel space attention (C2PSA), which significantly enhance its feature extraction capabilities and improve its multiscale target recognition accuracy. Furthermore, these innovations strengthen the model’s adaptability to diverse backgrounds and targets of varying sizes, achieving an optimal balance between precision and computational efficiency. In addition to its core detection capabilities, YOLOv11 expands its functionality through enhanced multitask processing, enabling seamless execution of advanced tasks such as instance segmentation and pose estimation alongside traditional object detection. These advancements make Yolov11 a versatile solution for a wide range of applications, including intelligent monitoring, autonomous driving, and UAV vision systems. By providing superior performance across these domains, YOLOv11 sets a new benchmark for efficient and intelligent object detection45.

As illustrated in Fig. 1, the YOLO series models are compared on the COCO dataset in terms of detection accuracy (map50–95) and inference latency (ms/img), highlighting the performance advantages of YOLOv11. As the latest evolution in the YOLO series, YOLOv11 demonstrates outstanding performance across all latency ranges. Specifically, it achieves a peak map50–95 of approximately 54, representing a 1.5–2 percentage point improvement over YOLOv10 and a 6–10 percentage point gain over YOLOv5 and YOLOv6–3.0, with particularly notable accuracy in the medium‒to-high latency range. In low-latency scenarios (2–4 ms), YOLOv11 maintains a detection accuracy above 52 + , significantly outperforming models such as YOLOv9, YOLOv10, and PP-YOLoE + , underscoring its real-time efficiency. In high-latency scenarios (8–14 ms), while the performance curve stabilizes, YOLOv11 retains its accuracy advantage, reflecting exceptional stability and robustness. The incorporation of innovative modules such as SPPF and C2PSA enhances YOLOv11’s ability to detect multiscale targets, offering a distinct advantage in small object detection and handling complex scenes. These advancements allow YOLOv11 to achieve a superior balance between accuracy and latency, surpassing its predecessors.

Fig. 1
figure 1

Comparison of mAP and inference latency for YOLO series models on the COCO dataset46.

In addition to these comparisons, Shi et al.47 analyzed single-stage and two-stage detectors, summarizing the strengths and weaknesses of representative algorithms such as YOLO, CornerNet, R-CNN, SPPNet, and AutoFPN. As detailed in Table 1, another state-of-the-art object detection framework, DETR, which leverages a transformer-based end-to-end design, was benchmarked against the best-performing YOLO models. On the COCO dataset, DETR achieves a map50–95 of 54.7%, second only to YOLO9–e (55.6%) and surpasses YOLO10–x (54.4%) as well as earlier YOLO versions, demonstrating notable accuracy gains. However, Yolov11-x stands out with a parameter count of 56.9 M, which is significantly lower than those of Yolov8-x (68.2 M) and RT-DETR-R101 (76.0 M), while remaining close to Yolov9-e (58.1 M). This lightweight design enhances computational efficiency and reduces hardware resource requirements. For example, YOLOv10-x, with the lowest parameter count of 29.5 M, exhibits slightly lower accuracy (0.3%) than YOLOv11-x does, making it suitable for less accuracy-critical applications. In contrast, models such as YOLOv8-x and RT-DETR-R101, while offering similar accuracies, have significantly higher parameter counts, resulting in increased resource demands.

Table 1 Performance comparison of YOLO series models and RT-DETR-R101 on the COCO dataset46.

From a computational perspective, YOLOv11-x achieves a FLOP value of 194.9B, representing a 25% reduction over RT-DETR-R101 (259.0G) and aligning closely with YOLOv9-e (192.5B), demonstrating its high computational efficiency. With a moderate parameter count, YOLOv11-x achieves an optimal balance between accuracy and model size, significantly enhancing its practical applicability. Overall, YOLOv11-x exemplifies a well-optimized trade-off among the parameter count, computational complexity, and detection accuracy. By maintaining a high map50-95 (54.7%) while reducing its parameter count relative to most comparable models, YOLOv11-x enhances deployment flexibility and inference efficiency, making it particularly well suited for real-time and resource-constrained applications. Its design not only reflects cutting-edge advancements in object detection but also offers valuable insights for academic research and real-world implementations⁶⁰.

Network architecture

As illustrated in Fig. 2, YOLOv11’s network architecture is composed of three key components, the backbone, neck, and head, which together form a comprehensive pipeline for object detection, spanning from input preprocessing to final output generation. Initially, the raw input image undergoes preprocessing steps, such as normalization and resizing, to align with the model’s input requirements. The preprocessed image is then fed into the backbone, the core feature extraction module, where features are progressively refined through multiple layers, including convolutional layers (Conv), C3K2 modules, SPPF modules, and C2PSA modules. The Conv module handles initial feature processing, whereas the C3K2 module employs a multibranch architecture to extract both fine-grained details and deep hierarchical information. The SPPF module integrates multiscale contextual information through pyramid pooling, and the C2PSA module further enhances feature representation by leveraging cross-scale pixel space attention, increasing the model’s ability to focus on critical target regions. These feature maps, which contain rich local and global information, are subsequently passed to the neck for advanced integration.

Fig. 2
figure 2

Structure diagram of the YOLOv11 network48.

The neck is designed to aggregate and enhance the multiscale features extracted by the backbone. Through operations such as upsampling and feature concatenation, it combines local details with the global semantic context. Additionally, the integrated features are further refined by the C3K2 module, ensuring higher-quality feature representations. The enhanced feature maps generated by the neck are then routed to the head for the final detection tasks.

The head component of YOLOv11 is responsible for object classification and bounding box regression. Using a multibranch design, the head processes the multiscale features provided by the neck, where each branch is optimized to detect objects of specific sizes, including small, medium, and large targets. By employing nonmaximum suppression (NMS), the head produces the final detection outputs, encompassing the object categories, bounding box coordinates, and confidence scores.

This streamlined pipeline, encompassing input preprocessing, feature extraction, fusion, and detection, ensures seamless collaboration among modules with distinct yet complementary roles. This design significantly enhances object detection accuracy and efficiency while also improving the model’s adaptability to multiscale targets and complex scenes. These innovations enable Yolov11 to address the challenges of modern object detection tasks effectively in diverse scenarios48.

Main module

Conv module

The Conv module, as illustrated in Fig. 3, comprises three key components: a two-dimensional convolutional layer (Conv2d), a batch normalization layer (BatchNorm2d), and the SiLU activation function. Conv2d is responsible for performing convolution operations on the input feature maps, extracting local spatial features such as edges and textures to provide the foundational representations for subsequent processing. To stabilize the outputs from Conv2d, BatchNorm2d is applied to normalize the feature maps, which not only accelerates model convergence but also alleviates issues such as gradient vanishing or exploding, thereby improving the model’s generalization ability. Finally, the SiLU activation function enhances feature expressiveness by retaining more information from negative values. Compared with the traditional ReLU activation function, SiLU excels in capturing intricate details and modeling complex nonlinear relationships, making it particularly suitable for tasks that require fine-grained feature extraction and higher representational capacity49.

Fig. 3
figure 3

Conv module49.

C3K2 module

The C3K2 module serves as the core feature extraction component of YOLOv11 and is designed to strike a balance between feature representation capability and computational efficiency by leveraging a multibranch architecture and residual connections. As depicted in Fig. 4, the module supports two configurations—c3k = True and c3k = False—to adapt to different task requirements. In the c3k = True configuration, the module utilizes a lightweight C3k branch, where the input features are split into multiple groups through a split operation, processed independently, and subsequently fused via feature concatenation (Concat). This lightweight design significantly reduces computational complexity, making it particularly suitable for real-time applications. Conversely, in the c3k = False configuration, the module employs a more advanced bottleneck structure (Fig. 5), incorporating additional convolutional and activation layers to extract deeper, more complex features, which are ideal for tasks demanding high precision. The C3K2 module’s multibranch architecture enables efficient extraction of multiscale and diverse features, while the fusion operations ensure seamless integration of local details and global semantics. Furthermore, the incorporation of residual connections addresses gradient vanishing issues, allowing for effective training even in deeper networks. By offering a flexible trade-off between lightweight efficiency and high representational capacity, the C3K2 module is highly adaptable to the varied requirements of modern object detection tasks, ensuring robust performance across a wide range of scenarios49.

Fig. 4
figure 4

C3k2 module49.

Fig. 5
figure 5

BottleNeck module49.

SPPF module

The spatial pyramid pooling fast (SPPF) module is a pivotal component of YOLOv11 and is designed to enhance object detection performance by efficiently fusing multiscale features. As illustrated in Fig. 6, the module begins with a convolutional layer that processes the input feature maps, extracting essential features while reducing their spatial dimensions to minimize computational overhead. This is followed by three successive MaxPool2d operations, each expanding the receptive field and capturing multiscale information, transitioning from local details to global semantics. These operations effectively preserve critical features while suppressing redundancy, ensuring robust feature extraction. The output feature maps from the pooling layers are subsequently integrated via a concatenation operation (Concat), which combines contextual information across multiple scales. This fusion produces output features that balance fine-grained detail with the global semantic context, significantly enriching the feature representation. A final convolutional layer refines the fused features, optimizing the channel structure to generate high-quality inputs for downstream network layers. Designed for efficiency and robustness, the SPPF module streamlines pooling and concatenation processes to lower computational complexity, making it particularly well suited for real-time detection tasks. Its ability to extract and integrate multiscale information greatly enhances the model’s ability to detect small objects and process complex scenes. Moreover, the SPPF module improves adaptability to multiobject detection scenarios, providing a versatile solution for diverse detection tasks49.

Fig. 6
figure 6

SPPF module49.

C2PSA module

The cross-scale pixel spatial attention (C2PSA) module introduces an innovative attention mechanism in YOLOv11 aimed at improving feature representation through cross-scale attention and pixel-level spatial optimization. This design enhances the model’s capacity to detect complex objects and capture fine-grained details, as illustrated in Fig. 7. The module architecture comprises an initial convolutional layer, multiple cascaded PSABlocks (pixel spatial attention blocks), feature concatenation (Concat), and a final convolutional layer for feature integration. The process begins with the initial convolutional layer, which extracts basic features from the input. The extracted features are then processed by a series of PSABlocks, which iteratively refine the feature representation. Each PSABlock includes a pixel spatial attention mechanism and a residual connection. The attention mechanism emphasizes critical features in target regions via a weighting strategy, effectively reducing background interference and improving the focus on salient details. Moreover, the residual connections facilitate gradient flow, ensuring stable training and efficient feature learning. Following the cascaded PSABlocks, the output features are concatenated to integrate multilevel information. A final convolutional layer further refines these features, aligns feature dimensions and enhances overall representational quality. By combining cross-scale attention with pixel-level optimization, the C2PSA module effectively balances detail preservation and the global context, making it particularly suitable for detecting intricate objects in challenging scenarios. This design ensures robust feature extraction, contributing to the overall accuracy and efficiency of YOLOv1149.

Fig. 7
figure 7

C2PSA module49.

Detection module

The detect module serves as the core of object detection in YOLOv11 and is responsible for predicting target categories and bounding box locations. It incorporates a lightweight architecture, a multitask separation mechanism, and advanced loss function optimization to achieve high efficiency and accuracy. As illustrated in Fig. 8, the input features are initially processed through a series of Conv and depthwise separable convolutions (DWGConv). DWGConv significantly reduces the parameter count and computational complexity while preserving robust feature extraction capabilities, thereby enhancing the module’s efficiency and suitability for real-time applications. Following this initial processing, the feature maps are divided into separate paths via a multipath parallel design to handle bounding box prediction and target classification independently. In the bounding box prediction path, Conv generates the coordinates of the bounding boxes, and the discrepancy between the predicted and ground-truth values is quantified by bounding box loss (box loss). This guides the model’s optimization for the regression task, ensuring accurate spatial localization. On the other hand, the classification path predicts the probability distribution of target categories, with classification loss (Cls loss) minimizing the error between the predicted and actual categories to refine the classification accuracy. This task-specific separation ensures that both bounding box regression and target classification are optimized independently yet harmoniously, leveraging efficient feature utilization and error minimization. By integrating advanced convolutional designs and effective loss functions, the Detect module achieves an optimal balance between computational efficiency and detection accuracy, making it a critical enabler of YOLOv11’s high performance in diverse object detection tasks49.

Fig. 8
figure 8

Detection module49.

Dataset acquisition and processing

Sources and characteristics of remote sensing image datasets

The datasets used for training and testing the YOLOv11 model in this study are primarily publicly available high-resolution remote sensing image datasets, including DIOR50, NWPU VHR-1051, and RSOD52,53. These datasets encompass a wide variety of ground objects, such as golf courses, vehicles, highway toll stations, train stations, chimneys, storage tanks, ships, ports, airplanes, tennis courts, athletic tracks, dams, basketball courts, highway service areas, stadiums, airports, baseball fields, bridges, wind turbines, and interchanges. In total, the training set includes 42,234 images, whereas the validation and test sets consist of 21,117 and 7038 images, respectively. Sample images from the datasets are shown in Fig. 9. These datasets provide a diverse range of training samples, enabling the model to learn robust representations and facilitating reliable evaluations of its generalizability and robustness in practical applications. The target objects in the datasets exhibit high density and considerable scale variation, with a notable prevalence of small objects and multiscale targets. Furthermore, remote sensing images often present challenges due to complex background interferences, including cloud occlusion, shadows, and variations in lighting conditions. These factors increase the difficulty of detection tasks, making them ideal for assessing the performance of detection models in handling real-world complexities.

Fig. 9
figure 9

Example of the dataset samples.

Data augmentation and annotation

Data augmentation and annotation play pivotal roles in improving the generalization ability and detection accuracy of YOLOv11. To increase the diversity of training samples, a range of data augmentation techniques were employed, including random scaling, which scales images and their corresponding target objects to various degrees, enabling the model to effectively handle targets of different sizes; flipping, where horizontal and vertical flipping of images increases dataset variability and enhances robustness; rotation, where randomly rotating images by specific angles allows the model to detect targets from diverse orientations; color jittering, which adjusts image brightness, contrast, and saturation to simulate lighting variations, improving the model’s resilience to illumination changes; blurring, which applies light blurring to simulate motion blur or image capture artifacts, enabling detection under suboptimal conditions; and cropping, where random cropping positions target objects in different areas of the image, training the model to detect objects across various locations and perspectives. These data augmentation techniques effectively simulate real-world variations in observation angles, lighting conditions, and target scales, enhancing the model’s robustness in complex scenarios.

In line with the characteristics of high-resolution remote sensing imagery, a hybrid approach combining manual and automated annotation methods was employed to label each target object with precision, paying special attention to the annotation accuracy of small- and multiscale targets. Each annotation file corresponds to an individual image and includes the class ID, normalized coordinates of the bounding box center relative to the image dimensions, and the normalized width and height of the bounding box. The annotation format is defined as < class_id > < x_center > < y_center > < width > < height > , where x_center and y_center denote the normalized center coordinates and width and height represent the normalized dimensions of the bounding box. Examples of annotations include "4 0.542 0.538 0.737 0.632" (Fig. 10a), "10 0.426 0.505 0.516 0.973" (Fig. 10b), and "14 0.500 0.419 0.56 0.701" (Fig. 10c). This combination of advanced data augmentation techniques and precise annotation methods ensures that YOLOv11 is well equipped to handle the challenges posed by complex remote sensing imagery, ultimately improving its performance across diverse detection tasks.

Fig. 10
figure 10

Example of the labels of the dataset.

Model training

Parameter settings

The initial parameter settings during model training are critical for both the performance of YOLOv11 and the convergence of the training process. In this study, the following parameter configurations were applied: (1) Epochs: the total number of training epochs was set to 1000 to ensure sufficient iterations for feature learning. The model achieved its optimal performance at the 496th epoch, indicating an efficient learning process. (2) Batch size: A batch size of 64 was selected after careful tuning to strike a balance between computational efficiency and memory usage, optimizing both training speed and resource allocation. (3) Imgsz (Image Size): The input image dimensions were fixed at 1280 × 1280, which preserved image details while maintaining computational efficiency during training. (4) Pretrained Weights: The pretrained weights "Yolo-V11x.pt," derived from a large-scale dataset, provided strong initialization for the model, significantly improving detection accuracy by enabling the network to leverage prior knowledge. (5) Learning rate: The initial learning rate was set to 0.0001 and dynamically adjusted via a cosine learning rate scheduler, ensuring a gradual reduction in the learning rate over the course of training to promote stable convergence and prevent oscillations. (6) Optimizer: The AdamW optimizer (adaptive moment estimation with weight decay)⁶⁹ was utilized, which combines the adaptive learning properties of Adam with weight decay regularization to better manage gradient updates and mitigate the risk of overfitting. These meticulously chosen parameters effectively laid the groundwork for rapid convergence and enhanced final performance, ensuring the robustness of YOLOv11 in diverse detection tasks.

Evaluation and analysis of the experimental results

Evaluation and analysis of loss functions

In object detection tasks, the training process typically integrates multiple loss functions to achieve comprehensive performance optimization across various subtasks. These include Box_Loss, Cls_Loss(classification loss), and Dfl_Loss (distribution focal loss), each tailored to specific objectives. Box_Loss focuses on refining the spatial accuracy of bounding box predictions, ensuring the precise localization of detected objects. Cls_Loss addresses the accuracy of target classification by minimizing the discrepancy between the predicted and actual categories. Moreover, Dfl_Loss enhances boundary refinement by learning the fine-grained distribution of object edges, effectively improving the model’s ability to distinguish subtle differences between overlapping or closely positioned objects. The combination of these loss functions enables holistic optimization, ensuring robust performance in classification, localization, and boundary precision tasks.

Overview of loss functions
  1. (1)

    Box_Loss is a crucial metric in object detection tasks; it measures the difference between the predicted and ground-truth bounding boxes and directly influences the localization accuracy. YOLOv11x incorporates an enhanced complete intersection over union (CIoU) method within Box_Loss, offering a more comprehensive bounding box evaluation than traditional IoU and GIoU methods do. This design makes the CIoU particularly effective in dense multiobject scenes and small object detection tasks, where precise localization is essential. The CIoU loss, which forms the basis of Box_Loss, is calculated via the equations below. Unlike the IoU and GIoU, the CIoU integrates three critical components. First, the center point distance accounts for the deviation between the centers of the predicted and ground-truth boxes. By including this term, the CIoU ensures precise target localization and mitigates vanishing gradient issues in low-overlap scenarios. Second, aspect ratio consistency, represented by the term v, optimizes the alignment between the shapes of the predicted and ground-truth boxes, particularly in situations where the object shapes vary significantly. Third, comprehensiveness and stability are achieved by combining the IoU, center point distance, and aspect ratio consistency into a unified framework. This holistic approach significantly enhances bounding box localization, especially in tasks involving small or densely packed targets. Through its integration of the IoU, center point distance, and aspect ratio consistency, CIoU loss provides a robust and holistic evaluation metric, significantly improving the model’s performance in accurately localizing objects under challenging conditions54.

    $$Box\_loss = 1 - CIoU = 1 - \left[ {IoU - \frac{{\rho^{2} \left( {b,b^{*} } \right)}}{{c^{2} }} - \alpha \upsilon } \right]$$
    (1)
    $$v = \frac{4}{{\pi^{2} }} \times \left( {\arctan \frac{{w^{*} }}{{h^{*} }} - \arctan \frac{w}{h}} \right)^{2}$$
    (2)
    $$\alpha = \frac{v}{{\left( {1 - IoU} \right) + v}}$$
    (3)

    where:

    IoU: Intersection over union, the ratio of the overlapping area to the union area of the predicted and ground-truth boxes, which measures spatial overlap.

    \({\rho }^{2}(b,{b}^{*})\): Euclidean distance between the center points of the predicted box (b) and the ground-truth box (b*), which is used to quantify center point deviation.

    c: Diagonal length of the smallest enclosing rectangle covering both the predicted and ground-truth boxes, used to normalize the distance term \(\frac{{\rho }^{2}(b,{b}^{*})}{{c}^{2}}\), ensuring scale invariance.

    V: Aspect ratio consistency term, which assesses the alignment between the predicted and ground-truth box shapes.

    α: Adjustment factor balancing the contribution of v to the overall loss.

    w, h: Width and height of the ground-truth box.

    w*, h*: Width and height of the predicted box.

  2. (2)

    Cls_Loss plays a crucial role in object detection tasks, as it measures the accuracy of category predictions and serves as an essential component for optimizing the model’s classification performance. YOLOv11 adopts weighted binary cross entropy (WBCE) loss to address the challenges of class imbalance in multiclass detection tasks and to improve the model’s classification accuracy for categories with fewer samples, as described in Eq. (4). WBCE offers several distinct advantages. First, it effectively handles class imbalance by introducing class weights (wiw_iwi​), which amplify the contribution of small-sample categories to the overall loss. This ensures that the model learns to classify minority categories accurately without being dominated by majority classes. Second, the gradient computation of the WBCE is both simple and stable, enabling rapid convergence during the initial training phase and ensuring consistent optimization of target category probabilities in later stages. Third, WBCE is particularly suited for multiclass detection tasks, as it independently optimizes the classification probabilities for each category, improving the model’s adaptability to scenarios involving multiple targets. By integrating WBCE into YOLOv11, the model effectively balances the learning process across classes, enabling improved classification performance for underrepresented categories and ensuring robust adaptability to multitarget scenarios. This design not only enhances training stability but also ensures optimal classification accuracy across diverse datasets54.

    $$Cls\_loss = - \mathop \sum \limits_{i = 1}^{N} \left[ {w_{i} \times y_{i} \times log\left( {p_{i} } \right) + \left( {1 - y_{i} } \right)log\left( {1 - p_{i} } \right)} \right]$$
    (4)

    where:

    C: Total number of target categories.

    yi: Ground-truth label for the i-th sample, where yi = 1 if the sample belongs to the category and yi = 0 otherwise.

    pi: Predicted probability of the i-th sample belonging to a specific category, representing the model’s confidence.

    wi: Class weight, which is designed to balance the contributions of different categories to the loss function.

  3. (3)

    Dfl_Loss is a novel loss function in YOLOv11x that is specifically designed to enhance bounding box quality by refining the predicted distribution of bounding box coordinates. Unlike traditional regression losses such as L1 or L2, Dfl_Loss models the predicted bounding box coordinates as a probability distribution. This approach enables the model to focus not only on optimizing overall localization but also on capturing finer details in high-confidence regions, thereby improving both the precision and regression quality of bounding boxes. The formula for Dfl_Loss is presented in Eq. (5). The primary innovation of Dfl_Loss lies in weighting the regression error of each pixel by the predicted probability distribution. By assigning greater importance to regions with higher confidence, the model can better refine bounding box details. This is particularly advantageous for scenarios requiring precise localization, such as small object detection or objects with complex shapes. Furthermore, Dfl_Loss complements CIoU, which focuses on overall bounding box localization accuracy. Together, these two loss functions work synergistically to achieve significant improvements in bounding box prediction performance. By integrating the probabilistic modeling of bounding box coordinates, Dfl_Loss allows YOLOv11x to achieve more precise and robust localization. The synergistic use of Dfl_Loss and CIoU ensures optimal balance between global localization accuracy and fine-grained bounding box detail refinement, significantly enhancing the model’s detection performance in challenging scenarios54.

    $$Dfl\_loss = \mathop \sum \limits_{j = 1}^{N} \left( {p_{j} \times \left| {x_{j} - x_{j}^{*} } \right|} \right)$$
    (5)

    where:

    N: Total number of target categories.

    pj: Predicted probability distribution for candidate bounding box j.

    xi, xi*: Predicted and ground-truth coordinates of the bounding box, respectively.

Evaluation and analysis of loss functions

Figure 11 illustrates the dynamic trends of the three loss functions (Box_loss, Cls_loss, and Dfl_loss) during model training and validation, with their characteristics quantitatively summarized in Table 2. All three loss functions demonstrate desirable properties, including rapid convergence, curve stability, and consistent behavior between the training and validation sets. In the early training phase (first 3–4 epochs), all loss functions rise sharply to a peak before decreasing significantly and eventually stabilizing. This trend indicates that the model can quickly adapt to data distributions and optimize parameters in the early stage, achieving loss convergence through continued refinement in later stages. Additionally, the similar curve trends between the training and validation sets reflect the model’s robust stability and generalizability.

  1. (1)

    Box_loss: As shown in Fig. 11a, Box_loss decreases from 1.0728 to 0.6272 in the training set, a reduction of 41.53%, and from 1.2353 to 0.8465 in the validation set, a reduction of 31.47%. These results indicate that YOLOv11 effectively minimizes bounding box localization errors, particularly in complex backgrounds and across various object scales. The smooth decline in Box_loss further highlights the model’s stability when processing challenging scenarios. The enhanced regression capability is largely attributed to the introduction of the C3k2 module and spatial pyramid pooling fusion (SPPF), which provide richer contextual information and improve multiscale object detection, especially for small and complex targets.

  2. (2)

    Cls_loss: Fig. 11b shows that Cls_loss decreases significantly, from 1.0374 to 0.3730 in the training set (64.05% reduction) and from 0.8327 to 0.4169 in the validation set (49.93% reduction). This sharp decline reflects the model’s strong performance in distinguishing object categories, even in complex backgrounds. The introduction of the cross-scale pixel spatial attention (C2PSA) mechanism enhances feature representation and improves category differentiation. Additionally, the multitask learning framework facilitates collaborative optimization between classification and regression tasks, demonstrating robustness in diverse remote sensing images.

  3. (3)

    Dfl_loss: As illustrated in Fig. 11c, Dfl_loss decreases from 1.1578 to 0.6861 in the training set (40.74% reduction) and from 1.1451 to 0.7418 in the validation set (35.22% reduction). The consistent decline in Dfl_loss underscores the model’s ability to refine boundary predictions for objects, even in complex distributions. The optimization benefits from YOLOv11’s multiscale feature extraction and enhanced boundary precision through spatial pyramid pooling fusion (SPPF) and deeper feature representations. This improvement is especially evident in scenarios involving small or densely distributed objects.

Fig. 11
figure 11

Loss curves for the training and validation sets.

Table 2 Table of changes in each loss function on the training and validation sets.

The analysis reveals that YOLOv11 excels in classification, bounding box regression, and distribution prediction tasks, particularly under high-resolution, complex-background, and multitarget conditions. The stable convergence of Box_loss and the rapid decline in Cls_loss demonstrate effective collaborative optimization in localization and classification tasks. Moreover, the robust performance of Dfl_loss further validates the model’s ability to refine complex boundary predictions.

Evaluation and analysis of the accuracy metrics

In object detection tasks, key metrics such as precision, recall, map50, and map50–95 are commonly used to evaluate model performance. These metrics provide a multidimensional assessment, enabling the optimization of detection accuracy and robustness. As defined in Eq. (6), precision measures the proportion of true positive predictions among all the predictions classified as positive by the model. This metric reflects the reliability of the model’s predictions; a higher precision value indicates fewer false positives, meaning that the model is more accurate in identifying actual objects and less prone to mistakenly classifying background or negative samples as targets. Recall, as described in Eq. (7), that the proportion of true positive samples correctly detected by the model out of all ground-truth positive samples is quantified. Higher recall values suggest a lower likelihood of missed detections, allowing the model to capture all the targets comprehensively. Equation 8 introduces map50, which calculates the mean average precision (AP) across all classes at a fixed IoU threshold of 0.5. The AP is derived from the area under the precision‒recall curve, which represents the balance between precision and recall for each class. map50 provides a performance summary at a moderate overlap threshold, where a detection is considered correct if the IoU between the predicted and ground-truth bounding boxes is at least 50%. In contrast, map50-95, as defined in Eq. 9, evaluate the mean AP across multiple IoU thresholds ranging from 0.5–0.95. This metric introduces a stricter evaluation standard by considering the model’s performance at varying degrees of overlap, ensuring that it performs well under both lower overlap conditions (IoU = 0.5) and more challenging high overlap thresholds (IoU = 0.95). As such, map50-95 serves as a more rigorous and comprehensive measure of the model’s ability to handle precise localization and accurate detection. By combining these metrics, a holistic evaluation of YOLOv11’s performance can be achieved, providing insights into its precision, recall capabilities, and adaptability to diverse detection scenarios55.

$$Precision = \frac{TP}{{TP + FP}}.$$
(6)
$$Recall = \frac{TP}{{TP + FN}}$$
(7)
$$mAP_{50} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} AP_{50,i} .$$
(8)
$$mAP_{50 - 95} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \frac{1}{K}\mathop \sum \limits_{j = 1}^{K} AP_{i,j}$$
(9)

where:

TP (true positive): The number of samples correctly identified as positive by the model, representing accurate detections.

FP (false positive): The number of samples incorrectly classified as positive by the model, reflecting false alarms or misclassifications.

FN (false negative): The number of actual positive samples that the model failed to detect, representing missed detections.

N: The total number of categories in the dataset.

APi: The average precision for class i, which is calculated as the area under the precision-recall curve for that class, summarizes the model’s balance between precision and recall across recall levels.

K: The number of IoU thresholds considered, typically set to 10 (ranging from 0.5 to 0.95 with a step size of 0.05), to capture performance under varying levels of detection overlap.

APi,j: The average precision for class i at a specific IoU threshold j, which quantifies the detection accuracy under a given overlap condition.

As illustrated in Fig. 12, the four evaluation metrics—precision, recall, map50, and map50–95—demonstrate notable similarities in their trends. During the initial training phase (first three epochs), all the metrics exhibit a temporary decline, followed by rapid recovery and steady improvement. In the later training stages (after 150 epochs), these metrics stabilize, reflecting the model’s ability to adapt to complex scenarios and multiscale targets in the early phase and its subsequent optimization of localization and classification tasks for comprehensive accuracy and robustness. The early decline is likely due to the model’s initial difficulties in adapting to specific challenges in remote sensing data, such as small objects and complex boundaries. However, with the incorporation of multitask optimization strategies and enhanced feature extraction modules, the model quickly corrects these shortcomings and achieves stable convergence.

Fig. 12
figure 12

Accuracy curve.

As shown in Table 3, precision and recall serve as fundamental metrics for evaluating false positive and false negative rates, respectively. The precision improves from an initial value of 0.7440, temporarily dropping to 0.6050 before increasing to 0.8861, a 19.10% increase, indicating reduced false detections and improved detection accuracy. Recall starts at 0.5966, decreases to 0.4932, and eventually reaches 0.8563, reflecting a 43.54% improvement, highlighting the model’s ability to reduce missed detections and achieve higher coverage. These improvements are driven by the cross-scale pixel spatial attention (C2PSA) mechanism and the spatial pyramid pooling fusion (SPPF) module, which enhance the model’s ability to detect small and diverse objects in complex backgrounds.

Table 3 Each accuracy change table.

map50 and map50-95 provide a more comprehensive evaluation of the model’s detection capabilities. map50 improves from an initial value of 0.6578, decreases to 0.4967 at its lowest value, and increases to 0.8920, representing a 35.60% increase, underscoring the model’s strong localization capabilities under lenient IoU conditions. Moreover, map50–95, which starts at 0.5731, decreases to 0.3935, and eventually reaches 0.8646, achieves a remarkable 50.85% improvement. This significant gain demonstrates Yolov’s ability to handle high-precision localization tasks, particularly for challenging remote sensing targets such as building boundaries and road intersections. These results are attributed to the integration of the C3k2 module and deep feature fusion techniques, which enhance multiscale feature extraction and boundary optimization, enabling the model to overcome complex target distributions and background noise.

Overall, the initial decline across these metrics reflects the model’s adaptation to the complexities of remote sensing data, while the subsequent improvements validate YOLOv11’s robustness and optimization efficiency. The stabilization of all curves after 150 epochs highlights the model’s strong convergence and effective multitask collaborative optimization. By combining high precision and recall, YOLOv11 effectively balances false positive and false negative rates, adapting efficiently to multiscale targets. Furthermore, the significant improvements in map50 and map50-95 confirm the model’s superior classification and localization accuracy under both lenient and strict IoU thresholds, making it highly suitable for high-resolution and complex remote sensing applications.

Evaluation and analysis of the F1 score

As defined in Eq. (10), the F1 score is a critical metric for evaluating model performance in classification tasks. It represents the harmonic mean of precision and recall, providing a balanced measure of the two. When precision and recall are similar in value, the F1 score is high, whereas significant discrepancies between the two lead to lower scores. This makes the F1 score particularly suitable for tasks with imbalanced class distributions, such as detecting minority class targets. The score ranges from 0 to 1, with values closer to 1 reflecting superior classification accuracy and consistency.

As shown in Fig. 13, the F1 score closely mirrors the trends of precision and recall, with the following phase-specific characteristics: (1) Initial phase (1–3 epochs): During the first three epochs, the F1 score decreases sharply to 0.5434, corresponding to a simultaneous decline in precision and recall. This suggests that the model initially struggles to adapt to the complexities of remote sensing imagery, such as small or occluded targets, resulting in temporarily reduced classification performance. (2) Rapid improvement phase (4–50 epochs): As the model learns and adapts to the multiscale features of remote sensing targets, both precision and recall improve significantly, driving a rapid increase in the F1 score. This phase reflects the model’s ability to optimize classification performance through feature learning and task adaptation. (3) Stable convergence phase (after 50 epochs): Beyond 50 epochs, the F1 score stabilizes, reaching a final value of 0.8709. This finding indicates that YOLOv11 achieves robust classification performance under both lenient IoU conditions (e.g., map50 of 90.28%) and stricter evaluation standards. This robustness is particularly crucial for detecting minority class targets, including small vehicles and isolated buildings, in remote sensing imagery.

Fig. 13
figure 13

F1 score plot.

YOLOv11’s ability to maintain high precision and recall and minimize false positives and false negatives is a test of its advanced design. Techniques such as multiscale feature extraction, attention mechanism enhancements, and multitask optimization enable the model to handle complex and diverse target distributions effectively. These strengths make YOLOv11 a highly reliable solution for remote sensing tasks involving multitarget detection and challenging scenarios56.

$$F1\, Score = 2 \times \frac{Precision \times Recall}{{Precision + Recall}}$$
(10)

Evaluation and analysis of the precision-recall curve

The precision‒recall (PR) curve is a critical tool for evaluating the performance of classification and object detection models. By visually representing the relationship between precision and recall across different confidence thresholds, the PR curve provides a comprehensive view of the model’s trade-offs in classification and detection tasks. The horizontal axis represents Recall, whereas the vertical axis represents Precision, illustrating the model’s ability to balance accuracy and coverage under varying threshold conditions. When the confidence threshold is set low, the model prioritizes detecting more positive samples, resulting in higher recall. However, this may also lead to an increase in false positives, causing a decline in precision. Conversely, with a higher confidence threshold, the model applies stricter criteria for positive samples, leading to improved precision but potentially lower recall. This trade-off, captured dynamically by the PR curve, highlights the model’s performance across different operating conditions. The area under the PR curve, quantified as the AP, serves as a key metric for evaluating single-class detection performance. A higher AP value indicates that the PR curve is closer to the ideal state (upper right corner), reflecting stronger detection capabilities for the corresponding category. In multiclass object detection tasks, the mean AP across all categories is calculated as the mean average precision (mAP), which provides an overall measure of the model’s detection performance. mAP@0.5, which is calculated under a lenient IoU threshold of 0.5, is used to evaluate the model’s overall performance when lower overlap requirements are acceptable. This metric is well suited for tasks where a degree of localization flexibility is permissible. In contrast, mAP@0.5–95, which averages AP values across IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, offers a more stringent and comprehensive assessment. By considering performance under both lenient and strict overlap conditions, mAP@0.5–95 reflects the model’s precision and robustness across diverse detection scenarios.

This dual evaluation framework, which combines lenient and strict thresholds, enables a more nuanced understanding of the model’s strengths and weaknesses, making the PR curve and mAP metrics indispensable tools in modern object detection research.

Figure 14 shows the PR curves of YOLOv11 for various object categories in remote sensing image detection tasks, revealing the model’s performance across different target types. The curves and their corresponding AP values highlight the strengths and limitations of the model, which are influenced by factors such as target characteristics, background complexity, target scale, class distribution, and dataset variability. To provide a deeper understanding, we analyze three performance levels, namely, high-performing, moderately performing, and low-performing categories, and discuss the numerical differences within the context of remote sensing tasks.

  1. (1)

    High-performing categories (mAP@0.5 > 0.90): Targets in this category, including Airplane (0.985), Tenniscourt (0.979), Ship (0.966), Basketballcourt (0.948), Baseballfield (0.950), and Stadium (0.949), exhibit PR curves close to the upper-right corner, reflecting high precision and recall. This performance stems from the distinct characteristics of these targets, such as their regular shapes, well-defined boundaries, and strong contrast with simple backgrounds (e.g., airplanes on runways or ships on water). These features are easily recognized by the model. The moderate size of these targets further avoids challenges related to extremely small or large scales. YOLOv11’s advanced modules, including SPPF and C3k2, enhance feature extraction and boundary detection, contributing to its outstanding performance for highly regular targets.

  2. (2)

    Moderately performing categories (0.80 < mAP@0.5 < 0.90): Categories such as Expressway-Toll-Station (0.897), Storagetank (0.941), and Dam (0.842) show good overall performance but encounter slight decreases in precision at high recall. Factors such as background complexity and target diversity contribute to this decline. For example, toll stations often appear in highway settings with numerous vehicles, increasing the likelihood of false positives. The attention mechanisms in Yolov11, such as C2PSA, help mitigate these challenges by focusing on salient regions, enabling consistent detection performance in diverse environments.

  3. (3)

    Low-performing categories (mAP@0.5 < 0.80): Targets such as Vehicle (0.800) and Bridge (0.649) show steeper PR curves, with significant precision drops under high Recall conditions. These challenges arise from small target sizes, low contrast with complex backgrounds, and irregular textures. Bridges, for example, often blend with roads, making boundary distinction difficult. Data imbalance in these categories further limits the model’s generalizability.

Fig. 14
figure 14

Precision‒recall curves.

Overall, YOLOv11 demonstrates robust multitask optimization, excelling in detecting regular and multiscale targets while maintaining computational efficiency. The PR curves indicate a strong balance between precision and recall, reducing the number of false positives and negatives. Future improvements, such as enhancing small-target detection, optimizing background suppression, and addressing class imbalance, could further expand YOLOv11’s applications in bridge monitoring, port management, and urban planning.

Evaluation and analysis of the test set results

Confidence is a key metric in object detection models, serving as an indicator of the reliability of prediction results. It quantifies both the likelihood that an object exists within a region and the quality of the bounding box prediction, making it a comprehensive measure of detection accuracy. During the postprocessing phase, confidence values are used to filter predictions and guide nonmaximum suppression (NMS), which directly impacts the precision and recall of the model. As defined in Eq. (11), confidence is determined by two primary factors: the predicted probability of the target category and the overlap between the predicted and ground-truth bounding boxes. It reflects the model’s certainty in the accuracy of a specific prediction, expressed as a probabilistic value ranging from 0 to 1. Higher confidence values (closer to 1) indicate greater certainty in the prediction, whereas lower values (closer to 0) suggest a lack of confidence or unreliable predictions. Confidence not only guides the detection process by filtering out low-quality predictions but also plays a crucial role in balancing the trade-off between false positives and false negatives. Its effectiveness depends on an appropriate threshold setting, which allows the model to maximize precision while maintaining an acceptable level of recall. By combining probabilistic prediction and geometric alignment (IoU), confidence serves as a robust metric for evaluating and improving the quality of detection results in object detection models.

$$Confidecne = P\left( {Object} \right) \times IoU_{predicted,true}$$
(11)

where:

P(Object): The probability predicted by the model that the region contains a target object;

IoU predicted, true: The intersection over union between the predicted bounding box and the ground-truth bounding box, which quantifies the degree of overlap.

The trained model successfully identified all test samples, achieving a confidence score exceeding 85% for 80% of the samples. Additionally, the model demonstrated remarkable efficiency, with its high-speed recognition capabilities enabling the rapid classification and detection of a large number of targets. These results highlight the model’s potential for real-time applications, offering reliable and efficient performance in scenarios requiring large-scale target detection and classification. Figure 15 provides examples of detection results, which were analyzed to reveal the following insights: (1) High-confidence categories (average confidence ≥ 0.9): Categories such as Airplane (0.9638), Tennis Court (0.9554), Ship (0.9455), Harbor (0.9405), Storage Tank (0.9342), Basketball Court (0.9232), Baseball Field (0.9207), Stadium (0.9189), Airport (0.9165), Windmill (0.9159), Chimney (0.9142), Expressway Toll Station (0.9137), Golf Course (0.9072), and Ground Track Field (0.9016) presented consistently high confidence scores. These results indicate that the model is highly reliable in detecting these targets, likely because of their distinct shapes and features. The simplicity of their surrounding environments further enhances detection accuracy, allowing the model to consistently achieve high certainty. (2) Medium-confidence categories (0.8 ≤ average confidence < 0.9): For categories such as vehicle (0.8587), train station (0.8333), and dam (0.8715), the model achieves moderate confidence scores, reflecting relatively good but not optimal reliability. These targets often appear in more complex or cluttered scenes, which may obscure their distinct features or lead to confusion with other objects. For example, train stations are often situated in densely populated areas, and vehicles may be obscured by surrounding traffic or background elements. (3) Low-confidence categories (average confidence ≤ 0.8): Categories such as Bridge (0.7563) and Overpass (0.7890) showed lower confidence levels, indicating greater uncertainty in the model’s predictions. Detection of these targets is hindered by factors such as complex and overlapping backgrounds or the variability in structural features. Bridges, for instance, often blend with roads or other infrastructure, making their boundaries difficult to distinguish. Similarly, harbor (0.8000) exhibited lower confidence due to the diversity of objects (e.g., ships, buildings, cargo) typically present in port areas, complicating the detection process.

Fig. 15
figure 15

Example diagram of the validation set results (single category).

Confidence serves as an essential metric for assessing the reliability of detection results, but it should not be considered in isolation. High confidence values may occasionally be assigned to incorrect detections. Therefore, confidence metrics must be combined with other performance indicators, such as precision, recall, and mAP, to provide a more comprehensive evaluation of the model’s effectiveness. Together, these metrics ensure that the model’s performance is not only reliable but also robust across various detection scenarios.

As illustrated in Fig. 16, multiclass scenarios represent a significant challenge in object detection, characterized by the simultaneous presence of multiple target categories within a single image under complex conditions. These scenarios are often associated with highly complex backgrounds, dense target distributions, and diverse class features, placing stringent demands on detection models. The model must effectively identify each target category while maintaining high confidence, even in challenging environments with mixed distributions of dynamic and static targets. Through test set analysis, the model’s performance in multiclass scenarios highlights its adaptability and detection capabilities across multiple dimensions.

Fig. 16
figure 16

Example diagram of the validation set results (multiple categories).

First, the model exhibits exceptional confidence performance in handling complex backgrounds. In images containing multiple target categories, increased background complexity often makes detection significantly more challenging. For example, highway scenarios frequently include vehicles and infrastructure, such as bridges and toll stations, which may exhibit similarities in shape and texture. Additionally, background noise, such as road signs, trees, and shadows, can further complicate detection. Despite these challenges, the model achieves precise recognition of targets while maintaining high confidence, which is attributed to its advanced feature extraction and background suppression mechanisms. This capability underscores the model’s reliability in real-world applications with substantial environmental interference.

Second, the model demonstrates remarkable adaptability and stability in dynamic multiclass environments. In highly dynamic scenarios such as ports and airports, target distributions are heterogeneous, encompassing both stationary objects (e.g., docked ships, static buildings) and moving targets (e.g., vehicles, aircraft). These targets may be distributed unevenly in space, with complex intercategory interactions. The model effectively detects multiple targets in such environments, maintaining consistently high confidence across categories. This indicates robust performance in simultaneously managing classification and localization tasks under dynamic conditions.

Additionally, the model excels in scenarios that combine indoor and outdoor targets. For example, scenes featuring stadiums and ground track fields often present distinct environmental characteristics, coupled with challenges such as uneven spatial distributions and varying lighting conditions. The model successfully learns the intricate relationships between scene features and target attributes, enabling precise classification and high-confidence detection of diverse targets. This reflects its ability to maintain consistent performance across heterogeneous scenarios.

In overlapping and occluded target scenarios, the model further shows its ability to handle complex relationships between targets. For example, overlapping scenes such as ground track fields and tennis courts pose challenges because of blurred or partially occluded boundaries. Nonetheless, the model effectively identifies target categories and maintains high confidence by leveraging both local features and global contextual information. This highlights its ability to differentiate targets even in environments with significant occlusion, enhancing its robustness and reliability.

Overall, the test results demonstrate that the model’s performance in multiclass scenarios is a test of its advanced feature extraction ability, background adaptability, and ability to manage complex target relationships. From handling intricate backgrounds to adapting to dynamic environments and addressing occlusion challenges, the model consistently delivers reliable and efficient detection performance. This adaptability positions the model as a valuable tool for practical applications in remote sensing imagery, with broad potential in areas such as traffic monitoring, port management, and urban planning, where intelligent analysis of multitarget scenarios is essential.

Application case testing

To thoroughly evaluate the model’s performance across diverse scenarios, four representative locations were randomly selected from Google Earth for application testing: Michigan Stadium, Olympiastadion Berlin, Port of Los Angeles, and Denver International Airport (Fig. 17). These scenarios encompass a range of complex environments, including large buildings, ports, and airports, aiming to validate the model’s object detection capabilities in diverse settings and assess its generalizability and robustness in real-world applications.

Fig. 17
figure 17

Application case test.

In large building scenarios, such as Michigan Stadium and Olympiastadion Berlin, the model was tested for its ability to handle complex geometries, multilayered structures, and occlusions. These buildings feature intricate architectural designs and diverse details, often accompanied by obstructions such as trees and vehicles, posing challenges for boundary recognition and feature extraction. The model effectively captured the overall shapes of these structures, accurately located boundaries, and identified associated targets in the surrounding complex backgrounds (e.g., spectator stands and parking areas). This performance highlights the model’s stability and adaptability in processing complex contours and multitarget scenarios.

In port scenarios, such as the Port of Los Angeles, the testing focus shifted to the model’s ability to handle multicategory recognition and densely distributed targets. Ports are dynamic environments containing diverse objects, including ships and ground vehicles, with varying sizes and dense distributions that sometimes blend with water backgrounds. The model demonstrated strong multitarget detection capabilities, accurately identified tightly packed containers in terminals, distinguished docked ships from moving vehicles, and maintained reliable performance in these challenging settings. This underscores its effectiveness in multicategory classification and localization tasks within complex backgrounds.

In airport scenarios, such as Denver International Airport, the emphasis was on the model’s performance in large-scale, dynamic environments. Airports typically include runways, terminals, aprons, and targets such as aircraft and ground vehicles. These targets are widely distributed and exhibit mixed states of motion, combining static and dynamic elements. The model successfully detected stationary structures, such as terminals, as well as dynamic objects, including aircraft on runways, achieving high accuracy despite challenges such as lighting variations and shadow interference. These results demonstrate the model’s strong adaptability and robust detection performance in highly dynamic and complex environments.

Overall, the test results reveal that the model excels in multitarget, multibackground, and mixed dynamic‒static scenarios, highlighting its generalizability and robustness in complex environments. This exceptional performance in multicategory detection tasks highlights the model’s potential for practical applications in remote sensing imagery. Furthermore, the tests provide valuable insights into future optimization directions. For example, improving the resolution of small targets and enhancing background suppression mechanisms could further increase detection accuracy in dense environments. In dynamic scenarios, integrating temporal information and trajectory prediction techniques could enhance the model’s ability to process sequential data, paving the way for more advanced applications in traffic monitoring, port management, and urban planning.

Discussion

Analysis of misclassification and false detection

The YOLOv11-trained model has significant advantages in remote sensing applications, particularly through its end-to-end design, which enables rapid detection. By processing multiple targets in a single forward pass, the model efficiently predicts both locations and categories. For objects with distinct geometric features, such as airplanes and ships, the model delivers exceptional performance. Similarly, fixed-shaped and relatively large targets, such as those in sports fields, are characterized by clear contours and features in high-resolution remote sensing images. The model effectively capitalizes on these characteristics, achieving high accuracy in extracting critical information for detection tasks. However, limitations emerge when addressing extremely small targets or targets embedded in complex backgrounds. These challenges are particularly pronounced in remote sensing scenarios. For example, vehicles captured at great distances may occupy minimal pixel areas, rendering them less distinguishable from the overall image. As illustrated in Fig. 18, the vehicles on the right appear as tiny pixel clusters, which increases the likelihood of the model misclassifying them as background or failing to detect them entirely.

Fig. 18
figure 18

Mischeck cases.

The observed issues in misclassification and false detection arise from several key factors: (1) Insufficient feature representation for small targets: Small targets, such as distant vehicles, ships, or auxiliary building structures, often lack sufficient texture and shape features owing to their minimal pixel coverage. This deficiency makes it challenging for the model to extract salient features effectively. Furthermore, small targets may blend into surrounding backgrounds or be overshadowed by noise, leading to higher rates of false detections and missed detections. For example, vehicles situated in complex road conditions or dense traffic environments may be difficult to differentiate from fine details such as road markings, shadows, or other background elements. (2) Background complexity interference: Targets in remote sensing imagery often reside in intricate backgrounds, which can interfere with the feature extraction process and complicate recognition tasks. Low contrast or merged boundaries between targets and their backgrounds exacerbate this issue, increasing the likelihood of misclassification and false detection. (3) Similarity between target categories: When multiple target categories with overlapping characteristics coexist in an image, distinguishing between them becomes challenging. For example, densely arranged containers and small ships in a port scene may share similar shapes and textures, causing confusion during detection. Similarly, vehicles and ground equipment in airports often exhibit feature similarities that lead to misclassification. (4) Occlusion and target overlap: High-density target distributions frequently result in occlusion or overlap, as observed in scenarios such as athletic fields or ports. Partial occlusion makes it difficult for the model to extract sufficient features from obscured targets, leading to reduced classification and localization accuracy. For example, overlapping ships or buildings may be mistaken as a single target, resulting in missed detections or classification errors. (5) Scale variation and resolution differences: Remote sensing images often contain multiscale targets, with the same category appearing in significantly different sizes. The model must adapt to these scale variations, as distant small targets and close-up large targets pose different detection challenges. Furthermore, differences in image resolution can lead to feature loss for highly detailed targets in lower-resolution images, further increasing the risk of misclassification or false detection.

These findings highlight critical areas for future optimization, such as enhancing small-target feature extraction, improving background suppression techniques, and developing mechanisms to adapt to scale variations and resolution inconsistencies. Addressing these challenges significantly enhances the model’s overall robustness and accuracy in remote sensing applications.

Comparison of the experimental results with YOLOv10

On the basis of the data in Table 4 and Table 5, YOLOv11 demonstrates substantial improvements over YOLOv10 in both loss function optimization and accuracy metrics, underscoring its enhanced performance in object detection tasks. These improvements are attributed to advancements in model architecture and training strategies, which have significantly improved adaptability and robustness for multitarget detection in complex remote sensing scenarios.

Table 4 Comparison of YOLOv10 and YOLOv11 in terms of the training interval of the loss functions.
Table 5 Comparison of the accuracy metric training intervals of YOLOv10 and YOLOv11.

In terms of loss functions, YOLOv11 achieves superior results compared with YOLOv10 across Box_Loss, Cls_Loss, and DFL_Loss during both training and validation. Specifically, YOLOv11 reduces Box_Loss to 0.6272 in training and 0.8465 in validation, both of which are lower than YOLOv10’s corresponding values. This indicates greater precision in boundary localization, particularly in addressing challenges such as small-object detection and boundary ambiguity in complex backgrounds. These improvements are largely driven by enhanced multiscale feature extraction modules, such as C3k2 and SPPF, which enhance the model’s ability to perceive target boundaries across varying scales. Similarly, YOLOv11 significantly lowers Cls_Loss, reflecting improved differentiation of target categories, particularly in complex multicategory scenarios. Additionally, the reduced DFL_Loss highlights YOLOv11’s ability to capture fine details and improve regression accuracy, providing critical support for high-precision object detection.

In terms of accuracy, YOLOv11 outperforms YOLOv10 across all key metrics, including precision, recall, mAP, and the F1 score. The precision increased from 0.8714 with YOLOv10 to 0.8861 with YOLOv11, demonstrating a reduced false positive rate. This improvement is attributed to the integration of the cross-scale pixel spatial attention (C2PSA) mechanism, which enhances the model’s ability to focus on salient features while mitigating background interference. Recall improved from 0.8344 to 0.8563, further highlighting YOLOv11’s reduced missed detection rate, particularly in scenarios with densely distributed targets or small objects. Furthermore, map50 and map50–95 both exhibited substantial improvements, reaching 0.8920 and 0.8646, respectively. These results indicate that YOLOv11 excels under both lenient and strict IoU conditions, demonstrating stronger classification and localization capabilities. The F1 score also increased from 0.8524 to 0.8709, reflecting a better balance between precision and recall.

Overall, YOLOv11’s enhancements in both loss function optimization and accuracy metrics provide a significant performance improvement over YOLOv10. These improvements enable more accurate detection in complex environments, particularly in scenarios involving small targets, densely distributed objects, and challenging backgrounds. This makes YOLOv11 a highly robust and adaptable solution for remote sensing applications, offering improved accuracy and reliability in multitarget detection tasks. Future optimization efforts could focus on further refining small-object detection capabilities and enhancing the model’s adaptability to highly dynamic environments, paving the way for broader applications in remote sensing imagery.

As shown in Fig. 19, a comparative analysis of detection performance highlights that YOLOv11 outperforms YOLOv10 across various tasks, including large-target detection, small-target detection, multitarget detection, and the reduction of false negatives and false positives. In large-target detection tasks, such as Stadium and Overpass (Figs. 19a and b), both models exhibit comparable confidence levels, yet YOLOv11 demonstrates slight but consistent improvements. This indicates further refinement in YOLOv11’s feature extraction and classification capabilities, enabling it to better capture the prominent features of large-scale targets. For small-target detection tasks, such as Ship (Fig. 19c), YOLOv11 achieves significantly higher average confidence levels than YOLOv10 does. This improvement is attributed primarily to advancements in its multiscale feature extraction modules (e.g., C3k2 and SPPF), which enhance its ability to identify and capture the fine details of small targets. In terms of false negatives and false positives (Fig. 19d), YOLOv11 has notable advantages. For example, YOLOv10 yields false positives in the Overpass scenario and misses detections in the Expressway Service Area. In contrast, YOLOv11 achieves higher accuracy by effectively minimizing such errors, showing enhanced robustness in both classification and localization tasks. Furthermore, in multitarget detection scenarios (Fig. 19e), YOLOv11 consistently delivers higher confidence levels, particularly in scenes with complex target distributions. This improvement is largely due to the integration of its cross-scale pixel spatial attention (C2PSA) mechanism, which allows the model to focus better on key target regions while suppressing background noise, thereby improving both precision and recall. A notable example of YOLOv11’s performance advantage is observed in the Expressway Toll Station detection task (Fig. 19f). YOLOv10 frequently misclassifies this category, leading to false positives, whereas YOLOv11 effectively avoids such errors, demonstrating its superior ability to differentiate between visually similar target categories in multiclass detection tasks.

Fig. 19
figure 19

Comparison of the performance between YOLOv10 and YOLOv11.

Overall, YOLOv11 shows significant advancements in increasing detection confidence, reducing false negatives and false positives, and handling multitarget scenarios in complex environments. These improvements highlight its potential as a reliable and precise solution for remote sensing imagery analysis and multitarget detection tasks. By addressing challenges such as small-target identification, complex background interference, and densely distributed objects, YOLOv11 provides robust support for practical applications in areas such as transportation monitoring, urban planning, and resource management.

Future directions for optimization and improvement

While YOLOv11 has demonstrated remarkable advancements in remote sensing object detection, there remains ample potential for further optimization to address the increasing complexity of application scenarios and task requirements. Future improvements can be explored in the following areas:

  1. 1.

    Expanding dataset diversity and improving class balance: Increasing the scale of datasets and enhancing sample diversity will be pivotal for addressing the limitations of current data. For underrepresented categories, techniques such as generative adversarial networks (GANs) or other data augmentation methods can be utilized to generate high-quality samples, mitigating sample scarcity. Additionally, collecting images across varying resolutions, angles, weather, and lighting conditions can enrich dataset characteristics, improving the model’s adaptability to diverse and complex scenarios.

  2. 2.

    Integrating segmentation models for improved accuracy: The integration of object segmentation techniques into a detection pipeline can significantly increase accuracy by extracting precise boundary information. This is particularly beneficial for small objects in complex backgrounds. Joint detection-segmentation frameworks could enable simultaneous classification, localization, and boundary segmentation, leading to more granular detection in multitarget scenarios. Multitask learning approaches enhance the performance of localization, classification, and boundary refinement, especially in dynamic or high-density target environments.

  3. 3.

    Enhancing the model structure and module design: (1) Convolutional advancements: Incorporating innovative techniques such as wavelet convolutions (WTConv) can expand receptive fields and extract low-frequency information, addressing challenges in small-object detection and long-range dependencies. Similarly, StripPooling modules, which capture horizontal and vertical dependencies, can optimize global and local feature representations, improving performance in dense and occluded target distributions. (2) Attention mechanism improvements: The introduction of single-head self-attention (SHSA) or deformable large kernel attention (DLKA) can enhance the model’s ability to extract salient features, particularly in complex scenes with small or dynamic targets. (3) Module optimization: Adopting CMUNeXt blocks for global context modeling can improve classification accuracy, whereas dynamic feature fusion (DFF) modules can strengthen feature interactions across layers, enabling better detection of intricate object details.

  4. 4.

    Comparing YOLOv11 with other models: In future work, it would be beneficial to compare YOLOv11 with other state-of-the-art models, such as Faster R-CNN57, RetinaNet58, and EfficientDet59, to better understand the strengths and limitations of YOLOv11 in various application scenarios. For example, Faster R-CNN has shown excellent performance in small object detection due to its region proposal network (RPN), while RetinaNet excels at handling class imbalance in complex backgrounds. EfficientDet, with its efficient architecture, achieves a good balance between speed and accuracy, especially in real-time applications60,61. By conducting comparative studies, we could identify specific areas where YOLOv11 may be further improved and explore potential integration strategies that combine the strengths of multiple models.

  5. 5.

    Scenario-driven model customization: Given the diversity of remote sensing applications, task-specific optimizations are essential. For military surveillance, enhancements that target camouflaged or small mobile objects can improve long-range detection accuracy. In ecological monitoring, emphasis could be placed on detecting vegetation and water bodies, whereas for urban planning, models could focus on transportation infrastructure analysis, such as roads and bridges. Scenario-driven customization can significantly enhance task adaptability and expand practical applications in various domains.

  6. 6.

    Optimizing annotation and training strategies: Automating annotation processes with weakly supervised or semisupervised learning can increase annotation efficiency and consistency. Dynamic annotation optimization strategies could further improve data quality. During training, dynamic learning rate adjustments based on real-time performance feedback can enhance parameter tuning. Additionally, adaptive loss weight optimization for rare categories can improve the model’s ability to detect underrepresented objects, strengthening its performance in multiclass detection tasks.

In summary, these future directions aim to further enhance YOLOv11’s robustness, adaptability, and precision across diverse and challenging remote sensing scenarios. By addressing current limitations in small-object detection, complex backgrounds, and dynamic environments, YOLOv11 can evolve into an even more versatile tool for applications such as traffic monitoring, urban planning, and environmental conservation.

Conclusion

This study employed the YOLOv11 model for training, detection, and identification of land cover targets in remote sensing imagery, encompassing 20 target categories with a total of 70,389 samples. After 496 training epochs, the model demonstrated robust and consistent performance. The rapid convergence of loss functions, including Box_Loss, Cls_Loss, and DFL_Loss, underscores the model’s effective optimization in localization, classification, and detail refinement tasks. The evaluation metrics further highlight the model’s high precision and robustness, achieving a precision of 0.8861, a recall of 0.8563, a map50 of 0.8920, a map50-95 of 0.8646, and an F1 score of 0.8709. These results confirm the model’s reliability in accurately detecting and classifying objects under stringent IoU thresholds, with 80% of the test samples achieving recognition confidence exceeding 85%. Compared with YOLOv10, YOLOv11 significantly outperforms the other methods in detecting small targets, adapting to complex backgrounds, and handling densely distributed multitarget scenarios. These enhancements stem from the model’s architectural advancements, such as its multiscale feature extraction modules and attention mechanisms, which collectively contribute to its superior performance. Additionally, real-world application tests conducted on randomly selected locations from Google Earth further validated the model’s adaptability and robustness. Despite occasional missed detections in high-resolution images with complex backgrounds, the model consistently demonstrated stable detection of most targets and maintained high confidence in its classification results. In dynamic environments such as ports and airports, the model effectively addresses the challenges posed by densely distributed targets, highlighting its potential for complex real-world applications.

In summary, the YOLOv11 model excels in remote sensing object detection, offering high accuracy, robustness, and outstanding multitarget detection capabilities in complex scenarios. While opportunities remain to enhance small-target detection and address challenges in highly intricate backgrounds, the model’s overall performance underscores its considerable potential for practical applications. This method provides a reliable foundation for intelligent remote sensing imagery processing. Future research will prioritize dataset expansion and model optimization to increase adaptability in multitarget scenarios and improve small-target detection, fostering broader implementation in fields such as environmental conservation and urban planning.