Background & Summary

With the development of space technology and the continuous launch of human spacecraft, the number of non-cooperative space targets is steadily increasing1,2. In addition to naturally occurring space rocks, various artificial satellites and debris resulting from the malfunction and disintegration of spacecraft are widely distributed in orbit. Throughout this paper, the term non-cooperative target is used in the rendezvous-and-proximity-operations (RPO) context. Such a target passively drifts in orbit, provides no relative navigation aids or telemetry, and lacks active attitude control. Therefore, these targets must be perceived through vision-based algorithms. To improve satellite safety and spacecraft life, developing on-orbit servicing technologies is crucial. As shown in Fig. 1, accurate aerospace perception3 is a critical prerequisite for successful on-orbit missions. Tasks such as target monitoring4 and debris removal5 rely on space target detection, so these tasks require detecting targets quickly and identifying their types. However, on-orbit assembly6 and maintenance7 require the identification of components on satellites, these tasks are based on component segmentation techniques, where various components are differentiated by their semantic labels. Traditional perception methods rely on ground-based optical sensors that are limited by atmospheric turbulence and adverse weather, reducing the accuracy and speed needed for on-orbit servicing tasks. Imagery from spaceborne sensors is more robust in achieving accurate, real-time perception of targets in space.

Fig. 1
figure 1

Relationship between space target perception and space missions.

Artificial intelligence technologies, particularly deep learning, have developed rapidly in recent years. The biggest challenge in applying deep learning to space target perception is the lack of high-quality, large-scale annotated datasets. Obtaining sufficient images of actual targets is extremely difficult, unlike natural images. However, data are the core of deep learning models, and the quality of the training data largely determines the model’s performance. Therefore, a dataset that best simulates the actual space environment and contains many well-annotated space targets is urgently needed. In addition, current datasets are typically designed for specific perception tasks, such as space debris detection or satellite antenna recognition. However, actual on-orbit servicing requires the perception of multiple aspects of the same target. Establishing a space target dataset that supports multiple perception tasks is also essential.

Motivated by these challenges, this study created a benchmark dataset that supports both space target detection and component segmentation tasks. With the help of Blender software, we simulated a realistic space environment that included sunlight illumination, the Earth, a background of stars, and a spaceborne observation camera.

Next, we collected and refined 3D models of 26 space targets, including satellites, space debris, and space rocks. After importing the models into the simulation, we analyzed the surface materials, textures, orbital characteristics, and operational states of the different space targets. We further refined and adjusted the models to make the scene more realistic. By randomly varying lighting conditions and camera positions, we generated 200,000 images to support the subsequent training and validation of deep learning models.

The generated samples were annotated to meet the requirements of both space target detection and component segmentation. For the object detection task, we divided all satellites into 10 categories based on their functions and distinct visual features, along with space debris and space rocks, resulting in 12 categories. Each image was annotated with bounding boxes and corresponding categories. Subsequently, we extracted all images in the dataset of the satellite target category. We used semantic segmentation annotations to label four distinct components: body, solar panels, antenna, and observation payloads. Note that our annotation method bridges two space target perception tasks. The model can detect and classify the type of satellite in an image and segment its components. This approach has not been explored in previous research.

After constructing the dataset, we conducted a statistical analysis of the space targets in the dataset, calculating their distribution according to their size, position, and other relevant characteristics. Further, we selected 10 representative object detection methods and 10 semantic segmentation methods to test their performance on our dataset. We also analyzed the methods’ performance across different target categories for target detection. Similarly, for component segmentation, we evaluated the performance of each algorithm on different components. Beyond the supervised baseline models reported here, NCSTP is designed to support hybrid and self-supervised training paradigms, which hold strong potential for improving synthetic-to-real transfer in space imagery.

According to the results, we analyzed the performance of current methods and the characteristics and task challenges of the dataset, and also outline future directions for research.

Related dataset

Many scholars have tried constructing space target datasets with different methods for perceptual tasks in recent years. The spacecraft pose estimation dataset (SPEED)8 was initially released by ESA and Stanford University to clarify satellite pose estimation. Because of such limitations as restricted pose, single-lighting conditions, and small datasets, this study developed an improved version of the Next-Generation Dataset for Spacecraft Pose Estimation (SPEED+)9. URSO10 comprises synthetic and actual images of the Soyuz and Dragon spacecraft in space, one dataset for the Dragon spacecraft and two datasets for the Soyuz model with different operating ranges. Hoang et al.11 collected 3117 images of satellites and space stations, including synthetic and authentic images. It is the first publicly available dataset for space target detection and component segmentation. SPARK is a multi-modal dataset that provides paired RGB and depth images. The dataset contains 150,000 images, including 11 types of spacecraft and space debris. Zhang et al.12 proposed a space target recognition dataset containing 50,000 Blender images, including 11 classes of satellites and 35 classes of space debris. STAR-24K13 is a dataset for space target detection with approximately 24,000 images. It was constructed by extracting images released by the NASA and ESA websites. UESD is a UE4-based dataset proposed by Zhao, et al.14. It is used for spacecraft component segmentation. In Table 1, We compared NCSTP with other space target datasets from multiple considerations, such as the number of images, target categories, annotation methods, and supported perception tasks.

Table 1 Quantitative statistics for each dataset.

The contributions of our work are summarized as follows.

  1. (1)

    This paper constructs a large-scale non-cooperative space target dataset, integrating multiple perception tasks, including space target detection and component segmentation. Fine-grained recognition and sequential perception tasks for the satellite targets are supported, detecting the satellite and then segmenting its components. The NCSTP dataset provides rich, high-quality data samples and benchmark evaluation conditions for space target perception research.

  2. (2)

    We present a comprehensive and systematic review of datasets and methods related to non-cooperative space target perception tasks, including space target detection and component segmentation, with analysis and perspective on future research trends.

  3. (3)

    This article evaluates the performance of 20 advanced methods for space target detection and component segmentation on NCSTP. It also presents a direction to improve tasks according to challenges and the characteristics of the dataset.

Methods

This section provides a detailed description of the generation and annotation process of the NCSTP dataset.

Data generation

Training deep learning models requires many samples. However, the availability of directly collected 2D images of space targets is limited. To generate a rich and diverse set of space target images, we chose to use 3D models. After importing the models with optimized material types and spectral characteristics into the simulated space environment, we generated multiple samples of the same target by varying attributes such as size, pose, lighting condition and other properties.

Space target modelling

The simulation of space objects can be divided into two parts: structural modeling and optical property modeling. The former involves constructing a three-dimensional geometric model of the object, while the latter requires selecting appropriate materials for the object to ensure it exhibits realistic surface spectral characteristics. We collected 3D models of space targets from official NASA repositories to ensure their reliability and authenticity (https://science.nasa.gov/3d-resources/). There were 26 types of 3D models, comprising 16 types of satellites, 6 types of space debris, and 4 types of space rocks.

Although the three-dimensional models obtained by these authoritative agencies ensure the accuracy of the geometric structures, the material and texture information of these models is incomplete. Therefore, we inspected each model and added high-resolution texture maps and adjust its material properties based on the types of materials used in actual satellites.

Notably, the surface of the satellite body is typically covered with a layer of golden material. This material consists of a polymer compound of polyimide, combined withthin metallic films of gold, silver, or aluminum. It is commonly referred to as multi-layer-insulation (MLI). The primary material used in solar panels is silicon. Silicon is a crucial semiconductor material with strong absorption properties for sunlight. Antennas are typically constructed using aluminum alloy as the primary material, characterized by a gray or silver-gray appearance. This material offers excellent corrosion resistance, thermal conductivity, and electrical conductivity. The material of the observation payload is typically related to the type of payload. Here, we model it using a Disney-type mixture of metallic and non-metallic components.

Different materials exhibit distinct thermal physical parameters and optical properties. The reflectivity, metallic, roughness, and IOR of a material all significantly influence its radiative characteristics. In Blender, we carefully configured these parameters to closely align with the actual properties of the materials.

Simulated space environment

In this study, we simulated a space environment using Blender software. In addition to the optimized 3D target models, the simulation included the Earth, an observation camera, a star background, and a light source that mimicked the Sun.

Due to the significant distance between space objects and the light source, the sunlight in the scene can be approximated as parallel light that uniformly illuminates the entire target. To simulate this scenario, we utilize the ‘Sun’ type light source in Blender, which is positioned at an infinite distance. The star background is emulated in Blender using star-field textures to approximate realistic stellar density and brightness.

The Earth model also effectively replicated features of the Earth’s surface, such as mountains, deserts, forests, and plains. Surrounding the Earth was an atmospheric layer model, which greatly enhanced the simulated Earth’s appearance. Additionally, the application of different texture maps on the Earth model are controlled by functions to vary Earth backgrounds under different lighting conditions, which made the images more closely resemble the real-space scenarios encountered in service missions.

To achieve realistic rendering of lighting and material appearances, it is necessary to calculate the radiance reaching the camera after accounting for the physical interactions of light, such as reflection and refraction, on the surfaces of spatial objects. This process requires analyzing the spectral reflectance properties of the object surfaces, specifically the variation of reflectance with respect to wavelength. Our approach primarily involves the definition of a Bidirectional Reflectance Distribution Function (BRDF) model (Eq. 1).

$$f(p,{\omega }_{i},{\omega }_{o})=\frac{d{L}_{o}(p,{\omega }_{o})}{d{E}_{i}(p,{\omega }_{i})}$$
(1)

Here, p denotes the surface point being shaded. The incident lighting direction \({\omega }_{i}\) is defined as the unit vector pointing from the surface toward the light source, whereas the view direction \({\omega }_{o}\) is the unit vector pointing from the surface toward the camera. The term \({L}_{o}(p,{\omega }_{o})\) represents the outgoing radiance at p in the direction \({\omega }_{o}\), and \(d{E}_{i}(p,{\omega }_{i})\) denotes the differential irradiance incident at p from the solid angle around \({\omega }_{i}\).

The spectral reflectance characteristics vary across different materials. The Cook-Torrance reflection model (Eq. 2), a key component of BRDF, can simulate micro-surface reflections on metallic and non-metallic surfaces. Therefore, we ultimately selected this model to calculate the radiance received by the camera after accounting for physical interactions, such as reflection and refraction, of light with the surfaces of various components of the space target.

$${f}_{r}=\frac{{K}_{d}}{\pi }+\frac{{K}_{s}}{\pi }\cdot \frac{F\cdot D\cdot G}{\cos \,{\theta }_{i}\cdot \,\cos \,{\theta }_{r}}$$
(2)

Here, \({\theta }_{i}\) denotes the incident zenith angle between the surface normal n and the incident direction l, while \({\theta }_{r}\) denotes the reflection zenith angle between n and the view direction v. The diffuse reflectance coefficient is represented by Kd, and the specular reflectance coefficient by Ks. In addition, F, D and G denote the Fresnel term, the normal-distribution function of microfacets, and the geometric attenuation term, respectively.

We utilized the Principled BSDF shader within Blender, along with the Cycles rendering engine, to implement this optical model. Through the modelling of the light source and the spectral reflectance characteristics of the material on the surface of the spatial target, we have achieved a lighting effect that closely approximates real-world conditions in space.

After the simulated environment was set up and the optimized 3D models imported, we conducted a comprehensive survey of each target’s functions and orbital distribution to enable us to generate trajectories for each model. To generate videos, the targets’ motion in space was stimulated by continuously altering the target’s pose and position, as well as the Earth parameters and lighting conditions. By extracting frames from the video, we obtained the required target images. For each spatial target in the generated data, Fig. 2 shows the different poses caused by various observation angles, different scales resulting from different observation distances, extreme imaging conditions such as high exposure or low lighting due to varying positions relative to the Sun, and RGB and grayscale images captured by different imaging sensors.

Fig. 2
figure 2

Multiple imaging conditions in our dataset. Examples illustrate pose changes from diverse view angles, scale variation due to range, illumination extremes under different Sun–target–sensor geometries and sensor modalities.

Data annotation

We designed a two-level space target classification scheme for this study. Based on the origin of the targets, they were classified as satellites, space debris, or space rocks, as shown in Fig. 3. Because space missions require further identification of the types of satellites and their components, we performed a fine-grained classification for all categorized satellites. The functions and structural features were used to divide the satellites into 10 fine-grained categories: CubeSats, cylindrical satellites, geo-communication satellites, LEO communication satellites, the space telescope, SAR satellites, navigation satellites, symmetrical Earth observation satellites, and single-panel Earth observation satellites.

Fig. 3
figure 3

Example of generated data and two-level space target categories.

After the two-level classification rules were defined, we classified the images of the 26 space targets previously generated. Specifically, the images of the six types of space debris were categorized as space debris, and the four types of space rocks were classified as space rocks. The 16 types of satellite target images were then further categorized into 10 fine-grained satellite categories according to the second-level classification rules.

Based on the two-level classification criteria, we annotated the boundary boxes and categories of all targets in the images to meet the target detection and recognition requirements. All satellite components in the images identified as “satellite” were further annotated in the form of masks to support component recognition. Four components, solar panels, antennas, bodies, and observation payloads, were annotated according to the structural diagram of each type of satellite. Since our source 3D models arrive in heterogeneous formats and conventions, we normalize these component variants into a shared taxonomy so that models learn function rather than shape style.

Solar panels absorb energy from the Sun to power the satellite while the antenna collects signals. The body is the satellite’s core, where various support systems are typically integrated. The observation payload typically includes optical and infrared cameras. Because the payload often needs to be oriented toward the target, it is not usually integrated with the body, so we annotated it as a separate component.

The labels for object detection consist of the coordinates of the target’s bounding box and the target category. The labels for semantic segmentation are mask images of the same size as the original image, where each pixel value represents a specific component. Figure 4 shows the annotation results for a satellite image.

Fig. 4
figure 4

Multi-task annotation provided by our dataset.

Data Records

The NCSTP15 dataset is available at https://doi.org/10.6084/m9.figshare.28606754. Figure 5 visualises the two top-level branches NCSTP_det_recog and NCSTP_comseg, each pre-split into train, valid and test subsets. In the detection branch, lossless PNG frames are stored within the split folders, and the accompanying COCO-style16 annotation files (train.json, val.json, test.json) are placed in a directory that follows the official COCO field specification. In the segmentation branch, each image is associated with a single-channel, 8-bit mask named <basename>_mask.png.

Fig. 5
figure 5

Directory structure of the NCSTP dataset.

The final dataset contains 200,000 images with a uniform resolution of 576 × 324 pixels. Each image includes a single space target. We provided annotations for the bounding boxes and category information of the targets for all images so that all images could be used to train and validate object detection and recognition models. We split the NCSTP dataset into training, validation, and test sets at a 7:1:2 ratio, with 140,000, 20,000, and 40,000 images, respectively. Splits were stratified by top-level class (and 10 fine-grained satellite subtypes). We sampled each subset with a fixed seed.

The dataset has 100,000 satellite images. We annotated the components for these images and stored them as mask images with the same resolution. This means that there are a total of 100,000 satellite images of different categories that can be used for model training and validation of component segmentation. Similarly, we split this subset into training, validation, and test sets at a 7:1:2 ratio, with 70,000, 10,000, and 20,000 images, respectively.

Table 2 shows the statistical results of the dataset for each space target perception task and annotation format.

Table 2 NCSTP dataset statistics.

Data overview

The dataset we constructed in this study has unique features and advantages, providing foundational data support for target visual perception tasks in space. Moreover, it presents new challenges for researchers in this field.

We have compiled statistical information on the sample distribution of the NCSTP dataset. Figure 6 illustrates the types of space targets included in the dataset and provides an intuitive visualization of the sample counts for each target class. It also presents a scatter plot of target size distributions. Figure 7 shows the pixel distribution for each component within the satellite category and calculates the proportion of pixels corresponding to each component (excluding the background) in the image.

Fig. 6
figure 6

The quantity and size distribution of generated data according to target category.

Fig. 7
figure 7

The distribution of pixel counts and proportions for each component.

The features of the NCSTP dataset are summarized as follows:

  1. (1)

    Large scale: Our dataset is currently the most extensive space object dataset, containing 200,000 images. These images can be used for space object detection and recognition tasks. Among them, 100,000 images categorized as satellites can be used for component segmentation tasks.

  2. (2)

    Multiple and sequent tasks: Our dataset currently supports the most diverse range of space object perception tasks, including detection, recognition, and satellite component segmentation. Additionally, our dataset supports subsequent tasks for the same object.

  3. (3)

    Reasonable classification approach: We designed a two-level classification approach for all space objects. We further performed a fine-grained classification of satellite objects based on their functions and significant structural features.

Technical Validation

Based on the NCSTP, we evaluated and compared the performance of the proposed method with SOTA models. The technical validation included two parts: 1. Space Target Detection and Recognition, 2. Space Target Component Segmentation.

Space target detection and recognition

In computer vision research, deep learning-based object detection models can simultaneously provide the bounding boxes and categories of objects in an image. Because each image in our dataset contains only one space target, a deep learning-based object detection algorithm can simultaneously perform both detection and recognition tasks for an image in the NCSTP dataset. We selected 10 object detection methods for experiments on detection and recognition: Faster R-CNN17, Yolov318, Centernet19, DETR20, Sparse R-CNN21, YoloF22, deformable DETR23, YoloX24, VitDet25, and DiffusionDet26. These algorithms incorporate various methods, including two-stage and one- stage methods based on CNNs, anchor-free methods, transformer-based approaches, and the latest methods based on diffusion models. These models were fine-tuned for 12 epochs based on pre-trained weights. The performance of each method was evaluated after the final epoch. ResNet50 was used as the backbone of all methods except for YOLOv3, which uses Darknet53; YOLOX, which uses CSPDarknet; and ViTDet, which uses ViT-base. All methods were trained and tested on a server equipped with an NVIDIA RTX 4090 GPU, to ensure fairness in the comparative experiments.

Evaluation metrics

We quantitatively evaluated the performance of the models using accuracy, model size, and computational complexity. For accuracy, we used mean average precision (mAP) as the standard metric to evaluate the model’s performance. mAP is the average of the average precision (AP) across different categories, where AP is the area under the precision-recall curve after interpolation. Calculating mAP requires precision and recall values.Precision is the ability of a model to identify only relevant objects. It is the percentage of correct positive predictions. Recall is the ability of a model to find all ground-truth bounding boxes. It is the percentage of correct positive predictions among all given ground truths. The formulas for calculating precision and recall are as follows:

$${\rm{Precision}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FP}}}$$
(3)
$${\rm{Recall}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}$$
(4)

where TP represents true positives (correctly detected objects), FP represents false positives (incorrectly detected objects), and FN represents false negatives (undetected objects). The above formulas require us to predefine what differentiates a correct detection from an incorrect detection. A common way to do so is using the intersection over union (IoU). In the object detection scope, the IoU measures the overlapping area of the predicted bounding box Bp and the ground-truth bounding box Bg divided by the area of their union; that is,

$$IOU=\frac{area({B}_{p}\cap {B}_{g})}{area({B}_{p}\cup {B}_{g})}$$
(5)

By comparing the IoU with a given threshold, t, we can classify a detection as correct or incorrect. If IOU ≥ t, then the detection is considered correct. If IoU < t, the detection is incorrect.

After obtaining the precision and recall values, we can calculate the AP, which considers both metrics. Before this, we need to interpolate the P-R curve to smooth it and reduce the impact of curve fluctuations. Given a recall value r, the interpolated precision Pinterp is the maximum precision value between the current recall value r and the next recall value rn. We can calculate the AP value by averaging the precision values corresponding to all different recall points. This is equivalent to the area under the interpolated precision-recall curve and the X-axis, as follows:

$$AP=\mathop{\sum }\limits_{i=1}^{n-1}({r}_{i+1}-{r}_{i}){p}_{{\rm{interp}}}({r}_{i+1})$$
(6)

where

$${P}_{{\rm{interp}}}(r)=\mathop{\max }\limits_{{r}_{n}\ge r}P({r}_{n})$$
(7)

The mAP is simply the average AP over all classes; that is,

$$mAP=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}A{P}_{i}$$
(8)

where APi is the AP in the ith class, and N is the total number of classes in the dataset.

In addition, we calculated the FLOPs and the number of parameters (Params) for each model. FLOPs are commonly used to measure the computational complexity of a model, indicating how many floating-point operations are required during execution. Params is the total number of parameters that need to be trained in the model, reflecting the size of the model.

Results and analysis

Table 3 presents the overall performance of the 10 algorithms on the NCSTP dataset. Specifically, we provide the backbone, mAP, FLOPs, and Params of each algorithm. We set the IoU threshold t range from 0.5 to 0.95, with a step size of 0.05. The final mAP value was obtained by calculating the mAP at each threshold within this range and then averaging the results. Additionally, following the definition of COCO16, we calculated the mAP values for each method on large, medium, and small objects separately.

Table 3 Comparison of benchmark evaluation results for target detection and recognition.

Table 4 shows the mAP values of the 10 models for each space object category. The abbreviations of the category names in the table correspond to their full names as follows: CubeSat (CU), cylindrical satellite (CY), geo-communication satellite (GE), LEO communication satellite (LE), space telescope (ST), SAR satellite (SA), navigation satellite (NA), symmetrical Earth observation satellites (SY), single-panel Earth observation satellites (SI), space debris (DE), and space rocks (RO).

Table 4 Comparison of mAP for each class.

Figure 8 shows the overall performance of each algorithm in spatial object detection and recognition on the NCSTP dataset. Each circle in the figure represents a model. The horizontal axis indicates the FLOPs of the model, with further rightward positions representing higher computational costs. The vertical axis represents mAP, with higher positions indicating better model accuracy. The radius of each circle corresponds to Params in the model, with larger radii indicating more parameters. At the per-class level, satellite subclasses are generally easier than debris and rocks. DETR variants show the steepest drop on rocks, while Sparse R-CNN remains robust.

Fig. 8
figure 8

Object detection model performance scatter plot on NCSTP dataset.

While Sparse R-CNN yields the highest AP on NCSTP, it is relatively demanding for embedded deployment. For on-orbit scenarios, compute and power budgets are tight, so models with small footprints are preferable. Yolov3 requires the fewest computational resources, but its mAP is lower than other algorithms, indicating lower accuracy. It also requires more storage space. YOLOX, however, has the smallest Params and a relatively lower FLOPs value, but its accuracy is worse than half of the other algorithms. Taking into account factors such as model accuracy, required storage space, and computational resources, YOLOF demonstrates a more balanced performance in on-orbit space target detection and recognition.

Overall, transformer-based algorithms typically have more parameters. Additionally, all methods experience a performance drop on small targets, indicating that further research is needed to improve the detection and recognition of distant spatial objects. We should consider the accuracy requirements and constraints of on-orbit resources for actual missions to select the most suitable method.

Space target component segmentation

This study transformed space target component recognition into a semantic segmentation problem. For each input image, we labelled the various components of the satellite in the form of masks and separated them from the background. Semantic segmentation aims to assign each pixel in an image to a predefined object class.

We selected 10 semantic segmentation methods for satellite component recognition experiments, namely FCN27, U-Net28, DeepLabV3+29, GCnet30, Fast-SCNN31, SETR32, Segformer33, K-net34, MaskFormer35, and Mask2Former36. These methods cover two types: CNN- and transformer-based approaches. All these models were fine-tuned for 160,000 items based on pre-trained weights. The performance of each method was evaluated after the final epoch. For the backbone of the methods, except for SETR, which uses Vit-Large; Segformer, which uses mit_b0; and Fast-SCNN, which does not have a backbone—all other methods use ResNet50. As in space target detection and recognition experiments, all methods were trained and tested on a server with an NVIDIA RTX 4090 GPU.

Evaluation metrics

Regarding object detection and recognition tasks, we also evaluated the performance of these models on segmenting satellite components in the NCSTP dataset according to accuracy, model size, and computational complexity. For the model size and computational complexity, we still use Params and FLOPs as evaluation metrics. The accuracy of the semantic segmentation models was based on the mIOU.

In semantic segmentation, the IoU is the ratio of the intersection and union of the ground-truth labels and the predicted values for a specific class. The mIoU is the average of the IoU for each class in the dataset. Its calculation formula is as follows:

$$mIOU=\frac{1}{k+1}\mathop{\sum }\limits_{i-0}^{k}\frac{{p}_{ij}}{{\sum }_{j=0}^{k}\,{p}_{ij}+{\sum }_{j=0}^{k}\,{p}_{ji}-{p}_{ii}}$$
(9)

where Pij represents the number of pixels of class i that are predicted as class j. In a confusion matrix for semantic segmentation, Pij indicates how many pixels from the ground truth of class i are misclassified as class j by the model. The formula can also be written as follows:

$$mIoU=\frac{1}{k+1}\,\mathop{\sum }\limits_{i=0}^{k}\frac{TP}{FN+FP+TP}$$
(10)

where TP represents true positives, the number of correctly predicted pixels for a specific class; FN represents false negatives, the pixels of a specific class that were not predicted correctly; and FP represents false positives, the pixels incorrectly predicted as a specific class.

In addition to the mIOU, we also calculated another metric to measure model accuracy, mACC. Accuracy refers to the proportion of correctly predicted pixels to the total pixels. mACC is the average accuracy across all categories:

$$mACC=\frac{1}{k+1}\,\mathop{\sum }\limits_{i=0}^{k}\frac{TP}{FN+TP}$$
(11)

Furthermore, the mACC only considers the ratio of true positives to false negatives, reflecting the model’s accuracy. However, unlike the mIOU, it does not account for the impact of false positives. Compared with the mACC, the mIOU can more rigorously evaluate the performance of segmentation models, especially in cases of class imbalance, and better reflect the model’s fine-grained prediction capabilities. Therefore, this paper primarily uses mIOU to measure the model’s accuracy.

It is worth mentioning that in the images of the NCSTP dataset, aside from the pixels occupied by space targets, a significant portion is labeled as background with a pixel value of 0. Because our evaluation focuses solely on the segmentation accuracy of satellite components, the mIoU and mACC values we calculate exclude the background part (reduce zero label).

Results and analysis

Table 5 presents the performance of these 10 representative semantic segmentation algorithms on the NCSTP dataset. The table also presents the mIOU and mACC values for each model across all component categories and the IoU and accuracy values for the four components: body, solar panel, antenna, and observation load. In addition, we list the backbone used by each model, along with the number of parameters and FLOPs. Consistent with the object detection algorithms, to comprehensively compare the overall performance of the models, we have plotted the mIOU, Params, and FLOPs information of all models in Fig. 9, where the X-axis represents FLOPs, the Y-axis represents mIOU, and each colored circle represents a model. The radius of the circle indicates the size of the model’s Params.

Table 5 Comparison of benchmark evaluation results for component segmentation.
Fig. 9
figure 9

Semantic segmentation model performance scatter plot on NCSTP dataset.

Among all the semantic segmentation models, Mask2Former36 achieved the highest accuracy, with an mIOU of 88.1. Its Params was at a moderate level, but its FLOPs value was relatively high, indicating that this model’s deployment and inference require considerable computational resources. In contrast, the Fast-SCNN model had very small Params and FLOPs, allowing it to be deployed and run under extremely limited storage and computational resources. However, its mIOU was also much lower than that of other models. Additionally, SegFormer and DeepLabv3+ exhibited high mIOU while maintaining relatively low Params and FLOPs, making their overall performance quite impressive.

A detailed analysis of the models’ performance for each component category revealed that all models achieved higher segmentation accuracy for solar panels than other components. This may be because solar panels typically have clear boundaries with other components, possess similar structures across different satellites, and generally occupy more pixel values than other types of components. In addition, the practical implication of this bias is that, in space operations, different components are of varying levels of interest. However, if a given space operation task places greater emphasis on antennas or payloads, it becomes necessary to design more targeted perception algorithms specifically tailored to those components.

The semantic segmentation algorithm can be chosen for tasks with different resource constraints. If higher accuracy is desired and resources are sufficient, Mask2Former36 can be selected. However, if the model needs to be deployed on platforms with limited computational and storage resources, such as small satellites or space drones, Fast-SCNN would be a better choice. While this sacrifices accuracy, it meets the requirements for real-time tasks.