Abstract
Estimating the 6D pose of objects is crucial for robots to interact with the environment. 6D Object pose estimation from RGB images in a cluttered scene and heavy occlusions is a critical issue. Most existing methods use two stages to estimate object pose: First, extract the object features, and then use the PnP/RANSAC method to estimate object pose. However, most of these techniques merely localize a group of key-points by regressing their coordinates, which are vulnerable to occlusion and have poor performance for multi-object pose estimation. These methods cannot directly regress the 6D pose estimation from a loss during training. In this paper, we propose a framework based on convolutional neural network (CNN) and self-attention mechanism as an end-to-end method for single and multi-object 6D pose estimation using RGB images with low computational cost. Our method utilizes feature fusion to extract local features and combines multi-head self-attention (MHSA) with iterative refinement to improve pose estimation performance. Furthermore, our method can be scaled according to computational resources. Our experiments illustrate that our method performs in benchmark datasets the Linemod and Occlusion Linemod and achieves 97.45% and 84.84% in terms of the ADD(-S) metric in both datasets, respectively.
Similar content being viewed by others
Introduction
An object’s pose determines its orientation and position in space. Generally, a 3D rotation and translation with six degrees of freedom (6DoF) describe an object’s pose. Estimating the 6D pose of an object is a significant challenge in computer vision, as it involves detecting both an object’s 3D orientation and 3D location in the case of camera-centered coordinates, where 3D translation and rotation vectors emphasize 6D pose estimation1,2. Nowadays, 6D object pose estimation has become indispensable for many successful applications such as driverless vehicles, manufacturing robots, and augmented reality2,3,4,5,6,7. Therefore, understanding and perceiving an object’s pose is essential for many applications, including robotic grasping and interacting robots with actual objects in the real-world8. For instance, in the industrial sector, a robot must be able to analyze an object’s pose to manipulate it9,10. Consequently, 6D pose estimation gives detailed data on both the 3D location and orientation of objects11, allowing robots to perform many manipulation tasks such as object grasping, assembly tasks, and human-robot interaction by learning from demonstration.
In order to perform 6D pose estimation of objects, a method should be robust against heavy occlusion, sensor noise, background clutter, and changing illumination conditions while meeting the speed requirements of real-time applications2,4,9. In recent years, convolution and self-attention have significantly advanced in computer vision tasks. During this time, convolutional neural networks (CNNs) have become dominant in computer vision; however, many successful methods significantly perform various tasks, such as image recognition, classification, object detection, and semantic segmentation, and have achieved significant performance on many benchmarks12. On the other hand, vision transformers13,14 have made dramatic achievements in many vision tasks due to self-attention15,16. In 6D object pose estimation tasks, current techniques can estimate the 6D pose in a chaotic cluttered environment with outstanding efficiency, driven by the recent success of deep learning17,10. Generally, most existing methods use a two-stage approach that involves first determining 2D-3D correspondences with a reference image or object’s model11 and then generating the 6D pose using RANSAC-based Perspective-n-Point(PnP) algorithm for each object separately18,23. Subsequently, they are not practical in many real-world applications since they take more time and have a higher computational cost. Other methods based on the end-to-end approach4,5,10,24,25,26, alternatively, estimate the 6D pose of objects directly. However, several significant challenges remain, such as the presence of heavy occlusion, cluttered scenes, and unpredictable lighting conditions. To tackle the issue of changing illumination, a plausible and effective strategy involves extensive training across a wide range of lighting scenarios2,4,9. Still, occlusion and cluttered scenes remain open areas of research that demand further exploration. Clearly, these complexities require innovative techniques to enhance the performance and reliability of object pose estimation.
In this paper, we introduce a new method to accurately estimate the 6D pose of objects from RGB images that is designed to provide fast and efficient inference using an end-to-end approach. In fact, the existing methods depend on resource-intensive multi-stage procedures like RANSAC and PnP, which frequently encounter computational inefficiencies. In contrast, our method obviates the need for these procedures. By integrating self-attention and iterative refinement techniques, our model directly estimates the pose of objects, even in cluttered environments, which markedly improves precision and computing efficiency. The proposed method comprises three essential components: backbone, feature fusion, and prediction subnetworks. Based on Densenet, the backbone enables richer feature extraction from occluded objects through its densely connected layers. Embedding MHSA enables the network to focus on essential regions and enhance pose estimation in challenging environments with overlapping objects. The feature fusion consolidates multi-scale features to capture essential features from varying feature resolutions. This ensures that the model effectively learns from global and local contexts, which leads to predict an object under occlusion (Sec.3.1.2). Also, to improve the performance against occlusion, our method uses the same principle of5 by employing pixel-wise voting of the center of an object instead of predicting the whole object. Finally, the prediction subnetworks focus on improving 6D pose estimation by combining iterative refinement and MHSA in translation and rotation tasks. This iterative approach ensures that the model refines its predictions progressively, achieving higher accuracy regardless of the number of objects in the scene. In summary, the key contributions of this work are:
-
We propose a novel framework architecture that seamlessly integrates feature fusion and combines convolution and self-attention operations to regress the 6D object pose directly from RGB images.
-
Our method incorporates rotation and translation networks leverage iterative refinement modules alongside MHSA to iteratively estimate and enhance 3D rotation and translation.
-
We conduct extensive experiments on the Occlusion Linemod27 and Linemod datasets28 benchmarks. Our results show a significant advancement over existing methods that utilize RGB input, particularly for multi-object pose estimation, and also contribute to enhancing computational efficiency, making our method a compelling choice for real-world applications.
Related work
Two stages pose estimation approach
Considerable literature that uses a two-stage approach has been published7,18,20,21,29,30, and two-stage approach dominates the field of 6D pose estimation: first detect 2D key points or 2D-3D corresponding 3D model of the object in the image, and then estimate the 6D pose using RANSAC-based Perspective-n-Point algorithm. Rad et al.7 propose BB8, a framework to predict objects poses as 2D projections of the corners of their 3D bounding boxes and then use a PnP algorithm to determine their 3D poses from this 2D-3D relationship. However, when the object is partially unseen, BB8 may not obtain an accurate 3D bounding box during the estimation stage. To address this issue, Hu et al.20 suggest dividing the image into multiple patches and requires each patch to predict the location of the 2D projections and the object to which they belong. Subsequently, the 6D pose is estimated by aggregating all patches belonging to the same object, and the PnP algorithm is applied. Similarly, Su et al. introduce ZebraPose21, an RGB-based method for estimating 6D object pose. The method utilizes a hierarchical binary grouping strategy to build dense 2D-3D correspondences by employing a coarse-to-fine surface encoding technique. Specifically, ZebraPose assigns binary descriptors to 3D vertices and uses a progressive training strategy to predict correspondences. On the other hand, PVNet29 regresses pixel-wise unit vectors, leading to key-points. Vector-field representation is proposed to focus on local features and establish the relationship between different parts of the object, enabling PVNet to recover the occluded parts of the object from the visible portions, even if they are outside the scene. In contrast, DPOD30 employs a UV map for rich correspondences. Recent approaches, such as 6D-Diff23, introduce a diffusion-based framework for 6D object pose estimation that mitigates noise and indeterminacy by employing reverse diffusion for 2D-3D correspondences, followed by the Perspective-n-Point (PnP) algorithm to recover the 6D pose. Likewise, BDR6D22 aims to improve 6D pose estimation by leveraging multi-modal data through the fusion of RGB images and depth information and then recovers the 6D pose using the PnP algorithm. From another perspective, Wang et al.31 propose a self-training framework for unsupervised domain adaptation in satellite pose estimation, leveraging domain-agnostic geometrical constraints and fine-grained segmentation to enhance predicted sparse keypoints accuracy and then utilize PnP to estimate the poses. They introduce adversarial training for mask alignment without real data annotations. Nevertheless, the PnP technique is sensitive to minor errors in the 2D representation, making pose estimation challenging, particularly in the presence of occlusion. Although two-stage methods have been widely adopted, their limitations hinder their effectiveness, especially in multi-object scenarios. The necessity for an additional regression step for each object substantially increases computational expenses. Furthermore, as the number of objects increases, these methods encounter difficulties in maintaining accuracy, resulting in a significant performance drop in multi-object pose estimation. In contrast, by combining iterative refinement and self-attention modules instead of RANSAC or PnP, our method reduces computational complexity and improves pose estimation accuracy, particularly in scenarios with multiple objects in cluttered scenes.
End-to-end pose estimation approach
One of the most cited studies is that of Xiang et al.5, who propose PoseCNN, which is based on a convolutional neural network (CNN) for 6D object pose estimation in cluttered scenes. PoseCNN aims to estimate an object’s 3D translation and rotation by using pixel-wise voting to determine the object’s center and estimating its distance from the camera to handle occlusion. In a related approach, Ullah et al.26 proposed a fully convolutional, parallel architecture for pixel-wise dense estimation of 3D translation and orientation. Zhou et al.14 introduced a Deep Fusion Transformer (DFTr) network that integrates cross-modality features by leveraging semantic similarity between color and depth data and employs weighted vector-wise voting with global optimization for 3D keypoint localization. Generally, the two-stage approach is robust due to the use of the PnP algorithm for 6D pose estimation. Building on this, to leverage PnP in an end-to-end manner, Chen et al.32 propose EPro-PnP, a novel probabilistic PnP layer for end-to-end pose estimation. Specifically, EPro-PnP outputs a distribution of poses with a differentiable probability density on the SE(3) manifold, and it treats 2D-3D coordinates and corresponding weights as intermediate variables. Meanwhile, other researchers10,24,33,34 use a pose refinement network instead of the PnP algorithm to improve the performance of 6D object pose estimation in an end-to-end fashion. For instance, HybridPose33 employs multiple representations such as geometric information, key-points, and symmetry correspondences to estimate the 6D pose of objects by improving reprojection error adjustment. Similarly, Di et al.10 leverage 2D-3D correspondences and self-occlusion to establish a two-layer representation for 3D objects. They utilize a shared encoder and two independent decoders to provide 2D-3D correspondences and self-occlusion information and then combine the outputs to regress the 6D pose parameters directly. Along the same lines, Iwase et al.34 recently improved performance by formulating object pose refinement as an optimization problem based on feature alignment. From another perspective, Bukschat et al.24 extend EfficientDet35 to estimate the 6D poses of objects by adding rotation and translation subnetworks alongside classification and regressing bounding box subnetworks. The most intriguing aspect of their work is the use of 6D augmentation to enrich the dataset, as the benchmark datasets they used are small. Taken together, these studies provide important insights into 6D object pose estimation. However, most of the aforementioned works focused on developing a method for a single object rather than multi-object scenario. Consequently, poor performance is observed when some of these methods are applied to pose estimation for multiple objects. While these methods represent a significant advance, they struggle with challenges such as occlusion and cluttered scenes. Our method addresses these issues by integrating feature fusion, self-attention, and iterative refinement to dynamically focus on relevant objects in the image, which improves pose estimation in heavily occluded and cluttered environments.
Attention mechanism and convolution
As our work is a vision task and the method is based on combining self-attention with convolution modules, we also summarize the improved work related to vision tasks based on self-attention and convolution. Basically, convolutional neural networks and vision transformers have made significant progress in computer vision tasks in recent years, thanks to the use of convolution operations and attention mechanisms on CNN and ViT, respectively. Recently, considerable literature has grown around combining convolution and attention mechanisms for various vision tasks. Many researchers focus on improving transformer models by adding convolution operations. Xiao et al.36 use convolution at an early stage as a stem with a transformer to improve performance and training stability. Wu et al.37 utilize convolutional token embedding and stride convolution to decrease the computational cost of self-attention. Another work combines convolution and self-attention in a hybrid network structure called Conformer38, which uses convolution to extract local features and self-attention for global representation, thereby integrating and keeping feature representation from deterioration. Zhong et al.39 and Zhou et al.14 employ Transformer-based methods for 6D pose estimation, Trans6D, and DFTr, respectively. It employs Transformers to capture global dependencies effectively, mitigating information loss. Additionally, two novel modules enhance Trans6D’s accuracy and robustness: a patch-aware feature fusion module and a pure Transformer-based pose refinement module. On the other hand, many previously described attention techniques applied to images show that they may help convolutional neural networks overcome their weaknesses on the locality issues40. Several studies investigate the idea of applying attention modules or utilizing additional relational data to improve the performance of convolutional neural networks12. Srinivas et al.41 replace the convolutional layers with self-attention in the model’s final stages. Bello et al.40 propose to augment self-attention with convolution by concatenating feature maps from the self-attention pipeline with convolutions in certain layers. Pan et al.12 propose a hybrid model that comprises convolution and self-attention modules in a parallel manner. Building on approaches12,40, our method combines convolution and self-attention to address the challenges of 6D pose estimation. We improve pose estimation in complex, occluded environments by leveraging self-attention to capture global dependencies and convolutional operations for local feature refinement. This hybrid approach significantly reduces the performance gap in multi-object scenarios while maintaining computational efficiency.
Methodology
We propose a framework to estimate the 6D object pose for single and multi-objects from RGB images. Pose estimation involves detecting objects and calculating their 3D translations and orientations. In particular, a 6D pose is described by a transformation (R, t) from the object’s coordinate system to the camera’s coordinate system, where R and t are 3D rotation and translation, respectively. Estimating the 6D pose of an object in the environment is critical and has many challenges, such as heavy occlusion and background clutter scenes.
To resolve these challenges, (1) a feature fusion technique that focuses on understanding how different parts of an object are related (Sec. 3.1.2). Instead of predicting the entire object, our method aims to predict key center points, which allows the model to remain effective even under heavy occlusion. (2) self-attention is used to highlight and focus on the regions of interest and ignore the background clutter. In contrast, applying self-attention in high-resolution images has a few challenges, such as a high computation cost and a huge memory. In order to resolve these challenges, we follow the work41 in our design and consider the following: (1) Use convolutions to learn low-resolution feature maps and abstract from large images effectively. (2) Use self-attention to process and combine the information that convolutions have collected in the feature maps. Convolutions thus perform spatial downsampling for high-resolution features and then use self-attention for low-resolution features.
Network architecture
Our method’s overall structure consists of five levels based on the feature resolutions. It is separated into three components, namely the backbone, the feature fusion, and the prediction subnetworks, as illustrated in Fig. 1. In our method, we incorporate two input streams: the input image and the camera parameter as vector \(a \in R^6\), to compute the object translation that comprises the focal lengths of the pinhole camera \(f_x\) and \(f_y\), the principal point coordinates \(p_x\) and \(p_y\), and scaling factors of image and translation measurement. The input image undergoes feature extraction in the backbone component. Specifically, we employ Densenet as the backbone to extract features comprehensively. Notably, within the backbone, the embedded MHSA operates at level 4 to enhance feature representation by capturing long-range dependencies. Features extracted from different resolutions within the backbone are fused in feature fusion to create a comprehensive feature representation. This fusion process is pivotal for synthesizing information from various scales and resolutions, enriching the feature set. Finally, in the prediction subnetworks component, features from different levels in the feature fusion are utilized for diverse tasks, including classification, bounding box regression, rotation, and translation. To reduce the number of parameters and optimize computational efficiency, we adopt separable convolutions instead of conventional convolutions in our architecture, except for the backbone, effectively reducing computational costs while maintaining performance standards. Generally, separable convolution consists of a stacked depth-wise convolution (each input channel is treated independently), followed by another convolution called pointwise convolution, which combines the resultant output channels produced by the depth-wise convolution42,43. We explain the overall architecture in more detail as follows:
Backbone
Typical image processing uses convolutional neural networks, which have dominated computer vision for several years because of their performance and lower computational costs compared to traditional neural networks. Our model’s backbone is the densely connected convolution network Densenet44. In particular, dense connection promotes feature reuse and reduces the number of parameters, which is a crucial element of the Densenet design. Admittedly, Densenet is more effective than other designs in reusing the features since feature concatenation is used, which is known as feature reusability. Benchmark datasets for 6D pose estimation that we used are small; besides transfer learning and data augmentation, feature reusability is advantageous in Densenet to enrich data thanks to dense connectivity, as shown in Fig. 2. Accordingly, compared to a traditional CNN, Densenet may learn mapping with fewer parameters since duplicate mappings do not need to be learned. Densenet generally comprises three primary parts: the initial layers (stem), dense blocks, and transition layers. We modify Densenet by embedding MHSA on it. Attention mechanisms are methods for directing focus to the most important areas of an image while ignoring unimportant areas45. Unlike traditional convolutional methods that often fail to preserve feature integrity in chaotic environments, our incorporation of MHSA into the Densenet architecture allows the network to dynamically prioritize and consolidate pertinent features from partially obscured objects12. This results in enhanced pose estimation accuracy in situations characterized by significant occlusion and numerous overlapping objects, in contrast to methods as such5,24. Theoretically, self-attention is a more adaptable operation that can simulate convolutional models’ behavior when encoding local features46,47. The attention module performs its calculations repeatedly and concurrently, referred to as attention heads15. Therefore, the attention module divides its Query, Key, and Value parameters N-ways and independently routes each split via a different head. A final attention score is subsequently generated by combining these related attention computations. This term is called MHSA, which consists of numerous self-attention blocks to capture the intricate interactions between the various parts in the sequence see Fig 3. As mentioned above, identifying an object in a cluttered scene is critical and challenging on 6D pose estimation tasks; thus, we use the MHSA module to resolve this issue and estimate the relevance of an object to other objects in the scene. Densenet typically has 4 dense blocks commonly referred to as [block1, block2, block3, block4]. We embed the MHSA between blocks 3 and 4, where the feature map resolution is low, as inspired by Srinivas et al.41, who discovered that adding self-attention achieved significant performance in their experiments.
Feature fusion
Utilizing convolution kernels in a convolutional neural network to extract local features has emerged as the most effective for various vision tasks12,41. However, in order to achieve a reliable 6D pose estimate, it is crucial to integrate local features from various phases of the backbone into a unified representation. This is where feature fusion techniques play a crucial role. Feature fusion typically consists of several bottom-up and top-down multi-scale pathways for gathering feature maps from various input features35,48 as illustrated in Fig. 1(middle). In our method, we adopt the BiFPN35 module as the multi-scale feature fusion. The fusion within BiFN involves iteratively combining features as much as possible at different resolutions from the different blocks of the backbone. Specifically, features from lower-resolution layers (which contain semantic information) are merged with features from higher-resolution layers (which retain more detailed spatial information). This merging process is accomplished by utilizing a sequence of weighted connections, which enable the network to dynamically ascertain the significance of each feature map at various scales. By fusing local and global features, the network can make predictions based on the visible portion of an object in the scene, thereby eliminating and minimizing occlusion issues4. The fusion process enables the model to accurately capture the interconnections between various parts of an object, thus distinguishing between the various parts of each object and the relationship between those parts enhances the understanding of the object structure41,49. We follow Mingxin et al.35 to scale up the fusion block width and depth, as they found the best performance by scaling the width and depth with the following equations:
where \(\phi\) is the scale-up parameter.
Prediction subnetworks
6D pose estimation usually uses four computer vision tasks: object detection, classification, rotation, and translation. We define four separate networks for those four tasks, as seen in Fig. 1(right), and then concatenate each network at different levels into a single output.
Classification/ Bounding Box Regression Subnetworks These two subnetworks are adopted from EfficientDet35 network, which consists of convolution layers followed by patch normalization and SiLU activation function. Both have the same depth and width. In our experiments, we follow the work24 to balance between depth and width. We fix the width of both two networks to be the same as the feature fusion component width (i.e., \(W_{class}=W_{box}= W_{feature_fusion}\)), whereas the depth specified by the equation:
where \(\lfloor \rfloor\) refers to the floor function
Rotation Subnetwork
The central and crucial parts of 6D object pose estimation are 3D rotation and translation of the object; therefore, rotation and translation subnetworks are designed carefully and elegantly to be very simple and more effective. In the rotation network, we use axis-angle representation since its representations require only three scalar values to represent the rotation, making them more compact compared to other representations such as quaternions. This compactness reduces the memory footprint and computational complexity of the rotation network, making it more efficient during training and inference50. Some works51 proved that axis-angle representation for rotation achieved significant performance24. Our rotation network predicts a single rotation vector for each anchor box. The 3D points rotation by an angle theta around an axis v, where \(\Vert v \Vert _2=1\) can be captured by a rotation matrix \(R = exp(\theta [v]_x)\) where \([v]_x\) is the vectors’ skew-symmetric operator i.e.,
Hence, for each axis-angle vector \(y= \theta v\), there is a corresponding rotation matrix R and vice-versa. The rotation subnetwork comprises two crucial components/modules: iterative refinement and MHSA modules. the MHSA module allows the network to focus on the critical area (target objects) and discard the cluttered while iterative refinement estimates the pose of an object, resulting in enhanced pose estimation failures iteratively. The iterative refinement module consists of stacked depth-separable convolution layers, each layer followed by patch normalization and SiLU activation function. The number of conv layers in each iterative refinement module is calculated using Eq. 2. The output of each module is used as part of the input for the next module (MHSA) to make pose estimation more accurate as a whole4. Our architecture is designed horizontally based on feature resolutions; the most effective way to leverage the self-attention mechanism while keeping the computational cost reasonable is by utilizing it on features with small resolutions. For low feature resolutions, we combine the MHSA module and iterative refinement module4,52, whereas removing the MHSA module for high-resolution features. Figure 4 depicts the structure of the rotation network. We initialize the rotation network with a separable convolution layer, followed by a sequence of MHSA and iterative refinement modules. We follow the traditional and influence architectures Densenet and Resnet to design rotation subnetworks by utilizing skip connections to flow data between those two modules in two fundamental ways: addition and concatenation. The input of each module connects directly to its output using addition to address and solve the degradation problem as the network goes deeper. Simultaneously, we use concatenation to ensure feature reusability and enrich the features since our dataset is small. Equation (2) is used to specify the repeated MHSA and iterative refinement modules. Also, the same equation is used to specify the number of heads in MHSA and the depth of the refinement module.
Translation Subnetwork The translation is the distance between the object in the scene coordinate system and the camera coordinate system. To regress the translation of an object \(t=(t_x, t_y, t_z)^T\), we follow5,24 by splitting the translation into two tasks, the distance \(t_z\) regression and predicting center point \(c= [(c_x,c_y)]^T\), then we can calculate the \(T_x\) and \(T_y\) by the following equation:
where \(f_x\) and \(f_y\) refers to the camera focal lengths, and \([(p_x,p_y)]^T\) is the principal point.
illustration of the camera and object coordinate systems5.
As seen in Fig. 5, estimating the translation can be done by localizing the object center and estimating the center distance from the camera. Besides feature fusion, estimating the center of the object instead of estimating the whole object enables the network to estimate the pose even if the part of the object is occluded. The structure of the translation subnetwork is similar to the rotation network. However, instead of regressing the translation directly in the iterative refinement module, the translation is divided into predicting the center point (x, y) and regressing the distance z; The outputs of each iterative module are two separable convolution layers representing xy and z coordinates.
Compound scaling
We follow EfficientDet35 scaling strategy by scaling up the architecture via a hyperparameter \(\phi\) that controls scaling input image resolution and various parts of the architecture, such as BiFBN width and depth, subnetworks depth, and the depth of iterative refinement module, which equals the number of heads in each MHSA module.
Experiments
In our experiments, we evaluate our method on two benchmark datasets for 6D pose estimation, the Linemod28 and Occlusion Linemod27, as well as conduct an ablation study. We compare our method with some main methods5,53,24,34,39,21,26,14,23,22 that use RGB color images.
Dataset
In order to fortify our method for real-world applications, we deliberately opt for authentic RGB images instead of synthetic ones. This deliberate choice ensures that our method stands resilient against the complexities inherent in real-world scenarios. Linemod dataset consists of 13 objects, and each scene has only one object with its annotated data, as shown in Fig. 6(Bottom), and classical methods extensively use it. Additionally, some of the Linemod images were further annotated to produce the Occlusion Linemod. Each image has several marked objects, as shown in Fig. 6(Top), pose estimation is tricky since they are severely occluded. One of our objectives is to build a model for multi-object pose estimation tasks. We use the Occlusion Linemod dataset to train the model for multi-object since it contains eight annotated objects in each scene. For conducting a comparison with other methods, we follow other works5,29 to split the dataset for training and testing datasets. This partitioning method involves selecting training images in a way that ensures a minimum angular separation of 15 degrees between object poses. As a result, approximately 15% of the images are designated for training, while the remaining 85% are allocated for testing.
Evaluation metrics
We evaluate our method with Average Distance Differentiable ADD(-S)54. It computes the average distance between two point sets transformed by ground truth pose and predicted pose with a maximum threshold of 10 cm. Due to the fact that our datasets include both symmetric and asymmetric objects, ADD is defined as:
where R and t are the ground truth of rotation and translation, respectively, \({\hat{R}}\) is the estimated rotation and \({\hat{t}}\) is the estimated translation of the model point transformed. While for symmetric objects, ADD-S is described as:
Implementation details
Our model is implemented using the TensorFlow framework. The backbone is initialized with the Densenet model trained on Imagenet. ReduceLROnPlateau callback Keras class is used to reduce the learning rate; the initial learning rate is 1-e4, and the learning rate is reduced if no improvement is observed during training for 20 epochs. Data augmentation is used to address the limited size of the dataset and boost performance. The training epochs are set to be 3000, evaluated every 10 epochs, and the batch size is 4. Since our model performs multi-tasks, each task has a loss function, such as classification and bounding box regression. Rotation and translation are combined with a single loss function that considers both symmetric and asymmetric objects in the account as follows: For asymmetric objects, the loss function is defined as follows:
where \({\hat{R}}_x\) and \({\hat{t}}\) indicate predicted rotation and translation, respectively; \(R_x\) and t refer to the ground truth; M denotes an object’s 3D model points set, and m denotes the number of points. For symmetric objects, the loss function is the same, but the minimum distance between the predicted point to any point in the ground truth points is set instead of the difference between matching points, as shown below:
Ablation study
In this section, we conduct ablation studies on the Occlusion Linemod dataset to assess the impact of various design choices on the robustness of our method.
The ablation study results with different configurations. The accuracy of ADD(s), (a) the accuracy of the baseline model with/without self-attention mechanisms, (b) accuracy when removing FPN, (c) excluding the iterative refinement from the model, (d) accuracy after increasing the depth of iterative refinement, (e),(f) accuracies after using different backbones Resnet and Efficientnet respectively.
Figure7 shows how the baseline model performance compares with different configurations.
Impact of component removal
Excluding MHSA leads to a considerable drop in overall performance, which achieves an average ADD(-S) score of 78.18% as shown in Table 2, compared to the baseline model’s score of 81.84%. The decrease in performance is especially noticeable in the case of Ape (54.95%) and Duck (63.96%), which indicates that MHSA is essential for effectively dealing with complicated environments and enhancing pose estimation for textureless objects. Similarly, removing the FPN leads to a substantial performance decrease, with an average score of 77.10%. The Cat and Holepuncher objects, in particular, exhibit significant decreases in performance, with scores dropping to 52.63% and 80.57%, respectively. This demonstrates the crucial importance of multi-scale feature fusion in achieving precise pose estimation, especially for objects that vary in size and shape. The absence of Iterative Refinement also leads to a performance decline, with an average score of 77.69%. These results indicate that feature fusion has a considerable impact on the model, which provides a 5.10% improvement in model performance. Further experimentation reveals that increasing the iterative steps with the same depth of the \(\phi = 2\) model with four iterations results in an average performance of 81.72%, nearly identical to the baseline model. This indicates the robustness and stability of MHSA across varying configurations, such as the balance between the depth and width of various components of the model, highlighting its ability to maintain high performance even as the model complexity is adjusted. This implies that although iterative refinement typically enhances accuracy, it may not be equally useful for all objects, especially those that require a more direct approach to pose estimation.
Performance across backbone architectures and scaling configurations
Evaluation assessment also includes the performance of using Resnet and Efficientnet backbones architectures. Resnet and Efficientnet attain a mean score of 78.92% and 79.22%, respectively, while baseline with Densenet backbone achieves an average score of 81.84%. This demonstrates its exceptional capability to effectively capture and utilize dense feature connections for pose estimation. Additionally, we performed experiments with various models, as shown in Table 1 to reference each model’s configuration. We compare our proposed model with different hyperparameters such as image resolution, BiFPN depth and width, number of heads, etc. As can be seen in Table 2, the \(\phi = 2\) model performed significantly and achieved 84.38% on average, outperforming the \(\phi = 0\) baseline model by 2.54%. A possible explanation might be that the more objects there are, the deeper the model will have to be to get perfect results. Considering efficiency, since the Linemod dataset is trained using an independent model for each object, it does not need a complex model, so the \(\phi =0\) model is used and denoted by “OURS” in all experiments on the Linemod dataset, while \(\phi =2\) model is compared with state-of-the-art models in Occlusion Linemod dataset.
Experimental results and comparison
As mentioned in Sect. 4.1, we evaluate our method on the Occlusion Linemod and Linemod datasets. Our method achieves significant performance compared with state-of-the-art methods that use RGB images on both two benchmark datasets. Our model achieves an average accuracy of ADD(-S) 84.38% and 97.45% on the Occlusion Linemod and Linemod, respectively.
Comparison on the occlusion linemod
Table 3 presents a comparative evaluation on the Occlusion Linemod dataset of our proposed method against notable state-of-the-art methods EfficientPose24, Zebrapose21, Ullah et al.26, Deepfusion14, 6D-diff23, and RDPN6D53. It is important to note that some methods, such as RDPN6D, leverage RGB-D inputs, while others, including our approach, rely solely on RGB. For a fair comparison, we focus on the dataset used for both training and testing, and we exclude the comparison with other methods that solely used the Occlusion Linemod dataset for evaluation. Our method achieves the highest overall average ADD(-S) score of 84.38%, outperforming all other methods in occlusion scenes. EfficientPose follows with an average score of 83.98%, while methods like Zebrapose and Deepfusion exhibit lower average scores of 76.9% and 77.7%, respectively. For individual objects, our method exhibits competitive performance across numerous objects compared to other methods. Our method outperforms previous approaches by achieving the maximum accuracy on objects such as Ape (66.46%) and Duck (78.85%). In addition, although 6D-diff gives the highest result for Can (97.9%) and Glue (92.0%), our method consistently achieves a strong performance of 94.73% and 90.28% for both objects, respectively. RDPN6D benefits from using depth information, yielding strong results on objects like the Holepuncher; its overall performance falls behind our method, highlighting the efficiency and accuracy of our method. This notable outcome emphasizes the strength and dependability of our method, particularly in difficult scenarios characterized by extensive occlusion and the presence of several objects. Furthermore, Fig.8 visually represents qualitative outcomes, demonstrating the accurate correspondence between predicted and actual 3D bounding boxes on the Occlusion dataset.
Comparison on the linemod
Table 4 illustrates the comparison in terms of ADD(-S) between our method and previous state-of-the-art methods5,22,24,26,34,39 on the Linemod dataset. Most methods use additional processing, such as RANSAC or PnP algorithms to obtain and improve 6D pose estimation. In contrast with these methods, our method eliminates these additional steps and boosts performance by focusing on the important regions and discarding the background clutter. Our method achieves significant performance close to26, which achieves the highest average ADD(-S) score of 97.62%, while our method is 97.45%. Our method outperforming PoseCNN5 by 8.85% and closely matching Ullah et al.’s result24 of 97.62%. Although methods like BDR6D22 use additional steps to estimate poses, our method shows significant improvement while maintaining computational efficiency. Significantly, our method exhibits exceptional precision across several objects, achieving a perfect 100% score for four objects (Eggbox, Glue, Bench Vise, and Lamp), which underscores its robustness, particularly in handling symmetric objects. Generally, Symmetric objects exhibit higher performance in most methods due to reduced ambiguity in pose estimation. However, our method excels with non-symmetric objects such as Driller (99.80%) and Cat (99.40%). Even though our method has been generally successful, it performs less effectively for the Ape and duck objects, achieving an accuracy of 86.09% and 92.11%, respectively. The object’s limited dimensions and absence of notable visual characteristics impede its visibility and reliable estimation of its position in complex environments. Nevertheless, our approach continues to possess a strong advantage, as it exhibits enhancements in performance for 9 out of the 13 objects, surpassing an accuracy rate of 98%.
The reason behind the subpar performance of the ape object can be attributed to its diminutive size and lack of distinctive features, which have a negative impact on its visibility within the scene. Fig. 9 shows our method’s qualitative results for some objects.
Running time
Computation efficiency depends on many factors, such as hardware configuration and image resolution. We compare our method with other methods, such as Efficientpose24, Zebrapose21, Ullah et al.26, Deepfusion14, and RDPN6D53 based on results reported in their respective papers, which utilized different hardware configurations and different image resolution. For instance, RDPN6D53 uses RTX 3090 GPU, while Ullah et al.26 Nvidia 2080Ti GPU. We measure our model using a NVIDIA A100-PCIE. It is essential to consider these variations in hardware while analyzing the outcomes. While our model was tested with 512x512 and 768x768 image resolutions, similar details regarding image sizes for EfficientPose, ZebraPose, and Deepfusion are unavailable, which could affect direct runtime comparison. As shown in Fig. 10, our model exhibits comparable efficiency to the most advanced methods available for multi-object pose estimation. More precisely, when our lightweight model \(\phi =0\) is applied to an image with a resolution of 512x512, it takes approximately 27 milliseconds for each image. This achievement indicates a similar level of computing efficiency in RDPN6D, which has a runtime of 29 ms per image. On the other hand, our model with higher computational cost denoted as \(\phi =2\), inference images with a resolution of 768x768 and achieved a remarkable processing time. To thoroughly assess our model’s efficiency under resource-constrained conditions, we conducted evaluations on a personal computer equipped with an Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz and an NVIDIA GeForce GTX 1650 GPU with 4GB memory. Remarkably, the lightweight \(\phi =0\) model achieved a runtime of just 93 ms on the GPU and 160 ms per image on the CPU, showcasing exceptional performance even in the absence of GPU acceleration. Furthermore, the more advanced \(\phi =2\) model maintained impressive runtimes of 195 ms with GPU and 276 ms on CPU per image. These results highlight the robustness, efficiency, and adaptability of our models across varying hardware environments.
Conclusion
We introduced a method based on an end-to-end approach for 6D object pose estimation by leveraging convolutional neural networks and self-attention mechanisms. Our method, structured across five levels of feature resolution, incorporates three key components: backbone based on Densenet, feature fusion based on BiFPN module, and prediction subnetworks that handle four tasks: classification, bounding box regression, rotation, and translation. We designed translation and rotation subnetworks elegantly and effectively by combining MHSA with an iterative refinement module to improve and boost the performance of 6D object pose estimation. Our experiments show that our method achieves a significant ADD(-S) metric performance with 97.45% and 84.38% in the Linemod and Occlusion Linemod datasets, respectively, showcasing its effectiveness across challenging environments with heavy occlusion and clutter. Furthermore, our ablations revealed the contributions of each component, validating the importance of feature fusion, MHSA, and iterative refinement in enhancing the overall accuracy. However, our method does have limitations. In circumstances involving very small and textureless objects, the efficacy of our model may decrease due to the increased difficulty in extracting features for small objects. One notable constraint is the inability to apply our method to new objects, as it is only trained on specific objects in small datasets. This limitation may hinder its usefulness in situations where the model encounters objects that were not included in the training data.
Data availability
The dataset employed and/or analyzed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/metwalli/linemod-occlusionlinemod-dataset.
Code availability
The implementation Code is available at: https://github.com/Metwalli/SMO-6DPose.
References
Kehl, W., Manhardt, F., Tombari, F., Ilic, S. & Navab, N. Ssd-6d: Making rgb-based 3D detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, 1521–1529 (2017).
Hoque, S., Arafat, M. Y., Xu, S., Maiti, A. & Wei, Y. A comprehensive review on 3D object detection and 6D pose estimation with deep learning. IEEE Access 9, 143746–143770 (2021).
He, Z., Feng, W., Zhao, X. & Lv, Y. 6D pose estimation of objects: Recent technologies and challenges. Appl. Sci. 11, 228 (2020).
Wang, C. et al. Densefusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3343–3352 (2019).
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. Posecnn: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv preprint arXiv:1711.00199 (2017).
Tremblay, J. et al. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects. arXiv preprint arXiv:1809.10790 (2018).
Rad, M. & Lepetit, V. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, 3828–3836 (2017).
Bauer, D. et al. Challenges for monocular 6d object pose estimation in robotics. IEEE Trans. Robot. (2024).
Park, K., Mousavian, A., Xiang, Y. & Fox, D. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10710–10719 (2020).
Di, Y. et al. So-pose: Exploiting self-occlusion for direct 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12396–12405 (2021).
Guan, J., Hao, Y., Wu, Q., Li, S. & Fang, Y. A survey of 6dof object pose estimation methods for different application scenarios. Sensors 24, 1076 (2024).
Pan, X. et al. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 815–825 (2022).
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition At Scale. arXiv preprint arXiv:2010.11929 (2020).
Zhou, J., Chen, K., Xu, L., Dou, Q. & Qin, J. Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13967–13977 (2023).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst.30 (2017).
Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation By Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014).
Labbé, Y., Carpentier, J., Aubry, M. & Sivic, J. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, 574–591 (Springer, 2020).
Chen, D., Li, J., Wang, Z. & Xu, K. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11973–11982 (2020).
Cai, M. & Reid, I. Reconstruct locally, localize globally: A model free method for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3153–3163 (2020).
Hu, Y., Hugonot, J., Fua, P. & Salzmann, M. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3385–3394 (2019).
Su, Y. et al. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6738–6748 (2022).
Liu, P., Zhang, Q. & Cheng, J. Bdr6d: Bidirectional deep residual fusion network for 6D pose estimation. IEEE Trans. Autom. Sci. Eng.21 (2024).
Xu, L., Qu, H., Cai, Y. & Liu, J. 6d-diff: A keypoint diffusion framework for 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9676–9686 (2024).
Bukschat, Y. & Vetter, M. Efficientpose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv preprint arXiv:2011.04307 (2020).
Castro, P. & Kim, T.-K. Crt-6D: Fast 6D object pose estimation with cascaded refinement transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5746–5755 (2023).
Ullah, F., Wei, W., Fan, Z. & Yu, Q. 6d object pose estimation based on dense convolutional object center voting with improved accuracy and efficiency. Vis. Comput. 1–14 (2023).
Brachmann, E. et al. Learning 6d object pose estimation using 3D object coordinates. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, 536–551 (Springer, 2014).
Hinterstoisser, S. et al. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, 858–865 (IEEE, 2011).
Peng, S., Liu, Y., Huang, Q., Zhou, X. & Bao, H. Pvnet: Pixel-wise voting network for 6Dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4561–4570 (2019).
Zakharov, S., Shugurov, I. & Ilic, S. Dpod: 6D pose object detector and refiner. In Proceedings of the IEEE/CVF international conference on computer vision, 1941–1950 (2019).
Wang, Z., Chen, M., Guo, Y., Li, Z. & Yu, Q. Bridging the domain gap in satellite pose estimation: A self-training approach based on geometrical constraints. IEEE Transactions on Aerospace and Electronic Systems (2024).
Chen, H. et al. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2781–2790 (2023).
Song, C., Song, J. & Huang, Q. Hybridpose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 431–440 (2020).
Iwase, S., Liu, X., Khirodkar, R., Yokota, R. & Kitani, K. M. Repose: Fast 6D object pose refinement via deep texture rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3303–3312 (2021).
Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10781–10790 (2020).
Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural. Inf. Process. Syst. 34, 30392–30400 (2021).
Wu, H. et al. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31 (2021).
Peng, Z. et al. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 367–376 (2021).
Zhang, Z., Chen, W., Zheng, L., Leonardis, A. & Chang, H. J. Trans6D: Transformer-based 6d object pose estimation and refinement. In European Conference on Computer Vision, 112–128 (Springer, 2022).
Bello, I., Zoph, B., Vaswani, A., Shlens, J. & Le, Q. V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3286–3295 (2019).
Srinivas, A. et al. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16519–16529 (2021).
tf.keras.layers.SeparableConv2D TensorFlow v2.9.1. https://www.tensorflow.org/api_docs/python/tf/keras/layers /SeparableConv2D (Accessed 29 Aug 2022).
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1251–1258 (2017).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708 (2017).
Guo, M.-H. et al. Attention mechanisms in computer vision: A survey. Comput. Vis. Med. 8, 331–368 (2022).
Pérez, J., Marinković, J. & Barceló, P. On the Turing Completeness of Modern Neural Network Architectures. arXiv preprint arXiv:1901.03429 (2019).
Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2022).
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
Hu, H., Gu, J., Zhang, Z., Dai, J. & Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3588–3597 (2018).
Zhou, Y., Barnes, C., Lu, J., Yang, J. & Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5745–5753 (2019).
Mahendran, S., Ali, H. & Vidal, R. 3D pose regression using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2174–2182 (2017).
Kanazawa, A., Black, M. J., Jacobs, D. W. & Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7122–7131 (2018).
Hong, Z.-W., Hung, Y.-Y. & Chen, C.-S. Rdpn6d: Residual-based dense point-wise network for 6dof object pose estimation based on Tgb-d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5251–5260 (2024).
Hinterstoisser, S. et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, 548–562 (Springer, 2013).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 62001452), Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China (No. 2021ZZ116), Science and Technology Program of Fuzhou City(No. 2022ZD001) and Science and Technology Program of Fujian province(No. 2023T3040).
Funding
This work was supported by the National Natural Science Foundation of China (No. 62001452), Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China (No. 2021ZZ116), Science and Technology Program of Fuzhou City(No. 2022ZD001) and Science and Technology Program of Fujian province(No. 2023T3040).
Author information
Authors and Affiliations
Contributions
Author Metwalli Al-Selwi: Conceptualisation, Methodology, Experiments, write-review & editing; Authors Ning Huang and Yan Chao: prepared drawings and figures, Writing - review & editing; Authors Yin Gao and Qiming Li: Experiments, Results analysis, Writing - review & editing; Author Dr. Li Jun: Writing - review & editing, Supervision.
Corresponding authors
Ethics declarations
Competing interests
We declare that there are no competing interests among the authors.
Ethics approval
Not applicable.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Al-Selwi, M., Ning, H., Gao, Y. et al. Enhancing object pose estimation for RGB images in cluttered scenes. Sci Rep 15, 8745 (2025). https://doi.org/10.1038/s41598-025-90482-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-90482-6
Keywords
This article is cited by
-
A non-sub-sampled shearlet transform-based deep learning sub band enhancement and fusion method for multi-modal images
Scientific Reports (2025)
-
Lightweight image encryption for wireless sensor networks using optimized elliptic curve and fuzzy logic
Scientific Reports (2025)
-
Infrared and visible image fusion using GAN with fuzzy logic and Harris Hawks optimization
Scientific Reports (2025)
-
Integrating image processing with deep convolutional neural networks for gene selection and cancer classification using microarray data
Scientific Reports (2025)
-
Computer vision-based laser communication system for robust optical beam tracking and alignment
Scientific Reports (2025)












