Enhancing object pose estimation for RGB images in cluttered scenes

Al-Selwi, Metwalli; Ning, Huang; Gao, Yin; Chao, Yan; Li, Qiming; Li, Jun

doi:10.1038/s41598-025-90482-6

Download PDF

Article
Open access
Published: 13 March 2025

Enhancing object pose estimation for RGB images in cluttered scenes

Metwalli Al-Selwi^1,2,3,4,
Huang Ning³,
Yin Gao^1,3,4,
Yan Chao³,
Qiming Li³ &
…
Jun Li^1,2,3,4

Scientific Reports volume 15, Article number: 8745 (2025) Cite this article

4660 Accesses
16 Citations
Metrics details

Subjects

Abstract

Estimating the 6D pose of objects is crucial for robots to interact with the environment. 6D Object pose estimation from RGB images in a cluttered scene and heavy occlusions is a critical issue. Most existing methods use two stages to estimate object pose: First, extract the object features, and then use the PnP/RANSAC method to estimate object pose. However, most of these techniques merely localize a group of key-points by regressing their coordinates, which are vulnerable to occlusion and have poor performance for multi-object pose estimation. These methods cannot directly regress the 6D pose estimation from a loss during training. In this paper, we propose a framework based on convolutional neural network (CNN) and self-attention mechanism as an end-to-end method for single and multi-object 6D pose estimation using RGB images with low computational cost. Our method utilizes feature fusion to extract local features and combines multi-head self-attention (MHSA) with iterative refinement to improve pose estimation performance. Furthermore, our method can be scaled according to computational resources. Our experiments illustrate that our method performs in benchmark datasets the Linemod and Occlusion Linemod and achieves 97.45% and 84.84% in terms of the ADD(-S) metric in both datasets, respectively.

Instance-level 6D pose estimation based on multi-task parameter sharing for robotic grasping

Article Open access 02 April 2024

Enhanced RGB-D feature extraction for 6D pose estimation

Article Open access 07 January 2026

DON6D: a decoupled one-stage network for 6D pose estimation

Article Open access 10 April 2024

Introduction

An object’s pose determines its orientation and position in space. Generally, a 3D rotation and translation with six degrees of freedom (6DoF) describe an object’s pose. Estimating the 6D pose of an object is a significant challenge in computer vision, as it involves detecting both an object’s 3D orientation and 3D location in the case of camera-centered coordinates, where 3D translation and rotation vectors emphasize 6D pose estimation^1,2. Nowadays, 6D object pose estimation has become indispensable for many successful applications such as driverless vehicles, manufacturing robots, and augmented reality^2,3,4,5,6,7. Therefore, understanding and perceiving an object’s pose is essential for many applications, including robotic grasping and interacting robots with actual objects in the real-world⁸. For instance, in the industrial sector, a robot must be able to analyze an object’s pose to manipulate it⁹^,¹⁰. Consequently, 6D pose estimation gives detailed data on both the 3D location and orientation of objects¹¹, allowing robots to perform many manipulation tasks such as object grasping, assembly tasks, and human-robot interaction by learning from demonstration.

In order to perform 6D pose estimation of objects, a method should be robust against heavy occlusion, sensor noise, background clutter, and changing illumination conditions while meeting the speed requirements of real-time applications²^,⁴^,⁹. In recent years, convolution and self-attention have significantly advanced in computer vision tasks. During this time, convolutional neural networks (CNNs) have become dominant in computer vision; however, many successful methods significantly perform various tasks, such as image recognition, classification, object detection, and semantic segmentation, and have achieved significant performance on many benchmarks¹². On the other hand, vision transformers¹³^,¹⁴ have made dramatic achievements in many vision tasks due to self-attention¹⁵^,¹⁶. In 6D object pose estimation tasks, current techniques can estimate the 6D pose in a chaotic cluttered environment with outstanding efficiency, driven by the recent success of deep learning¹⁷^,¹⁰. Generally, most existing methods use a two-stage approach that involves first determining 2D-3D correspondences with a reference image or object’s model¹¹ and then generating the 6D pose using RANSAC-based Perspective-n-Point(PnP) algorithm for each object separately^18,23. Subsequently, they are not practical in many real-world applications since they take more time and have a higher computational cost. Other methods based on the end-to-end approach⁴^,⁵^,¹⁰^,²⁴^,²⁵^,²⁶, alternatively, estimate the 6D pose of objects directly. However, several significant challenges remain, such as the presence of heavy occlusion, cluttered scenes, and unpredictable lighting conditions. To tackle the issue of changing illumination, a plausible and effective strategy involves extensive training across a wide range of lighting scenarios²^,⁴^,⁹. Still, occlusion and cluttered scenes remain open areas of research that demand further exploration. Clearly, these complexities require innovative techniques to enhance the performance and reliability of object pose estimation.

In this paper, we introduce a new method to accurately estimate the 6D pose of objects from RGB images that is designed to provide fast and efficient inference using an end-to-end approach. In fact, the existing methods depend on resource-intensive multi-stage procedures like RANSAC and PnP, which frequently encounter computational inefficiencies. In contrast, our method obviates the need for these procedures. By integrating self-attention and iterative refinement techniques, our model directly estimates the pose of objects, even in cluttered environments, which markedly improves precision and computing efficiency. The proposed method comprises three essential components: backbone, feature fusion, and prediction subnetworks. Based on Densenet, the backbone enables richer feature extraction from occluded objects through its densely connected layers. Embedding MHSA enables the network to focus on essential regions and enhance pose estimation in challenging environments with overlapping objects. The feature fusion consolidates multi-scale features to capture essential features from varying feature resolutions. This ensures that the model effectively learns from global and local contexts, which leads to predict an object under occlusion (Sec.3.1.2). Also, to improve the performance against occlusion, our method uses the same principle of⁵ by employing pixel-wise voting of the center of an object instead of predicting the whole object. Finally, the prediction subnetworks focus on improving 6D pose estimation by combining iterative refinement and MHSA in translation and rotation tasks. This iterative approach ensures that the model refines its predictions progressively, achieving higher accuracy regardless of the number of objects in the scene. In summary, the key contributions of this work are:

We propose a novel framework architecture that seamlessly integrates feature fusion and combines convolution and self-attention operations to regress the 6D object pose directly from RGB images.
Our method incorporates rotation and translation networks leverage iterative refinement modules alongside MHSA to iteratively estimate and enhance 3D rotation and translation.
We conduct extensive experiments on the Occlusion Linemod²⁷ and Linemod datasets²⁸ benchmarks. Our results show a significant advancement over existing methods that utilize RGB input, particularly for multi-object pose estimation, and also contribute to enhancing computational efficiency, making our method a compelling choice for real-world applications.

Related work

Two stages pose estimation approach

Considerable literature that uses a two-stage approach has been published⁷^,¹⁸^,²⁰^,²¹^,²⁹^,³⁰, and two-stage approach dominates the field of 6D pose estimation: first detect 2D key points or 2D-3D corresponding 3D model of the object in the image, and then estimate the 6D pose using RANSAC-based Perspective-n-Point algorithm. Rad et al.⁷ propose BB8, a framework to predict objects poses as 2D projections of the corners of their 3D bounding boxes and then use a PnP algorithm to determine their 3D poses from this 2D-3D relationship. However, when the object is partially unseen, BB8 may not obtain an accurate 3D bounding box during the estimation stage. To address this issue, Hu et al.²⁰ suggest dividing the image into multiple patches and requires each patch to predict the location of the 2D projections and the object to which they belong. Subsequently, the 6D pose is estimated by aggregating all patches belonging to the same object, and the PnP algorithm is applied. Similarly, Su et al. introduce ZebraPose²¹, an RGB-based method for estimating 6D object pose. The method utilizes a hierarchical binary grouping strategy to build dense 2D-3D correspondences by employing a coarse-to-fine surface encoding technique. Specifically, ZebraPose assigns binary descriptors to 3D vertices and uses a progressive training strategy to predict correspondences. On the other hand, PVNet²⁹ regresses pixel-wise unit vectors, leading to key-points. Vector-field representation is proposed to focus on local features and establish the relationship between different parts of the object, enabling PVNet to recover the occluded parts of the object from the visible portions, even if they are outside the scene. In contrast, DPOD³⁰ employs a UV map for rich correspondences. Recent approaches, such as 6D-Diff²³, introduce a diffusion-based framework for 6D object pose estimation that mitigates noise and indeterminacy by employing reverse diffusion for 2D-3D correspondences, followed by the Perspective-n-Point (PnP) algorithm to recover the 6D pose. Likewise, BDR6D²² aims to improve 6D pose estimation by leveraging multi-modal data through the fusion of RGB images and depth information and then recovers the 6D pose using the PnP algorithm. From another perspective, Wang et al.³¹ propose a self-training framework for unsupervised domain adaptation in satellite pose estimation, leveraging domain-agnostic geometrical constraints and fine-grained segmentation to enhance predicted sparse keypoints accuracy and then utilize PnP to estimate the poses. They introduce adversarial training for mask alignment without real data annotations. Nevertheless, the PnP technique is sensitive to minor errors in the 2D representation, making pose estimation challenging, particularly in the presence of occlusion. Although two-stage methods have been widely adopted, their limitations hinder their effectiveness, especially in multi-object scenarios. The necessity for an additional regression step for each object substantially increases computational expenses. Furthermore, as the number of objects increases, these methods encounter difficulties in maintaining accuracy, resulting in a significant performance drop in multi-object pose estimation. In contrast, by combining iterative refinement and self-attention modules instead of RANSAC or PnP, our method reduces computational complexity and improves pose estimation accuracy, particularly in scenarios with multiple objects in cluttered scenes.

End-to-end pose estimation approach

One of the most cited studies is that of Xiang et al.⁵, who propose PoseCNN, which is based on a convolutional neural network (CNN) for 6D object pose estimation in cluttered scenes. PoseCNN aims to estimate an object’s 3D translation and rotation by using pixel-wise voting to determine the object’s center and estimating its distance from the camera to handle occlusion. In a related approach, Ullah et al.²⁶ proposed a fully convolutional, parallel architecture for pixel-wise dense estimation of 3D translation and orientation. Zhou et al.¹⁴ introduced a Deep Fusion Transformer (DFTr) network that integrates cross-modality features by leveraging semantic similarity between color and depth data and employs weighted vector-wise voting with global optimization for 3D keypoint localization. Generally, the two-stage approach is robust due to the use of the PnP algorithm for 6D pose estimation. Building on this, to leverage PnP in an end-to-end manner, Chen et al.³² propose EPro-PnP, a novel probabilistic PnP layer for end-to-end pose estimation. Specifically, EPro-PnP outputs a distribution of poses with a differentiable probability density on the SE(3) manifold, and it treats 2D-3D coordinates and corresponding weights as intermediate variables. Meanwhile, other researchers¹⁰^,²⁴^,³³^,³⁴ use a pose refinement network instead of the PnP algorithm to improve the performance of 6D object pose estimation in an end-to-end fashion. For instance, HybridPose³³ employs multiple representations such as geometric information, key-points, and symmetry correspondences to estimate the 6D pose of objects by improving reprojection error adjustment. Similarly, Di et al.¹⁰ leverage 2D-3D correspondences and self-occlusion to establish a two-layer representation for 3D objects. They utilize a shared encoder and two independent decoders to provide 2D-3D correspondences and self-occlusion information and then combine the outputs to regress the 6D pose parameters directly. Along the same lines, Iwase et al.³⁴ recently improved performance by formulating object pose refinement as an optimization problem based on feature alignment. From another perspective, Bukschat et al.²⁴ extend EfficientDet³⁵ to estimate the 6D poses of objects by adding rotation and translation subnetworks alongside classification and regressing bounding box subnetworks. The most intriguing aspect of their work is the use of 6D augmentation to enrich the dataset, as the benchmark datasets they used are small. Taken together, these studies provide important insights into 6D object pose estimation. However, most of the aforementioned works focused on developing a method for a single object rather than multi-object scenario. Consequently, poor performance is observed when some of these methods are applied to pose estimation for multiple objects. While these methods represent a significant advance, they struggle with challenges such as occlusion and cluttered scenes. Our method addresses these issues by integrating feature fusion, self-attention, and iterative refinement to dynamically focus on relevant objects in the image, which improves pose estimation in heavily occluded and cluttered environments.

Attention mechanism and convolution

As our work is a vision task and the method is based on combining self-attention with convolution modules, we also summarize the improved work related to vision tasks based on self-attention and convolution. Basically, convolutional neural networks and vision transformers have made significant progress in computer vision tasks in recent years, thanks to the use of convolution operations and attention mechanisms on CNN and ViT, respectively. Recently, considerable literature has grown around combining convolution and attention mechanisms for various vision tasks. Many researchers focus on improving transformer models by adding convolution operations. Xiao et al.³⁶ use convolution at an early stage as a stem with a transformer to improve performance and training stability. Wu et al.³⁷ utilize convolutional token embedding and stride convolution to decrease the computational cost of self-attention. Another work combines convolution and self-attention in a hybrid network structure called Conformer³⁸, which uses convolution to extract local features and self-attention for global representation, thereby integrating and keeping feature representation from deterioration. Zhong et al.³⁹ and Zhou et al.¹⁴ employ Transformer-based methods for 6D pose estimation, Trans6D, and DFTr, respectively. It employs Transformers to capture global dependencies effectively, mitigating information loss. Additionally, two novel modules enhance Trans6D’s accuracy and robustness: a patch-aware feature fusion module and a pure Transformer-based pose refinement module. On the other hand, many previously described attention techniques applied to images show that they may help convolutional neural networks overcome their weaknesses on the locality issues⁴⁰. Several studies investigate the idea of applying attention modules or utilizing additional relational data to improve the performance of convolutional neural networks¹². Srinivas et al.⁴¹ replace the convolutional layers with self-attention in the model’s final stages. Bello et al.⁴⁰ propose to augment self-attention with convolution by concatenating feature maps from the self-attention pipeline with convolutions in certain layers. Pan et al.¹² propose a hybrid model that comprises convolution and self-attention modules in a parallel manner. Building on approaches¹²^,⁴⁰, our method combines convolution and self-attention to address the challenges of 6D pose estimation. We improve pose estimation in complex, occluded environments by leveraging self-attention to capture global dependencies and convolutional operations for local feature refinement. This hybrid approach significantly reduces the performance gap in multi-object scenarios while maintaining computational efficiency.

Methodology

We propose a framework to estimate the 6D object pose for single and multi-objects from RGB images. Pose estimation involves detecting objects and calculating their 3D translations and orientations. In particular, a 6D pose is described by a transformation (R, t) from the object’s coordinate system to the camera’s coordinate system, where R and t are 3D rotation and translation, respectively. Estimating the 6D pose of an object in the environment is critical and has many challenges, such as heavy occlusion and background clutter scenes.

To resolve these challenges, (1) a feature fusion technique that focuses on understanding how different parts of an object are related (Sec. 3.1.2). Instead of predicting the entire object, our method aims to predict key center points, which allows the model to remain effective even under heavy occlusion. (2) self-attention is used to highlight and focus on the regions of interest and ignore the background clutter. In contrast, applying self-attention in high-resolution images has a few challenges, such as a high computation cost and a huge memory. In order to resolve these challenges, we follow the work⁴¹ in our design and consider the following: (1) Use convolutions to learn low-resolution feature maps and abstract from large images effectively. (2) Use self-attention to process and combine the information that convolutions have collected in the feature maps. Convolutions thus perform spatial downsampling for high-resolution features and then use self-attention for low-resolution features.

Network architecture

Our method’s overall structure consists of five levels based on the feature resolutions. It is separated into three components, namely the backbone, the feature fusion, and the prediction subnetworks, as illustrated in Fig. 1. In our method, we incorporate two input streams: the input image and the camera parameter as vector $a \in R^6$, to compute the object translation that comprises the focal lengths of the pinhole camera $f_x$ and $f_y$, the principal point coordinates $p_x$ and $p_y$, and scaling factors of image and translation measurement. The input image undergoes feature extraction in the backbone component. Specifically, we employ Densenet as the backbone to extract features comprehensively. Notably, within the backbone, the embedded MHSA operates at level 4 to enhance feature representation by capturing long-range dependencies. Features extracted from different resolutions within the backbone are fused in feature fusion to create a comprehensive feature representation. This fusion process is pivotal for synthesizing information from various scales and resolutions, enriching the feature set. Finally, in the prediction subnetworks component, features from different levels in the feature fusion are utilized for diverse tasks, including classification, bounding box regression, rotation, and translation. To reduce the number of parameters and optimize computational efficiency, we adopt separable convolutions instead of conventional convolutions in our architecture, except for the backbone, effectively reducing computational costs while maintaining performance standards. Generally, separable convolution consists of a stacked depth-wise convolution (each input channel is treated independently), followed by another convolution called pointwise convolution, which combines the resultant output channels produced by the depth-wise convolution⁴²^,⁴³. We explain the overall architecture in more detail as follows:

Backbone

Typical image processing uses convolutional neural networks, which have dominated computer vision for several years because of their performance and lower computational costs compared to traditional neural networks. Our model’s backbone is the densely connected convolution network Densenet⁴⁴. In particular, dense connection promotes feature reuse and reduces the number of parameters, which is a crucial element of the Densenet design. Admittedly, Densenet is more effective than other designs in reusing the features since feature concatenation is used, which is known as feature reusability. Benchmark datasets for 6D pose estimation that we used are small; besides transfer learning and data augmentation, feature reusability is advantageous in Densenet to enrich data thanks to dense connectivity, as shown in Fig. 2. Accordingly, compared to a traditional CNN, Densenet may learn mapping with fewer parameters since duplicate mappings do not need to be learned. Densenet generally comprises three primary parts: the initial layers (stem), dense blocks, and transition layers. We modify Densenet by embedding MHSA on it. Attention mechanisms are methods for directing focus to the most important areas of an image while ignoring unimportant areas⁴⁵. Unlike traditional convolutional methods that often fail to preserve feature integrity in chaotic environments, our incorporation of MHSA into the Densenet architecture allows the network to dynamically prioritize and consolidate pertinent features from partially obscured objects¹². This results in enhanced pose estimation accuracy in situations characterized by significant occlusion and numerous overlapping objects, in contrast to methods as such⁵^,²⁴. Theoretically, self-attention is a more adaptable operation that can simulate convolutional models’ behavior when encoding local features⁴⁶^,⁴⁷. The attention module performs its calculations repeatedly and concurrently, referred to as attention heads¹⁵. Therefore, the attention module divides its Query, Key, and Value parameters N-ways and independently routes each split via a different head. A final attention score is subsequently generated by combining these related attention computations. This term is called MHSA, which consists of numerous self-attention blocks to capture the intricate interactions between the various parts in the sequence see Fig 3. As mentioned above, identifying an object in a cluttered scene is critical and challenging on 6D pose estimation tasks; thus, we use the MHSA module to resolve this issue and estimate the relevance of an object to other objects in the scene. Densenet typically has 4 dense blocks commonly referred to as [block1, block2, block3, block4]. We embed the MHSA between blocks 3 and 4, where the feature map resolution is low, as inspired by Srinivas et al.⁴¹, who discovered that adding self-attention achieved significant performance in their experiments.

Feature fusion

Utilizing convolution kernels in a convolutional neural network to extract local features has emerged as the most effective for various vision tasks¹²^,⁴¹. However, in order to achieve a reliable 6D pose estimate, it is crucial to integrate local features from various phases of the backbone into a unified representation. This is where feature fusion techniques play a crucial role. Feature fusion typically consists of several bottom-up and top-down multi-scale pathways for gathering feature maps from various input features³⁵^,⁴⁸ as illustrated in Fig. 1(middle). In our method, we adopt the BiFPN³⁵ module as the multi-scale feature fusion. The fusion within BiFN involves iteratively combining features as much as possible at different resolutions from the different blocks of the backbone. Specifically, features from lower-resolution layers (which contain semantic information) are merged with features from higher-resolution layers (which retain more detailed spatial information). This merging process is accomplished by utilizing a sequence of weighted connections, which enable the network to dynamically ascertain the significance of each feature map at various scales. By fusing local and global features, the network can make predictions based on the visible portion of an object in the scene, thereby eliminating and minimizing occlusion issues⁴. The fusion process enables the model to accurately capture the interconnections between various parts of an object, thus distinguishing between the various parts of each object and the relationship between those parts enhances the understanding of the object structure⁴¹^,⁴⁹. We follow Mingxin et al.³⁵ to scale up the fusion block width and depth, as they found the best performance by scaling the width and depth with the following equations:

$$\begin{aligned} W = 64 > (1.35^\phi ), D = 3 + \phi \end{aligned}$$

(1)

where $\phi$ is the scale-up parameter.

Prediction subnetworks

6D pose estimation usually uses four computer vision tasks: object detection, classification, rotation, and translation. We define four separate networks for those four tasks, as seen in Fig. 1(right), and then concatenate each network at different levels into a single output.

Classification/ Bounding Box Regression Subnetworks These two subnetworks are adopted from EfficientDet³⁵ network, which consists of convolution layers followed by patch normalization and SiLU activation function. Both have the same depth and width. In our experiments, we follow the work²⁴ to balance between depth and width. We fix the width of both two networks to be the same as the feature fusion component width (i.e., $W_{class}=W_{box}= W_{feature_fusion}$), whereas the depth specified by the equation:

$$\begin{aligned} D_{class} = D_{box} = 3 + \lfloor \phi /2\rfloor \end{aligned}$$

(2)

where $\lfloor \rfloor$ refers to the floor function

Rotation Subnetwork

The central and crucial parts of 6D object pose estimation are 3D rotation and translation of the object; therefore, rotation and translation subnetworks are designed carefully and elegantly to be very simple and more effective. In the rotation network, we use axis-angle representation since its representations require only three scalar values to represent the rotation, making them more compact compared to other representations such as quaternions. This compactness reduces the memory footprint and computational complexity of the rotation network, making it more efficient during training and inference⁵⁰. Some works⁵¹ proved that axis-angle representation for rotation achieved significant performance²⁴. Our rotation network predicts a single rotation vector for each anchor box. The 3D points rotation by an angle theta around an axis v, where $\Vert v \Vert _2=1$ can be captured by a rotation matrix $R = exp(\theta [v]_x)$ where $[v]_x$ is the vectors’ skew-symmetric operator i.e.,

$$\begin{aligned} [v]_x = \begin{pmatrix} 0 & -v_3 & v_2\\ v_3 & 0 & -v_1\\ -v_2 & v_1 & 0 \end{pmatrix} \text { for } v = [v_1,v_2, v_3]^T \end{aligned}$$

(3)

Hence, for each axis-angle vector $y= \theta v$, there is a corresponding rotation matrix R and vice-versa. The rotation subnetwork comprises two crucial components/modules: iterative refinement and MHSA modules. the MHSA module allows the network to focus on the critical area (target objects) and discard the cluttered while iterative refinement estimates the pose of an object, resulting in enhanced pose estimation failures iteratively. The iterative refinement module consists of stacked depth-separable convolution layers, each layer followed by patch normalization and SiLU activation function. The number of conv layers in each iterative refinement module is calculated using Eq. 2. The output of each module is used as part of the input for the next module (MHSA) to make pose estimation more accurate as a whole⁴. Our architecture is designed horizontally based on feature resolutions; the most effective way to leverage the self-attention mechanism while keeping the computational cost reasonable is by utilizing it on features with small resolutions. For low feature resolutions, we combine the MHSA module and iterative refinement module⁴^,⁵², whereas removing the MHSA module for high-resolution features. Figure 4 depicts the structure of the rotation network. We initialize the rotation network with a separable convolution layer, followed by a sequence of MHSA and iterative refinement modules. We follow the traditional and influence architectures Densenet and Resnet to design rotation subnetworks by utilizing skip connections to flow data between those two modules in two fundamental ways: addition and concatenation. The input of each module connects directly to its output using addition to address and solve the degradation problem as the network goes deeper. Simultaneously, we use concatenation to ensure feature reusability and enrich the features since our dataset is small. Equation (2) is used to specify the repeated MHSA and iterative refinement modules. Also, the same equation is used to specify the number of heads in MHSA and the depth of the refinement module.

Translation Subnetwork The translation is the distance between the object in the scene coordinate system and the camera coordinate system. To regress the translation of an object $t=(t_x, t_y, t_z)^T$, we follow⁵^,²⁴ by splitting the translation into two tasks, the distance $t_z$ regression and predicting center point $c= [(c_x,c_y)]^T$, then we can calculate the $T_x$ and $T_y$ by the following equation:

$$\begin{aligned} \begin{pmatrix} \textsf{T}_{\textsf{x}} \\ \textsf{T}_{\textsf{y}} \end{pmatrix} = \begin{pmatrix} \frac{(\textsf{c}_{\textsf{x}}-\textsf{p}_{\textsf{x}}).\textsf{T}_{\textsf{z}}}{\textsf{f}_{\textsf{x}}} \\ \frac{(\textsf{c}_{\textsf{y}}-\textsf{p}_{\textsf{y}}).\textsf{T}_{\textsf{z}}}{\textsf{f}_{\textsf{y}}} \end{pmatrix} \end{aligned}$$

(4)

where $f_x$ and $f_y$ refers to the camera focal lengths, and $[(p_x,p_y)]^T$ is the principal point.

As seen in Fig. 5, estimating the translation can be done by localizing the object center and estimating the center distance from the camera. Besides feature fusion, estimating the center of the object instead of estimating the whole object enables the network to estimate the pose even if the part of the object is occluded. The structure of the translation subnetwork is similar to the rotation network. However, instead of regressing the translation directly in the iterative refinement module, the translation is divided into predicting the center point (x, y) and regressing the distance z; The outputs of each iterative module are two separable convolution layers representing xy and z coordinates.

Compound scaling

We follow EfficientDet³⁵ scaling strategy by scaling up the architecture via a hyperparameter $\phi$ that controls scaling input image resolution and various parts of the architecture, such as BiFBN width and depth, subnetworks depth, and the depth of iterative refinement module, which equals the number of heads in each MHSA module.

Table 1 The scaling configuration of our models.

Full size table

Experiments

In our experiments, we evaluate our method on two benchmark datasets for 6D pose estimation, the Linemod²⁸ and Occlusion Linemod²⁷, as well as conduct an ablation study. We compare our method with some main methods⁵^,⁵³^,²⁴^,³⁴^,³⁹^,²¹^,²⁶^,¹⁴^,²³^,²² that use RGB color images.

Dataset

In order to fortify our method for real-world applications, we deliberately opt for authentic RGB images instead of synthetic ones. This deliberate choice ensures that our method stands resilient against the complexities inherent in real-world scenarios. Linemod dataset consists of 13 objects, and each scene has only one object with its annotated data, as shown in Fig. 6(Bottom), and classical methods extensively use it. Additionally, some of the Linemod images were further annotated to produce the Occlusion Linemod. Each image has several marked objects, as shown in Fig. 6(Top), pose estimation is tricky since they are severely occluded. One of our objectives is to build a model for multi-object pose estimation tasks. We use the Occlusion Linemod dataset to train the model for multi-object since it contains eight annotated objects in each scene. For conducting a comparison with other methods, we follow other works⁵^,²⁹ to split the dataset for training and testing datasets. This partitioning method involves selecting training images in a way that ensures a minimum angular separation of 15 degrees between object poses. As a result, approximately 15% of the images are designated for training, while the remaining 85% are allocated for testing.

Evaluation metrics

We evaluate our method with Average Distance Differentiable ADD(-S)⁵⁴. It computes the average distance between two point sets transformed by ground truth pose and predicted pose with a maximum threshold of 10 cm. Due to the fact that our datasets include both symmetric and asymmetric objects, ADD is defined as:

$$\begin{aligned} ADD = \frac{1}{m} \qquad \sum _{x \in M} \Vert (R_x + t) - ({\hat{R}}_x + {\hat{t}})\Vert _2 \end{aligned}$$

(5)

where R and t are the ground truth of rotation and translation, respectively, ${\hat{R}}$ is the estimated rotation and ${\hat{t}}$ is the estimated translation of the model point transformed. While for symmetric objects, ADD-S is described as:

$$\begin{aligned} ADD-S = \frac{1}{m} \qquad \sum _{x \in M} min\Vert (R_x + t) - ({\hat{R}}_x + {\hat{t}})\Vert _2 \end{aligned}$$

(6)

Implementation details

Our model is implemented using the TensorFlow framework. The backbone is initialized with the Densenet model trained on Imagenet. ReduceLROnPlateau callback Keras class is used to reduce the learning rate; the initial learning rate is 1-e4, and the learning rate is reduced if no improvement is observed during training for 20 epochs. Data augmentation is used to address the limited size of the dataset and boost performance. The training epochs are set to be 3000, evaluated every 10 epochs, and the batch size is 4. Since our model performs multi-tasks, each task has a loss function, such as classification and bounding box regression. Rotation and translation are combined with a single loss function that considers both symmetric and asymmetric objects in the account as follows: For asymmetric objects, the loss function is defined as follows:

$$\begin{aligned} Loss_{asym} = \frac{1}{m} \qquad \sum _{x \in M} \Vert ({\hat{R}}(x) + {\hat{t}}) - (R(x) + t)\Vert _2 \end{aligned}$$

(7)

where ${\hat{R}}_x$ and ${\hat{t}}$ indicate predicted rotation and translation, respectively; $R_x$ and t refer to the ground truth; M denotes an object’s 3D model points set, and m denotes the number of points. For symmetric objects, the loss function is the same, but the minimum distance between the predicted point to any point in the ground truth points is set instead of the difference between matching points, as shown below:

$$\begin{aligned} Loss_{sym} = \frac{1}{m}\qquad \sum _{x_1 \in M} min_{x_2 \in M}\Vert ({\hat{R}}(x_1) + {\hat{t}}) - (R(x_2) + t)\Vert _2 \end{aligned}$$

(8)

Ablation study

In this section, we conduct ablation studies on the Occlusion Linemod dataset to assess the impact of various design choices on the robustness of our method.

Table 2 The ablation study results in terms of the ADD(-S) metric on the Occlusion Linemod dataset.

Full size table

Figure7 shows how the baseline model performance compares with different configurations.

Impact of component removal

Excluding MHSA leads to a considerable drop in overall performance, which achieves an average ADD(-S) score of 78.18% as shown in Table 2, compared to the baseline model’s score of 81.84%. The decrease in performance is especially noticeable in the case of Ape (54.95%) and Duck (63.96%), which indicates that MHSA is essential for effectively dealing with complicated environments and enhancing pose estimation for textureless objects. Similarly, removing the FPN leads to a substantial performance decrease, with an average score of 77.10%. The Cat and Holepuncher objects, in particular, exhibit significant decreases in performance, with scores dropping to 52.63% and 80.57%, respectively. This demonstrates the crucial importance of multi-scale feature fusion in achieving precise pose estimation, especially for objects that vary in size and shape. The absence of Iterative Refinement also leads to a performance decline, with an average score of 77.69%. These results indicate that feature fusion has a considerable impact on the model, which provides a 5.10% improvement in model performance. Further experimentation reveals that increasing the iterative steps with the same depth of the $\phi = 2$ model with four iterations results in an average performance of 81.72%, nearly identical to the baseline model. This indicates the robustness and stability of MHSA across varying configurations, such as the balance between the depth and width of various components of the model, highlighting its ability to maintain high performance even as the model complexity is adjusted. This implies that although iterative refinement typically enhances accuracy, it may not be equally useful for all objects, especially those that require a more direct approach to pose estimation.

Performance across backbone architectures and scaling configurations

Evaluation assessment also includes the performance of using Resnet and Efficientnet backbones architectures. Resnet and Efficientnet attain a mean score of 78.92% and 79.22%, respectively, while baseline with Densenet backbone achieves an average score of 81.84%. This demonstrates its exceptional capability to effectively capture and utilize dense feature connections for pose estimation. Additionally, we performed experiments with various models, as shown in Table 1 to reference each model’s configuration. We compare our proposed model with different hyperparameters such as image resolution, BiFPN depth and width, number of heads, etc. As can be seen in Table 2, the $\phi = 2$ model performed significantly and achieved 84.38% on average, outperforming the $\phi = 0$ baseline model by 2.54%. A possible explanation might be that the more objects there are, the deeper the model will have to be to get perfect results. Considering efficiency, since the Linemod dataset is trained using an independent model for each object, it does not need a complex model, so the $\phi =0$ model is used and denoted by “OURS” in all experiments on the Linemod dataset, while $\phi =2$ model is compared with state-of-the-art models in Occlusion Linemod dataset.

Experimental results and comparison

As mentioned in Sect. 4.1, we evaluate our method on the Occlusion Linemod and Linemod datasets. Our method achieves significant performance compared with state-of-the-art methods that use RGB images on both two benchmark datasets. Our model achieves an average accuracy of ADD(-S) 84.38% and 97.45% on the Occlusion Linemod and Linemod, respectively.

Table 3 The results of our method are compared with state-of-the-art methods in terms of ADD(-S) on the Occlusion Linemod dataset. Objects in bold names are symmetric.

Full size table

Comparison on the occlusion linemod

Table 3 presents a comparative evaluation on the Occlusion Linemod dataset of our proposed method against notable state-of-the-art methods EfficientPose²⁴, Zebrapose²¹, Ullah et al.²⁶, Deepfusion¹⁴, 6D-diff²³, and RDPN6D⁵³. It is important to note that some methods, such as RDPN6D, leverage RGB-D inputs, while others, including our approach, rely solely on RGB. For a fair comparison, we focus on the dataset used for both training and testing, and we exclude the comparison with other methods that solely used the Occlusion Linemod dataset for evaluation. Our method achieves the highest overall average ADD(-S) score of 84.38%, outperforming all other methods in occlusion scenes. EfficientPose follows with an average score of 83.98%, while methods like Zebrapose and Deepfusion exhibit lower average scores of 76.9% and 77.7%, respectively. For individual objects, our method exhibits competitive performance across numerous objects compared to other methods. Our method outperforms previous approaches by achieving the maximum accuracy on objects such as Ape (66.46%) and Duck (78.85%). In addition, although 6D-diff gives the highest result for Can (97.9%) and Glue (92.0%), our method consistently achieves a strong performance of 94.73% and 90.28% for both objects, respectively. RDPN6D benefits from using depth information, yielding strong results on objects like the Holepuncher; its overall performance falls behind our method, highlighting the efficiency and accuracy of our method. This notable outcome emphasizes the strength and dependability of our method, particularly in difficult scenarios characterized by extensive occlusion and the presence of several objects. Furthermore, Fig.8 visually represents qualitative outcomes, demonstrating the accurate correspondence between predicted and actual 3D bounding boxes on the Occlusion dataset.

Comparison on the linemod

Table 4 Evaluation results of our model and existing main methods in terms of ADD(-S) on the Linemod dataset. Bold names are symmetric objects.

Full size table

Table 4 illustrates the comparison in terms of ADD(-S) between our method and previous state-of-the-art methods⁵^,²²^,²⁴^,²⁶^,³⁴^,³⁹ on the Linemod dataset. Most methods use additional processing, such as RANSAC or PnP algorithms to obtain and improve 6D pose estimation. In contrast with these methods, our method eliminates these additional steps and boosts performance by focusing on the important regions and discarding the background clutter. Our method achieves significant performance close to²⁶, which achieves the highest average ADD(-S) score of 97.62%, while our method is 97.45%. Our method outperforming PoseCNN⁵ by 8.85% and closely matching Ullah et al.’s result²⁴ of 97.62%. Although methods like BDR6D²² use additional steps to estimate poses, our method shows significant improvement while maintaining computational efficiency. Significantly, our method exhibits exceptional precision across several objects, achieving a perfect 100% score for four objects (Eggbox, Glue, Bench Vise, and Lamp), which underscores its robustness, particularly in handling symmetric objects. Generally, Symmetric objects exhibit higher performance in most methods due to reduced ambiguity in pose estimation. However, our method excels with non-symmetric objects such as Driller (99.80%) and Cat (99.40%). Even though our method has been generally successful, it performs less effectively for the Ape and duck objects, achieving an accuracy of 86.09% and 92.11%, respectively. The object’s limited dimensions and absence of notable visual characteristics impede its visibility and reliable estimation of its position in complex environments. Nevertheless, our approach continues to possess a strong advantage, as it exhibits enhancements in performance for 9 out of the 13 objects, surpassing an accuracy rate of 98%.

The reason behind the subpar performance of the ape object can be attributed to its diminutive size and lack of distinctive features, which have a negative impact on its visibility within the scene. Fig. 9 shows our method’s qualitative results for some objects.

Running time

Computation efficiency depends on many factors, such as hardware configuration and image resolution. We compare our method with other methods, such as Efficientpose²⁴, Zebrapose²¹, Ullah et al.²⁶, Deepfusion¹⁴, and RDPN6D⁵³ based on results reported in their respective papers, which utilized different hardware configurations and different image resolution. For instance, RDPN6D⁵³ uses RTX 3090 GPU, while Ullah et al.²⁶ Nvidia 2080Ti GPU. We measure our model using a NVIDIA A100-PCIE. It is essential to consider these variations in hardware while analyzing the outcomes. While our model was tested with 512x512 and 768x768 image resolutions, similar details regarding image sizes for EfficientPose, ZebraPose, and Deepfusion are unavailable, which could affect direct runtime comparison. As shown in Fig. 10, our model exhibits comparable efficiency to the most advanced methods available for multi-object pose estimation. More precisely, when our lightweight model $\phi =0$ is applied to an image with a resolution of 512x512, it takes approximately 27 milliseconds for each image. This achievement indicates a similar level of computing efficiency in RDPN6D, which has a runtime of 29 ms per image. On the other hand, our model with higher computational cost denoted as $\phi =2$, inference images with a resolution of 768x768 and achieved a remarkable processing time. To thoroughly assess our model’s efficiency under resource-constrained conditions, we conducted evaluations on a personal computer equipped with an Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz and an NVIDIA GeForce GTX 1650 GPU with 4GB memory. Remarkably, the lightweight $\phi =0$ model achieved a runtime of just 93 ms on the GPU and 160 ms per image on the CPU, showcasing exceptional performance even in the absence of GPU acceleration. Furthermore, the more advanced $\phi =2$ model maintained impressive runtimes of 195 ms with GPU and 276 ms on CPU per image. These results highlight the robustness, efficiency, and adaptability of our models across varying hardware environments.

Conclusion

We introduced a method based on an end-to-end approach for 6D object pose estimation by leveraging convolutional neural networks and self-attention mechanisms. Our method, structured across five levels of feature resolution, incorporates three key components: backbone based on Densenet, feature fusion based on BiFPN module, and prediction subnetworks that handle four tasks: classification, bounding box regression, rotation, and translation. We designed translation and rotation subnetworks elegantly and effectively by combining MHSA with an iterative refinement module to improve and boost the performance of 6D object pose estimation. Our experiments show that our method achieves a significant ADD(-S) metric performance with 97.45% and 84.38% in the Linemod and Occlusion Linemod datasets, respectively, showcasing its effectiveness across challenging environments with heavy occlusion and clutter. Furthermore, our ablations revealed the contributions of each component, validating the importance of feature fusion, MHSA, and iterative refinement in enhancing the overall accuracy. However, our method does have limitations. In circumstances involving very small and textureless objects, the efficacy of our model may decrease due to the increased difficulty in extracting features for small objects. One notable constraint is the inability to apply our method to new objects, as it is only trained on specific objects in small datasets. This limitation may hinder its usefulness in situations where the model encounters objects that were not included in the training data.

Data availability

The dataset employed and/or analyzed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/metwalli/linemod-occlusionlinemod-dataset.

Code availability

The implementation Code is available at: https://github.com/Metwalli/SMO-6DPose.

References

Kehl, W., Manhardt, F., Tombari, F., Ilic, S. & Navab, N. Ssd-6d: Making rgb-based 3D detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, 1521–1529 (2017).
Hoque, S., Arafat, M. Y., Xu, S., Maiti, A. & Wei, Y. A comprehensive review on 3D object detection and 6D pose estimation with deep learning. IEEE Access 9, 143746–143770 (2021).
Article Google Scholar
He, Z., Feng, W., Zhao, X. & Lv, Y. 6D pose estimation of objects: Recent technologies and challenges. Appl. Sci. 11, 228 (2020).
Article MATH Google Scholar
Wang, C. et al. Densefusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3343–3352 (2019).
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. Posecnn: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv preprint arXiv:1711.00199 (2017).
Tremblay, J. et al. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects. arXiv preprint arXiv:1809.10790 (2018).
Rad, M. & Lepetit, V. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, 3828–3836 (2017).
Bauer, D. et al. Challenges for monocular 6d object pose estimation in robotics. IEEE Trans. Robot. (2024).
Park, K., Mousavian, A., Xiang, Y. & Fox, D. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10710–10719 (2020).
Di, Y. et al. So-pose: Exploiting self-occlusion for direct 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12396–12405 (2021).
Guan, J., Hao, Y., Wu, Q., Li, S. & Fang, Y. A survey of 6dof object pose estimation methods for different application scenarios. Sensors 24, 1076 (2024).
Article ADS PubMed PubMed Central MATH Google Scholar
Pan, X. et al. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 815–825 (2022).
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition At Scale. arXiv preprint arXiv:2010.11929 (2020).
Zhou, J., Chen, K., Xu, L., Dou, Q. & Qin, J. Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13967–13977 (2023).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst.30 (2017).
Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation By Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014).
Labbé, Y., Carpentier, J., Aubry, M. & Sivic, J. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, 574–591 (Springer, 2020).
Chen, D., Li, J., Wang, Z. & Xu, K. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11973–11982 (2020).
Cai, M. & Reid, I. Reconstruct locally, localize globally: A model free method for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3153–3163 (2020).
Hu, Y., Hugonot, J., Fua, P. & Salzmann, M. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3385–3394 (2019).
Su, Y. et al. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6738–6748 (2022).
Liu, P., Zhang, Q. & Cheng, J. Bdr6d: Bidirectional deep residual fusion network for 6D pose estimation. IEEE Trans. Autom. Sci. Eng.21 (2024).
Xu, L., Qu, H., Cai, Y. & Liu, J. 6d-diff: A keypoint diffusion framework for 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9676–9686 (2024).
Bukschat, Y. & Vetter, M. Efficientpose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv preprint arXiv:2011.04307 (2020).
Castro, P. & Kim, T.-K. Crt-6D: Fast 6D object pose estimation with cascaded refinement transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5746–5755 (2023).
Ullah, F., Wei, W., Fan, Z. & Yu, Q. 6d object pose estimation based on dense convolutional object center voting with improved accuracy and efficiency. Vis. Comput. 1–14 (2023).
Brachmann, E. et al. Learning 6d object pose estimation using 3D object coordinates. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, 536–551 (Springer, 2014).
Hinterstoisser, S. et al. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, 858–865 (IEEE, 2011).
Peng, S., Liu, Y., Huang, Q., Zhou, X. & Bao, H. Pvnet: Pixel-wise voting network for 6Dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4561–4570 (2019).
Zakharov, S., Shugurov, I. & Ilic, S. Dpod: 6D pose object detector and refiner. In Proceedings of the IEEE/CVF international conference on computer vision, 1941–1950 (2019).
Wang, Z., Chen, M., Guo, Y., Li, Z. & Yu, Q. Bridging the domain gap in satellite pose estimation: A self-training approach based on geometrical constraints. IEEE Transactions on Aerospace and Electronic Systems (2024).
Chen, H. et al. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2781–2790 (2023).
Song, C., Song, J. & Huang, Q. Hybridpose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 431–440 (2020).
Iwase, S., Liu, X., Khirodkar, R., Yokota, R. & Kitani, K. M. Repose: Fast 6D object pose refinement via deep texture rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3303–3312 (2021).
Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10781–10790 (2020).
Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural. Inf. Process. Syst. 34, 30392–30400 (2021).
Google Scholar
Wu, H. et al. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31 (2021).
Peng, Z. et al. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 367–376 (2021).
Zhang, Z., Chen, W., Zheng, L., Leonardis, A. & Chang, H. J. Trans6D: Transformer-based 6d object pose estimation and refinement. In European Conference on Computer Vision, 112–128 (Springer, 2022).
Bello, I., Zoph, B., Vaswani, A., Shlens, J. & Le, Q. V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3286–3295 (2019).
Srinivas, A. et al. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16519–16529 (2021).
tf.keras.layers.SeparableConv2D TensorFlow v2.9.1. https://www.tensorflow.org/api_docs/python/tf/keras/layers /SeparableConv2D (Accessed 29 Aug 2022).
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1251–1258 (2017).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708 (2017).
Guo, M.-H. et al. Attention mechanisms in computer vision: A survey. Comput. Vis. Med. 8, 331–368 (2022).
Article MATH Google Scholar
Pérez, J., Marinković, J. & Barceló, P. On the Turing Completeness of Modern Neural Network Architectures. arXiv preprint arXiv:1901.03429 (2019).
Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2022).
Article MATH Google Scholar
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
Hu, H., Gu, J., Zhang, Z., Dai, J. & Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3588–3597 (2018).
Zhou, Y., Barnes, C., Lu, J., Yang, J. & Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5745–5753 (2019).
Mahendran, S., Ali, H. & Vidal, R. 3D pose regression using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2174–2182 (2017).
Kanazawa, A., Black, M. J., Jacobs, D. W. & Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7122–7131 (2018).
Hong, Z.-W., Hung, Y.-Y. & Chen, C.-S. Rdpn6d: Residual-based dense point-wise network for 6dof object pose estimation based on Tgb-d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5251–5260 (2024).
Hinterstoisser, S. et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, 548–562 (Springer, 2013).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62001452), Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China (No. 2021ZZ116), Science and Technology Program of Fuzhou City(No. 2022ZD001) and Science and Technology Program of Fujian province(No. 2023T3040).

Funding

This work was supported by the National Natural Science Foundation of China (No. 62001452), Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China (No. 2021ZZ116), Science and Technology Program of Fuzhou City(No. 2022ZD001) and Science and Technology Program of Fujian province(No. 2023T3040).

Author information

Authors and Affiliations

Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, Fuzhou, Fujian, China
Metwalli Al-Selwi, Yin Gao & Jun Li
University of Chinese Academy of Sciences, Beijing, China
Metwalli Al-Selwi & Jun Li
Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, Fujian, China
Metwalli Al-Selwi, Huang Ning, Yin Gao, Yan Chao, Qiming Li & Jun Li
Fujian College, University of Chinese Academy of Sciences, Fuzhou, Fujian, China
Metwalli Al-Selwi, Yin Gao & Jun Li

Authors

Metwalli Al-Selwi
View author publications
Search author on:PubMed Google Scholar
Huang Ning
View author publications
Search author on:PubMed Google Scholar
Yin Gao
View author publications
Search author on:PubMed Google Scholar
Yan Chao
View author publications
Search author on:PubMed Google Scholar
Qiming Li
View author publications
Search author on:PubMed Google Scholar
Jun Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Author Metwalli Al-Selwi: Conceptualisation, Methodology, Experiments, write-review & editing; Authors Ning Huang and Yan Chao: prepared drawings and figures, Writing - review & editing; Authors Yin Gao and Qiming Li: Experiments, Results analysis, Writing - review & editing; Author Dr. Li Jun: Writing - review & editing, Supervision.

Corresponding authors

Correspondence to Metwalli Al-Selwi or Jun Li.

Ethics declarations

Competing interests

We declare that there are no competing interests among the authors.

Ethics approval

Not applicable.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Al-Selwi, M., Ning, H., Gao, Y. et al. Enhancing object pose estimation for RGB images in cluttered scenes. Sci Rep 15, 8745 (2025). https://doi.org/10.1038/s41598-025-90482-6

Download citation

Received: 06 May 2024
Accepted: 13 February 2025
Published: 13 March 2025
Version of record: 13 March 2025
DOI: https://doi.org/10.1038/s41598-025-90482-6

Keywords

This article is cited by

A non-sub-sampled shearlet transform-based deep learning sub band enhancement and fusion method for multi-modal images
- Sudhakar Sengan
- Praveen Gugulothu
- Amr Yousef
Scientific Reports (2025)
Lightweight image encryption for wireless sensor networks using optimized elliptic curve and fuzzy logic
- Mohsen Zarei
- Mohammad Hosein Fatehi Dindarlou
- Jasem Jamali
Scientific Reports (2025)
Infrared and visible image fusion using GAN with fuzzy logic and Harris Hawks optimization
- Mahvash Zarimeidani
- Amir Amirabadi
- Siavash Es’haghi
Scientific Reports (2025)
Integrating image processing with deep convolutional neural networks for gene selection and cancer classification using microarray data
- Yuanyuan Zhang
- Jing Chen
- Chong Zhang
Scientific Reports (2025)
Computer vision-based laser communication system for robust optical beam tracking and alignment
- Shuai Li
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Related work

Two stages pose estimation approach

End-to-end pose estimation approach

Attention mechanism and convolution

Methodology

Network architecture

Backbone

Feature fusion

Prediction subnetworks

Compound scaling

Experiments

Dataset

Evaluation metrics

Implementation details

Ablation study

Impact of component removal

Performance across backbone architectures and scaling configurations

Experimental results and comparison

Comparison on the occlusion linemod

Comparison on the linemod

Running time

Conclusion

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links