Abstract
The 6D pose estimation of objects plays a crucial role in robot sorting technology. To simultaneously improve the accuracy and efficiency of pose estimation during the grasping process, this study proposes an innovative deep learning pose estimation method based on PVN3D. Specifically, this study deeply optimized the backbone network of the image extraction model, using dense connections and learnable grouped convolutions to improve model accuracy and reduce model complexity. At the same time, the core idea of convolutional neural networks is integrated into the point cloud feature extraction process, which uses dynamic convolution techniques to process point cloud features, making the processing of point cloud data more efficient and flexible. In addition, by introducing a parameter-free attention mechanism, the accuracy of the model has been further improved. Through extensive experimental verification on the publicly available LineMOD, YCB video datasets and other three core datasets under the BOP benchmark, the method proposed in this paper demonstrates significant advantages in both computational efficiency and estimation accuracy. It not only significantly improves the accuracy of pose estimation but also significantly enhances real-time performance.
Similar content being viewed by others
Introduction
With the continuous advancements in robotics technology and deep learning, sorting technology driven by machine vision has been widely applied in various fields, such as industrial manufacturing, agricultural production, and nuclear waste treatment. A key challenge when robots perform sorting tasks is how to quickly and accurately determine the six-degree-of-freedom (6D) pose information of the target object under visual sensors1,2. At present, the research focus in this field is developing efficient and robust algorithms to cope with complex scenarios such as tightly stacked objects and mutual occlusion, ensuring accurate acquisition of the complete 6D pose of each object.
On the basis of the differences in input information types, pose-estimation methods can be roughly divided into three categories: those based on RGB images, 3D point-cloud data, and RGB-D image fusion3,4. Methods based on RGB images rely mainly on visual cues such as color distributions, texture features, and edge contours in the image to infer the pose states of objects. This type of method demonstrates good performance in scenes with rich surface textures and prominent feature points, and can efficiently complete feature extraction and matching tasks. However, when facing objects with scarce surface textures or single colors, such as white walls or smooth metal surfaces, traditional feature detection algorithms (such as SIFT, SURF, etc.) often struggle to extract sufficient feature points, which can adversely affect the accuracy and stability of pose estimation. The method of using three-dimensional point cloud data directly utilizes the three-dimensional geometric structure information of objects for pose estimation. However, owing to the large number of discrete spatial points that make up three-dimensional point cloud data, the data scale is massive and the arrangement is disorderly, posing great challenges to data processing and computation. Compared with the previous two methods, RGB-D images integrate RGB color information and depth information, providing more comprehensive and rich data support for pose estimation. By combining RGB information with depth information, RGB-D-based methods can more accurately perceive the shape features, texture details, and spatial positions of objects, thereby significantly improving the accuracy and robustness of pose estimation. However, despite significant progress in RGB-D pose estimation via deep learning methods, many challenges still need to be addressed, such as difficulty in feature extraction caused by smooth surfaces of target objects, interference factors in complex environments, and occlusion issues.
Although transformer-based and diffusion-based frameworks such as PoseDiffusion5 and foundation-model approaches such as FoundationPose6 have recently achieved impressive generalization in 6D pose estimation, their adoption in robotic systems remains constrained by several factors. These methods typically rely on large-scale pre-training on synthetic data and complex attention architectures that incur heavy computational and memory costs. Their inference latency often exceeds 200 ms per frame, which limits their applicability to real-time robotic manipulation. Moreover, global attention in diffusion or transformer backbones emphasizes semantic context at the expense of fine-grained geometric fidelity, making them less suitable for small or texture-less industrial objects.
In contrast, our design focuses on lightweight, geometry-aware efficiency. The CondenseNet backbone provides a learnable dense connectivity pattern that preserves multi-scale RGB features while greatly reducing redundant parameters. The Omni-Dimensional Dynamic Convolution (ODConv)7 module enhances local geometric adaptivity in point-cloud space by dynamically modulating convolutional kernels across spatial and channel dimensions, which addresses the local-detail loss that global-attention networks often suffer. The SimAM attention mechanism8 further refines discriminative feature regions through parameter-free energy-based weighting, enabling robust focus on occluded or partially visible structures with negligible computational overhead. Together, these components deliver a compact network that maintains high-accuracy 6D pose estimation with approximately 19% faster inference and 11% smaller model size than PVN3D9, meeting real-time requirements unmet by transformer or diffusion frameworks.
Beyond the classical PVN3D lineage, we also compare our approach with more recent architectures such as RCVPose10 and GDRNPP. RCVPose employs transformer-based region-consistent voting to enhance long-range feature reasoning, but its quadratic attention complexity hinders deployment on embedded systems. GDRNPP integrates dense geometric regression for high-precision correspondence learning but demands large memory and extensive offline refinement. Our method achieves comparable accuracy with significantly lower computational cost, demonstrating that efficient convolutional designs can still compete effectively with heavier attention or diffusion frameworks—particularly in robotic sorting and industrial scenarios where latency and resource constraints are critical.This study proposes a new pose-estimation method based on deep learning, which deeply modifies the original feature extraction network of PVN3D9. The backbone network of the image extraction model is optimized via dense connections and learnable grouped convolution techniques. This improvement not only effectively reduces the complexity of the model but also significantly improves its accuracy, making it easier to deploy on resource-constrained devices. Integrating the idea of convolutional neural networks into the process of point cloud feature extraction and using dynamic convolution techniques to process point cloud features makes point cloud data processing more efficient and flexible. In addition, by introducing a parameter-free attention mechanism, the model can more effectively identify and extract features, thereby further improving accuracy. To comprehensively evaluate the effectiveness of the proposed method, extensive comparative and ablation experiments were conducted on the publicly available LineMOD (LM) dataset. The experimental results show that, compared with the single-mode pose estimation network that uses only image information or point cloud information, the proposed method has significant advantages in terms of model size and inference accuracy. Compared with the original PVN3D network, our method not only reduces the model complexity but also achieves significant improvements in the accuracy and real-time performance. In contrast to prior works such as PVN3D, FFB6D, and DenseFusion, which rely on static feature fusion, our method introduces a dynamic and biologically inspired fusion strategy that jointly optimizes efficiency, adaptivity, and robustness under occlusion.
To summarize, our contributions include the following:
-
The image feature extraction network is improved by merging all feature output layers, promoting feature reuse and reducing information loss. Moreover, redundant convolution kernels were removed, improving model efficiency.
-
The idea of convolutional neural networks is integrated into the point cloud feature extraction process, dynamic convolution techniques are used to process point cloud features, and a parameter-free attention mechanism is introduced to identify and extract features more effectively.
-
For the LM and YCB video datasets, our method demonstrated significant advantages in model size, inference speed, and detection accuracy, validating the effectiveness and practicality of the proposed method.
The remainder of this paper is organized as follows. The section “Related work” provides a brief overview of RGB and point-cloud-based position estimation methods. A detailed description of the proposed method is provided in the section “Methodology”. The extensive experiments conducted to evaluate the proposed method are described in the section “Experiments”. Finally, the conclusion is presented in the section “Conclusions”.
Related work
RGB image-based pose estimation
Research on pose estimation based on RGB images has undergone significant evolution from traditional geometric methods to deep learning. Early research relied on manually designed features and geometric constraints but had poor robustness to texture loss and occlusion scenes, and required precise 3D model matching, which limited practical applications. With the rise of deep learning, research has shifted toward end-to-end pose regression. Kendall et al. (2015) proposed PoseNet11, which pioneered the use of CNNs for direct regression of camera poses. The NOCS method proposed by Wang et al. (2019) achieves category level pose estimation by standardizing the object coordinate space, greatly improving the algorithm’s generalizability12. The PVNet proposed by Peng et al. (2019) uses a vector field voting mechanism to predict keypoint positions, improving robustness under occlusion conditions13. However, this method still has ambiguity in estimating the poses of symmetric objects and requires additional prior knowledge of 3D shapes. The PoseDiffusion model5 was the first to apply diffusion models to 6D pose estimation tasks; it predicts object poses through a gradual denoising process, significantly improving estimation accuracy in complex scenes. However, methods based on diffusion models have high computational costs and have difficulty meeting real-time requirements.
Point cloud-based position estimation
The development of point cloud pose estimation methods is closely related to the advancement of 3D perception technology. Early work relied mainly on traditional registration algorithms, such as the FPFH feature combined with the ICP algorithm proposed by Rusu et al. (2009), which performed well in structured scenes14. The GICP algorithm proposed by Segal et al. (2009) further improved the registration accuracy15. However, these methods are highly sensitive to point cloud density and quality, and their performance decreases sharply for sparse or noisy point clouds. With the emergence of deep learning frameworks such as PointNet16, point cloud processing methods have achieved a qualitative improvement. PointNet + +17 is also a classic architecture in the field of point cloud processing, which has achieved collaborative modeling of local geometric features and the global context of point clouds through a hierarchical feature learning framework, PointNet and PointNet + + laid the foundation for learning directly from unordered 3D point clouds. The Point Transformer proposed by Zhao et al. (2021) effectively captures long-range dependencies between point clouds through a self-attention mechanism, demonstrating superior performance in complex scenes18. DenseFusion19 improves the accuracy of pose estimation through pixel-by-pixel feature fusion19. The FFB6D20 uses a full-stream bidirectional fusion network to further improve performance. However, these methods have high computational complexity and poor real-time performance, and their robustness to missing areas in point clouds still needs to be improved. Subsequent studies, such as GDR-Net21 and RCVPose10, demonstrated that incorporating local geometric reasoning can further enhance accuracy under occlusion. Motivated by these insights, our method retains a point-based branch but replaces static convolution with dynamic kernels to increase spatial adaptability.
RGB-D-based pose estimation
The RGB-D method combines visual and depth information, and has both textural and geometric advantages. Shotton et al. (2013) were the first to use random forest regression of scene coordinates22, laying the foundation for subsequent research. The dense visual odometry based on RGB-D proposed by Kerl et al. in 2013 further promoted the development of this field23. However, this type of method is strongly dependent on the accuracy of depth sensors and performs poorly in scenarios such as long-distance or reflective surfaces. The PoseCNN proposed by Xiang et al. (2018) was the first to achieve end-to-end RGB-D pose estimation24. The FFB6D network designed by He et al. (2021) has further improved the estimation accuracy in complex scenes through bidirectional feature fusion. The PVN3D uses PointNet + + and a deep Hough voting network for 3D keypoint detection and estimates the 6D pose of an object through a least squares fitting algorithm. The GDR-Net further improves the accuracy of pose estimation through geometric perception dense correspondence learning, but has low computational efficiency and many model parameters. These works establish the standard paradigm of RGB-D fusion for 6D pose estimation. Our framework builds upon this lineage but emphasizes lightweight design and adaptive feature modeling to improve real-time applicability.
Attention mechanisms
Attention mechanisms improve feature discrimination by weighting informative regions. Channel-attention methods such as SENet25 and combined spatial-channel methods such as CBAM26 have achieved significant success but at the cost of additional parameters. In contrast, SimAM is a parameter-free attention module inspired by energy-minimization principles in biological vision. It computes neuron importance through variance-based energy functions, enhancing salient spatial structures without extra weights or FLOPs. Given its minimal overhead and strong localization ability, SimAM is particularly suitable for compact RGB-D architectures. In our framework, SimAM refines the fused RGB-D features to emphasize key geometric cues under partial occlusion, complementing the efficiency of CondenseNet and the adaptivity of ODConv.
Methodology
As shown in Fig. 1, this study proposes an innovative pose estimation network model that takes RGB images captured by a depth camera and corresponding depth images as inputs. For RGB images, we use CondenseNet27 for feature extraction. Owing to its excellent computational efficiency and lightweight model design, CondenseNet not only significantly reduces the number of parameters and computational costs but also deeply explores the rich feature information in RGB images, providing a solid foundation for subsequent processing. For depth images, we convert them into a set of local regions of point cloud data through sampling and grouping layers. To capture the local geometric and shape features of point clouds, we introduce ODConv7. By considering the directionality of point cloud data, ODConv can more effectively extract spatial structural information from point clouds, significantly enhancing the expressive power of the model. In addition, we introduced the SimAM attention mechanism to further enhance key features and improve the performance of the model8. Unlike prior integration-based networks such as FFB6D20 and DenseFusion19, which simply fuse RGB and point-cloud features through feature concatenation, our design introduces mutually adaptive feature interaction via CondenseNet, ODConv, and SimAM. Specifically, CondenseNet’s learnable dense connectivity allows compact RGB encoding that complements point-wise geometric reasoning in ODConv. ODConv dynamically adapts convolutional kernels to spatial and channel variations, which addresses the geometric rigidity of PVN3D. Finally, SimAM provides parameter-free attention that enhances discriminative key-point responses under occlusion. The joint effect forms a lightweight yet geometrically adaptive architecture not present in previous PVN3D-based extensions.
Schematic diagram of the 6D pose estimation process. First, image and point cloud features are extracted. The dense fusion module then feeds the extracted features to the 3D keypoint detection module, the instance semantic segmentation module and the center-of-mass voting module. Finally, the least squares method is used to estimate the 6D pose of the target.
After extracting image features and point cloud features, we use a Dense Fusion module to feed these features separately to the 3D keypoint detection module, the instance semantic segmentation module, and the center point voting module to achieve deep fusion and utilization of multimodal features. Finally, we utilize the correspondence between the camera coordinate system and the target object coordinate system, and use the least squares method to accurately estimate the 6D pose of the target, thereby achieving precise positioning and pose estimation of the target object.
Feature extraction based on dense group convolution
ResNet28, as a milestone design of CNNs, solves the gradient vanishing problem in deep network training through residual connections, improving model performance. However, residual connections are insufficient for feature reuse and have limited efficiency in feature transfer; however, an increase in network depth can lead to a dramatic increase in parameter and computational complexity, slow training and inference speeds, and poor deployment on resource constrained devices. This study introduces the CondenseNet dense connection network, which adopts a learnable dense connection pattern and can dynamically adjust cross layer connection paths to optimize the network structure more flexibly. To retain the advantages of feature reuse, a unique optimization strategy is used to further reduce the number of parameters, making the model more lightweight.
The structural diagram of CondenseNet is shown in Fig. 2. The core of CondenseNet is the dense connection layer, where each layer is directly connected to all previous layers to achieve comprehensive feature reuse. This connection method ensures that each layer can obtain the feature information extracted by all previous layers, thereby increasing the diversity and expressive power of the features. CondenseNet introduces learnable grouped convolution, which means that the network can dynamically adjust cross layer connection paths and automatically select the optimal connection method on the basis of training data and learning task requirements. This flexibility enables CondenseNet to adapt better to different application scenarios and datasets. CondenseNet further reduces the number of model parameters by optimizing the network structure and connection patterns. This not only reduces the storage requirements of the model, but also improves computational efficiency.
Diagram of the CondenseNet structure. Each layer is directly connected to all previous layers for full feature reuse.
This study used CondenseNet as the backbone network for RGB image feature extraction, significantly improving the overall performance and application value of the model. Owing to the combination of dense connections and learnable connection patterns, CondenseNet performs well in feature extraction, as each layer can fully utilize the feature information of all previous layers, improving the efficiency of feature reuse. Moreover, this structure also promotes the smoothness of gradient flow, effectively alleviating the problem of gradient vanishing in deep network training and enabling the model to converge more stably. In pose estimation tasks, CondenseNet can significantly accelerate inference speed without sacrificing accuracy, making it particularly suitable for robot sorting tasks that require high real-time performance. In addition, the dynamic feature selection mechanism of CondenseNet enhances the robustness and generalization ability of the model, making it perform better in complex scenarios.
Point cloud feature extraction based on full-dimensional dynamics
PointNet + + demonstrates good performance and computational efficiency in tasks such as 3D object detection and scene segmentation. However, in the face of high-density noise, nonuniform point distributions, and real-time processing requirements for large-scale point clouds in complex scenes, this architecture still has shortcomings: limited local feature extraction ability, weak robustness to point density changes, and high computational resource consumption, which restrict its practical application in high-density and large-scale scenes.
In response to the above issues, this study introduces the dynamic kernel learning mechanism of convolutional neural networks into the point cloud feature extraction system and proposes an improved point cloud processing framework based on Omni-Dimensional Dynamic Convolution (ODConv). As shown in Fig. 3, this module generates a dimensional perceptual attention tensor through a global feature aggregation branch (including global average pooling and multilayer perceptron) and establishes a dynamic weight modulation mechanism along four orthogonal directions: the spatial dimension, input channel, output channel, and convolutional kernel ontology. This multiscale attention decoupling strategy achieves fine-grained modeling of spatial topological relationships, channel correlations, and local pattern variations in point cloud data by parallelizing feature learning paths. The core advantage of this mechanism lies in its dynamic modeling ability for the complex characteristics of point cloud data. In the spatial dimension, dynamically adjusting the receptive field range effectively alleviates the problem of feature extraction bias caused by uneven point cloud density. In the channel dimension, an attention mechanism is used to adaptively calibrate feature channels, reducing the interference of noise on key semantic features. In the convolutional kernel dimension, the use of deformable convolution techniques combined with geometric prior knowledge enhances the model’s ability to complete and reconstruct incomplete point cloud data.
ODConv structure diagram. Dynamic weight modulation mechanism along four orthogonal directions: spatial dimension, input channel, output channel and convolution kernel ontology.
After integrating full-dimensional dynamic convolution, PointNet + + achieves significant optimization of its network architecture and substantially enhances its feature extraction capability. ODConv constructs a multidimensional dynamic modulation mechanism that enables convolutional kernels to adaptively adjust in multiple dimensions, such as space, channels, and kernel morphology, effectively capturing complex geometric structures and subtle feature differences in point cloud data. This design enhances the network’s ability to analyze complex scenes and fine structures, especially exhibiting stronger robustness when dealing with high-density noise and nonuniform point distributions. Notably, ODConv achieves a balanced optimization of accuracy and speed by parallelizing parameter sharing kernel decomposition and grouped convolution while improving performance and maintaining computational efficiency.
Focused characterisation based on SimAM attention
The ball query mechanism of PointNet + + has significant drawbacks in complex scenes because of fixed radius sampling: it cannot adaptively adjust at occluded boundaries, resulting in confusion between occluded objects and their features; sparse geometric features in areas with uneven density are prone to excessive smoothing, resulting in loss of local details; and multilevel feature fusion adopts simple addition or concatenation, ignoring hierarchical semantic differences, limiting multiscale expression ability, and reducing model accuracy. To this end, we introduce the SimAM attention mechanism, inspired by the biological visual suppression process, which dynamically adjusts attention weights based on energy minimization, enabling the network to focus on key features, while suppressing irrelevant ones, thereby enhancing feature extraction and semantic understandingin complex scenes.
SimAM is a parameter-free attention mechanism whose core objective is to dynamically adjust feature weights by minimizing the energy function, as shown in Fig. 4, which is the SimAM network structure diagram.
SimAM network structure. 3 Dynamic adjustment of feature weights for spatial attention and channel attention.
For a target neuron \(\:t\) and the remaining neurons \(\:{x}_{i}\)in the same channel, the energy function is defined as:
where \(\:\widehat{t}={w}_{t}t+{b}_{t}\) and \(\:{\widehat{x}}_{i}={w}_{t}{x}_{i}+{b}_{t}\) are affine transforms of the activations, and \(\:{y}_{t},{y}_{o}\)are target values for the “highlighted” and “non-highlighted” groups, respectively. Minimizing the above equation is equivalent to training linear separability between neuron t and other neurons within the same channel. Using binary labeling with \(\:{y}_{t}=1\) and \(\:{y}_{o}=-1\) and adding regular terms, the final energy function is defined as follows:
This form encourages the transformed target neuron to be close to + 1 while pushing other neurons toward − 1 Parsing the above equation yields the weights and biases:
where \(\:{\mu\:}_{t}\) and \(\:{\sigma\:}_{t}\) denote the mean and variance, respectively, and the final simplified energy equation is as follows:
A lower \(\:{e}_{t}^{\star\:}\)indicates that the target neuron is easier to separate from its surroundings; we therefore treat \(\:\frac{1}{{e}_{t}^{\star\:}}\)as an importance signal. On the basis of the definition of the attention mechanism, the features are augmented via the sigmoid function to suppress values with too much energy in E. E refers to the sum of the energy in all spatial and channel dimensions, and X is the input feature. \(\:\odot\:\) refers to the element-wise multiplication. The final SimAM formula is as follows:
The introduction of the SimAM attention mechanism in PointNet + + can significantly improve the processing capability of complex point cloud scenes: it effectively compensates for the limitations of fixed-radius ball queries through dynamic weight allocation and energy minimization principles, adaptively adjusts attention weights in occluded boundaries and uneven density areas, takes into account local geometric structure and global contextual information, suppresses irrelevant point interference, and enhances key features. SimAM optimizes the multilevel feature fusion strategy by adaptively adjusting weights at different levels, solving the problem of traditional methods ignoring semantic differences and enhancing the semantic consistency of multiscale features. Its parameter-free design does not require additional learnable parameters, with low computational overhead and fast inference speed, making it suitable for tasks with high real-time requirements.
Experiments
Experimental environment
The machine operating system for the experimental model training in this study is Ubuntu 18.04, the CPU model is Intel(R) Core(TM) i7-14700, the memory size is 16 GB, the GPU model is NVIDIA GeForce GTX 3090, the video card memory size is 24 GB, the deep learning framework is Pytorch 1.9.0, and cuda11.1 is used. The model was trained with an Adam optimizer, with initial learning rate lr = 10− 4. The learning rate was decayed by a factor of 0.1 every 70 epochs following a fixed step schedule. The network was trained for 300 epochs on the LineMOD (LM) dataset and 500 epochs on the YCB-Video dataset with a batch size of 32. Furthermore, to evaluate generalization and robustness, we tested the proposed model against state-of-the-art (SOTA) methods on three core datasets from the BOP benchmark—Occluded LineMOD (LM-O), T-LESS, and Homebrewed (HB). Each model was trained for 100 epochs with a batch size of 32 under identical configurations to ensure fair comparison.
Datasets
The datasets used in this experiment are the LM public dataset and the YCB video dataset.
The LM dataset, as one of the classic benchmarks in the field of object pose estimation, contains a rich variety of object categories and diverse scenes, such as 13 different shapes of objects such as water bottles, plastic bottles, phones, egg cartons, etc. In response to the high demand for data volume in the model structure adopted in this study, we adopt a data augmentation scheme based on background replacement to expand the training dataset. We chose the SUN2012 dataset, which contains 118000 high-quality images covering natural scenes and indoor environments, as the background source. Its rich scene diversity can provide representative background environments for the target objects on the LM dataset. Each RGB image was composited with randomly selected real-world backgrounds from the SUN2012 dataset to increase visual diversity. This augmentation technique encourages the network to focus on geometric and structural cues rather than relying solely on texture or color information. As a result, it reduces overfitting, improves robustness to illumination and background variation, and significantly enhances the model’s ability to extract discriminative features from low-texture or reflective objects, which are common in real-world industrial scenarios.
As shown in Fig. 5, the enhanced dataset significantly improves data diversity while maintaining the original annotation accuracy. This data augmentation method effectively alleviates the problem of training sample scarcity, and more importantly, enhances the model’s adaptability to real-world scenarios, providing reliable and comprehensive data support for subsequent model training.
(Left) Sample of LM dataset image following data enhancement. (Right) Composite image of objects with different backgrounds prepared using LM dataset.
The YCB video dataset is an important benchmark dataset of evaluating 6D object pose-estimation. By collecting high-quality RGB-D video sequences of 92 real indoor scenes, it fully records multiperspective observation data of objects in daily environments. Its construction fully considers the complexity of practical application scenarios, especially including key elements such as dynamic lighting variations, partial occlusions, multi-object interactions, and continuous viewpoint changes. Such diverse and realistic settings make YCB-Video an effective testbed for assessing model robustness and generalization beyond controlled laboratory conditions. Evaluating our method on YCB-Video demonstrates its applicability to practical robotic perception tasks, where objects often appear in cluttered, low-texture, and dynamically illuminated scenes.
Evaluation criteria
This study used widely used evaluation metrics in the field of pose estimation, namely, the average distance metric (ADD) and average nearest distance metric (ADD-S), to evaluate the accuracy of 6D pose estimation and compare it with other related advanced methods in the current field.
-
1.
ADD: The average distance metric (ADD) is a common method used to assess the accuracy of 6D pose estimation, and its core goal is to measure accuracy by comparing the difference between the predicted 6D pose (with rotation and translation parameters) and the real pose. For specific implementation, the known point cloud of the target object is first coordinate transformed with the predicted and real poses, and two sets of point clouds are obtained; then, the Euclidean distances of the corresponding points in the two sets of point clouds are calculated, and the average value is taken as the evaluation index. The specific formula is as follows:
where N denotes the total number of sampling points in the point cloud data, \(\:{p}_{i}\) denotes the ith sampling point of the target object, \(\:{R}_{p}\) and \(\:T\) denote the predicted rotation matrix and translation vector, respectively, and \(\:{R}^{*}\) and \(\:{T}^{*}\) denote the true rotation matrix and translation vector, respectively. This metric can intuitively reflect the deviation between the predicted pose and the real pose, and is suitable for the evaluation of asymmetric objects.
-
2.
ADD-S: Unlike the ADD metric, the ADD-S metric no longer relies on a strict point-to-point correspondence, but instead calculates the distance from each point in the predicted point cloud to the nearest point in the real point cloud by averaging these distances. The formula is as follows:
where i denotes the current sampled point in the predicted point cloud, j traverses all the sampled points in the real point cloud, and \(\:{\text{m}\text{i}\text{n}}_{j}\) denotes that for each predicted point \(\:{p}_{i}\), the point \(\:{q}_{i}\) with the closest distance to it is found in the real point cloud.
Comparison experiment
In response to the pose estimation evaluation metrics proposed earlier, this section presents comparative experiments with other advanced algorithms on the LM and YCB video datasets.
The comparison results on the LM dataset are shown in Table 1, with the first column representing the target category and the remaining columns displaying the accuracies of the different algorithms. The bold data represent the optimal results. Specifically, in categories such as Ape, Cam, Can, Cat, Driller, Duck, Eggbox, Glue, Holepuncher, and Phone, etc., the accuracy with ADD-S-0.1d of our algorithm is at a leading level, far exceeding algorithms such as Point Fusion29, RCVPose, FoundationPose, Pix2Pose30, and PVNet13, etc. Unlike diffusion- or transformer-based frameworks requiring extensive pre-training, the proposed hybrid network achieves comparable accuracy with one-tenth of the parameters, offering a practical solution for real-time robotic pose estimation. For example, the Ape category accuracy is 96.5% (PoseCNN 21.6%, Pix2Pose 58.1%), and the Cam category accuracy is 96.3% (higher than FoundationPose’s 91.8%). Compared with beaseline PVN3D, this algorithm still has advantages in most categories with a reduced number of parameters (~ 28 M vs. ~39 M for PVN3D). Although the accuracy is slightly lower than that of PVNet in the Benchvise, Iron, and Lamp categories (with differences of 3.2%, 5.5%, and 4.1%, respectively), the overall performance is better. The overall average accuracy is 95.9%, which is significantly higher than Point Fusion (73.7%), RCVPose (92.7%), FoundationPose (92.6%), Pix2Pose (72.4%), PVNet (86.3%), PVN3D (90.6%), GDR-NET (93.7%), and GDRNPP (95.3%).
As shown in Fig. 6, the detection accuracies of various algorithms on different objects are compared. The accuracy of other algorithms fluctuates greatly with the target category, which affects the reliability of detection. The algorithm in this article maintains a high accuracy of approximately 95% in all 13 categories, with minimal fluctuations, demonstrating strong robustness and stability and providing reliable guarantees for practical applications.
Comparison of the detection accuracies of the selected algorithms for different object types.
Table 2 presents the comparative experimental results of different models on the YCB video dataset, including PoseCNN, DenseFusion, PVN3D, and the ADD and ADD-S metrics of our algorithm in 21 object categories. The data show that, except for the Pudding-box category, our algorithm performs the best in ADD and ADD-S for all other categories, and the average detection accuracy is improved by 1.7% compared with the suboptimal algorithm. On the basis of the experimental results of the LM and YCB video datasets, our algorithm is significantly superior to the other algorithms in terms of detection accuracy, robustness, and stability.
The method proposed in this study exhibits significant robustness in dealing with target object occlusion. To analyze the sensitivity of different methods to the degree of occlusion, this paper calculates the model performance of each object under different occlusion percentages. Figure 7 shows the relationship between the occlusion percentage and ADD-S < 2 cm accuracy: when there is slight occlusion (a small number of points are occluded), the accuracy of each method is similar; however, as the occlusion ratio increases, the performance difference gradually becomes significant. Especially in severely occluded scenes, the accuracy of other advanced algorithms decreases sharply, whereas that of our method remains stable. The experiment shows that the proposed method has the advantage of robustness in occlusion processing, effectively improving the reliability and practicality of pose estimation algorithms in practical applications.
Performance of different methods at different masking percentages.
Ablation experiment
This study optimized RGB feature extraction and point cloud feature extraction. In this section, the effectiveness of the proposed improvement points was verified by performing ablation experiments. As shown in Table 3, PVN3D represents the network before improvement, Condense represents adding CondenseNet to the original network, Condense + ODConv represents introducing ODConv in PointNet + + after replacing with CondenseNet, and the last column represents adding the SimAM attention mechanism to the previous improvement.
After adding CondenseNet, the average accuracy of the model improved by 2.8%. For example, the recognition accuracy of Ape objects increased from 94.0% to 95.3%, demonstrating the significant advantage of CondenseNet in feature extraction. On this basis, further introduction of ODConv in PointNet + + improved the model accuracy by 1.5%, and the recognition accuracy of Cam objects increased from 93.6% to 95.3%, which proves the effectiveness of ODConv in point cloud feature extraction. Finally, after adding the SimAM attention mechanism to the previous work, the average accuracy increased by 1.0% again. For example, the recognition accuracy of Driller objects increased from 96.2% to 97.0%, which fully demonstrates the role of SimAM attention mechanism in enhancing the model’s attention to key features, thereby significantly improving the overall performance of the model.
Table 4 shows the comparison of the average inference time and model size of the improved model in the test set. Compared with the original algorithm, the improved algorithm proposed in this article achieves a significant improvement in inference speed. For the inference task involving 13 categories of objects and a total of 1009 test images, the improved algorithm reduced the average detection time by 19% and reduced the model size by 11%. Overall, the performance of the improved model has significantly improved.
Figure 8 shows the visualization results of the predictions on the YCB video dataset. These results were obtained by converting point cloud data on the basis of predicted poses and reprojecting them onto RGB images by stacking the reprojected point clouds as a visual comparison basis. Specifically, the first row in Fig. 8 presents the original RGB images, which show the appearance features, spatial position, and possible occlusion and lighting changes of the objects. The second row shows the prediction results of PVN3D. The coverage area of the point cloud after predicted pose transformation and reprojection onto the RGB image may not completely match the target object, especially in the edge part where there is a significant deviation. In contrast, the third line shows the prediction results of the improved model proposed in this article. After undergoing the same prediction pose transformation and reprojection process, the point cloud data generated by the model in this paper can more closely and accurately fit the contour of the target object, and the matching degree of the edge part is also significantly improved. These visualization results indicate that the algorithm proposed in this paper has greater accuracy and stability in predicting poses when dealing with complex environments such as different lighting conditions and occlusion situations.
Comparison of experimental visualizations. The first, second, and third rows of images are original RGB images, PVN3D prediction results, and the prediction results of the improved model proposed in this study, respectively.
To further evaluate the effects of each module—CondenseNet for model compression, ODConv for robustness enhancement, and SimAM for efficient feature refinement—we provide detailed results in terms of accuracy, parameter count, runtime, and computational cost, as summarized in Table 5. Replacing the original PVN3D backbone with CondenseNet reduces parameters from 39 M to 26 M (− 33%) and GFLOPs from 598 to 466 (− 22%), while improving accuracy by + 2.8%. Integrating ODConv adds slight computational cost (+ 2 M parameters, + 14 GFLOPs) but improves robustness to occlusion and noise, yielding + 1.5% accuracy. Finally, SimAM introduces no additional parameters and negligible computation overhead (< 1%), yet further improves accuracy by + 1.0%. The complete model achieves 95.9% ADD-S-0.1d accuracy with 36% lower GFLOPs and 19% faster inference than the PVN3D baseline.
Visualization of the impact of CondenseNet, ODConv, and SimAM on the LM dataset. The bar chart shows accuracy, parameter count, runtime, and GFLOPs under incremental module integration, highlighting progressive accuracy gains with minimal increases in computational cost.
To provide a more intuitive understanding of the trade-offs between performance and computational efficiency, Fig. 9 visualizes the impact of CondenseNet, ODConv, and SimAM on the LineMOD dataset. Each group of bars represents an incremental integration of modules on top of the PVN3D baseline. As shown in the figure, the blue bars (accuracy) increase steadily from 90.6% to 95.9% as the modules are added, demonstrating the complementary contribution of each component. Meanwhile, the orange (parameters) and purple (GFLOPs) bars show a significant reduction after replacing the backbone with CondenseNet, followed by a slight increase with ODConv due to dynamic kernel generation, and a nearly unchanged cost when SimAM is included. The green bars (runtime) reveal consistent inference-speed improvement, confirming that the proposed combination achieves both higher accuracy and better efficiency. Overall, the visualization highlights that CondenseNet primarily contributes to model compression, ODConv enhances robustness and geometric adaptivity, and SimAM refines feature selectivity with negligible overhead.
Comparison with state of the arts
We compare our module with several state-of-the-art refinement methods PoseDiffusion, FoundationPose, RCVPose, GDRNPP, PVN3D on the three core datasets: Occluded LineMOD (LM-O)32, T-LESS33 and HB34 included in the BOP benchmark, and present the results in Table 6. As shown in table, our proposed model outperforms all other methods, achieving the highest accuracy.
Conclusions
In response to the core challenge in robot sorting technology—the accuracy and efficiency of object 6D pose estimation—this study innovatively proposes a deep learning solution based on the PVN3D framework. This scheme not only deeply analyzes the shortcomings of existing methods but also comprehensively improves the performance of pose estimation through a series of carefully designed optimization strategies. In terms of image feature extraction, this paper deeply optimized the backbone network by using dense connections and learnable grouped convolutions for feature extraction. Owing to its efficient computational performance and lightweight model structure, it significantly reduces the number of parameters and computational costs while fully mining the rich feature information in RGB images. For the processing of depth images, to effectively capture the local geometric features and shape information of point clouds, ODConv is used to consider the directionality of point cloud data, effectively enhancing the ability to extract spatial geometric features of point clouds. In addition, the SimAM attention mechanism is introduced to further enhance key features and improve the overall performance of the model. This paper conducted extensive experimental verification on the widely used LM dataset, YCB video dataset and other three core datasets. The experimental results show that this method significantly reduces the computational complexity of the model while achieving a significant improvement in the accuracy of the pose estimation and demonstrates excellent performance. Although the proposed method achieves competitive accuracy and efficiency across multiple benchmarks, several limitations remain: (1) Scalability: The model performance may degrade when applied to large-scale or high-density point-cloud scenes due to increased memory demand during dynamic convolution. (2) Extreme occlusion and low texture: While the method shows robustness on moderately occluded objects, accuracy drops under severe occlusion (> 70%) or with texture-less surfaces, which may require future integration with generative or diffusion-based priors. (3) Real-time deployment: Although our model is lightweight and faster than PVN3D, real-time performance on embedded or edge devices still requires additional optimization, such as pruning or quantization. Future work will explore these directions to further enhance scalability and real-world applicability.
Data availability
The datasets generated and analyzed during the current study are available in the LineMOD repository: https://github.com/paroj/linemod_dataset, YCB-Video repository: https://rse-lab.cs.washington.edu/projects/posecnn/, LM-O: https://service.tib.eu/ldmservice/dataset/lm-o, HB: https://service.tib.eu/ldmservice/dataset/homebreweddb--hb- and T-LESS: https://cmp.felk.cvut.cz/~hodanto2/darwinset/download.html.
References
Guan, J., Hao, Y., Wu, Q., Li, S. & Fang, Y. A survey of 6DoF object pose estimation methods for different application scenarios. Sensors https://doi.org/10.3390/s24041076 (2024).
Thalhammer, S. et al. Challenges for monocular 6-D object pose Estimation in robotics. IEEE Trans. Robot. 40, 4065–4084. https://doi.org/10.1109/TRO.2024.3433870 (2024).
Hoque, S., Arafat, M. Y., Xu, S., Maiti, A. & Wei, Y. A comprehensive review on 3D object detection and 6D pose Estimation with deep learning. IEEE Access. 9, 143746–143770. https://doi.org/10.1109/ACCESS.2021.3114399 (2021).
Liu, J. et al. Deep Learning-Based object pose estimation: A comprehensive survey. ArXiv:2405 07801. https://doi.org/10.48550/arXiv.2405.07801 (2024).
Wang, J., Rupprecht, C., & Novotny, D. PoseDiffusion solving pose estimation via diffusion-aided bundle adjustment. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 9739–9749.
Wen, B., Yang, W., Kautz, J., & Birchfield, S. FoundationPose Unified 6D Pose Estimation and Tracking of Novel Objects. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17868–17879 (IEEE, 2024).
Li, C., Zhou, A. & Yao, A. Omni-Dimensional Dwynamic Convolution. In 2022 International Conference on Learning Representations (ICLR 2022).
Yang, L., Zhang, R. Y., Li, L., & Xie, X. SimAM A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139 (eds Marina, M. & Tong, Z.) 11863–11874 (PMLR, Proceedings of Machine Learning Research, 2021).
He, Y. et al. IEEE,. PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11629–11638 (2020).
Wu, Y., Zand, M., Etemad, A. & Greenspan, M. Vote from the center: 6 DoF pose Estimation in RGB-D images by radial keypoint voting. In Computer Vision – ECCV 2022, 335–352 (Springer Nature Switzerland, 2022).
Kendall, A., Grimes, M., & Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In 2015 IEEE International Conference on Computer Vision (ICCV), 2938–2946.
Wang, H. et al. Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2637–2646.
Peng, S. et al. Pixel-Wise Voting Network for 6DoF Pose Estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4556–4565.
Rusu, R. B., Blodow, N. & Beetz, M. Fast Point Feature Histograms (FPFH) for 3D registration. In 2009 IEEE International Conference on Robotics and Automation, 3212–3217.
Segal, A. V., Hähnel, D. & Thrun, S. Generalized-ICP. In Robotics: Science and Systems (2009).
Charles, R. Q., Su, H., Kaichun, M., & Guibas, L. J. PointNet Deep Learning on Point Sets for 3D Classification and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 77–85.
Qi, C. R., Yi, L., Su, H., Guibas, L. J. & Pointnet++ Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, 30 (2017).
Zhao, H., Jiang, L., Jia, J., Torr, P. & Koltun, V. Point Transformer. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 16239–16248 (IEEE).
Wang, C. et al. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3338–3347.
He, Y., Huang, H., Fan, H., Chen, Q. & Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3002–3012.
Wang, G., Manhardt, F., Tombari, F. & Ji, X. GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16606–16616 (IEEE, 2021).
Shotton, J. et al. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2930–2937.
Kerl, C., Sturm, J. & Cremers, D. Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2100–2106.
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. PoseCNN: A convolutional neural network for 6D object pose Estimation in cluttered scenes. arXiv:1711.00199 (2017). https://ui.adsabs.harvard.edu/abs/2017arXiv171100199X
Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141 (IEEE, 2018).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: convolutional block attention module. In Computer Vision – ECCV 2018, 3–19 (Springer International Publishing, 2018).
Huang, G., Liu, S. & Maaten, L. v. d. & Weinberger, K. Q. CondenseNet: An Efficient DenseNet Using Learned Group Convolutions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2752–2761.
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (IEEE, 2016).
Xu, D., Anguelov, D., & Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 244–253.
Park, K., Patten, T. & Vincze, M. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7667–7676.
Liu, X. et al. A Geometry-Guided and fully Learning-Based object pose estimator. IEEE Trans. Pattern Anal. Mach. Intell. 47, 5742–5759. https://doi.org/10.1109/TPAMI.2025.3553485 (2025).
Brachmann, E. et al. Learning 6D object pose Estimation using 3D object coordinates. In Computer Vision – ECCV 2014, 536–551 (Springer International Publishing, 2014).
Hodan, T. et al. T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 880–888 (IEEE, 2017).
Kaskman, R., Zakharov, S., Shugurov, I., & Ilic, S. HomebrewedDB: RGB-D Dataset for 6D Pose Estimation of 3D Objects. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2767–2776 (IEEE, 2019).
Author information
Authors and Affiliations
Contributions
Conceptualization, H.Z. and J.T.; methodology, H.Z. and L.W.; formal analysis, H.Z.; investigation, J.T., L.W. and J.C.; resources, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, J.T. and H.Z.; supervision, L.W. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, H., Tong, J., Wei, L. et al. Enhanced RGB-D feature extraction for 6D pose estimation. Sci Rep 16, 4656 (2026). https://doi.org/10.1038/s41598-025-34757-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-34757-y











