Dual chain dynamic hypergraph convolution network for 3D human pose estimation

Han, Qiuying; Zhang, Shaohui; Wang, Peng

doi:10.1038/s41598-025-22261-2

Download PDF

Article
Open access
Published: 09 October 2025

Dual chain dynamic hypergraph convolution network for 3D human pose estimation

Qiuying Han¹,
Shaohui Zhang^2,3 &
Peng Wang^2,3

Scientific Reports volume 15, Article number: 35211 (2025) Cite this article

517 Accesses
Metrics details

Subjects

Abstract

Recently, various Graph Convolution Networks (GCNs) have been developed to represent the human skeleton using dynamic graph structures, enhancing the flexibility of node feature aggregation. However, the optimal graph structure can vary significantly across different human poses, making it impractical to estimate a single optimal graph structure for all poses. Previous experiments have also shown that the optimal graph structures derived by existing GCN methods, which focus solely on joint error loss, tend to reduce model adaptability. This limitation further compromises the generalization performance of the models. To address this issue, we propose a novel Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN). This framework introduces a dual-chain structure that decouples the processes of dynamic hypergraph construction and hypergraph convolution. Additionally, we propose a new edge-weight matching mechanism to decompose the independence of hypergraphs into the independence of hyperedges with low computational complexity. These two innovations are integrated into a Selector-Processor block (SP-block) within the DCD-HCN, which is trained with both supervised joint error loss and unsupervised extra hypergraph construction loss. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art (SOTA) generalization performance while maintaining competitive testing results.

Hierarchical intertwined graph representation learning for skeleton-based action recognition

Article Open access 10 October 2025

DGHSA: derivative graph-based hypergraph structure attack

Article Open access 04 December 2024

Distributed constrained combinatorial optimization leveraging hypergraph neural networks

Article 30 May 2024

Introduction

3D Human Pose Estimation (3D HPE) aims to restore the spatial position of body joints as accurately as possible. The 3D HPE methods can be roughly divided into two categories: single-stage(end-to-end training)^1,2 and multi-stage(phase-based training)^3,4,5. Multi-stage methods are composed of detecting stage and lifting stage. The detecting stage estimates the 2D joint coordinates from input images, and the lifting stage calculates the 3D joint coordinates from 2D joint coordinates. Due to the great progress of detecting methods, researchers now mostly focus on the lifting stage, which is still challenging as an inherently ill-posed problem. Among plenty of lifting approaches, those based on Graph Convolution Network (GCN) with the human skeleton structure achieve excellent performance. These GCN-based approaches aggregate joint features through fixed message-passing mechanism constrained by the human body’s prior knowledge, which can be trained efficiently with low overhead. However, the traditional static human skeleton structure is difficult to fully reflect the flexibility of human poses, especially in those cooperative actions of human limbs, such as clapping hands.

To overcome the above limitation, some researchers propose dynamic GCNs (or Hypergraph Convolutional Network, HCN) to learn to construct dynamic graph structures for 3D HPE. Although these methods have achieved promising performance without more training data, they do not fully take into account the issue that training samples for distinctive human actions with different optimal graph structures will greatly affect the learning process, resulting in inadequate progress in dynamic graph construction. The most important reason is lack of guidance and supervision in dynamic graph construction.

In this paper, we propose a novel Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) to address the above problems. The dual chain in DCD-HCN consists of Dynamic Hypergraph Construction Chain (DHCC) and Hypergraph Convolution Chain (HCC), as shown in Fig. 1. In detail, the Selector generates adaptive dynamic hypergraph in DHCC, and the Processor conducts hypergraph convolution with the dynamic hypergraph generated by the Selector in HCC. Furthermore, the Selector and Processor are combined as the Selector-Processor block (SPblock) in DCD-HCN. During the network training, we use the hypergraph construction loss to guide the training of DHCC and use the joint error loss to supervise the training of HCC. The gradients of these two chains are separated to avoid mutual influence from the joint error loss of different inputs. This is because confusing the optimal hypergraph structures of different inputs can interfere with the adaptability of dynamic hypergraph construction. In contrast to prior works like SemGCN³, which rely on static or semi-dynamic graphs with coupled construction and convolution, our DCD-HCN introduces a novel dual-chain architecture that fully decouples dynamic hypergraph construction from convolution. This innovation enables independent optimization of adaptive hyperedges via an unsupervised signal, significantly enhancing flexibility for complex pose variations and improving generalization across datasets.

Benefiting from the Selector-Processor architecture, DCD-HCN can generate integral dynamic hypergraph structure while keeping decimal adjacency matrix weight. This can strengthen the edge constraints of node feature aggregation in dynamic hypergraph convolution, which cannot be realized in end-to-end methods due to the demand for gradient propagation between dynamic graph constructing and graph convolution.

As decomposing the independence of hypergraph into the independence of hyperedge can greatly enhance the adaptability of models, we need to guarantee each possible hyperedge has a unique weight to match. We introduce one hyperedge weight parameter pool in each Processor. If a hyperedge is not chosen, its unique weight parameter will not be used or trained, and the features of joints in this hyperedge will not be aggregated. With the help of this matching mechanism, we can keep the active dynamic weights maintaining synchronous changes with dynamic hypergraphs. Our main contributions are summarized as follows:

We propose a Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) for 3D HPE, which decouples the dynamic hypergraph convolution into dynamic hypergraph construction and hypergraph convolution via training them with unsupervised hypergraph construction loss and supervised joint error loss separately.
DCD-HCN introduces a Selector-Processor block (SPblock) to achieve the interplay between DHCC and HCC with dynamic integral hypergraphs, which helps our method enforce the constraint of node feature aggregation in dynamic hypergraph convolution.
DCD-HCN involves a hyperedge-weight parameter matching mechanism to keep the one-to-one correspondence between dynamic hyperedges and hyperedge weights, which can improve the training process of the proposed network with reasonable computation cost.
We achieve state-of-the-art generalization performance on the 3DHP dataset while maintaining leading testing results on Human3.6M dataset.

Related works

3D human pose estimation

Recently, 3D human pose estimation (3D HPE) has made great progress. Martinez et al.⁶ used fully connected network to map 2D joint coordinates to 3D space, and achieved excellent results. SemGCN³ successfully applied graph convolution network (GCN) to 3D HPE. Liu et al.⁴ designed Semi-Dynamic Hypergraph Neural Network based on Feng et al.⁷, which used dynamic graph construction to improve the flexibility of the feature aggregation of GCN. Subsequently, researchers proposed a variety of improved methods based on GCN and the prior knowledge of 3D human pose^8,9,10,11,12 through making various adjustments to the graphs. Ci et al.⁵ proposed a local connection network (LCN) based on human skeleton joint connections. The success of LCN shows that enforcing the constraint of node connections on feature aggregation is of great significance to the model with good generalization performance. Xie et al.¹³ employ a multi-stage approach by combining high-order graph convolution and Transformer to handle the lifting from 2D to 3D, using dynamic adjacency matrices to capture dynamic joint relationships and address issues like self-occlusion and depth ambiguity. Li et al.¹⁴ process data through parallel subnetworks from sparse to fine representations, utilizing multi-scale graph structure construction to generate denser topologies for learning feature aggregation of pose nodes. However, these multi-stage methods often suffer from drawbacks such as higher computational complexity and increased parameter counts, leading to potential error propagation across stages. Our DCD-HCN mitigates through its decoupled dynamic hypergraph architecture for more streamlined and adaptive processing. Moreover, different from previous GCN-based work using decimal graph structure to represent the human body, we introduce dynamic integral hypergraph in DCD-HCN, which can better enforce the constraint of node feature aggregation in dynamic hypergraph convolution.

Dynamic graph convolutional network

Dynamic graph convolutional networks improve the flexibility of node feature aggregation by constructing dynamic graph adjacency matrices. These methods can be divided into three categories based on the way of graph construction, which are graph adjacency matrix parameterization, non-learning methods, and hidden layer features as adjacency matrices. Graph adjacency matrix parameterization converts some or all elements of the static graph adjacency matrix into learnable parameters^{3,15,16,17,18}. Non-learning methods generally construct graph structures dynamically by manually defined rules^4,7,19,20. Other methods incorporated learnable modules to generate adjacency matrics^{9,16,21,22,23,24,25,26,27,28,29}. Since these methods can not explicitly quantify the effect of generated graph structure, they are difficult to carry out further improvement by exploring the role of dynamic graph construction. In this paper, we propose an unsupervised hypergraph construction loss to instruct the dynamic hypergraph construction purposely in one dual chain, which is realized via separating the process of dynamic hypergraph construction from the dynamic hypergraph convolution.

Transformer

Transformer was originally proposed for natural language processing³⁰, which performed well by shortening the propagation distance between sequences through a self-attention mechanism and position encoding embedding. Recently, researchers found that Transformer could also be used for vision tasks, and the well-known ViT was proposed as a pure Transformer architecture³¹, which broke the stereotype of cross-task model through large-scale pre-training. Now more and more researchers are trying to introduce Transformer architecture into different vision tasks, such as the 3D HPE^{2,11,32,33,34}. The good results show that the Transformer module can greatly enhance the capability of joint feature extraction of the 3D HPE network. Therefore, we also leverage a Transformer encoder to execute the global embedding of node features.

DCD-HCN: a dual chain dynamic hypergraph convolution network

As shown in Fig. 2, the proposed Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) firstly embeds the 2D coordinates of skeleton joints into a high-dimensional space using node embedding layers, and then stacks multiple SPblocks to learn the indistinct potential patterns of nodes. Finally, the node features are mapped to the 3D coordinate by the node embedding layer. The first node embedding layer and each SPblock are followed by global embedding layers.

Node embedding layer

Inspired by Zhao et al.³, we construct a static graph convolution layer based on the SemGConv layer with the following forward propagation equation, expressed as Eqs. 1 and 2.

$$\begin{aligned} {A = \textrm{softmax}\left( {adj \odot W,axis = 1} \right) } \end{aligned}$$

(1)

$$\begin{aligned} {Y = \left( {A \odot I} \right) X\Theta ^{1} + \left( {A \odot \left( {1 - I} \right) } \right) X\Theta ^{2}} \end{aligned}$$

(2)

where adj is the adjacency matrix of the human skeleton model, W is the weights of the skeleton edge set, I is the identity matrix, X is the input features, $\Theta ^{1}$ and $\Theta ^{2}$ are the parameter matrix to embed node features in self-link branch and non self-link branch, $\odot$ indicates the matrix is multiplied by elements.

It can be known from Eqs. 1 and 2 that when all node embedding layers share the same static human skeleton model, each node embedding layer will aggregate the features of connected joints in the same pattern. As Liu et al.⁴ points out, it can hardly capture the complex relationship of non-neighbour joints.

Global embedding layer

Since the graph convolution layer only aggregates features on the same feature dimension of different nodes and all feature dimensions of a single node, it cannot aggregate the features on the different feature dimensions of various nodes. For example, only considering the information of adjacent nodes is not enough to estimate their positions if these nodes are overlapped or missing. Therefore, we choose the Transformer encoder layer to extend the receptive field of subsequent graph convolution to the whole graph.

SPblock

The SPblock layer consists of two important parts: the Selector (Eq. 3) and the Processor (Eq. 4). The Selector constructs the integral hypergraph structure $H^{(l)}$ dynamically based on the pose features $P^{({l - 1})}$, l is the layer index. The Processor extracts node features with $H^{(l)}$.

$$\begin{aligned} H^{(l)} = Selector\left( P^{({l - 1})} \right) \end{aligned}$$

(3)

$$\begin{aligned} P^{(l)} = Processor\left( H^{(l)},W^{(l)}, P^{({l - 1})} \right) \end{aligned}$$

(4)

Selecter

We use Transformer encoder layer to encode pose features P, capturing global long-range dependencies among joints through multi-head self-attention mechanisms. The Transformer module extracts global contextual features, which are decoupled and passed to the Processor for GCN-based local aggregation. Specifically, long-range dependencies captured by Transformer are integrated into GCN via residual connections, where the Transformer-augmented attention weights refine the adjacency matrix in GCN convolutions. The encoded features are then column-normalized via $\textrm{softmax}$ to obtain the intermediate feature hid. Afterwards, we construct the integral hypergraph according to Algorithm 1. The construction algorithm firstly calculates the mean value of each column of hid, and sets the elements of each column to 0 when they are smaller than the mean value, otherwise 1, so as to establish the association between the intermediate feature hid and the dynamic hypergraph incidence matrix H. After that, we sort the nodes into g groups. If one of these nodes in a group is selected, we regard this group as a new selected node, otherwise as an unselected node. Finally, we calculate the indices of hyperedges, $\omega$. And $\omega$ will be used to match hyperedge weight parameters in Processors.

Processor

Every Processor internally stores one pool of weight parameters W corresponding to all hyperedges, and the indices $\omega$ of the hyperedge set generated by the Selector is used to extract the weight subset $W(\omega )$ corresponding to the hyperedge set, as shown in Fig. 3a.

Each parameter in the weight parameter pool W corresponds to one hyperedge. This allows the Processor to use corresponding parameters when the Selector chooses different hyperedges for different input features, thus ensuring the independence of the dynamic hypergraph and dynamic hyperedges. The hypergraph convolution formula is improved from Feng et al.⁷, as shown in Eqs. 5 and 6. Note that hyperedge $e \in E$, E is hyperedge set of the dynamic hypergraph, node $v \in V$, V is the node set, hyperedge weight $w(e) \in W$, W is the weight set of the dynamic hypergraph.

$$\begin{aligned} \begin{aligned}&\begin{matrix} {h\left( {v,e} \right) = \left\{ \begin{matrix} {1,~~v \notin e} \\ {0,~~v \in e} \\ \end{matrix} \right. } \\ \end{matrix} \\&d(v) = {\sum \limits _{e \in E}^{}{w(e)h(v,e)}} \\&d(e) = {\sum \limits _{v \in V}^{}{h\left( {v,e} \right) }} \end{aligned} \end{aligned}$$

(5)

where h(v, e) denotes whether the node v belongs to the hyperedge e. d(v) denotes the degree of the node v and d(e) denotes the degree of the hyperedge e.

$$\begin{aligned} {X^{({l + 1})} = \sigma \left( {D_{v}^{- \frac{1}{2}}H^{(l)}{W(\omega )D}_{e}^{- 1}(H^{(l)})^{T}D_{v}^{- \frac{1}{2}}X^{(l)}\Theta ^{(l)}} \right) } \end{aligned}$$

(6)

where $D_{v}$ denotes the diagonal matrix of the degrees of the nodes, $D_{v}^{ii} = d\left( V_{i} \right)$, $D_{e}$ denotes the diagonal matrix of the degrees of the dynamic hyperedges, $D_{e}^{jj} = d\left( E_{j} \right)$. $W(\omega )$ denotes the learnable weights corresponding to the dynamic hypergraph. $H^{(l)}$ denotes the incidence matrix of the hypergraph. $\Theta ^{(l)}$ denotes the matrix of learnable weights used to change the feature dimension of the nodes.

Regarding the hyperedge weight parameter pool W as a one-dimensional array, the hyperedge weights $W(\omega )$ corresponding to the hyperedges are extracted using the hyperedge indices $\omega$, and then the hypergraph incidence matrix H is combined to perform the hypergraph convolution on the pose features $x^{l}$.

If there are a large number of nodes, the subsets of nodes increase exponentially, which influences the efficiency of training. Therefore, according to the symmetry of human body structure, some nodes of the human skeleton model are combined into one group, as shown in Fig. 3b.

Loss function

Dynamic hypergraph construction loss

We use the dynamic hypergraph construction loss to train Selectors with their intermediate features hid, enabling them to overcome reliance solely on the joint error loss. This loss does not impose a direct or rigid constraint on the graph structure. Rather, it implicitly guides the Selectors of other samples to explore effective hypergraph connections that are suitable for similar poses by leveraging the hypergraph structure of the best-performing sample within the batch. As a result, during network parameter optimization, the model adaptively imitates and generalizes these high-quality structural patterns. We set the instance with the smallest joint error loss value $\mathcal {L}_{joint}$ in the batch N as the target value, and calculate the mean value of the difference between the output of each Selector and the target value from the best instance in the same batch as the dynamic hypergraph construction loss value, expressed as Eq. 7. The unsupervised training signal for hypergraph construction ($L_{DHCC}$) is a pivotal innovation in our learning strategy, promoting self-supervised adaptation of hypergraphs without relying on labeled joint errors. Unlike prior coupled approaches (e.g., SemGCN), which unify supervision and risk interfering optimal structures across poses, our decoupled strategy applies $L_{DHCC}$ solely to the DHCC for independent hyperedge optimization, while $L_{HCC}$ supervises the HCC. This resolves optimization challenges, demonstrating superior generalization and efficiency over related methods.

$$\begin{aligned} \mathcal {L}(hypergraph)=\frac{1}{N\times S} \sum _{n=1}^{N}\sum _{s=1}^{S} ( {hid_{n}^{s}-hid^{s}_{\mathop {\arg \min }\limits _{(\mathcal {L}_{joint})}}}) \end{aligned}$$

(7)

Joint error loss

The 3D coordinates of human joints predicted by DCD-HCN are defined as $Joints{ = \{ {\overset{\sim }{J}}_{i} | i = 1,\ldots ,n \}}$, the target is $\left\{ J_{i} \big | i = 1,\ldots ,n \right\}$. We choose the mean square error as the joint error loss, as shown in Eq. 8.

$$\begin{aligned} \mathcal {L}_{Joints} = \frac{1}{n}{\sum \limits _{i}^{n}\left( {\overset{\sim }{J_{i}}} - J_{i} \right) ^{2}} \end{aligned}$$

(8)

Network training

Due to the separation of dual chain, we can train the Selector and the Processor independently. Since it is impossible to determine whether the current parameter distribution of the Selector could generate suitable dynamic hypergraph structures, we follow the Expectation-Maximization (EM) algorithm to optimize the joint training of the Selector and the Processor. In the E step, we train the Processor under the current parameter distribution of the Selector until it is stable. In the M step, we optimize the parameter distribution of the Selector with dynamic hypergraph construction loss. As shown in Fig. 4, both the joint error loss ($L_{joint}$) and the dynamic hypergraph construction loss ($L_{hypergraph}$) gradually stabilize and converge as the number of training steps increases, eliminating the possibility of severe oscillations or divergence. This clearly demonstrates that the alternating optimization training strategy we employ is stable and effective in practical applications.

Experiments

Datasets and evaluation metrics

MPI-INF-3DHP

The MPI-INF-3DHP is a 3D human pose dataset³⁵. We employ the average Percentage of Correct Keypoint (PCK) with a threshold of 150 mm, and Area Under Curve (AUC) with the PCK threshold as the evaluation metrics.

Human3.6M

The Human3.6M is a large-scale public dataset for 3D human pose estimation^36,37. In this paper, the ground truth (GT) of 2D and 3D human pose data provided by the dataset are used for training, and two standard evaluation protocols, Protocol 1 and Protocol 2, are used for experiments. Both protocols use subjects S1, S5, S6, S7, and S8 as the training set and subjects S9 and S11 as the testing set. Under protocol 1, after aligning the predicted 3D human joint coordinates and the GT with the root joint (central hip joint), the Mean Per-Joint Position Error (MPJPE) is calculated in millimeters. Protocol 2 utilizes a rigid transformation to align the predictions with the GT and then calculates the Procrustes analysis MPJPE (P-MPJPE).

Implementation details

Table 1 The detailed configuration for experimental details.

Full size table

Table 2 Quantitative results in MPJPE(mm) between the estimated pose and the GT on Human3.6M under Protocol 1. We show the results using the 2D Pose detector as the input under Configuration 1 (above the middle line) and the 2D GT as the input under Configuration 2 (below). The top two best methods of each action are highlighted in bold and underlined respectively.

Full size table

We stack 8 SPblocks as our dual chain dynamic hypergraph convolution network (DCD-HCN) architecture. The hidden layer node dimensions are 128, and all Transformer encoders with 8 heads are officially implemented by Pytorch. Our models are trained with the Adam optimizer with the parameter of $\beta _1=0.9$, $\beta _2=0.999$, and $\epsilon =1e-8$. We set the learning rate to exponential decay, with the decay of step 32000 and parameter $\gamma$ of 0.96 to prevent overfitting. We use L2 regularization to prevent overfitting. The total loss is a weighted combination of $L_{DHCC}$ and $L_{HCC}$ with default equal weights, optimized via grid search to balance unsupervised hypergraph construction and supervised joint error minimization. The number of learning epochs is set to 300, and the batch size is 200. The detailed training configuration is shown in Table 1. All experiments are conducted on one Titan X GPU with PyTorch. For the Human3.6M datasets, the data pre-processing method is from Ci et al.⁵. No other data augmentation is used to ensure fairness in the experiment. We adopt the pre-trained SH model⁴² as the 2D Pose detector of Human3.6M.

Comparison with state-of-the-art methods

Table 3 Quantitative results in P-MPJPE(mm) on Human3.6M under Protocol 2. Other demonstrations are same to Table 2.

Full size table

Table 4 Quantitative comparison with state-of-the-art methods on MPI-INF-3DHP. The top two best methods of each action are highlighted in bold and underlined respectively.

Full size table

We compare our model with SOTA methods on Human3.6M in Tables 2 and 3. Also, as shown in Fig. 5, we also present some qualitative experimental results. In the experiments, our model accepts 17 2D joints’ coordinates with a single view and single frame as the input and outputs the predicted 3D joints’ coordinates. We find that our method achieves the best performance (34.9 mm) in Protocol 1, GT as inputs. Compared to Zhang et al.⁴⁰’s average result of 35.3 mm, this represents an improvement of 0.4 in joint evaluation accuracy. In particular, we improve large margin scores in taking photos, posing, sitting down, walking dog and walking together under two configurations. In Protocol 2, our model also achieves good results as shown in Table 3. Although our method’s average result is not as strong as Li et al.¹⁴, it achieves a significant lead in specific scenarios such as Photo, Pose, SitD, and WalkD. This is attributed to our dynamic hypergraph construction and decoupled dual-chain architecture that adaptively capture high-order joint relationships and enhance robustness to noisy 2D inputs in these complex, motion-intensive actions. The results show that our model gets the lowest MPJPE in 7 of all scenes.

To assess the generalization ability, we evaluate our method on MPI-INF-3DHP dataset. The DCD-HCN is trained on Human3.6M training set and then tested on MPI-INF-3DHP testing set. The PCK indicates the percentage of correct keypoints, whose threshold is set to 150mm, and the AUC indicates the area under curve, which is calculated under the threshold from 0mm to 150mm with the 5mm interval. The results in Table 4 demonstrate that our method outperforms other state-of-the-art methods while only using Human3.6M for training, achieving the highest PCK of 88.1 and AUC of 52.5. It validates the effectiveness of our model in improving generalized performance of unseen scenes. Furthermore, in terms of computational complexity, our approach strikes an optimal balance with 11.22M parameters and 17.94 FLOPs. Compared to other methods^6,44, our method has fewer parameters yet delivering superior performance.

Table 5 Quantitative comparison of different dynamic hypergraph construction methods.

Full size table

Ablation study

To verify the impact of each component in DCD-HCN, we conduct ablation experiments on the Human3.6M dataset under Protocol 1 with 2D GT as input.

Impact of hypergraph construction methods

According to the graph construction methods, the dynamic hypergraph can be categorized as follows: adjacency matrix parameterization, hidden layer features as adjacency matrices, and non-learning methods. The first and the second can be used simultaneously, which is recorded as the mixed learnable mode. In the comparative experiment, the original Selector module and the weight parameter pool are removed in our DCD-HCN, and only the joint error loss function is used, and remove W from Eq. 6 in model-P, model-H, and model-M.

For parameterization (model-P), we set a learnable tensor as the dynamic hypergraph incidence matrix, and then directly participates in the hypergraph convolution. For hidden layer features (model-H), we add a new Transformer encoder layer, which can be trained with Processors simultaneously, to generate the dynamic hypergraph incidence matrix. For the mixed learnable mode (model-M), we adopt the above two methods at the same time. We add parameterized adjacency matrix and output of the Transformer encoder layer as the dynamic hypergraph incidence matrix (model-M). In non-learning methods (model-N), we construct dynamic hyperedges refering method⁴ to generate a dynamic hypergraph incidence matrix. The experimental results are shown in Table 5. It can be found that the idea of separating dynamic hypergraph construction from convolution and the edge-weight parameter matching mechanism improve the prediction accuracy.

Furthermore, we thoroughly validate the impact of the dynamic hypergraph construction loss on 3D pose estimation performance. Specifically, our model employs both the proposed dual chain architecture and the dynamic hypergraph construction loss, while baseline models such as model-P, model-H, and model-M use only the traditional joint error loss with coupled graph construction and convolution in an end-to-end manner. The results demonstrate that our method achieves a significantly lower MPJPE of 34.9mm compared to the others, directly verifying the effectiveness of the proposed loss and the corresponding decoupled training framework in enhancing hypergraph construction quality and improving 3D pose estimation accuracy.

Sensitivity analysis

We investigated the influence of each module within the primary components of the DCD-HCN on model performance variability, providing evidence for our sensitivity analysis. The main parts of DCD-HCN include nodes embedding layer (NE), Selector-Processor block (SPblock) and global embedding layer (GE). We design 4 settings as shown in Table 6. In Setting 1 and Setting 2, we use fully connected layer to replace NE. In setting 3, we double GE to balance the parameters. In our sensitivity analysis, these settings reveal the model’s robustness to structural variations. In setting 2, replacing NE with a fully connected layer and adding SPblock tends to induce over-smoothing, underscoring the necessity of the node embedding layer for effective feature extraction. The results of Setting 4 also demonstrate this phenomenon. Furthermore, Setting 4 surpasses Setting 3 by 11.9%, demonstrating SPblock’s superior capability in uncovering indistinct potential node relationships while maintaining stability across parameter-balanced modifications. The collaborative function of all modules confirms the elasticity of DCD-HCN to moderate disturbances, enhancing its reliability in practical applications.

Table 6 Quantitative comparison among different module settings in DCD-HCN. NE means node embedding layer, SP means SPblock and GE means global embedding layer.

Full size table

Table 7 Evaluation results on Human3.6M under different scales of node groups. Params represent the size of the parameter pool.

Full size table

The scale of parameter pool

In Section Processor, the scale of parameter pool stored in the Processor is exponentially related to the number of nodes. Too many nodes will lead to too large parameter quantity, so we group nodes to improve matching efficiency in the Algorithm 1.

The main sources of complexity in our model stem from the Transformer encoder serving as the global embedding layer capturing long-range dependencies, and the hyperedge weight pool stored within the Processor enabling adaptive high-order aggregations. To mitigate this complexity, we incorporate a node grouping strategy that compresses the 17 joints into k groups, effectively reducing the exponential growth of the hyperedge weight pool from $2^{17}$ to $2^k$ possibilities. This design supports adaptive high-order aggregations, effectively controlling the exponential growth of the hyperedge weight pool and balancing performance and efficiency. In Table 7, we show the comparison results of different scales, gradually reducing the number of nodes in groups. We observe that no compression or small compression scale lead to the laggard results. However, over compression will lead to the fact that even if the dynamic hypergraph constructed by the Selector has a wide range of variation, and the Processor still tends to use the same parameters as weight, which makes the dynamic hypergraph meaningless and also obtains laggard results. Thus, we select 8 as the number of groups according to the results.

Conclusion

We propose a novel Dual Chain Dynamic Hypergraph Convolution Network(DCD-HCN) for 3D human pose estimation. DCD-HCN has a dynamic hypergraph construction chain (selector) which is guided by dynamic hypergraph construction loss, and a hypergraph convolution chain (processor) which is supervised by joint error loss. The integral dynamic hypergraphs generated by the selectors can enforce the graph structure constraint of graph convolution. The hyperedge weight parameter pool decomposes the independence of hypergraphs to hyperedges. Extensive experiments show that our proposed DCD-HCN has a remarkable generalization advantage over previous dynamic graph convolution networks.

Although our DCD-HCN achieves SOTA generalization performance in MPI-INF-3DHP dataset, DCD-HCN fails in some action scenes, especially in Purch, Sit, Smoke and Walk, when it takes detected 2D poses as inputs. DCD-HCN and previous GCN-based works all fail in these four scenes. We think that it is due to the inherent defect of GCN. By contrast, DCD-HCN makes great progress in the scenes that GCN performs well, like Photo, SitD, and WalkD. In these scenes, non-GCN-based works do not perform well. Overall, the DCD-HCN framework enhances both the performance and adaptability of GCN within the theoretical framework. Additionally, DCD-HCN adopts Transformer encoder layer as the part of the Selector and the part of the global embedding layer, resulting in an increase in the number of parameters.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Pavlakos, G., Zhou, X., Derpanis, K.G. & Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7025–7034 (2017).
Zeng, A. et al. Deciwatch: A simple baseline for 10x efficient 2D and 3D pose estimation. arXiv preprint arXiv:2203.08713 (2022).
Zhao, L., Peng, X., Tian, Y., Kapadia, M. & Metaxas, D.N. Semantic graph convolutional networks for 3D human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3425–3435 (2019).
Liu, S. et al. Semi-dynamic hypergraph neural network for 3D pose estimation. In IJCAI, 782–788 (2020).
Ci, H., Wang, C., Ma, X. & Wang, Y. Optimizing network structure for 3D human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, 2262–2271 (2019).
Martinez, J., Hossain, R., Romero, J. & Little, J.J. A simple yet effective baseline for 3D human pose estimation. In Proceedings of the IEEE international conference on computer vision, 2640–2649 (2017).
Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence 33, 3558–3565 (2019).
Xu, T. & Takano, W. Graph stacked hourglass networks for 3D human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16105–16114 (2021).
Zeng, A. et al. Learning skeletal graph neural networks for hard 3D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11436–11445 (2021).
Cai, Y. et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, 2272–2281 (2019).
Zhao, W., Wang, W. & Tian, Y. GraFormer: Graph-oriented transformer for 3D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20438–20447 (2022).
Liu, K., Ding, R., Zou, Z., Wang, L. & Tang, W. A comprehensive study of weight sharing in graph networks for 3D human pose estimation. In European Conference on Computer Vision, 318–334 (Springer, 2020).
Xie, Y., Hong, C., Zhuang, W., Liu, L. & Li, J. Hogformer: high-order graph convolution transformer for 3d human pose estimation. Int. J. Mach. Learn. Cybern. 16, 599–610 (2025).
Article CAS Google Scholar
Li, H. et al. Hierarchical graph networks for 3d human pose estimation. arXiv preprint arXiv:2111.11927 (2021).
Cheng, K. et al. Decoupling GCN with dropgraph module for skeleton-based action recognition. In European Conference on Computer Vision, 536–553 (Springer, 2020).
Shi, L., Zhang, Y., Cheng, J. & Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026–12035 (2019).
Chen, Z. et al. A spatial-temporal short-term traffic flow prediction model based on dynamical-learning graph convolution mechanism. arXiv preprint arXiv:2205.04762 (2022).
Zou, Z. & Tang, W. Modulated graph convolutional network for 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11477–11487 (2021).
Pan, G., Liu, P., Wang, J., Ying, R. & Wen, F. 3DTI-Net: Learn 3D transform-invariant feature using hierarchical graph CNN. In Pacific Rim International Conference on Artificial Intelligence, 37–51 (Springer, 2019).
Jiang, J., Wu, B., Chen, L. & Kim, S. Dynamic adaptive and adversarial graph convolutional network for traffic forecasting. arXiv preprint arXiv:2208.03063 (2022).
Ye, J., He, J., Peng, X., Wu, W. & Qiao, Y. Attention-driven dynamic graph convolutional network for multi-label image recognition. In European conference on computer vision, 649–665 (Springer, 2020).
Yang, G. et al. DRAG: Dynamic region-aware GCN for privacy-leaking image detection. arXiv preprint arXiv:2203.09121 (2022).
Ye, F. et al. Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, 55–63 (2020).
Liu, A. & Zhang, Y. Spatial-temporal interactive dynamic graph convolution network for traffic forecasting. arXiv preprint arXiv:2205.08689 (2022).
Caramalau, R., Bhattarai, B. & Kim, T.-K. Sequential graph convolutional network for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9583–9592 (2021).
Zhang, L. et al. Tsgcnet: Discriminative geometric feature learning with two-stream graph convolutional network for 3D dental model segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6699–6708 (2021).
She, D., Lai, Y., Yi, G. & Xu, K. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8475–8484 (2021).
Yang, J., Zheng, W., Yang, Q., Chen, Y. & Tian, Q. Spatial-temporal graph convolutional network for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3289–3299 (2020).
Iscen, A., Tolias, G., Avrithis, Y., Chum, O. & Schmid, C. Graph convolutional networks for learning with few clean and many noisy labels. In European Conference on Computer Vision, 286–302 (Springer, 2020).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Zheng, C. et al. 3D human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11656–11665 (2021).
Li, W., Liu, H., Tang, H., Wang, P. & Van Gool, L. Mhformer: Multi-hypothesis transformer for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13147–13156 (2022).
Zhang, J., Tu, Z., Yang, J., Chen, Y. & Yuan, J. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13232–13242 (2022).
Mehta, D. et al. Monocular 3D human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), 506–516 (IEEE, 2017).
Ionescu, C., Papava, D., Olaru, V. & Sminchisescu, C. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2013).
Article ADS Google Scholar
Ionescu, C., Li, F. & Sminchisescu, C. Latent structured models for human pose estimation. In International Conference on Computer Vision (2011).
Luvizon, D. C., Picard, D. & Tabia, H. 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5137–5146 (2018).
Wang, J., Huang, S., Wang, X. & Tao, D. Not all parts are created equal: 3D pose estimation by modeling bi-directional dependencies of body parts. In Proceedings of the IEEE/CVF international conference on computer vision, 7771–7780 (2019).
Zhang, Z. Group graph convolutional networks for 3d human pose estimation. In BMVC, 1019 (2022).
Ji, S. et al. Msmb-gcn: Multi-scale multi-branch fusion graph convolutional networks for 3d human pose estimation. In 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), 1–5 (IEEE, 2023).
Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. In European conference on computer vision, 483–499 (Springer, 2016).
Fang, H., Xu, Y., Wang, W., Liu, X. & Zhu, S. Learning pose grammar to encode human body configuration for 3D pose estimation. In Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018).
Pavllo, D., Feichtenhofer, C., Grangier, D. & Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7753–7762 (2019).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China(62172457), Science and Technology Research Project of Henan Province(242102210104, 252102210032), and Key Scientific Research Project of Higher Education Institutions of Henan Province(25B520021).

Author information

Authors and Affiliations

School of Computer Science and Technology, Zhoukou Normal University, Zhoukou, 466001, China
Qiuying Han
School of Artificial Intelligence, Zhoukou Normal University, Zhoukou, 466001, China
Shaohui Zhang & Peng Wang
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
Shaohui Zhang & Peng Wang

Authors

Qiuying Han
View author publications
Search author on:PubMed Google Scholar
Shaohui Zhang
View author publications
Search author on:PubMed Google Scholar
Peng Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Q.H. contributed to data collection, literature search, data analysis, interpretation, and manuscript writing. S.Z. contributed to the conceptualization, study design, data collection, interpretation, and fund sourcing. P.W. contributed to conceptualization, data analysis, and experiment validation. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Shaohui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Han, Q., Zhang, S. & Wang, P. Dual chain dynamic hypergraph convolution network for 3D human pose estimation. Sci Rep 15, 35211 (2025). https://doi.org/10.1038/s41598-025-22261-2

Download citation

Received: 01 March 2025
Accepted: 26 September 2025
Published: 09 October 2025
DOI: https://doi.org/10.1038/s41598-025-22261-2

Subjects

Abstract

Similar content being viewed by others

Hierarchical intertwined graph representation learning for skeleton-based action recognition

DGHSA: derivative graph-based hypergraph structure attack

Distributed constrained combinatorial optimization leveraging hypergraph neural networks

Introduction

Related works

3D human pose estimation

Dynamic graph convolutional network

Transformer

DCD-HCN: a dual chain dynamic hypergraph convolution network

Node embedding layer

Global embedding layer

SPblock

Selecter

Processor

Loss function

Dynamic hypergraph construction loss

Joint error loss

Network training

Experiments

Datasets and evaluation metrics

MPI-INF-3DHP

Human3.6M

Implementation details

Comparison with state-of-the-art methods

Ablation study

Impact of hypergraph construction methods

Sensitivity analysis

The scale of parameter pool

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links