Introduction

3D Human Pose Estimation (3D HPE) aims to restore the spatial position of body joints as accurately as possible. The 3D HPE methods can be roughly divided into two categories: single-stage(end-to-end training)1,2 and multi-stage(phase-based training)3,4,5. Multi-stage methods are composed of detecting stage and lifting stage. The detecting stage estimates the 2D joint coordinates from input images, and the lifting stage calculates the 3D joint coordinates from 2D joint coordinates. Due to the great progress of detecting methods, researchers now mostly focus on the lifting stage, which is still challenging as an inherently ill-posed problem. Among plenty of lifting approaches, those based on Graph Convolution Network (GCN) with the human skeleton structure achieve excellent performance. These GCN-based approaches aggregate joint features through fixed message-passing mechanism constrained by the human body’s prior knowledge, which can be trained efficiently with low overhead. However, the traditional static human skeleton structure is difficult to fully reflect the flexibility of human poses, especially in those cooperative actions of human limbs, such as clapping hands.

To overcome the above limitation, some researchers propose dynamic GCNs (or Hypergraph Convolutional Network, HCN) to learn to construct dynamic graph structures for 3D HPE. Although these methods have achieved promising performance without more training data, they do not fully take into account the issue that training samples for distinctive human actions with different optimal graph structures will greatly affect the learning process, resulting in inadequate progress in dynamic graph construction. The most important reason is lack of guidance and supervision in dynamic graph construction.

In this paper, we propose a novel Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) to address the above problems. The dual chain in DCD-HCN consists of Dynamic Hypergraph Construction Chain (DHCC) and Hypergraph Convolution Chain (HCC), as shown in Fig. 1. In detail, the Selector generates adaptive dynamic hypergraph in DHCC, and the Processor conducts hypergraph convolution with the dynamic hypergraph generated by the Selector in HCC. Furthermore, the Selector and Processor are combined as the Selector-Processor block (SPblock) in DCD-HCN. During the network training, we use the hypergraph construction loss to guide the training of DHCC and use the joint error loss to supervise the training of HCC. The gradients of these two chains are separated to avoid mutual influence from the joint error loss of different inputs. This is because confusing the optimal hypergraph structures of different inputs can interfere with the adaptability of dynamic hypergraph construction. In contrast to prior works like SemGCN3, which rely on static or semi-dynamic graphs with coupled construction and convolution, our DCD-HCN introduces a novel dual-chain architecture that fully decouples dynamic hypergraph construction from convolution. This innovation enables independent optimization of adaptive hyperedges via an unsupervised signal, significantly enhancing flexibility for complex pose variations and improving generalization across datasets.

Fig. 1
figure 1

The dual chain in DCD-HCN. 2D joint features are taken as the input, and each chain trains with its loss function separately, which are \({L_{DHCC}}\) and \({L_{HCC}}\).

Benefiting from the Selector-Processor architecture, DCD-HCN can generate integral dynamic hypergraph structure while keeping decimal adjacency matrix weight. This can strengthen the edge constraints of node feature aggregation in dynamic hypergraph convolution, which cannot be realized in end-to-end methods due to the demand for gradient propagation between dynamic graph constructing and graph convolution.

As decomposing the independence of hypergraph into the independence of hyperedge can greatly enhance the adaptability of models, we need to guarantee each possible hyperedge has a unique weight to match. We introduce one hyperedge weight parameter pool in each Processor. If a hyperedge is not chosen, its unique weight parameter will not be used or trained, and the features of joints in this hyperedge will not be aggregated. With the help of this matching mechanism, we can keep the active dynamic weights maintaining synchronous changes with dynamic hypergraphs. Our main contributions are summarized as follows:

  • We propose a Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) for 3D HPE, which decouples the dynamic hypergraph convolution into dynamic hypergraph construction and hypergraph convolution via training them with unsupervised hypergraph construction loss and supervised joint error loss separately.

  • DCD-HCN introduces a Selector-Processor block (SPblock) to achieve the interplay between DHCC and HCC with dynamic integral hypergraphs, which helps our method enforce the constraint of node feature aggregation in dynamic hypergraph convolution.

  • DCD-HCN involves a hyperedge-weight parameter matching mechanism to keep the one-to-one correspondence between dynamic hyperedges and hyperedge weights, which can improve the training process of the proposed network with reasonable computation cost.

  • We achieve state-of-the-art generalization performance on the 3DHP dataset while maintaining leading testing results on Human3.6M dataset.

Related works

3D human pose estimation

Recently, 3D human pose estimation (3D HPE) has made great progress. Martinez et al.6 used fully connected network to map 2D joint coordinates to 3D space, and achieved excellent results. SemGCN3 successfully applied graph convolution network (GCN) to 3D HPE. Liu et al.4 designed Semi-Dynamic Hypergraph Neural Network based on Feng et al.7, which used dynamic graph construction to improve the flexibility of the feature aggregation of GCN. Subsequently, researchers proposed a variety of improved methods based on GCN and the prior knowledge of 3D human pose8,9,10,11,12 through making various adjustments to the graphs. Ci et al.5 proposed a local connection network (LCN) based on human skeleton joint connections. The success of LCN shows that enforcing the constraint of node connections on feature aggregation is of great significance to the model with good generalization performance. Xie et al.13 employ a multi-stage approach by combining high-order graph convolution and Transformer to handle the lifting from 2D to 3D, using dynamic adjacency matrices to capture dynamic joint relationships and address issues like self-occlusion and depth ambiguity. Li et al.14 process data through parallel subnetworks from sparse to fine representations, utilizing multi-scale graph structure construction to generate denser topologies for learning feature aggregation of pose nodes. However, these multi-stage methods often suffer from drawbacks such as higher computational complexity and increased parameter counts, leading to potential error propagation across stages. Our DCD-HCN mitigates through its decoupled dynamic hypergraph architecture for more streamlined and adaptive processing. Moreover, different from previous GCN-based work using decimal graph structure to represent the human body, we introduce dynamic integral hypergraph in DCD-HCN, which can better enforce the constraint of node feature aggregation in dynamic hypergraph convolution.

Dynamic graph convolutional network

Dynamic graph convolutional networks improve the flexibility of node feature aggregation by constructing dynamic graph adjacency matrices. These methods can be divided into three categories based on the way of graph construction, which are graph adjacency matrix parameterization, non-learning methods, and hidden layer features as adjacency matrices. Graph adjacency matrix parameterization converts some or all elements of the static graph adjacency matrix into learnable parameters3,15,16,17,18. Non-learning methods generally construct graph structures dynamically by manually defined rules4,7,19,20. Other methods incorporated learnable modules to generate adjacency matrics9,16,21,22,23,24,25,26,27,28,29. Since these methods can not explicitly quantify the effect of generated graph structure, they are difficult to carry out further improvement by exploring the role of dynamic graph construction. In this paper, we propose an unsupervised hypergraph construction loss to instruct the dynamic hypergraph construction purposely in one dual chain, which is realized via separating the process of dynamic hypergraph construction from the dynamic hypergraph convolution.

Fig. 2
figure 2

a The structure of DCD-HCN. Its backbone consists of stacked SPblocks. b The Selector. It constructs the hypergraph incidence matrix as dynamic hypergraph. c The Processor. It performs hypergraph convolution on the pose features based on the hypergraph incidence matrix generated by the Selector.

Transformer

Transformer was originally proposed for natural language processing30, which performed well by shortening the propagation distance between sequences through a self-attention mechanism and position encoding embedding. Recently, researchers found that Transformer could also be used for vision tasks, and the well-known ViT was proposed as a pure Transformer architecture31, which broke the stereotype of cross-task model through large-scale pre-training. Now more and more researchers are trying to introduce Transformer architecture into different vision tasks, such as the 3D HPE2,11,32,33,34. The good results show that the Transformer module can greatly enhance the capability of joint feature extraction of the 3D HPE network. Therefore, we also leverage a Transformer encoder to execute the global embedding of node features.

DCD-HCN: a dual chain dynamic hypergraph convolution network

As shown in Fig. 2, the proposed Dual Chain Dynamic Hypergraph Convolution Network (DCD-HCN) firstly embeds the 2D coordinates of skeleton joints into a high-dimensional space using node embedding layers, and then stacks multiple SPblocks to learn the indistinct potential patterns of nodes. Finally, the node features are mapped to the 3D coordinate by the node embedding layer. The first node embedding layer and each SPblock are followed by global embedding layers.

Fig. 3
figure 3

Illustration of weight extraction from the hyperedge weight parameter pool. a Extract the hyperedge weights follow the original nodes. b Merge the nodes in the same group into a new node.

Node embedding layer

Inspired by Zhao et al.3, we construct a static graph convolution layer based on the SemGConv layer with the following forward propagation equation, expressed as Eqs. 1 and 2.

$$\begin{aligned} {A = \textrm{softmax}\left( {adj \odot W,axis = 1} \right) } \end{aligned}$$
(1)
$$\begin{aligned} {Y = \left( {A \odot I} \right) X\Theta ^{1} + \left( {A \odot \left( {1 - I} \right) } \right) X\Theta ^{2}} \end{aligned}$$
(2)

where adj is the adjacency matrix of the human skeleton model, W is the weights of the skeleton edge set, I is the identity matrix, X is the input features, \(\Theta ^{1}\) and \(\Theta ^{2}\) are the parameter matrix to embed node features in self-link branch and non self-link branch, \(\odot\) indicates the matrix is multiplied by elements.

It can be known from Eqs. 1 and 2 that when all node embedding layers share the same static human skeleton model, each node embedding layer will aggregate the features of connected joints in the same pattern. As Liu et al.4 points out, it can hardly capture the complex relationship of non-neighbour joints.

Global embedding layer

Since the graph convolution layer only aggregates features on the same feature dimension of different nodes and all feature dimensions of a single node, it cannot aggregate the features on the different feature dimensions of various nodes. For example, only considering the information of adjacent nodes is not enough to estimate their positions if these nodes are overlapped or missing. Therefore, we choose the Transformer encoder layer to extend the receptive field of subsequent graph convolution to the whole graph.

SPblock

The SPblock layer consists of two important parts: the Selector (Eq. 3) and the Processor (Eq. 4). The Selector constructs the integral hypergraph structure \(H^{(l)}\) dynamically based on the pose features \(P^{({l - 1})}\), l is the layer index. The Processor extracts node features with \(H^{(l)}\).

$$\begin{aligned} H^{(l)} = Selector\left( P^{({l - 1})} \right) \end{aligned}$$
(3)
$$\begin{aligned} P^{(l)} = Processor\left( H^{(l)},W^{(l)}, P^{({l - 1})} \right) \end{aligned}$$
(4)
Algorithm 1
figure a

Hypergraph Incidence Matrix Construction

Selecter

We use Transformer encoder layer to encode pose features P, capturing global long-range dependencies among joints through multi-head self-attention mechanisms. The Transformer module extracts global contextual features, which are decoupled and passed to the Processor for GCN-based local aggregation. Specifically, long-range dependencies captured by Transformer are integrated into GCN via residual connections, where the Transformer-augmented attention weights refine the adjacency matrix in GCN convolutions. The encoded features are then column-normalized via \(\textrm{softmax}\) to obtain the intermediate feature hid. Afterwards, we construct the integral hypergraph according to Algorithm 1. The construction algorithm firstly calculates the mean value of each column of hid, and sets the elements of each column to 0 when they are smaller than the mean value, otherwise 1, so as to establish the association between the intermediate feature hid and the dynamic hypergraph incidence matrix H. After that, we sort the nodes into g groups. If one of these nodes in a group is selected, we regard this group as a new selected node, otherwise as an unselected node. Finally, we calculate the indices of hyperedges, \(\omega\). And \(\omega\) will be used to match hyperedge weight parameters in Processors.

Processor

Every Processor internally stores one pool of weight parameters W corresponding to all hyperedges, and the indices \(\omega\) of the hyperedge set generated by the Selector is used to extract the weight subset \(W(\omega )\) corresponding to the hyperedge set, as shown in Fig. 3a.

Fig. 4
figure 4

The convergence curve of the two core loss functions over training steps. The joint loss converges to 0.0012, and the hypergraph loss converges to 0.0021. We normalize the loss values to \(0 \text{- } 1\) for better visualization.

Each parameter in the weight parameter pool W corresponds to one hyperedge. This allows the Processor to use corresponding parameters when the Selector chooses different hyperedges for different input features, thus ensuring the independence of the dynamic hypergraph and dynamic hyperedges. The hypergraph convolution formula is improved from Feng et al.7, as shown in Eqs. 5 and 6. Note that hyperedge \(e \in E\), E is hyperedge set of the dynamic hypergraph, node \(v \in V\), V is the node set, hyperedge weight \(w(e) \in W\), W is the weight set of the dynamic hypergraph.

$$\begin{aligned} \begin{aligned}&\begin{matrix} {h\left( {v,e} \right) = \left\{ \begin{matrix} {1,~~v \notin e} \\ {0,~~v \in e} \\ \end{matrix} \right. } \\ \end{matrix} \\&d(v) = {\sum \limits _{e \in E}^{}{w(e)h(v,e)}} \\&d(e) = {\sum \limits _{v \in V}^{}{h\left( {v,e} \right) }} \end{aligned} \end{aligned}$$
(5)

where h(ve) denotes whether the node v belongs to the hyperedge e. d(v) denotes the degree of the node v and d(e) denotes the degree of the hyperedge e.

$$\begin{aligned} {X^{({l + 1})} = \sigma \left( {D_{v}^{- \frac{1}{2}}H^{(l)}{W(\omega )D}_{e}^{- 1}(H^{(l)})^{T}D_{v}^{- \frac{1}{2}}X^{(l)}\Theta ^{(l)}} \right) } \end{aligned}$$
(6)

where \(D_{v}\) denotes the diagonal matrix of the degrees of the nodes, \(D_{v}^{ii} = d\left( V_{i} \right)\), \(D_{e}\) denotes the diagonal matrix of the degrees of the dynamic hyperedges, \(D_{e}^{jj} = d\left( E_{j} \right)\). \(W(\omega )\) denotes the learnable weights corresponding to the dynamic hypergraph. \(H^{(l)}\) denotes the incidence matrix of the hypergraph. \(\Theta ^{(l)}\) denotes the matrix of learnable weights used to change the feature dimension of the nodes.

Regarding the hyperedge weight parameter pool W as a one-dimensional array, the hyperedge weights \(W(\omega )\) corresponding to the hyperedges are extracted using the hyperedge indices \(\omega\), and then the hypergraph incidence matrix H is combined to perform the hypergraph convolution on the pose features \(x^{l}\).

If there are a large number of nodes, the subsets of nodes increase exponentially, which influences the efficiency of training. Therefore, according to the symmetry of human body structure, some nodes of the human skeleton model are combined into one group, as shown in Fig. 3b.

Loss function

Dynamic hypergraph construction loss

We use the dynamic hypergraph construction loss to train Selectors with their intermediate features hid, enabling them to overcome reliance solely on the joint error loss. This loss does not impose a direct or rigid constraint on the graph structure. Rather, it implicitly guides the Selectors of other samples to explore effective hypergraph connections that are suitable for similar poses by leveraging the hypergraph structure of the best-performing sample within the batch. As a result, during network parameter optimization, the model adaptively imitates and generalizes these high-quality structural patterns. We set the instance with the smallest joint error loss value \(\mathcal {L}_{joint}\) in the batch N as the target value, and calculate the mean value of the difference between the output of each Selector and the target value from the best instance in the same batch as the dynamic hypergraph construction loss value, expressed as Eq. 7. The unsupervised training signal for hypergraph construction (\(L_{DHCC}\)) is a pivotal innovation in our learning strategy, promoting self-supervised adaptation of hypergraphs without relying on labeled joint errors. Unlike prior coupled approaches (e.g., SemGCN), which unify supervision and risk interfering optimal structures across poses, our decoupled strategy applies \(L_{DHCC}\) solely to the DHCC for independent hyperedge optimization, while \(L_{HCC}\) supervises the HCC. This resolves optimization challenges, demonstrating superior generalization and efficiency over related methods.

$$\begin{aligned} \mathcal {L}(hypergraph)=\frac{1}{N\times S} \sum _{n=1}^{N}\sum _{s=1}^{S} ( {hid_{n}^{s}-hid^{s}_{\mathop {\arg \min }\limits _{(\mathcal {L}_{joint})}}}) \end{aligned}$$
(7)

Joint error loss

The 3D coordinates of human joints predicted by DCD-HCN are defined as \(Joints{ = \{ {\overset{\sim }{J}}_{i} | i = 1,\ldots ,n \}}\), the target is \(\left\{ J_{i} \big | i = 1,\ldots ,n \right\}\). We choose the mean square error as the joint error loss, as shown in Eq. 8.

$$\begin{aligned} \mathcal {L}_{Joints} = \frac{1}{n}{\sum \limits _{i}^{n}\left( {\overset{\sim }{J_{i}}} - J_{i} \right) ^{2}} \end{aligned}$$
(8)

Network training

Due to the separation of dual chain, we can train the Selector and the Processor independently. Since it is impossible to determine whether the current parameter distribution of the Selector could generate suitable dynamic hypergraph structures, we follow the Expectation-Maximization (EM) algorithm to optimize the joint training of the Selector and the Processor. In the E step, we train the Processor under the current parameter distribution of the Selector until it is stable. In the M step, we optimize the parameter distribution of the Selector with dynamic hypergraph construction loss. As shown in Fig. 4, both the joint error loss (\(L_{joint}\)) and the dynamic hypergraph construction loss (\(L_{hypergraph}\)) gradually stabilize and converge as the number of training steps increases, eliminating the possibility of severe oscillations or divergence. This clearly demonstrates that the alternating optimization training strategy we employ is stable and effective in practical applications.

Fig. 5
figure 5

Qualitative results on four scenarios in Human3.6M.

Experiments

Datasets and evaluation metrics

MPI-INF-3DHP

The MPI-INF-3DHP is a 3D human pose dataset35. We employ the average Percentage of Correct Keypoint (PCK) with a threshold of 150 mm, and Area Under Curve (AUC) with the PCK threshold as the evaluation metrics.

Human3.6M

The Human3.6M is a large-scale public dataset for 3D human pose estimation36,37. In this paper, the ground truth (GT) of 2D and 3D human pose data provided by the dataset are used for training, and two standard evaluation protocols, Protocol 1 and Protocol 2, are used for experiments. Both protocols use subjects S1, S5, S6, S7, and S8 as the training set and subjects S9 and S11 as the testing set. Under protocol 1, after aligning the predicted 3D human joint coordinates and the GT with the root joint (central hip joint), the Mean Per-Joint Position Error (MPJPE) is calculated in millimeters. Protocol 2 utilizes a rigid transformation to align the predictions with the GT and then calculates the Procrustes analysis MPJPE (P-MPJPE).

Implementation details

Table 1 The detailed configuration for experimental details.
Table 2 Quantitative results in MPJPE(mm) between the estimated pose and the GT on Human3.6M under Protocol 1. We show the results using the 2D Pose detector as the input under Configuration 1 (above the middle line) and the 2D GT as the input under Configuration 2 (below). The top two best methods of each action are highlighted in bold and underlined respectively.

We stack 8 SPblocks as our dual chain dynamic hypergraph convolution network (DCD-HCN) architecture. The hidden layer node dimensions are 128, and all Transformer encoders with 8 heads are officially implemented by Pytorch. Our models are trained with the Adam optimizer with the parameter of \(\beta _1=0.9\), \(\beta _2=0.999\), and \(\epsilon =1e-8\). We set the learning rate to exponential decay, with the decay of step 32000 and parameter \(\gamma\) of 0.96 to prevent overfitting. We use L2 regularization to prevent overfitting. The total loss is a weighted combination of \(L_{DHCC}\) and \(L_{HCC}\) with default equal weights, optimized via grid search to balance unsupervised hypergraph construction and supervised joint error minimization. The number of learning epochs is set to 300, and the batch size is 200. The detailed training configuration is shown in Table 1. All experiments are conducted on one Titan X GPU with PyTorch. For the Human3.6M datasets, the data pre-processing method is from Ci et al.5. No other data augmentation is used to ensure fairness in the experiment. We adopt the pre-trained SH model42 as the 2D Pose detector of Human3.6M.

Comparison with state-of-the-art methods

Table 3 Quantitative results in P-MPJPE(mm) on Human3.6M under Protocol 2. Other demonstrations are same to Table 2.
Table 4 Quantitative comparison with state-of-the-art methods on MPI-INF-3DHP. The top two best methods of each action are highlighted in bold and underlined respectively.

We compare our model with SOTA methods on Human3.6M in Tables 2 and 3. Also, as shown in Fig. 5, we also present some qualitative experimental results. In the experiments, our model accepts 17 2D joints’ coordinates with a single view and single frame as the input and outputs the predicted 3D joints’ coordinates. We find that our method achieves the best performance (34.9 mm) in Protocol 1, GT as inputs. Compared to Zhang et al.40’s average result of 35.3 mm, this represents an improvement of 0.4 in joint evaluation accuracy. In particular, we improve large margin scores in taking photos, posing, sitting down, walking dog and walking together under two configurations. In Protocol 2, our model also achieves good results as shown in Table 3. Although our method’s average result is not as strong as Li et al.14, it achieves a significant lead in specific scenarios such as Photo, Pose, SitD, and WalkD. This is attributed to our dynamic hypergraph construction and decoupled dual-chain architecture that adaptively capture high-order joint relationships and enhance robustness to noisy 2D inputs in these complex, motion-intensive actions. The results show that our model gets the lowest MPJPE in 7 of all scenes.

To assess the generalization ability, we evaluate our method on MPI-INF-3DHP dataset. The DCD-HCN is trained on Human3.6M training set and then tested on MPI-INF-3DHP testing set. The PCK indicates the percentage of correct keypoints, whose threshold is set to 150mm, and the AUC indicates the area under curve, which is calculated under the threshold from 0mm to 150mm with the 5mm interval. The results in Table 4 demonstrate that our method outperforms other state-of-the-art methods while only using Human3.6M for training, achieving the highest PCK of 88.1 and AUC of 52.5. It validates the effectiveness of our model in improving generalized performance of unseen scenes. Furthermore, in terms of computational complexity, our approach strikes an optimal balance with 11.22M parameters and 17.94 FLOPs. Compared to other methods6,44, our method has fewer parameters yet delivering superior performance.

Table 5 Quantitative comparison of different dynamic hypergraph construction methods.

Ablation study

To verify the impact of each component in DCD-HCN, we conduct ablation experiments on the Human3.6M dataset under Protocol 1 with 2D GT as input.

Impact of hypergraph construction methods

According to the graph construction methods, the dynamic hypergraph can be categorized as follows: adjacency matrix parameterization, hidden layer features as adjacency matrices, and non-learning methods. The first and the second can be used simultaneously, which is recorded as the mixed learnable mode. In the comparative experiment, the original Selector module and the weight parameter pool are removed in our DCD-HCN, and only the joint error loss function is used, and remove W from Eq. 6 in model-P, model-H, and model-M.

For parameterization (model-P), we set a learnable tensor as the dynamic hypergraph incidence matrix, and then directly participates in the hypergraph convolution. For hidden layer features (model-H), we add a new Transformer encoder layer, which can be trained with Processors simultaneously, to generate the dynamic hypergraph incidence matrix. For the mixed learnable mode (model-M), we adopt the above two methods at the same time. We add parameterized adjacency matrix and output of the Transformer encoder layer as the dynamic hypergraph incidence matrix (model-M). In non-learning methods (model-N), we construct dynamic hyperedges refering method4 to generate a dynamic hypergraph incidence matrix. The experimental results are shown in Table 5. It can be found that the idea of separating dynamic hypergraph construction from convolution and the edge-weight parameter matching mechanism improve the prediction accuracy.

Furthermore, we thoroughly validate the impact of the dynamic hypergraph construction loss on 3D pose estimation performance. Specifically, our model employs both the proposed dual chain architecture and the dynamic hypergraph construction loss, while baseline models such as model-P, model-H, and model-M use only the traditional joint error loss with coupled graph construction and convolution in an end-to-end manner. The results demonstrate that our method achieves a significantly lower MPJPE of 34.9mm compared to the others, directly verifying the effectiveness of the proposed loss and the corresponding decoupled training framework in enhancing hypergraph construction quality and improving 3D pose estimation accuracy.

Sensitivity analysis

We investigated the influence of each module within the primary components of the DCD-HCN on model performance variability, providing evidence for our sensitivity analysis. The main parts of DCD-HCN include nodes embedding layer (NE), Selector-Processor block (SPblock) and global embedding layer (GE). We design 4 settings as shown in Table 6. In Setting 1 and Setting 2, we use fully connected layer to replace NE. In setting 3, we double GE to balance the parameters. In our sensitivity analysis, these settings reveal the model’s robustness to structural variations. In setting 2, replacing NE with a fully connected layer and adding SPblock tends to induce over-smoothing, underscoring the necessity of the node embedding layer for effective feature extraction. The results of Setting 4 also demonstrate this phenomenon. Furthermore, Setting 4 surpasses Setting 3 by 11.9%, demonstrating SPblock’s superior capability in uncovering indistinct potential node relationships while maintaining stability across parameter-balanced modifications. The collaborative function of all modules confirms the elasticity of DCD-HCN to moderate disturbances, enhancing its reliability in practical applications.

Table 6 Quantitative comparison among different module settings in DCD-HCN. NE means node embedding layer, SP means SPblock and GE means global embedding layer.
Table 7 Evaluation results on Human3.6M under different scales of node groups. Params represent the size of the parameter pool.

The scale of parameter pool

In Section Processor, the scale of parameter pool stored in the Processor is exponentially related to the number of nodes. Too many nodes will lead to too large parameter quantity, so we group nodes to improve matching efficiency in the Algorithm 1.

The main sources of complexity in our model stem from the Transformer encoder serving as the global embedding layer capturing long-range dependencies, and the hyperedge weight pool stored within the Processor enabling adaptive high-order aggregations. To mitigate this complexity, we incorporate a node grouping strategy that compresses the 17 joints into k groups, effectively reducing the exponential growth of the hyperedge weight pool from \(2^{17}\) to \(2^k\) possibilities. This design supports adaptive high-order aggregations, effectively controlling the exponential growth of the hyperedge weight pool and balancing performance and efficiency. In Table 7, we show the comparison results of different scales, gradually reducing the number of nodes in groups. We observe that no compression or small compression scale lead to the laggard results. However, over compression will lead to the fact that even if the dynamic hypergraph constructed by the Selector has a wide range of variation, and the Processor still tends to use the same parameters as weight, which makes the dynamic hypergraph meaningless and also obtains laggard results. Thus, we select 8 as the number of groups according to the results.

Conclusion

We propose a novel Dual Chain Dynamic Hypergraph Convolution Network(DCD-HCN) for 3D human pose estimation. DCD-HCN has a dynamic hypergraph construction chain (selector) which is guided by dynamic hypergraph construction loss, and a hypergraph convolution chain (processor) which is supervised by joint error loss. The integral dynamic hypergraphs generated by the selectors can enforce the graph structure constraint of graph convolution. The hyperedge weight parameter pool decomposes the independence of hypergraphs to hyperedges. Extensive experiments show that our proposed DCD-HCN has a remarkable generalization advantage over previous dynamic graph convolution networks.

Although our DCD-HCN achieves SOTA generalization performance in MPI-INF-3DHP dataset, DCD-HCN fails in some action scenes, especially in Purch, Sit, Smoke and Walk, when it takes detected 2D poses as inputs. DCD-HCN and previous GCN-based works all fail in these four scenes. We think that it is due to the inherent defect of GCN. By contrast, DCD-HCN makes great progress in the scenes that GCN performs well, like Photo, SitD, and WalkD. In these scenes, non-GCN-based works do not perform well. Overall, the DCD-HCN framework enhances both the performance and adaptability of GCN within the theoretical framework. Additionally, DCD-HCN adopts Transformer encoder layer as the part of the Selector and the part of the global embedding layer, resulting in an increase in the number of parameters.