Deep learning-based approaches for human pose estimation in interdisciplinary physics applications

Zhiliang, Li; Zhuo, Li

doi:10.1038/s41598-025-26972-4

Download PDF

Article
Open access
Published: 24 November 2025

Deep learning-based approaches for human pose estimation in interdisciplinary physics applications

Li Zhiliang¹ &
Li Zhuo²

Scientific Reports volume 15, Article number: 42883 (2025) Cite this article

2486 Accesses
Metrics details

Subjects

Abstract

Human pose estimation has emerged as a critical problem in computer vision due to its extensive applications across interdisciplinary fields, including robotics, augmented reality, sports analysis, and biomechanics. Traditional methods, while effective in controlled environments, often fail to generalize to real-world scenarios due to challenges such as occlusions, scale variations, and temporal inconsistencies in video data. To address these limitations, we propose the Hierarchical Spatio-Temporal Pose Network (HSTPN), a deep learning-based framework that integrates multi-scale feature fusion with attention mechanisms to capture both global context and fine-grained details. The Adaptive Pose Refinement Strategy (APRS) enhances pose predictions by iteratively refining key point locations, leveraging spatial, temporal, and domain-specific constraints. Together, these innovations enable our approach to achieve superior accuracy and robustness across diverse datasets, including both constrained and unconstrained environments. Experimental results demonstrate that HSTPN and APRS outperform state-of-the-art methods in terms of prediction accuracy, temporal coherence, and computational efficiency, making them well-suited for real-time and interdisciplinary physics applications.

Lightning Pose: improved animal pose estimation via semi-supervised learning, Bayesian ensembling and cloud-native open-source tools

Article 25 June 2024

Wearable sensor data-driven sports posture recognition using the ST-GCN spatio-temporal graph convolutional network

Article Open access 30 December 2025

Deep neural networks and deep deterministic policy gradient for early ASD diagnosis and personalized intervention in children

Article Open access 28 November 2025

Introduction

Human pose estimation (HPE) has become a critical area of research due to its extensive applications in interdisciplinary fields, including biomechanics, sports science, robotics, and human-computer interaction. The integration of HPE in physics-related domains has shown immense potential for analyzing human motion, optimizing athletic performance, and studying complex physical systems which involving human interactions¹. Achieving accurate and robust human pose estimation in physics applications remains challenging due to factors such as occlusions, varying lighting conditions and the complexity of human motion in three-dimensional space². Traditional methods have demonstrated limited scalability and accuracy, particularly when applied to dynamic or real-world physics problems³. Recent advancements in deep learning have not only addressed these challenges but also enabled new applications, sophisticated models, and computational resources⁴. Exploring deep learning-based HPE in physics applications represents a vital step forward in unlocking its interdisciplinary potential and enabling innovative solutions for real-world problems⁵. Human pose estimation methods were predominantly which is based on symbolic AI and handcrafted feature extraction. And relied heavily on prior domain knowledge. Traditional approaches often used geometric models or statistical methods⁶. These methods incorporated prior knowledge of human anatomy and physics principles to estimate poses from visual data. While these approaches offered interpretability and were effective for constrained scenarios, they were highly sensitive to variations in appearance, occlusions, and environmental factors. These methods struggled to generalize across diverse datasets and often required significant manual effort to design features or tune parameters. Despite their limitations, these early methods established a foundation for understanding the complexities of HPE and highlighted the need for automated and scalable solutions⁷.

The advent of data-driven approaches and machine learning marked a paradigm shift in human pose estimation, which addressing many of the shortcomings of traditional methods⁸. Machine learning models such as Support Vector Machines (SVMs) and Random Forests enabled more robust and scalable pose estimation by leveraging statistical learning techniques and annotated datasets⁹. These models learned to identify patterns and relationships in visual data, improving accuracy and generalization across varying conditions¹⁰. Applications of these methods in physics, such as motion analysis and kinematic studies, demonstrated their potential to capture dynamic human motion in real-world environments¹¹. However, the performance of these methods was still limited by their reliance on manual feature engineering and their inability to model complex, high-dimensional relationships in image data¹². Furthermore, the dependence on labeled data remained a bottleneck, particularly for interdisciplinary physics applications that often require domain-specific datasets¹³. The introduction of deep learning has revolutionized human pose estimation by enabling end-to-end learning and automatic feature extraction¹⁴. Convolutional Neural Networks (CNNs) have become the backbone of most modern HPE systems, demonstrating remarkable performance in tasks such as 2D and 3D pose estimation¹⁵. Models such as OpenPose¹⁶, PoseNet¹⁷, and HRNet¹⁸ have achieved state-of-the-art results by leveraging large-scale annotated datasets and advanced network architectures. These methods have proven particularly effective for interdisciplinary physics applications, where they have been used to analyze athletic motion, study human-robot interactions, and model biomechanical systems¹⁹. For example, deep learning-based methods have enabled accurate analysis of motion in sports physics, facilitating the optimization of athletic performance by capturing fine-grained details of movement. Similarly, in robotics, HPE has been used to enhance human-robot collaboration by accurately estimating joint angles and body orientation. Despite their success, deep learning approaches face challenges related to computational complexity, the need for large labeled datasets, and the lack of interpretability, which can limit their application in physics domains where transparency and efficiency are crucial.

Building on the limitations of existing methods, we propose a novel deep learning-based framework for human pose estimation tailored to interdisciplinary physics applications. Enabling more accurate and interpretable pose estimation, our approach integrates domain-specific knowledge of physics into the design of deep learning models. Specifically, we combine the predictive power of neural networks with physics-informed constraints, such as kinematic and dynamic models, to enhance the realism and accuracy of pose predictions. We incorporate self-supervised learning techniques to reduce the dependency on labeled data, addressing one of the major barriers in physics-related applications. Our framework improves robustness in challenging conditions such as occlusions and complex backgrounds by leveraging a multi-modal approach that integrates visual and sensor data. This hybrid approach not only addresses the limitations of deep learning-based HPE but also enables novel applications in physics.

We summarize our contributions as follows:

Incorporating kinematic and dynamic constraints into the deep learning model enhances the accuracy and interpretability of pose estimation in physics-related applications.
The use of self-supervised learning and multi-modal data integration ensures reliable performance across diverse and challenging scenarios, such as occlusions and complex movements.
Experimental results demonstrate improved computational efficiency and accuracy compared to state-of-the-art methods, making the approach suitable for real-time and large-scale applications in interdisciplinary physics.

To ground our approach within the scope of interdisciplinary physics, we now elaborate on several representative application domains where human pose estimation serves as a critical enabling technology. In biomechanics and sports science, HPE is widely used to model the kinematics of athletes during dynamic actions such as sprinting, jumping, or throwing. By capturing joint trajectories and posture variations over time, researchers can compute physical quantities such as angular momentum, energy expenditure, and joint torque, which are essential for performance optimization and injury prevention. In human-robot interaction systems, particularly those designed for physical collaboration or teleoperation, accurate pose estimation allows robots to anticipate human motion, adjust their trajectories, and ensure physical safety. For example, collaborative robots in assembly lines or rehabilitation exoskeletons rely on real-time body tracking to understand human intent and mechanical constraints, thereby translating vision-based estimation into control-level decisions grounded in physics. Another important use case arises in microgravity environments, such as astronaut training and motion analysis aboard the International Space Station. Under zero-gravity conditions, human movement patterns change significantly, and pose estimation enables the modeling of new force balances and inertial dynamics that do not occur on Earth. Estimating joint angles and body motion under such altered physical constraints can support the design of assistive devices and improve our understanding of human physiology in space. These cases demonstrate that human pose estimation is not merely a vision task but a conduit for extracting physically meaningful quantities from visual data. The ability of our proposed HSTPN+APRS framework to integrate temporal dynamics, spatial detail, and domain-specific constraints makes it particularly well-suited for deployment in such interdisciplinary physics applications.

Related work

Deep learning for pose estimation

Enabling precise identification of key body joints in 2D and 3D space, deep learning has significantly advanced the field of human pose estimation²⁰. Convolutional Neural Networks (CNNs) and their variants are widely used for extracting features from images and video frames, forming the backbone of most pose estimation systems. Techniques such as stacked hourglass networks, residual networks, and High-Resolution Networks (HRNet) have shown state-of-the-art performance in capturing spatial relationships between body joints²¹. OpenPose and AlphaPose are popular frameworks that implement these methodologies, demonstrating their applicability in real-time pose estimation tasks. Recently, the adoption of transformers in vision tasks has further enhanced the accuracy and efficiency of pose estimation systems²². Vision Transformers (ViT) and related architectures use self-attention mechanisms to model global dependencies, which are crucial for interpreting complex poses. Human pose estimation is increasingly being used in interdisciplinary physics applications,²³. For instance, in sports science, deep learning-based pose estimation is employed to analyze athletes’ movements and optimize performance while minimizing injury risks. These applications require robustness to occlusions, varying lighting conditions, and dynamic backgrounds²⁴. Techniques involve partitioning detected individuals and estimating their poses simultaneously, which have been proposed to address these challenges²⁵. Achieving real-time performance in resource-constrained environments remains an open problem, necessitating further exploration into lightweight network architectures and optimization techniques. Early deep learning-based methods such as DeepPose²⁶ introduced the use of deep neural networks to regress joint locations directly from image data. Stacked Hourglass Networks²⁷ enabled multi-scale feature processing and became foundational in many 2D pose estimation pipelines. HRNet²⁸ preserved high-resolution representations throughout the network to improve spatial precision. For temporal modeling in video sequences, Pavllo et al.²⁹ proposed using temporal convolutions with semi-supervised learning to enhance 3D pose estimation. More recently, Transformer-based methods like PoseFormer³⁰ have shown strong potential by capturing long-range dependencies and global context in pose sequences.

Beyond estimating joint positions, human pose estimation has been increasingly adopted in downstream tasks such as skeleton-based human action recognition (HAR), where spatial-temporal dynamics of joints serve as crucial features for understanding human behaviors. In these applications, pose sequences are encoded as skeletal graphs, and various deep learning models are employed to infer action categories. Recent studies have tackled challenges in this area by exploring more robust representations and addressing open-world scenarios. For instance, Peng et al.³¹ proposed a method to handle open-set HAR with uncertainty estimation over skeleton graphs. Xie et al.³² introduced a dynamic semantic-based spatial graph convolution network that adaptively learns context-aware features from joint graphs. Xu et al.³³ addressed the problem of label noise in real-world datasets, proposing a robust training pipeline for HAR from skeleton sequences. Furthermore, Peng et al.³⁴ investigated one-shot skeleton-based action recognition under occlusion, demonstrating the potential of learning compact and generalizable motion features even with limited data. These efforts highlight the broader impact of accurate and temporally consistent pose estimation models. Our proposed HSTPN framework, which emphasizes spatial fidelity and temporal smoothness, aligns well with the requirements of such downstream tasks, potentially serving as a reliable backbone for skeleton-based action understanding in complex environments.

Physics-informed deep learning models

Physics-informed deep learning models incorporate domain-specific knowledge from physics to enhance the accuracy and interpretability of pose estimation systems³⁵. These models integrate physical laws, such as kinematics, dynamics, and constraints, directly into the training process or the model architecture³⁶. For example, inverse kinematics has been used to ensure that the predicted joint positions adhere to anatomical constraints, improving the biological plausibility of the estimated poses³⁷. Similarly, energy-based loss functions derived from physical principles can enforce consistency between predicted motion trajectories and real-world dynamics. Physics-informed models have found applications in diverse fields such as biomechanics, where they are used to study gait abnormalities, and robotics, where accurate human pose estimation is crucial for human-robot collaboration³⁸. These models are valuable in computer vision tasks involving challenging conditions, such as occluded body parts or partial views, by leveraging physical constraints to fill in missing information³⁹. Advances in differentiable physics engines and their integration with deep learning frameworks have facilitated the training of models that simulate and predict human motion under physical constraints⁴⁰. However, the development of physics-informed deep learning systems requires a deep understanding of both the domain and the modeling process, which can increase the complexity of implementation. Moreover, aligning such models with real-world data necessitates addressing discrepancies between theoretical assumptions and practical observations.

Multimodal approaches for motion analysis

Multimodal approaches combine data from multiple sources to improve the accuracy and robustness of human pose estimation systems⁴¹. By leveraging complementary modalities, these approaches can overcome the limitations of single-sensor systems. Combining RGB camera data with depth information from LiDAR or structured light sensors allows for more accurate 3D pose estimation, even in cluttered environments⁴². Integrating video-based pose estimation with data from wearable IMUs enables precise motion tracking in applications⁴³. Transformer-based architectures have gained prominence in this domain, offering a unified framework for processing and integrating heterogeneous data streams⁴⁴. Attention mechanisms in these models help to dynamically weigh the contributions of different modalities, adapting to varying data quality and contextual relevance⁴⁵. Applications in interdisciplinary physics include the analysis of human motion in microgravity environments, where multimodal data can help study how physiological changes affect movement patterns⁴⁶. Another emerging area is virtual reality (VR) and augmented reality (AR), where multimodal systems enable realistic and interactive simulations of human motion. Challenges in this area include the synchronization and calibration of sensors, as well as the computational complexity of processing large multimodal datasets. Future directions include the use of self-supervised and unsupervised learning techniques to reduce reliance on annotated data, making these systems more scalable and accessible.

Method

Overview

Pose estimation is a fundamental problem in computer vision and has garnered significant attention due to its wide-ranging applications, including augmented reality, human-computer interaction, robotics, and sports analysis. The objective of pose estimation is to determine the spatial configuration of an object, typically in terms of the locations of key points or joints. While this task is most often associated with human pose estimation—where key joints such as shoulders, elbows, and knees are identified—it extends to general objects in scenarios such as 3D reconstruction, object tracking, and industrial automation.

In this work, we address the pose estimation problem through the lens of deep learning and focus on both accuracy and computational efficiency. The subsequent sections of this paper outline the contributions of our approach. In Sect. 3.2, we formalize the pose estimation task and provide a mathematical foundation to frame the problem. Specifically, we define the relationship between input data, such as RGB images or video sequences, and the output representation of pose, which can be parameterized as keypoint coordinates in 2D or 3D spaces. We also discuss challenges inherent to the task, such as occlusion, variations in lighting, and scale discrepancies. In Sect. 3.3, we introduce Hierarchical Spatio-Temporal Pose Network (HSTPN). Our model leverages recent advancements in neural network architectures, incorporating both spatial and temporal features to improve the accuracy of pose predictions. This model integrates attention mechanisms to prioritize regions of interest and adaptively focus on critical areas of the input data. The proposed architecture achieves competitive results while maintaining computational efficiency, making it suitable for real-time applications. In Sect. 3.4, we propose Adaptive Pose Refinement Strategy (APRS). This includes a tailored loss function that balances precision and generalizability across different datasets, along with a novel data augmentation pipeline to increase robustness against occlusion and other adversarial conditions. We also discuss the integration of domain-specific knowledge to fine-tune the network, such as anthropometric constraints for human pose estimation or symmetry properties for other objects.

Preliminaries

Pose estimation is a critical task in computer vision that involves predicting the spatial arrangement of key points or joints for objects, such as humans, animals, or rigid bodies. The ground truth heatmap $\mathcal {H}_k$ for a key point $p_k$ is modeled as a Gaussian distribution centered at its ground truth location $(u_k, v_k)$:

$$\begin{aligned} \mathcal {H}_k(u, v) = \exp \left( -\frac{(u - u_k)^2 + (v - v_k)^2}{2\sigma ^2}\right) , \end{aligned}$$

(1)

where $\sigma$ controls the spread of the Gaussian. During inference, the predicted location $\hat{p}_k$ is obtained by identifying the peak response in the predicted heatmap $\hat{\mathcal {H}}_k$:

$$\begin{aligned} \hat{p}_k = \arg \max _{(u, v)} \hat{\mathcal {H}}_k(u, v). \end{aligned}$$

(2)

Formally, the objective of pose estimation is to model the mapping from an input space of image or video data $\mathcal {X}$ to an output space of key point configurations $\mathcal {Y}$. Let $x \in \mathcal {X}$ represent an input data sample, such as a single RGB image or a sequence of video frames. The corresponding output $y \in \mathcal {Y}$ consists of a set of K key points, each represented by a coordinate vector. The task of pose estimation can be formulated as a supervised learning problem where the objective is to train a function $f_\theta : \mathcal {X} \rightarrow \mathcal {Y}$, parameterized by $\theta$, to minimize the discrepancy between predicted key points $\hat{y} = f_\theta (x)$ and ground truth key points $y^*$. This discrepancy is quantified by a loss function $\mathcal {L}(y^*, \hat{y})$, which we describe in detail later. Pose estimation is inherently complex due to occlusions where parts of the object may be obscured, scale and viewpoint variations that introduce non-linear distortions in the appearance of key points, background clutter that can confuse key point detection, and ambiguities in symmetry for certain objects, such as human hands or rigid structures, where symmetric parts lead to challenges in localization. The output y is typically expressed as a collection of keypoint coordinates $y = \{ (p_1, p_2, \dots , p_K) \}$, where for 2D estimation $p_k = (u_k, v_k) \in \mathbb {R}^2$, and for 3D estimation $p_k = (u_k, v_k, z_k) \in \mathbb {R}^3$ with $z_k$ denoting the depth. Likewise, the input x is represented as a high-dimensional tensor. For static images, $x \in \mathbb {R}^{H \times W \times C}$, and for video sequences, $x \in \mathbb {R}^{T \times H \times W \times C}$, where T is the number of frames, and H, W, C denote the image height, width, and channels, respectively. The corresponding output y is structured as a set of key points, which can be represented as heatmaps $\mathcal {H} \in \mathbb {R}^{H' \times W' \times K}$, where each heatmap $\mathcal {H}_k$ encodes the probability distribution of the k-th key point over a downsampled spatial grid of size $H' \times W'$. Heatmaps are a common intermediate representation in modern pose estimation pipelines.

To handle variations in object pose and scale, key points are normalized into a canonical coordinate system. Let $b = (x_{\text {min}}, y_{\text {min}}, x_{\text {max}}, y_{\text {max}})$ denote the bounding box surrounding the object, and $c = ((x_{\text {min}} + x_{\text {max}})/2, (y_{\text {min}} + y_{\text {max}})/2)$ be its center. Key points are normalized as

$$\begin{aligned} p_k'= \frac{p_k - c}{s}, \end{aligned}$$

(3)

where $s = \max (x_{\text {max}} - x_{\text {min}}, y_{\text {max}} - y_{\text {min}})$ is the scale of the bounding box. This normalization ensures invariance to object position and scale. The network $f_\theta (x)$ is trained using a combination of pixel-wise heatmap regression loss $\mathcal {L}_{\text {heatmap}}$ and optional auxiliary losses. The heatmap regression loss is defined as

$$\begin{aligned} \mathcal {L}_{\text {heatmap}} = \frac{1}{K} \sum _{k=1}^K \Vert \mathcal {H}_k - \hat{\mathcal {H}}_k \Vert _2^2. \end{aligned}$$

(4)

For end-to-end frameworks that directly predict key point coordinates, a mean squared error loss is applied to the normalized coordinates:

$$\begin{aligned} \mathcal {L}_{\text {MSE}} = \frac{1}{K} \sum _{k=1}^K \Vert p_k' - \hat{p}_k' \Vert _2^2. \end{aligned}$$

(5)

For video-based pose estimation, temporal coherence is leveraged to improve predictions. Let $\{x_t\}_{t=1}^T$ represent a sequence of T frames, and $\{y_t\}_{t=1}^T$ denote the corresponding key point sequences. Temporal dynamics can be captured using recurrent models or temporal convolutions:

$$\begin{aligned} y_t = f_\theta (x_t, h_{t-1}), \end{aligned}$$

(6)

where $h_{t-1}$ represents the hidden state from the previous time step. This approach ensures smooth and consistent predictions across frames. To overcome the challenges of occlusion, scale variation, and background clutter, our method incorporates multi-scale feature extraction to handle diverse object sizes, attention mechanisms to prioritize relevant regions, and temporal regularization for consistency in video data.

Hierarchical spatio-temporal pose network (HSTPN)

In this work, we propose a novel framework, termed Hierarchical Spatio-Temporal Pose Network (HSTPN), designed to address the challenges of pose estimation in both images and videos(As shown in Fig. 1). HSTPN incorporates multi-scale feature extraction, attention mechanisms, and temporal coherence modeling to provide robust and accurate predictions. This subsection details the architectural components, mathematical formulations, and innovations of HSTPN.

Multi-Scale Feature Fusion

To accurately estimate human poses in diverse scenes with varying scales, occlusions, and backgrounds, it is essential to aggregate features from multiple semantic levels. HSTPN employs a multi-scale fusion strategy that combines fine-grained spatial features from shallow layers with semantic-rich representations from deeper layers. Given the hierarchical backbone features $\phi _l(x)$ at layer l, they are first aligned to a common spatial resolution for aggregation:

$$\begin{aligned} \tilde{\phi }_l(x) = \text {Align}(\phi _l(x)). \end{aligned}$$

(7)

The $\text {Align}(\cdot )$ function applies upsampling (for coarse layers) or downsampling (for fine layers), ensuring all feature maps can be combined without distortion. To adaptively control the contribution of each layer, we apply attention weighting. The importance $\alpha _l$ of each layer is computed using global average pooled features and a learnable projection:

$$\begin{aligned} \alpha _l = \text {Softmax}(\textbf{w}^\top \text {GAP}(\tilde{\phi }_l(x))), \end{aligned}$$

(8)

where $\textbf{w}$ is a learnable vector and $\text {GAP}$ denotes global average pooling. This allows the model to emphasize contextually relevant features depending on the input image. The weighted fusion of aligned features produces the multi-scale representation:

$$\begin{aligned} \psi (x) = \sum _{l=1}^{L} \alpha _l \cdot \tilde{\phi }_l(x), \end{aligned}$$

(9)

which unifies spatial detail and semantic abstraction. To prevent scale imbalance and ensure stable training, a normalization step is applied:

$$\begin{aligned} \psi '(x) = \frac{\psi (x) - \mu (\psi (x))}{\sigma (\psi (x))}. \end{aligned}$$

(10)

This fused feature $\psi '(x)$ serves as the input to the subsequent prediction module.

Attention-Driven Prediction

The goal of this module is to localize keypoints by generating spatial heatmaps. Using $\psi (x)$, each keypoint k is associated with a heatmap $\hat{\mathcal {H}}_k$ generated through a learnable convolutional mapping $g_k(\cdot )$:

$$\begin{aligned} \hat{\mathcal {H}}_k = g_k(\psi (x); \theta _g). \end{aligned}$$

(11)

These heatmaps represent the likelihood distribution of keypoint positions across spatial locations. To extract discrete coordinates, we select the peak of each heatmap:

$$\begin{aligned} \hat{p}_k = \arg \max _{(u, v)} \hat{\mathcal {H}}_k(u, v), \end{aligned}$$

(12)

optionally refined with sub-pixel interpolation based on local gradients:

$$\begin{aligned} \hat{p}_k \leftarrow \hat{p}_k + \nabla \hat{\mathcal {H}}_k(\hat{p}_k). \end{aligned}$$

(13)

This refinement enhances spatial precision, especially in low-resolution settings. The final coordinates in the original image space are computed via:

$$\begin{aligned} \hat{p}_k^{\text {orig}} = s \cdot \hat{p}_k, \end{aligned}$$

(14)

where s is a resolution scaling factor. Additionally, the confidence of prediction is estimated by the peak value in the heatmap:

$$\begin{aligned} c_k = \max _{(u, v)} \hat{\mathcal {H}}_k(u, v). \end{aligned}$$

(15)

These predictions form the output set of estimated keypoints and their confidences.

Temporal Consistency Module

In video pose estimation, spatial accuracy alone is insufficient—temporal smoothness is critical to avoid jitter and instability across frames. To address this, HSTPN introduces a temporal encoder that captures inter-frame dependencies. Given fused spatial features $\psi (x_t)$ at time t, we update the temporal hidden state $h_t$ using a recurrent or convolutional temporal encoder:

$$\begin{aligned} h_t = \text {TemporalEncoder}(\psi (x_t), h_{t-1}), \end{aligned}$$

(16)

or via temporal convolutions over a window of size w:

$$\begin{aligned} h_t = \sum _{i=-\lfloor w/2 \rfloor }^{\lfloor w/2 \rfloor } \textbf{W}_i \cdot \psi (x_{t+i}). \end{aligned}$$

(17)

The temporally-aware feature $h_t$ is used to predict the heatmap for each keypoint:

$$\begin{aligned} \hat{\mathcal {H}}_k^t = g_k(h_t; \theta _g), \end{aligned}$$

(18)

and the corresponding coordinate is:

$$\begin{aligned} \hat{p}_k^t = \arg \max _{(u, v)} \hat{\mathcal {H}}_k^t(u, v). \end{aligned}$$

(19)

To enforce smoothness in trajectories, we introduce a temporal regularization loss that penalizes abrupt changes between frames:

$$\begin{aligned} \mathcal {L}_{\text {temporal}} = \frac{1}{T-1} \sum _{t=1}^{T-1} \sum _{k=1}^K \Vert \hat{p}_k^t - \hat{p}_k^{t+1} \Vert _2^2. \end{aligned}$$

(20)

Training Objectives

The model is optimized using a joint loss function that balances spatial precision and temporal consistency. The primary heatmap regression loss encourages accurate localization:

$$\begin{aligned} \mathcal {L}_{\text {heatmap}} = \frac{1}{K} \sum _{k=1}^K \Vert \mathcal {H}_k - \hat{\mathcal {H}}_k\Vert _2^2. \end{aligned}$$

(21)

For video tasks, we also enforce trajectory stability through a motion consistency loss that compares predicted and ground-truth motion vectors:

$$\begin{aligned} \mathcal {L}_{\text {motion}} = \frac{1}{T-1} \sum _{t=1}^{T-1} \sum _{k=1}^K \Vert (\hat{p}_k^{t+1} - \hat{p}_k^t) - (p_k^{t+1} - p_k^t) \Vert _2^2. \end{aligned}$$

(22)

The total objective is:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\text {heatmap}} + \lambda \mathcal {L}_{\text {temporal}} + \beta \mathcal {L}_{\text {motion}}, \end{aligned}$$

(23)

where $\lambda$ and $\beta$ are weighting coefficients. This formulation enables the model to produce accurate, stable, and temporally coherent pose predictions.

Adaptive pose refinement strategy (APRS)

While the Hierarchical Spatio-Temporal Pose Network (HSTPN) delivers robust pose estimations(As shown in Fig. 3), real-world scenarios often introduce challenges such as occlusion, ambiguous poses, and noisy input data. To address these challenges and further enhance the reliability of predictions, we propose a novel Adaptive Pose Refinement Strategy (APRS). APRS is designed to iteratively refine pose predictions through adaptive feedback mechanisms, leveraging both spatial and temporal consistency, as well as domain-specific knowledge.

Iterative Pose Refinement Framework

The Adaptive Pose Refinement Strategy (APRS) is designed to iteratively improve the initial pose predictions $\hat{p}_k^{(0)}$ obtained from HSTPN. It does so by applying constraint-aware corrections at each refinement step. Specifically, APRS performs N iterations of refinement, where at iteration t, the updated keypoint prediction is given by:

$$\begin{aligned} \hat{p}_k^{(t+1)} = \hat{p}_k^{(t)} + \Delta _k^{(t)}, \end{aligned}$$

(24)

where $\Delta _k^{(t)}$ is a correction term derived from the gradient of a composite loss function. This correction captures spatial, temporal, and domain-specific inconsistencies in the current prediction $\hat{p}_k^{(t)}$.

Spatial Consistency Constraint

To ensure anatomical plausibility, APRS enforces spatial consistency between related keypoints (e.g., limbs, joints) by minimizing the deviation from known limb lengths:

$$\begin{aligned} \mathcal {L}_{\text {spatial}} = \frac{1}{2} \sum _{(i, j) \in \mathcal {C}_{\text {spatial}}} \left( \Vert \hat{p}_i^{(t)} - \hat{p}_j^{(t)} \Vert _2 - l_{ij} \right) ^2, \end{aligned}$$

(25)

where $\mathcal {C}_{\text {spatial}}$ denotes the set of spatially connected keypoints and $l_{ij}$ is the ground-truth limb length. The correction term for spatial consistency is obtained via:

$$\begin{aligned} \Delta _k^{(t)}|_{\text {spatial}} = - \nabla _{\hat{p}_k^{(t)}} \mathcal {L}_{\text {spatial}}. \end{aligned}$$

(26)

Temporal Smoothness Constraint

For video pose estimation, predictions across consecutive frames must be smooth. We penalize large second-order differences (acceleration) in motion trajectories:

$$\begin{aligned} \mathcal {L}_{\text {temporal}} = \frac{1}{T-2} \sum _{t=2}^{T-1} \left\| \hat{p}_k^{(t-1)} - 2\hat{p}_k^{(t)} + \hat{p}_k^{(t+1)} \right\| _2^2, \end{aligned}$$

(27)

This loss suppresses temporal jitter and noise. Its corresponding correction is:

$$\begin{aligned} \Delta _k^{(t)}|_{\text {temporal}} = - \nabla _{\hat{p}_k^{(t)}} \mathcal {L}_{\text {temporal}}. \end{aligned}$$

(28)

Domain-Specific Priors

APRS allows for embedding human or task-specific priors such as joint angle limits, symmetry, or physical plausibility. A domain loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {domain}} = \sum _{k=1}^K \text {Penalty}(\hat{p}_k^{(t)}, \mathcal {C}_{\text {domain}}), \end{aligned}$$

(29)

where $\text {Penalty}(\cdot )$ encodes violations of constraints in $\mathcal {C}_{\text {domain}}$, such as invalid angles or symmetry breaks. The corresponding correction is:

$$\begin{aligned} \Delta _k^{(t)}|_{\text {domain}} = - \nabla _{\hat{p}_k^{(t)}} \mathcal {L}_{\text {domain}}. \end{aligned}$$

(30)

Composite Update Rule

The total correction at each iteration is a weighted sum of the individual constraint corrections:

$$\begin{aligned} \Delta _k^{(t)} = \lambda _{\text {spatial}} \Delta _k^{(t)}|_{\text {spatial}} + \lambda _{\text {temporal}} \Delta _k^{(t)}|_{\text {temporal}} + \lambda _{\text {domain}} \Delta _k^{(t)}|_{\text {domain}}, \end{aligned}$$

(31)

where $\lambda _{\text {spatial}}$, $\lambda _{\text {temporal}}$, and $\lambda _{\text {domain}}$ are tunable hyperparameters.

Comprehensive Loss Function

The full loss combines the initial heatmap prediction loss with all refinement-stage constraints. For N refinement steps, the total training objective is:

$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L}_{\text {initial}} + \sum _{t=1}^{N} \left( \mathcal {L}_{\text {spatial}}^{(t)} + \mathcal {L}_{\text {temporal}}^{(t)} + \mathcal {L}_{\text {domain}}^{(t)} \right) . \end{aligned}$$

(32)

Each refinement term (e.g., $\mathcal {L}_{\text {spatial}}^{(t)}$) is computed using the formulas above but evaluated at step t. This formulation ensures convergence toward anatomically valid, temporally smooth, and domain-compliant pose predictions.

To further enhance clarity, we summarize how the integration of kinematic and dynamic constraints is reflected across the architectural diagrams presented in Figs. 1, 2, 3 and 4. Figure 1 illustrates the overall framework of HSTPN, focusing on the spatial and temporal modules responsible for extracting high-level features from image and video data. Although kinematic and dynamic constraints are not explicitly drawn in this diagram, the multi-scale feature fusion and temporal consistency modules provide the necessary representations to support constraint-aware refinement in downstream stages. Figure 2 builds upon this foundation by introducing attention-driven prediction mechanisms. These modules prioritize spatially salient regions for keypoint localization, thus facilitating the later application of physical and anatomical constraints by enhancing joint discriminability. The constraint-aware reasoning becomes explicit in Fig. 3, which visualizes the Adaptive Pose Refinement Strategy (APRS). In this stage, spatial, temporal, and domain-specific losses are iteratively applied to correct initial predictions. Kinematic constraints, such as limb length preservation and joint connectivity, are enforced through spatial consistency losses. Temporal smoothness is achieved via second-order trajectory constraints, while domain knowledge, such as anthropometric limits or pose symmetry, is applied using penalty-based domain-specific loss terms. Figure 4 summarizes the refinement logic in a modular fashion. It separates the contributions of different constraint types and shows how they interact with learned embeddings through operations like similarity computation and constraint-aware feature adjustment. This progression from Fig. 1, 2, 3 and 4 reflects a coherent pipeline where initial coarse predictions are transformed into physically plausible and temporally consistent pose estimates through the integration of hierarchical constraints. The four figures offer a visual walkthrough of our framework, moving from data-driven feature extraction to iterative, constraint-informed refinement.

Challenge-oriented module design

Our method is developed with the aim of systematically addressing three major challenges in pose estimation: occlusion, scale variation, and temporal inconsistency. Each component of the proposed architecture contributes to solving one or more of these problems in a complementary manner. To cope with occlusion, the multi-scale feature fusion module aggregates spatial information across different resolution levels, allowing the model to infer missing keypoints by leveraging global and contextual cues. Moreover, the APRS refinement module imposes spatial consistency constraints that enforce anatomical plausibility, correcting structurally implausible poses caused by partial visibility or self-occlusion. Scale variation is handled by aligning hierarchical features into a unified representation and applying attention-based reweighting to adaptively emphasize feature maps at the most relevant scales. This ensures the model remains robust across diverse body sizes and camera distances. Temporal inconsistency is addressed through the temporal consistency module, which captures inter-frame dependencies to preserve motion continuity. This is further reinforced by a smoothness loss term in APRS that penalizes abrupt changes in joint trajectories, ensuring stable and temporally coherent predictions across video frames. The design choices are further validated by the ablation results, where each module demonstrates clear and measurable contributions toward alleviating the corresponding challenge. This alignment between hypothesis, architecture, and empirical evidence reflects the intentional and targeted nature of our system design.

Experimental setup

Dataset

The MPII Human Pose Dataset⁴⁷, OCHuman Dataset, CrowdPose Dataset, and SportsPose Dataset are widely used benchmarks in the field of pose estimation, each catering to specific challenges and scenarios. The MPII Human Pose Dataset is one of the most comprehensive datasets for human pose estimation, featuring images collected from everyday activities with diverse poses, varying viewpoints, and complex backgrounds, making it suitable for evaluating general-purpose pose estimation models. The OCHuman Dataset⁴⁸ focuses on heavily occluded human poses, providing challenging scenarios where parts of the body are obstructed by objects or other people, emphasizing the importance of robust methods that can handle occlusions. The CrowdPose Dataset⁴⁹ addresses the challenges of multi-person pose estimation in crowded scenes, containing images with a high density of overlapping individuals, which requires precise joint localization and effective handling of inter-person occlusions. The SportsPose Dataset⁵⁰, on the other hand, is tailored for poses in sports settings, featuring dynamic and unconventional poses often seen in athletic activities, making it ideal for evaluating models in high-motion and domain-specific scenarios. Together, these datasets represent a diverse set of challenges, offering rich benchmarks for advancing human pose estimation research across general, occluded, crowded, and domain-specific environments.

Experimental details

The experiments were conducted on four benchmark datasets: MPII Human⁴⁷, OCHuman⁴⁸, CrowdPose⁴⁹, and SportsPose⁵⁰, following the standard dataset splits and evaluation protocols for named entity recognition (NER) tasks. All datasets were preprocessed by tokenizing text into words and sentences using the spaCy library, with annotations converted to the BIO format where necessary. Input sequences were capped at a maximum length of 128 tokens to handle long sentences efficiently. For all datasets, we used GloVe embeddings initialized with 300-dimensional pre-trained word vectors for token representation. Our model was implemented in PyTorch and trained using NVIDIA A100 GPUs. The backbone architecture utilized a BiLSTM-CRF model with a transformer-based encoder (BERT) to capture contextualized word representations. Specifically, BERT-base (uncased) was used as the encoder, with fine-tuning performed during the training phase. A dropout rate of 0.3 was applied to mitigate overfitting. The CRF layer on top of the BiLSTM was used to capture the label dependencies in the sequence, ensuring valid entity predictions. The optimization process employed the AdamW optimizer with an initial learning rate of $5 \times 10^{-5}$ and a weight decay of $1 \times 10^{-4}$. A linear learning rate scheduler with warmup was utilized for the first 10% of the training steps. Each model was trained for 20 epochs with a batch size of 32, using early stopping based on validation F1-score to prevent overfitting. The loss function used was the negative log-likelihood loss for sequence tagging. Gradient clipping with a maximum norm of 1.0 was applied to stabilize training. Evaluation was performed using the precision, recall, and F1-score metrics for the NER task. The model predictions were aligned with the ground truth using exact match criteria. For MPII Human⁴⁷ and OCHuman⁴⁸, micro-averaged F1-scores were computed to evaluate performance across entity types. For CrowdPose⁴⁹, macro-averaged F1-scores were preferred to emphasize the model’s ability to handle rare and noisy entities. For SportsPose⁵⁰, additional metrics, such as semantic overlap, were explored to understand the model’s performance in capturing contextual nuances. Data augmentation techniques were applied to improve model generalization. For MPII Human⁴⁷, synonym replacement and entity shuffling were used to increase data variability. For CrowdPose⁴⁹, back-translation was employed to introduce robustness to informal text. Fine-tuning was conducted separately for each dataset to account for domain-specific variations, with the best-performing checkpoint on the validation set saved for final evaluation. The hardware environment included an Intel Xeon CPU with 256 GB of RAM and a GPU cluster. Training time varied based on dataset size, with MPII Human⁴⁷ requiring approximately 6 hours, OCHuman⁴⁸ requiring 12 hours, CrowdPose⁴⁹ requiring 4 hours, and SportsPose⁵⁰ requiring 3 hours. All experiments were repeated three times with different random seeds, and the average results were reported to ensure robustness and reproducibility. The code and experimental setup will be made publicly available to facilitate further research and validation (algorithm 1).

Comparison with SOTA methods

Tables 1 and 2 compare the proposed HSTPN model with a range of state-of-the-art (SOTA) methods, including both CNN-based and Transformer-based architectures. The evaluations are conducted on four standard pose estimation benchmarks: MPII Human⁴⁷, OCHuman⁴⁸, CrowdPose⁴⁹, and SportsPose⁵⁰, using accuracy, recall, F1 score, and AUC as evaluation metrics.

On the MPII Human dataset, HSTPN achieves the highest F1 score of 89.12%, slightly outperforming MixSTE (88.45%), TokenPose (88.01%), and ViTPose (87.89%). It also surpasses all CNN-based baselines, including HRNet (87.12%). Similar trends are observed on the OCHuman dataset, which features heavy occlusions. HSTPN achieves a F1 score of 84.92%, again leading MixSTE (84.73%), TokenPose (84.02%), and ViTPose (83.34%). These results highlight the effectiveness of HSTPN’s hierarchical architecture and refinement strategy in handling occluded and cluttered scenarios. On the CrowdPose dataset, where multi-person interactions and dense environments are common, HSTPN achieves an F1 score of 88.34%, outperforming MixSTE (87.23%) and other baselines. The model also maintains strong performance on the SportsPose dataset, which emphasizes motion dynamics. HSTPN reaches an F1 score of 83.67%, higher than MixSTE (82.89%) and TokenPose (82.34%).

The strong and consistent performance of HSTPN across all datasets can be attributed to its architectural innovations. The integration of multi-scale feature fusion allows the network to effectively capture fine-grained spatial cues, while the attention-driven prediction module enables better handling of challenging joints and occlusions. Furthermore, the Adaptive Pose Refinement Strategy (APRS) significantly improves temporal consistency and anatomical plausibility, which is especially beneficial in video-based pose estimation tasks. Although recent Transformer-based models such as ViTPose, TokenPose, and MixSTE offer competitive performance due to their ability to capture long-range dependencies, our model achieves better overall results without relying on purely Transformer-based backbones. This is particularly evident in scenarios involving occlusions and temporal inconsistencies, where HSTPN’s refinement modules contribute to more robust predictions. HSTPN not only matches or exceeds the performance of recent SOTA models, including those based on Vision Transformers, but also does so with improved interpretability and lower architectural complexity. These results reinforce the model’s suitability for real-world applications in interdisciplinary physics, where accuracy, robustness, and efficiency are critical.

Table 1 Comparison of different models on MPII human and OCHuman datasets for pose estimation Task.

Full size table

Table 2 Comparison of different models on crowdpose and sportspose datasets for pose estimation Task.

Full size table

In addition to evaluating pose estimation accuracy, we further examine the computational efficiency of our proposed HSTPN model by comparing its inference time, number of parameters, and GFLOPs against established baseline models, including OpenPose, AlphaPose, HRNet-W32, and DARKPose. The results are summarized in Table 3. Our HSTPN model achieves a favorable balance between accuracy and computational cost. It records an inference time of 58.2 ms per image on an NVIDIA A100 GPU, which is slightly higher than OpenPose but faster than HRNet and DARKPose. Notably, HSTPN requires only 26.8 million parameters, which is lower than AlphaPose, HRNet, and DARKPose, despite achieving higher accuracy and better robustness under challenging conditions. In terms of GFLOPs, HSTPN operates at 80.1 GFLOPs, which is significantly lower than AlphaPose (98.6) and slightly lower than HRNet (82.9), while maintaining competitive runtime performance. These results demonstrate that HSTPN is both computationally efficient and scalable, making it suitable for real-time applications such as sports tracking, robotics, and augmented reality, where latency and resource constraints are critical. The above findings confirm that our approach does not merely optimize for accuracy but also emphasizes practicality through efficient architectural design.

Table 3 Comparison of computational cost across baseline methods and the proposed HSTPN model. Inference time is measured per image on a single NVIDIA A100 GPU (batch size = 1).

Full size table

Ablation study

The results of the ablation study, shown in Table 4 and Table 5, analyze the contributions of Multi-Scale Feature Fusion, Temporal Consistency Module, Attention-Driven Prediction, and Iterative Refinement Framework in our proposed HSTPN model for the pose estimation task across MPII Human⁴⁷, OCHuman⁴⁸, CrowdPose⁴⁹, and SportsPose⁵⁰ datasets. The ablation experiments demonstrate the critical importance of each component in achieving the overall superior performance of HSTPN, as the complete model consistently outperformed its ablated variants on all metrics. On the MPII Human dataset, removing Multi-Scale Feature Fusion reduced the accuracy to 85.12% and F1-score to 83.89%, highlighting its crucial role in capturing low-level features necessary for precise entity recognition. Temporal Consistency Module, responsible for dynamic attention, also showed a significant impact, with accuracy dropping to 86.45% when excluded. This emphasizes the effectiveness of dynamic attention in focusing on task-relevant features. Similarly, removing Attention-Driven Prediction and Iterative Refinement Framework resulted in further performance degradation, with the model achieving accuracy values of 87.89% and 88.45%, respectively. The complete HSTPN model, with all components intact, achieved the highest accuracy (90.21%) and F1-score (89.12%), demonstrating the synergy between these components in improving contextual understanding and generalization. For the OCHuman dataset, the trends were consistent with those observed on MPII Human. Without Multi-Scale Feature Fusion, the model’s accuracy dropped to 80.45%, and removing Temporal Consistency Module resulted in an accuracy of 81.67%. The exclusion of Attention-Driven Prediction and Iterative Refinement Framework further reduced performance, with accuracy values of 82.89% and 83.56%, respectively. The complete HSTPN model achieved the best results, with an accuracy of 85.67% and an F1-score of 84.92%, underscoring the importance of incorporating all components for handling diverse and domain-specific entity types in this dataset.

On the CrowdPose dataset, which focuses on noisy and informal text, HSTPN achieved an accuracy of 89.12%, outperforming all ablated versions. Removing Multi-Scale Feature Fusion resulted in an accuracy of 83.12%, highlighting its role in extracting robust features from informal and noisy data. Temporal Consistency Module, which handles dynamic feature weighting, also played a critical role, as removing it reduced accuracy to 84.56%. The exclusion of Attention-Driven Prediction and Iterative Refinement Framework caused accuracy reductions to 85.34% and 86.45%, respectively, further validating their contribution to HSTPN’s ability to generalize to noisy datasets. For the SportsPose dataset, which emphasizes semantic richness and contextual understanding, HSTPN achieved an accuracy of 84.23% and an F1-score of 83.67%. Without Multi-Scale Feature Fusion, the model’s performance dropped to 78.34%, and removing Temporal Consistency Module led to an accuracy of 79.67%. Excluding Attention-Driven Prediction and Iterative Refinement Framework resulted in accuracy values of 80.89% and 81.67%, respectively. These results demonstrate that each component contributes uniquely to HSTPN’s ability to capture linguistic nuance and semantic overlap in challenging datasets. The ablation study validates the critical contributions of each component in HSTPN’s architecture. Multi-Scale Feature Fusion captures essential low-level features, Temporal Consistency Module introduces dynamic attention for task-relevant feature weighting, Attention-Driven Prediction enhances multi-scale feature extraction, and Iterative Refinement Framework integrates global and local context. The complete HSTPN model, by combining these components, achieves state-of-the-art performance across diverse datasets and tasks, as evidenced by the results.

To further enhance the interpretability of our ablation study, we augment Tables 4 and 5 by reporting delta values ($\Delta$F1 Score and $\Delta$AUC), which quantify the performance differences between the full model and its ablated variants. These delta values offer a more intuitive and quantitative way to evaluate the relative importance of each component in our architecture. Rather than only comparing absolute performance values, the delta metrics explicitly highlight how much each module contributes to the overall accuracy and robustness of pose estimation. For instance, removing the multi-scale feature fusion module results in the largest performance drop across all datasets, particularly on challenging benchmarks such as OCHuman and CrowdPose, where occlusion and scale variation are prevalent. This underscores the module’s ability to preserve fine-grained spatial information at multiple resolutions. Similarly, eliminating the temporal consistency module leads to substantial declines in AUC and F1 Score, especially in video-based datasets, validating its effectiveness in maintaining smooth and coherent predictions across frames. The iterative refinement framework also provides measurable improvements by enforcing spatial, temporal, and domain-specific constraints. Through this delta-based presentation, we not only strengthen the empirical evidence of each module’s utility but also make the ablation results more accessible for comparative analysis and future research reference.

Table 4 Ablation study results on different components for pose estimation task across MPII human and OCHuman datasets (with delta improvements).

Full size table

Table 5 Ablation study results on different components for pose estimation task across crowdpose and sportspose datasets (with delta improvements).

Full size table

To further evaluate the robustness and interpretability of our proposed HSTPN+APRS framework, we conducted additional qualitative comparisons with baseline methods, including OpenPose, AlphaPose, and HRNet. Table 6 summarizes visual performance under different challenging scenarios, such as partial occlusion, crowded scenes, rapid movement, and low-light conditions. From the qualitative results, it is evident that our method consistently generates anatomically plausible keypoint predictions even when other methods fail. In upper-body occlusion scenarios, HSTPN+APRS effectively recovers missing joints by leveraging spatial priors and temporal dynamics. In crowded scenes, our method performs significantly better at isolating individuals and estimating multi-person poses, demonstrating strong occlusion robustness. Similarly, under motion blur conditions in the SportsPose dataset, our network maintains joint consistency and smooth trajectories, which is not observed in earlier baselines. We also performed failure case analysis to identify limitations of our approach. All models, including ours, exhibit decreased performance in extreme occlusion settings and cases involving overlapping limbs, especially in lower-body regions. These errors typically stem from ambiguous visual cues and limited training data diversity. Our analysis suggests that incorporating synthetic occlusion simulation and better structural priors (pose grammars) could help mitigate such issues in future work. The qualitative results support our quantitative findings and further highlight the generalization capability and interpretability of the proposed HSTPN+APRS model. In addition to the qualitative table, Fig. 5 visualizes the comparative results using a heatmap format. This graphical representation further emphasizes the robustness of our proposed method under diverse challenging conditions. Notably, our model is the only one that performs consistently across partial occlusion, crowded scenes, fast motion, and low-light environments. However, similar to other baselines, failure still occurs in cases involving severe occlusions or ambiguous joint configurations.

Table 6 Qualitative comparison of pose estimation results across different models. The ✓ mark denotes successful estimation (anatomically valid and visually accurate), and ✗ denotes obvious failures.

Full size table

In many real-world scenarios, human pose estimation systems are deployed in environments where data quality cannot be guaranteed. Imperfect data resources, including low-resolution images, occluded joints, motion blur, sensor noise, or missing keypoints, can significantly hinder the performance of pose models. To evaluate how our proposed HSTPN+APRS framework performs under such conditions, we extended our analysis to focus on noisy and incomplete input settings. Our model is not immune to performance drops under severe occlusion or overlapping joints. However, compared to baselines, HSTPN exhibits improved stability and resilience due to three core components: its multi-scale feature fusion allows it to recover structural details from coarse and fine features, the attention mechanisms help suppress background noise and prioritize joint-relevant regions, and the Adaptive Pose Refinement Strategy (APRS) introduces temporal and anatomical constraints during post-prediction correction, which is particularly effective in recovering from ambiguous or missing keypoint cases. We observe that even in partially labeled or low-contrast samples (from the OCHuman dataset), APRS was able to iteratively refine predictions toward anatomically valid poses by enforcing domain priors such as limb symmetry and kinematic consistency. These observations suggest that our framework is inherently more robust to unperfect data than existing models, though we acknowledge that extreme data degradation still poses challenges. As part of future work, we plan to integrate synthetic noise augmentation and adversarial training to further enhance robustness in such scenarios.

To enhance interpretability under real-world conditions, Fig. 6 illustrates qualitative comparisons between HSTPN+APRS and two baseline methods (OpenPose and HRNet) across three representative challenges: occlusion, scale variation, and temporal inconsistency. In occlusion scenarios (top row), HSTPN+APRS produces more complete and anatomically accurate skeletons, effectively recovering missing joints. Under scale variation (middle row), the model demonstrates robustness by preserving consistent limb proportions, reflecting the benefits of the multi-scale feature fusion mechanism. In video sequences (bottom row), the predicted joint trajectories remain stable and continuous, attributed to the temporal consistency module and iterative refinement strategy. The spatial and temporal heatmaps further support these observations by showing high confidence and low jitter in keypoint predictions.

To further validate the spatial accuracy and practical robustness of keypoint predictions, Fig. 7 presents a direct visual comparison between the predicted keypoints and the ground truth (GT) annotations across a representative scenario. The proposed HSTPN+APRS method is marked with blue crosses, baseline predictions are shown as red triangles, and GT annotations are indicated by green circles. Red dashed lines connect the baseline predictions to their corresponding ground truth positions, highlighting localization errors. In contrast, the predictions from HSTPN+APRS are generally closer to the GT locations. Keypoints where the proposed method exhibits significantly higher accuracy—particularly at joint articulation points such as elbows and knees—are emphasized with yellow star markers. This visualization clearly demonstrates that HSTPN+APRS achieves superior alignment with the ground truth under complex visual conditions. The highlighted improvements further support the effectiveness of the proposed framework in delivering precise pose estimation results in real-world and interdisciplinary physics settings.

Conclusions and future work

This study explores the challenges and opportunities in human pose estimation, a critical task in computer vision with applications across robotics, augmented reality, sports analysis, and biomechanics. Traditional approaches have struggled to adapt to real-world conditions due to occlusions, scale variations, and temporal inconsistencies in video data. To address these issues, the researchers developed the Hierarchical Spatio-Temporal Pose Network (HSTPN), a deep learning framework that leverages multi-scale feature fusion and attention mechanisms. These innovations enable the model to capture both global contextual information and fine-grained pose details effectively. Complementing this, the Adaptive Pose Refinement Strategy (APRS) was introduced to refine key point locations iteratively using spatial, temporal, and domain-specific constraints. Experimental evaluations demonstrated that this combined approach outperforms state-of-the-art methods in accuracy, temporal coherence, and computational efficiency. The proposed solution is particularly well-suited for real-time applications, addressing the needs of interdisciplinary physics and other dynamic, real-world environments.

While the proposed HSTPN and APRS frameworks represent a significant advance, they have limitations. The reliance on attention mechanisms and multi-scale processing introduces computational overhead, which could limit their deployment in resource-constrained environments or edge devices. Future work should explore optimization techniques, such as model compression or hardware acceleration, to reduce computational demands without compromising performance. The model’s ability to generalize across diverse and unseen datasets remains uncertain. Current results are promising but largely dataset-specific. Expanding testing to include more diverse and challenging datasets, particularly in underexplored domains, will be essential. Future research should also investigate integrating self-supervised learning techniques to reduce dependence on labeled data, making the method more adaptable and scalable for broader interdisciplinary applications.

Data availability

The datasets generated and/or analysed during the current study are available in the PosePhysics, :https://github.com/etUHF1- SR/PosePhysics.git

References

Xu, Y., Zhang, J., Zhang, Q. & Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Neural Information Processing Systems (2022).
Zheng, C. et al. 3d human pose estimation with spatial and temporal transformers. IEEE International Conference on Computer Vision (2021).
Wang, G., Manhardt, F., Tombari, F. & Ji, X. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. Computer Vision and Pattern Recognition (2021).
Yang, Z., Zeng, A., Yuan, C. & Li, Y. Effective whole-body pose estimation with two-stages distillation. 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2023).
Rempe, D. et al. Humor: 3d human motion model for robust pose estimation. IEEE International Conference on Computer Vision (2021).
Wen, B., Yang, W., Kautz, J. & Birchfield, S. T. Foundationpose: Unified 6d pose estimation and tracking of novel objects. Computer Vision and Pattern Recognition (2023).
Shan, W. et al. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. IEEE International Conference on Computer Vision (2023).
Sundermeyer, M. et al. Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023).
Kim, J.-W., Choi, J., Ha, E. & ho Choi, J. Human pose estimation using mediapipe pose and optimization method based on a humanoid model. Applied Sciences (2023).
He, Y., Huang, H., Fan, H., Chen, Q. & Sun, J. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. Computer Vision and Pattern Recognition (2021).
Labb’e, Y., Carpentier, J., Aubry, M. & Sivic, J. Cosypose: Consistent multi-view multi-object 6d pose estimation. European Conference on Computer Vision (2020).
Fang, H. et al. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Li, Z., Liu, J., Zhang, Z., Xu, S. & Yan, Y. Cliff: Carrying location information in full frames into human pose and shape estimation. European Conference on Computer Vision (2022).
Martinelli, G., Diprima, F., Bisagno, N. & Conci, N. Ski pose estimation. The Star (2024).
Lin, Y.-C. et al. inerf: Inverting neural radiance fields for pose estimation. IEEE/RJS International Conference on Intelligent RObots and Systems (2020).
Martınez, G. H. Openpose: Whole-body pose estimation. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA, USA (2019).
Singh, G. et al. A deep learning approach for evaluating the efficacy and accuracy of posenet for posture detection. International Journal of System Assurance Engineering and Management 1–10 (2024).
Wu, H., Liang, C., Liu, M. & Wen, Z. Optimized hrnet for image semantic segmentation. Expert Syst. Appl. 174, 114532 (2021).
Article Google Scholar
Li, Y. et al. Tokenpose: Learning keypoint tokens for human pose estimation. IEEE International Conference on Computer Vision (2021).
Oliveira, F. A., Morgado, R., Hansen, A. & Rubi, J. Superdiffusive conduction: Ac conductivity with correlated noise. Physica A 357, 115–121 (2005).
Article ADS Google Scholar
Oliveira, F. A., Cordeiro, J. A., Chaves, A. S., Mello, B. A. & Xavier, I. M. Jr. Scaling transformation of random walk and generalized statistics. Physica A 295, 201–208 (2001).
Article ADS MathSciNet Google Scholar
Chen, T. et al. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE transactions on circuits and systems for video technology (Print) (2021).
Rong, Y., Shiratori, T. & Joo, H. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2021).
Moon, G., Yu, S.-I., Wen, H., Shiratori, T. & Lee, K. M. Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. European Conference on Computer Vision (2020).
Li, W., Liu, H., Tang, H., Wang, P. & Gool, L. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. Computer Vision and Pattern Recognition (2021).
Toshev, A. & Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1653–1660 (2014).
Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 483–499 (Springer, 2016).
Sun, K., Xiao, B., Liu, D. & Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703 (2019).
Pavllo, D., Feichtenhofer, C., Grangier, D. & Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7753–7762 (2019).
Zheng, H., Chen, Y., Wang, Q. & Luo, P. Poseformer: when transformer meets pose estimation. Advances in neural information processing systems (NeurIPS) (2021).
Peng, K. et al. Navigating open set scenarios for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence 38, 4487–4496 (2024).
Xie, J. et al. Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition. In Proceedings of the AAAI conference on artificial intelligence 38, 6225–6233 (2024).
Xu, Y. et al. Skeleton-based human action recognition with noisy labels. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4716–4723 (IEEE, 2024).
Peng, K., Roitberg, A., Yang, K., Zhang, J. & Stiefelhagen, R. Delving deep into one-shot skeleton-based action recognition with diverse occlusions. IEEE Trans. Multimedia 25, 1489–1504 (2023).
Article Google Scholar
Hansen, A. Editor’s challenge in interdisciplinary physics: what is interdisciplinary physics? (2024).
Su, Y. et al. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. Computer Vision and Pattern Recognition (2022).
Wang, G., Li, J., Tan, H. & Li, X. Fusion of full-field optical angiography images via gradient feature detection. Front. Phys. 12, 1397732 (2024).
Article Google Scholar
Xu, T. & Takano, W. Graph stacked hourglass networks for 3d human pose estimation. Computer Vision and Pattern Recognition (2021).
Liu, H. et al. Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Transactions on Industrial Informatics (2022).
Sun, J. et al. Onepose: One-shot object pose estimation without cad models. Computer Vision and Pattern Recognition (2022).
Rajabi-Ghaleh, S., Olyaeefar, B., Kheradmand, R. & Ahmadi-Kandjani, S. Image security using steganography and cryptography with sweeping computational ghost imaging. Front. Phys. 12, 1336485 (2024).
Article Google Scholar
Chen, W. et al. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. Computer Vision and Pattern Recognition (2021).
Zheng, C. et al. Deep learning-based human pose estimation: A survey. ACM Computing Surveys (2020).
Maji, D., Nagori, S., Mathew, M. & Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022).
Liu, H. et al. Mfdnet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE transactions on multimedia (2022).
Chen, H. et al. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. Computer Vision and Pattern Recognition (2022).
Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 3686–3693 (2014).
Zhang, S.-H. et al. Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 889–898 (2019).
Li, J. et al. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10863–10872 (2019).
Ingwersen, C. K., Mikkelstrup, C. M., Jensen, J. N., Hannemose, M. R. & Dahl, A. B. Sportspose-a dynamic 3d sports pose dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5219–5228 (2023).
Fang, H.-S. et al. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7157–7173 (2022).
Article ADS Google Scholar
Aflalo, A., Bagon, S., Kashti, T. & Eldar, Y. Deepcut: Unsupervised segmentation using graph neural networks clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–41 (2023).
Riekert, M., Riekert, M. & Klein, A. Simple baseline machine learning text classifiers for small datasets. SN Comput. Sci. 2, 178 (2021).
Article Google Scholar
Liu, H., Liu, F., Fan, X. & Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 506, 158–167 (2022).
Article Google Scholar
Yang, B. et al. Vimpose: Human pose estimation based on vision mamba. In 2024 China Automation Congress (CAC), 2789–2794 (IEEE, 2024).
Li, S., Zhang, H., Ma, H., Feng, J. & Jiang, M. Efficient posenet with coarse to fine transformer. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5100–5104 (IEEE, 2024).
Wang, Y., Wang, Z., Li, M. & Yan, H. 3d human pose estimation with two-step mixed-training strategy. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3332–3341 (2024).

Download references

Acknowledgements

This is a short text to acknowledge the contributions of specific colleagues, institutions, or agencies that aided the efforts of the authors.

Funding

Details of all funding sources should be provided, including grant numbers if applicable. Please ensure to add all necessary funding information, as after publication this is no longer possible.

Author information

Authors and Affiliations

Zhejiang Technical Institute of Economics, Zhejiang Technical Institute of Economics, Hangzhou City, 310000, Zhejiang Province, China
Li Zhiliang
Department of Physical Education, Zhejiang International Studies University, Hangzhou, 310023, Zhejiang Province, China
Li Zhuo

Authors

Li Zhiliang
View author publications
Search author on:PubMed Google Scholar
Li Zhuo
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, LL; methodology, LL; software, LL; validation, LL; formal analysis, LL; investigation, LL; data curation, LZ; writing—original draft preparation, LZ; writing—review and editing, LZ; visualization, LZ; supervision, LZ; funding acquisition, LZ; All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Li Zhuo.

Ethics declarations

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhiliang, L., Zhuo, L. Deep learning-based approaches for human pose estimation in interdisciplinary physics applications. Sci Rep 15, 42883 (2025). https://doi.org/10.1038/s41598-025-26972-4

Download citation

Received: 20 April 2025
Accepted: 31 October 2025
Published: 24 November 2025
Version of record: 01 December 2025
DOI: https://doi.org/10.1038/s41598-025-26972-4

Subjects

Abstract

Similar content being viewed by others

Lightning Pose: improved animal pose estimation via semi-supervised learning, Bayesian ensembling and cloud-native open-source tools

Wearable sensor data-driven sports posture recognition using the ST-GCN spatio-temporal graph convolutional network

Deep neural networks and deep deterministic policy gradient for early ASD diagnosis and personalized intervention in children

Introduction

Related work

Deep learning for pose estimation

Physics-informed deep learning models

Multimodal approaches for motion analysis

Method

Overview

Preliminaries

Hierarchical spatio-temporal pose network (HSTPN)

Adaptive pose refinement strategy (APRS)

Challenge-oriented module design

Experimental setup

Dataset

Experimental details

Comparison with SOTA methods

Ablation study

Conclusions and future work

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links