Scientific Reports

Table 1 Comparison of human pose estimation related research.

From: ScaleFormer architecture for scale invariant human pose estimation with enhanced mixed features

Authors	Application scenario	Research content	Potential limitations
Liu et al.²⁴	General 2D pose estimation	Proposed spatially decoupled pose estimation model (SD-Pose), transforming keypoint localization into a classification problem, using pyramid adaptive feature extractors to generate keypoint weights	Adaptive keypoint weight generation mechanism does not consider feature distribution differences caused by scale changes, showing unstable performance in multi-scale scenarios
Wang et al.²²	General 2D pose estimation	Proposed Gated Region Refinement Pose Transformer (GRRPT), extracting more refined candidate regions through multi-resolution attention mechanisms	High computational complexity when processing high-resolution feature maps, relatively low efficiency in real-time application scenarios, limited adaptability to scale changes
Chi et al.²³	Occlusion scenario pose estimation	Designed Pose Relation Transformer (PORT), reconstructing occluded joints by capturing global and local contextual relationships between joints	Heavily dependent on initial pose estimation quality, unstable performance when target scale changes, lacking scale adaptation mechanisms
Zhou et al.¹⁹	General 2D pose estimation	Proposed direction-aware pose grammar model using multi-scale BiC3D modules to promote message passing between human joints	Complex model structure, large computational overhead, no specially designed scale invariance mechanism, difficult to process inputs of different scales
Li et al.²⁵	Ego-perspective pose estimation	Designed self-body pose estimation method (EgoEgo) mediated by head pose estimation, using SLAM and learning methods to estimate head movement	Mainly designed for specific perspectives, insufficient robustness to scale changes, difficult to generalize to general pose estimation scenarios
Martinelli et al.²⁶	Skiing pose estimation	Developed specific pose estimation framework for skiing, using human pose priors to estimate positions of skis and ski poles	Only applicable to specific domains, lacking consideration for detection of skiing equipment at different scales, limited model generalization capability
Liao et al.²⁷	Animal pose estimation	Proposed THANet, a cross-domain method transferring human pose estimation knowledge to animal pose estimation	Feature extraction does not specifically consider scale change issues, larger scale adaptation challenges due to significant variations in animal body sizes
Zhang et al.³⁰	General 2D pose estimation	Proposed attention-enhanced HRNet architecture, integrating self-attention mechanisms to improve keypoint prediction accuracy	Attention mechanism does not consider feature expression consistency at different scales, large accuracy fluctuations when target distance changes
Chen et al.³¹	Lightweight pose estimation	Designed Efficient Aggregation Network (EANet), using efficient channel aggregation and efficient spatial aggregation units to reduce computational complexity	Lightweight design limits model capacity, difficult to capture complex scale change features, insufficient balance between accuracy and efficiency
Wang et al.³²	Dense scene pose estimation	Proposed DecenterNet, adopting decentralized pose representation and introducing decoupled pose evaluation mechanisms	Although solving occlusion problems in dense scenes, no specifically designed feature expression mechanism for human targets of different scales
Lou et al.³³	Lightweight pose estimation	Developed LAR-Pose, combining lightweight high-resolution backbone network and dynamic residual refinement network, adopting adaptive regression loss	Adaptive regression loss does not explicitly consider the impact of scale changes on residual distribution, stability in multi-scale scenarios needs improvement
Amadi et al.³⁴	Semi-supervised 3D pose estimation	Proposed pose consistency loss function, combining biomechanical pose regularization and multi-view pose consistency objective function	Highly dependent on camera parameters, does not specifically solve pose consistency problems at different scales, limited generalization capability
Cheng et al.³⁵	Video 3D pose estimation	Designed MixPose, using mixed encoders to fuse spatiotemporal information and introducing attention modules to enhance global perception	Primarily focused on temporal information modeling, insufficient processing of spatial scale changes, difficult to adapt to dynamically changing target distances
Li et al.³⁶	Dense scene pose estimation	Proposed InferTrans, a Transformer architecture based on hierarchical structure fusion, organizing joints and limbs through tree structures	Although the model focuses on structural information, it does not consider structural feature preservation problems at different scales, insufficient scale adaptability
Bai et al.³⁷	Occlusion scene pose estimation	Developed CONet, estimating poses of occluders and occludees using divide-and-conquer strategies, introducing interference point loss to improve anti-interference ability	Attention mechanism mainly designed for occlusion problems, does not specifically address feature distribution difference problems caused by scale changes
Li et al.³⁸	Monocular 3D pose estimation	Proposed TSwinPose, combining JointFlow to encode human joint movement, designing temporal SwinUnet structure to model multi-scale spatiotemporal relationships	Focused on multi-scale modeling in the time domain, feature inconsistency problems caused by spatial domain scale changes not fully resolved
Dong et al.³⁹	Coal mine scene pose estimation	Designed YH-Pose framework, using visual evidence from adjacent frames to assist current frame pose estimation, adopting temporal road modules and spatial road modules	Mainly designed for low-quality videos, insufficient consideration for scale changes in complex scenes, reduced accuracy for multi-scale targets

Back to article page

Search

Advanced search

Quick links