Table 1 Comparison of human pose estimation related research.

From: ScaleFormer architecture for scale invariant human pose estimation with enhanced mixed features

Authors

Application scenario

Research content

Potential limitations

Liu et al.24

General 2D pose estimation

Proposed spatially decoupled pose estimation model (SD-Pose), transforming keypoint localization into a classification problem, using pyramid adaptive feature extractors to generate keypoint weights

Adaptive keypoint weight generation mechanism does not consider feature distribution differences caused by scale changes, showing unstable performance in multi-scale scenarios

Wang et al.22

General 2D pose estimation

Proposed Gated Region Refinement Pose Transformer (GRRPT), extracting more refined candidate regions through multi-resolution attention mechanisms

High computational complexity when processing high-resolution feature maps, relatively low efficiency in real-time application scenarios, limited adaptability to scale changes

Chi et al.23

Occlusion scenario pose estimation

Designed Pose Relation Transformer (PORT), reconstructing occluded joints by capturing global and local contextual relationships between joints

Heavily dependent on initial pose estimation quality, unstable performance when target scale changes, lacking scale adaptation mechanisms

Zhou et al.19

General 2D pose estimation

Proposed direction-aware pose grammar model using multi-scale BiC3D modules to promote message passing between human joints

Complex model structure, large computational overhead, no specially designed scale invariance mechanism, difficult to process inputs of different scales

Li et al.25

Ego-perspective pose estimation

Designed self-body pose estimation method (EgoEgo) mediated by head pose estimation, using SLAM and learning methods to estimate head movement

Mainly designed for specific perspectives, insufficient robustness to scale changes, difficult to generalize to general pose estimation scenarios

Martinelli et al.26

Skiing pose estimation

Developed specific pose estimation framework for skiing, using human pose priors to estimate positions of skis and ski poles

Only applicable to specific domains, lacking consideration for detection of skiing equipment at different scales, limited model generalization capability

Liao et al.27

Animal pose estimation

Proposed THANet, a cross-domain method transferring human pose estimation knowledge to animal pose estimation

Feature extraction does not specifically consider scale change issues, larger scale adaptation challenges due to significant variations in animal body sizes

Zhang et al.30

General 2D pose estimation

Proposed attention-enhanced HRNet architecture, integrating self-attention mechanisms to improve keypoint prediction accuracy

Attention mechanism does not consider feature expression consistency at different scales, large accuracy fluctuations when target distance changes

Chen et al.31

Lightweight pose estimation

Designed Efficient Aggregation Network (EANet), using efficient channel aggregation and efficient spatial aggregation units to reduce computational complexity

Lightweight design limits model capacity, difficult to capture complex scale change features, insufficient balance between accuracy and efficiency

Wang et al.32

Dense scene pose estimation

Proposed DecenterNet, adopting decentralized pose representation and introducing decoupled pose evaluation mechanisms

Although solving occlusion problems in dense scenes, no specifically designed feature expression mechanism for human targets of different scales

Lou et al.33

Lightweight pose estimation

Developed LAR-Pose, combining lightweight high-resolution backbone network and dynamic residual refinement network, adopting adaptive regression loss

Adaptive regression loss does not explicitly consider the impact of scale changes on residual distribution, stability in multi-scale scenarios needs improvement

Amadi et al.34

Semi-supervised 3D pose estimation

Proposed pose consistency loss function, combining biomechanical pose regularization and multi-view pose consistency objective function

Highly dependent on camera parameters, does not specifically solve pose consistency problems at different scales, limited generalization capability

Cheng et al.35

Video 3D pose estimation

Designed MixPose, using mixed encoders to fuse spatiotemporal information and introducing attention modules to enhance global perception

Primarily focused on temporal information modeling, insufficient processing of spatial scale changes, difficult to adapt to dynamically changing target distances

Li et al.36

Dense scene pose estimation

Proposed InferTrans, a Transformer architecture based on hierarchical structure fusion, organizing joints and limbs through tree structures

Although the model focuses on structural information, it does not consider structural feature preservation problems at different scales, insufficient scale adaptability

Bai et al.37

Occlusion scene pose estimation

Developed CONet, estimating poses of occluders and occludees using divide-and-conquer strategies, introducing interference point loss to improve anti-interference ability

Attention mechanism mainly designed for occlusion problems, does not specifically address feature distribution difference problems caused by scale changes

Li et al.38

Monocular 3D pose estimation

Proposed TSwinPose, combining JointFlow to encode human joint movement, designing temporal SwinUnet structure to model multi-scale spatiotemporal relationships

Focused on multi-scale modeling in the time domain, feature inconsistency problems caused by spatial domain scale changes not fully resolved

Dong et al.39

Coal mine scene pose estimation

Designed YH-Pose framework, using visual evidence from adjacent frames to assist current frame pose estimation, adopting temporal road modules and spatial road modules

Mainly designed for low-quality videos, insufficient consideration for scale changes in complex scenes, reduced accuracy for multi-scale targets