Table 1 Comparison of human pose estimation related research.
From: ScaleFormer architecture for scale invariant human pose estimation with enhanced mixed features
Authors | Application scenario | Research content | Potential limitations |
|---|---|---|---|
Liu et al.24 | General 2D pose estimation | Proposed spatially decoupled pose estimation model (SD-Pose), transforming keypoint localization into a classification problem, using pyramid adaptive feature extractors to generate keypoint weights | Adaptive keypoint weight generation mechanism does not consider feature distribution differences caused by scale changes, showing unstable performance in multi-scale scenarios |
Wang et al.22 | General 2D pose estimation | Proposed Gated Region Refinement Pose Transformer (GRRPT), extracting more refined candidate regions through multi-resolution attention mechanisms | High computational complexity when processing high-resolution feature maps, relatively low efficiency in real-time application scenarios, limited adaptability to scale changes |
Chi et al.23 | Occlusion scenario pose estimation | Designed Pose Relation Transformer (PORT), reconstructing occluded joints by capturing global and local contextual relationships between joints | Heavily dependent on initial pose estimation quality, unstable performance when target scale changes, lacking scale adaptation mechanisms |
Zhou et al.19 | General 2D pose estimation | Proposed direction-aware pose grammar model using multi-scale BiC3D modules to promote message passing between human joints | Complex model structure, large computational overhead, no specially designed scale invariance mechanism, difficult to process inputs of different scales |
Li et al.25 | Ego-perspective pose estimation | Designed self-body pose estimation method (EgoEgo) mediated by head pose estimation, using SLAM and learning methods to estimate head movement | Mainly designed for specific perspectives, insufficient robustness to scale changes, difficult to generalize to general pose estimation scenarios |
Martinelli et al.26 | Skiing pose estimation | Developed specific pose estimation framework for skiing, using human pose priors to estimate positions of skis and ski poles | Only applicable to specific domains, lacking consideration for detection of skiing equipment at different scales, limited model generalization capability |
Liao et al.27 | Animal pose estimation | Proposed THANet, a cross-domain method transferring human pose estimation knowledge to animal pose estimation | Feature extraction does not specifically consider scale change issues, larger scale adaptation challenges due to significant variations in animal body sizes |
Zhang et al.30 | General 2D pose estimation | Proposed attention-enhanced HRNet architecture, integrating self-attention mechanisms to improve keypoint prediction accuracy | Attention mechanism does not consider feature expression consistency at different scales, large accuracy fluctuations when target distance changes |
Chen et al.31 | Lightweight pose estimation | Designed Efficient Aggregation Network (EANet), using efficient channel aggregation and efficient spatial aggregation units to reduce computational complexity | Lightweight design limits model capacity, difficult to capture complex scale change features, insufficient balance between accuracy and efficiency |
Wang et al.32 | Dense scene pose estimation | Proposed DecenterNet, adopting decentralized pose representation and introducing decoupled pose evaluation mechanisms | Although solving occlusion problems in dense scenes, no specifically designed feature expression mechanism for human targets of different scales |
Lou et al.33 | Lightweight pose estimation | Developed LAR-Pose, combining lightweight high-resolution backbone network and dynamic residual refinement network, adopting adaptive regression loss | Adaptive regression loss does not explicitly consider the impact of scale changes on residual distribution, stability in multi-scale scenarios needs improvement |
Amadi et al.34 | Semi-supervised 3D pose estimation | Proposed pose consistency loss function, combining biomechanical pose regularization and multi-view pose consistency objective function | Highly dependent on camera parameters, does not specifically solve pose consistency problems at different scales, limited generalization capability |
Cheng et al.35 | Video 3D pose estimation | Designed MixPose, using mixed encoders to fuse spatiotemporal information and introducing attention modules to enhance global perception | Primarily focused on temporal information modeling, insufficient processing of spatial scale changes, difficult to adapt to dynamically changing target distances |
Li et al.36 | Dense scene pose estimation | Proposed InferTrans, a Transformer architecture based on hierarchical structure fusion, organizing joints and limbs through tree structures | Although the model focuses on structural information, it does not consider structural feature preservation problems at different scales, insufficient scale adaptability |
Bai et al.37 | Occlusion scene pose estimation | Developed CONet, estimating poses of occluders and occludees using divide-and-conquer strategies, introducing interference point loss to improve anti-interference ability | Attention mechanism mainly designed for occlusion problems, does not specifically address feature distribution difference problems caused by scale changes |
Li et al.38 | Monocular 3D pose estimation | Proposed TSwinPose, combining JointFlow to encode human joint movement, designing temporal SwinUnet structure to model multi-scale spatiotemporal relationships | Focused on multi-scale modeling in the time domain, feature inconsistency problems caused by spatial domain scale changes not fully resolved |
Dong et al.39 | Coal mine scene pose estimation | Designed YH-Pose framework, using visual evidence from adjacent frames to assist current frame pose estimation, adopting temporal road modules and spatial road modules | Mainly designed for low-quality videos, insufficient consideration for scale changes in complex scenes, reduced accuracy for multi-scale targets |