Table 1 A comprehensive comparison with key related works across different tasks. Our framework is distinguished by its unique three-stream architecture, particularly the global contextual stream, tailored for robust 2D pose estimation in dynamic videos.
From: Learning spatio-temporal context for basketball action pose estimation with a multi-stream network
Method | Primary task | Core architecture & novelty | Key datasets & primary metric |
---|---|---|---|
OpenPose | 2D pose estimation | Bottom-up approach using Part Affinity Fields (PAFs) for real-time multi-person detection | COCO, MPII (mAP) |
AlphaPose | 2D pose estimation | Top-down approach with a symmetric spatial transformer network (SSTN) and parametric pose NMS | COCO, PoseTrack (mAP) |
PoseFlow | 2D pose tracking | Links detected poses across frames using a pose-guided optical flow model | PoseTrack (mAP) |
HRNet | 2D pose estimation | Maintains high-resolution feature representations throughout the network for precise keypoint localization | COCO, MPII (mAP) |
MHAFormer | 3D pose estimation | Multi-transformer encoder combined with a diffusion model to generate and aggregate multiple 3D pose hypotheses | Human3.6M, MPI-INF-3DHP (MPJPE) |
MLTFFPN | 3D pose estimation | Multi-level transformer with a feature frame padding network to capture longer temporal dependencies | Human3.6M, MPI-INF-3DHP (MPJPE) |
TimeSformer | Video classification | Pure transformer architecture with divided space-time self-attention for efficient video processing | Kinetics, something-V2 (Top-1 Acc.) |
HyMAT | Video object detection | Hybrid multi-attention transformer (HyMAT) module to enhance relevant correlations in feature aggregation | ImageNet VID, UA-DETRAC (mAP) |
ASTABSCF | Object tracking | Deep correlation filter with adaptive spatial regularization and target-aware background suppression | OTB, LaSOT (success/precision) |
DSRVMRT | Object tracking | Multi-regularized correlation filter that leverages historical interval information to suppress response variation | OTB, LaSOT (success/precision) |
HMATN | A/V emotion recognition | Hybrid multi-attention network for fusing audio and visual modalities, preserving intra- and inter-modal relationships. | AffWild2, AFEW-VA (CCC, F1) |
Ours | 2D pose estimation (video) | Multi-stream network (spatial, temporal, contextual) with a hybrid fusion module and a staged training strategy. Unique global contextual stream to resolve scene-level ambiguities | PoseTrack 2017/2018 (mAP) |