Scientific Reports

Table 1 A comprehensive comparison with key related works across different tasks. Our framework is distinguished by its unique three-stream architecture, particularly the global contextual stream, tailored for robust 2D pose estimation in dynamic videos.

From: Learning spatio-temporal context for basketball action pose estimation with a multi-stream network

Method	Primary task	Core architecture & novelty	Key datasets & primary metric
OpenPose	2D pose estimation	Bottom-up approach using Part Affinity Fields (PAFs) for real-time multi-person detection	COCO, MPII (mAP)
AlphaPose	2D pose estimation	Top-down approach with a symmetric spatial transformer network (SSTN) and parametric pose NMS	COCO, PoseTrack (mAP)
PoseFlow	2D pose tracking	Links detected poses across frames using a pose-guided optical flow model	PoseTrack (mAP)
HRNet	2D pose estimation	Maintains high-resolution feature representations throughout the network for precise keypoint localization	COCO, MPII (mAP)
MHAFormer	3D pose estimation	Multi-transformer encoder combined with a diffusion model to generate and aggregate multiple 3D pose hypotheses	Human3.6M, MPI-INF-3DHP (MPJPE)
MLTFFPN	3D pose estimation	Multi-level transformer with a feature frame padding network to capture longer temporal dependencies	Human3.6M, MPI-INF-3DHP (MPJPE)
TimeSformer	Video classification	Pure transformer architecture with divided space-time self-attention for efficient video processing	Kinetics, something-V2 (Top-1 Acc.)
HyMAT	Video object detection	Hybrid multi-attention transformer (HyMAT) module to enhance relevant correlations in feature aggregation	ImageNet VID, UA-DETRAC (mAP)
ASTABSCF	Object tracking	Deep correlation filter with adaptive spatial regularization and target-aware background suppression	OTB, LaSOT (success/precision)
DSRVMRT	Object tracking	Multi-regularized correlation filter that leverages historical interval information to suppress response variation	OTB, LaSOT (success/precision)
HMATN	A/V emotion recognition	Hybrid multi-attention network for fusing audio and visual modalities, preserving intra- and inter-modal relationships.	AffWild2, AFEW-VA (CCC, F1)
Ours	2D pose estimation (video)	Multi-stream network (spatial, temporal, contextual) with a hybrid fusion module and a staged training strategy. Unique global contextual stream to resolve scene-level ambiguities	PoseTrack 2017/2018 (mAP)

Significant values are in bold.

Back to article page

Search

Advanced search

Quick links