Table 1 A comprehensive comparison with key related works across different tasks. Our framework is distinguished by its unique three-stream architecture, particularly the global contextual stream, tailored for robust 2D pose estimation in dynamic videos.

From: Learning spatio-temporal context for basketball action pose estimation with a multi-stream network

Method

Primary task

Core architecture & novelty

Key datasets & primary metric

OpenPose

2D pose estimation

Bottom-up approach using Part Affinity Fields (PAFs) for real-time multi-person detection

COCO, MPII (mAP)

AlphaPose

2D pose estimation

Top-down approach with a symmetric spatial transformer network (SSTN) and parametric pose NMS

COCO, PoseTrack (mAP)

PoseFlow

2D pose tracking

Links detected poses across frames using a pose-guided optical flow model

PoseTrack (mAP)

HRNet

2D pose estimation

Maintains high-resolution feature representations throughout the network for precise keypoint localization

COCO, MPII (mAP)

MHAFormer

3D pose estimation

Multi-transformer encoder combined with a diffusion model to generate and aggregate multiple 3D pose hypotheses

Human3.6M, MPI-INF-3DHP (MPJPE)

MLTFFPN

3D pose estimation

Multi-level transformer with a feature frame padding network to capture longer temporal dependencies

Human3.6M, MPI-INF-3DHP (MPJPE)

TimeSformer

Video classification

Pure transformer architecture with divided space-time self-attention for efficient video processing

Kinetics, something-V2 (Top-1 Acc.)

HyMAT

Video object detection

Hybrid multi-attention transformer (HyMAT) module to enhance relevant correlations in feature aggregation

ImageNet VID, UA-DETRAC (mAP)

ASTABSCF

Object tracking

Deep correlation filter with adaptive spatial regularization and target-aware background suppression

OTB, LaSOT (success/precision)

DSRVMRT

Object tracking

Multi-regularized correlation filter that leverages historical interval information to suppress response variation

OTB, LaSOT (success/precision)

HMATN

A/V emotion recognition

Hybrid multi-attention network for fusing audio and visual modalities, preserving intra- and inter-modal relationships.

AffWild2, AFEW-VA (CCC, F1)

Ours

2D pose estimation (video)

Multi-stream network (spatial, temporal, contextual) with a hybrid fusion module and a staged training strategy. Unique global contextual stream to resolve scene-level ambiguities

PoseTrack 2017/2018 (mAP)

  1. Significant values are in bold.