Figure 5
From: An efficient self-attention network for skeleton-based action recognition

A kind of spatial self-attention block. C is the channel size; T is frame; V is the human joint. \(\theta\) and \({g}\) denote \(1 \times 1\) convolution. \(3 \times 1\) denote \(3 \times 1\) convolution.