Fig. 2: Overall architecture of BSP DCA-ViViT.

The architecture consists of four modules: a Pretrained FaceMesh detector for facial landmark extraction, b ViViT model for spatial feature extraction, c Dual cross attention model, d Dual cross attention layer. PE positional embedding, DCAL dual cross attention layer, MLP multilayer perceptron, SA self attention, LN layer norm, CA cross attention, FF feed-forward layer.