Fig. 10: The overview of CMSA.

The three feature tensors are concatenated along the channel dimension to form an integrated cross-modal feature representation \({X}_{fused}\), after which the output features \({F}_{sp}\) and \({F}_{st}\) from the two submodules are fused through concatenation and convolution operations.