Fig. 2
From: Multimodal fusion transformer network for multispectral pedestrian detection in low-light condition

Multimodal fusion backbone framework. Ri and Ti denote the RGB feature mapping and thermal modal feature mapping after convolution, respectively. \(\theta_i\) denotes the convolution module. MFT represents our proposed multimodal feature fusion module, DMFF represents the introduced bimodal feature fusion module.