Fig. 5

Illustration of the SegNext Attention. Here, d, \(k1 \times k2\) means a depth-wise convolution (d) using a kernel size of \(k1 \times k2\). We extract multi-scale features using convolutions and then utilize them as attention weights to reweigh the input of MSCA(Mulit-scale Feature).