Fig. 2

Structural diagram of the modified ResNet34 encoder. The input is modified to accept 6 channels (3 channels x 2 temporal images), Conv1 weights are initialized by averaging pretrainted RGB weights. The match channel layer converts the 256 to 320 channels to align with PVT-v2, and attention gates fuses features from ResNet34 and PVT-v2 streams.