Table 1 Detailed architecture of the encoder.

Encoder
Block	Input	Input-size	Out-size	Channel	element
Conv1	RGB	H × W	H × W	3 → 16	3 × 3, stride 1
Layer1	RGB	H × W	H/2 × W/2	3 → 64	7 × 7, stride 2
Layer2	F (Layer1)	H/2 × W/2	H/4 × W/4	64 → 256	3 × 3 max pool, stride 2
Layer2	F (Layer1)	H/2 × W/2	H/4 × W/4	64 → 256	\(\left[ {\begin{array}{{20}{c}} {\begin{array}{{20}{c}} {1 \times 1,{\text{ }}128} \\ {3 \times 3,{\text{ 128}}} \\ {1 \times 1,{\text{ }}256} \end{array}}&{C=32} \end{array}} \right] \times 3\)
Layer3	F (Layer2)	H/4 × W/4	H/8 × W/8	256 → 512	\(\left[ {\begin{array}{{20}{c}} {\begin{array}{{20}{c}} {1 \times 1,{\text{ }}256} \\ {3 \times 3,{\text{ 256}}} \\ {1 \times 1,{\text{ }}512} \end{array}}&{C=32} \end{array}} \right] \times 4\)
Layer4	F (Layer3)	H/8 × W/8	H/16 × W/16	512 → 1024	\(\left[ {\begin{array}{{20}{c}} {\begin{array}{{20}{c}} {1 \times 1,{\text{ }}512} \\ {3 \times 3,{\text{ 512}}} \\ {1 \times 1,{\text{ }}1024} \end{array}}&{C=32} \end{array}} \right] \times 23\)

Quick links

Search