Figure 3 | Scientific Reports

Figure 3

From: Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

Figure 3

Implementation Details. Firstly, the encoder generates four feature maps F1-F4 through progressive downsampling, which are then transformed into tokens under reshaping. Subsequently, under the mapping process, these tokens are mapped to high dimensions, and then restored to three-dimensional features through concatenation for convenient resizing by subsequent convolutions. Under resampling, these feature maps are upsampled to the same shape while preserving shallow and deep semantic information. Finally, multi-scale feature fusion is accomplished through fusion. In the encoder, we employ Linear SRA, which reduces time complexity through Spatial Reduction. On top of frozen Linear SRA weights, we use trainable weight matrices with ranks lower than the original SRA to fine-tune the encoder (for example, for a tensor shape of (196,768), it would be decomposed into two trainable weight matrices of sizes 1961 and 1768, preserving the original weights in a similar manner to ResNet). In the MLP, we replace the activation function with Gaussian Error Linear Unit (GELU) to enhance the model's robustness to noise and data biases.

Back to article page