Extended Data Fig. 2: Detailed architecture design of the proposed geometry-aware, multimodal, modular, interpretable, and self-supervised ego-motion estimation.
From: Deep learning-based robust positioning for all-weather autonomous driving

The modules consisting of encoder and decoder networks are based UNet architecture with skip connections. Feature extractors with an encoder are based on the ResNet18 network visualized on sample input17. FC layers represent fully connected layers. Pose fusion network is a multilayer-perceptron. As part of the spatial transformer module, the inverse warp algorithm re-uses the input target frames to calculate the reconstruction loss. Camera input can be set to contain more frames than the range sensor due to the higher fps rate. The fused pose is the final output, optimized in a self-supervised manner without ground-truth with respect to the intermediate pose and depth predictions.