Fig. 3
From: Efficient attention vision transformers for monocular depth estimation on resource-limited hardware

Visual results for the large versions of the networks. For each model and dataset, we show the input RGB image (Input), the depth ground truth (GT), the prediction of the baseline model (Base), the prediction of the best-performing optimised model (Opt) and the qualitative and quantitative difference, indicated by the RMSE, between these two predictions (Diff).