Table 9 Comparison of time complexity and parameter sizes for different modules.

From: A simple monocular depth estimation network for balancing complexity and accuracy

Module

Time complexity

Params

Standard Transformer13

\(O(H^2 \times W^2 \times C_{in})\)

4.09M

DCF (Ours)

\(O(H \times W \times C_{in}^2)\)

1.54M

Ordinary Convolution

\(O(H \times W \times K^2 \times C_{in} \times C_{out})\)

65.79K

DyConv41

\(O(H \times W \times K^2 \times C_{in} \times C_{out} \times N)\)

264.22K

LMC (Ours)

\(O(H \times W \times K^2 \times C_{in} \times C_{out} \times N)\)

287.84K

MHSA13

\(O(H^2 \times W^2 \times C_{in})\)

263.17K

W-MSA35

\(O(H \times W \times C_{in}^2 + N_w \times W_s^4 \times C_{in})\)

66.05K

WAT (Ours)

\(O(H^2 \times W^2 \times C_{in} / 16)\)

887.23K

  1. Among them, H, W, \(C_{in}\) represent the height, width, and number of channels of the input feature map, respectively; \(C_{out}\) represents the number of output channels; K denotes the kernel size of the convolution; \(N_w\) represents the number of windows; and \(W_s\) represents the spatial size of each window.
  2. The best result is indicated in bold.