Table 1 Summary of key characteristics for each backbone architecture.

From: Development of approach to an automated acquisition of static street view images using transformer architecture for analysis of Building characteristics

Aspect

ViT

Swin Transformer

PVT

MobileViT

Axial Transformer

Core concept

Pure transformer

Hierarchical transformer

Pyramid transformer

Hybrid convolution-transformer

Axial decomposition attention

Feature extraction

Global context features

Multi-scale hierarchical features

Multi-scale pyramid features

Balanced local-global features

Axial long-range dependencies

Attention Complexity

Quadratic (with image patches)

Linear scaling (with image size)

Efficient spatial-reduction (linear)

Optimized for edge devices (linear)

Linear scaling (axial decomposition)

Resource Demand

High

Moderate

Moderate

Low

Moderate