Table 1 Summary of key characteristics for each backbone architecture.
Aspect | ViT | Swin Transformer | PVT | MobileViT | Axial Transformer |
|---|---|---|---|---|---|
Core concept | Pure transformer | Hierarchical transformer | Pyramid transformer | Hybrid convolution-transformer | Axial decomposition attention |
Feature extraction | Global context features | Multi-scale hierarchical features | Multi-scale pyramid features | Balanced local-global features | Axial long-range dependencies |
Attention Complexity | Quadratic (with image patches) | Linear scaling (with image size) | Efficient spatial-reduction (linear) | Optimized for edge devices (linear) | Linear scaling (axial decomposition) |
Resource Demand | High | Moderate | Moderate | Low | Moderate |