Table 2 Detail architecture for used vision transformer. Trainable params: 85,801,732.
Layer name | Number of layers | Input shape | Output shape | Number of params |
---|---|---|---|---|
PatchEmbed | 1 | [1, 3, 224, 224] | [1, 196, 768] | 590 592 |
Dropout | 1 | [1, 197, 768] | [1, 197, 768] | – |
Identity | 2 | [1, 197, 768] | [1, 197, 768] | – |
Encoder Block | 12 | [1, 197, 768] | [1, 197, 768] | 85 209 604 |
LayerNorm | 1 | [1, 197, 768] | [1, 197, 768] | 1 536 |
Identity | 1 | [1, 768] | [1, 768] | – |
Dropout | 1 | [1, 768] | [1, 768] | – |
Linear | 1 | [1, 768] | [1, 4] | 3 076 |