Table 2 Detail architecture for used vision transformer. Trainable params: 85,801,732.

From: Residual self-attention vision transformer for detecting acquired vitelliform lesions and age-related macular drusen

Layer name

Number of layers

Input shape

Output shape

Number of params

PatchEmbed

1

[1, 3, 224, 224]

[1, 196, 768]

590 592

Dropout

1

[1, 197, 768]

[1, 197, 768]

Identity

2

[1, 197, 768]

[1, 197, 768]

Encoder Block

12

[1, 197, 768]

[1, 197, 768]

85 209 604

LayerNorm

1

[1, 197, 768]

[1, 197, 768]

1 536

Identity

1

[1, 768]

[1, 768]

Dropout

1

[1, 768]

[1, 768]

Linear

1

[1, 768]

[1, 4]

3 076