Table 2 Proposed model analysis based on dimensions.
Feature Name | Branch | Dimension | Description | Role |
|---|---|---|---|---|
Patch Embedding | ViT | \(\:N\times\:D\) | Linearly projected \(\:P\times\:P\) pixel patches of the input image into \(\:D\) patches | Encodes local pixel patterns; basis for global self-attention |
Positional Encoding | ViT | \(\:N\times\:D\) | Learnable vectors added to each patch token | Injects spatial arrangement; critical for reconstructing layout |
Self-Attention Output | ViT | \(\:N\times\:D\) | Weighted combination of all patch tokens within each Transformer block | Captures long-range dependencies across entire scene |
Class Token \(\:{\varvec{Z}}_{\varvec{c}\varvec{l}\varvec{s}}\) | ViT | \(\:D\) | Special token embedding after final encoder layer | Holistic, global image representation for classification |
Convolutional Stem Map | MLP | \(\:h\times\:w\times\:C\) | Two-layer \(\:3\times\:3\) CNN feature map | Extracts fine-grained textures (e.g., foliage, sand grains) |
Global Avg. Pool \(\:\varvec{m}\) | MLP | \(\:C\) | Spatially averaged convolutional feature map | Condenses local features into a fixed-length vector |
MLP Hidden Vectors \(\:{\varvec{h}}_{\varvec{k}}\) | MLP | \(\:D\) | Sequence of \(\:K\:FC+GELU\) layers projecting \(\:m\) into the shared latent space | Builds hierarchical, non-linear abstractions of local cues |
Gating Vector\(\:\:\varvec{g}\) | Fusion | \(\:D\) | Sigmoid of concatenated \(\:[{Z}_{cls};\:{h}_{k}]\) | Dynamically weights global vs. local features |
Fused Feature \(\:\varvec{f}\) | Fusion | \(\:D\) | Element-wise blend\(\::\:g\odot\:{Z}_{cls}+(1-g)\odot\:{h}_{k}\) | Balanced representation for final classification |
Class Probabilities | Head | 5 | SoftMax over \(\:f\) | Final predicted distribution over the five landscape classes |