Table 2 Proposed model analysis based on dimensions.

From: Global attention and local features using deep perceptron ensemble with vision Transformers for landscape design detection

Feature Name

Branch

Dimension

Description

Role

Patch Embedding

ViT

\(\:N\times\:D\)

Linearly projected

\(\:P\times\:P\) pixel patches of the input image into \(\:D\) patches

Encodes local pixel patterns; basis for global self-attention

Positional Encoding

ViT

\(\:N\times\:D\)

Learnable vectors added to each patch token

Injects spatial arrangement; critical for reconstructing layout

Self-Attention Output

ViT

\(\:N\times\:D\)

Weighted combination of all patch tokens within each Transformer block

Captures long-range dependencies across entire scene

Class Token \(\:{\varvec{Z}}_{\varvec{c}\varvec{l}\varvec{s}}\)

ViT

\(\:D\)

Special token embedding after final encoder layer

Holistic, global image representation for classification

Convolutional Stem Map

MLP

\(\:h\times\:w\times\:C\)

Two-layer \(\:3\times\:3\) CNN feature map

Extracts fine-grained textures (e.g., foliage, sand grains)

Global Avg. Pool \(\:\varvec{m}\)

MLP

\(\:C\)

Spatially averaged convolutional feature map

Condenses local features into a fixed-length vector

MLP Hidden Vectors \(\:{\varvec{h}}_{\varvec{k}}\)

MLP

\(\:D\)

Sequence of \(\:K\:FC+GELU\) layers projecting \(\:m\) into the shared latent space

Builds hierarchical, non-linear abstractions of local cues

Gating Vector\(\:\:\varvec{g}\)

Fusion

\(\:D\)

Sigmoid of concatenated \(\:[{Z}_{cls};\:{h}_{k}]\)

Dynamically weights global vs. local features

Fused Feature \(\:\varvec{f}\)

Fusion

\(\:D\)

Element-wise blend\(\::\:g\odot\:{Z}_{cls}+(1-g)\odot\:{h}_{k}\)

Balanced representation for final classification

Class Probabilities

Head

5

SoftMax over \(\:f\)

Final predicted distribution over the five landscape classes