Scientific Reports

Table 2 Proposed model analysis based on dimensions.

From: Global attention and local features using deep perceptron ensemble with vision Transformers for landscape design detection

Feature Name	Branch	Dimension	Description	Role
Patch Embedding	ViT	\(\:N\times\:D\)	Linearly projected \(\:P\times\:P\) pixel patches of the input image into \(\:D\) patches	Encodes local pixel patterns; basis for global self-attention
Positional Encoding	ViT	\(\:N\times\:D\)	Learnable vectors added to each patch token	Injects spatial arrangement; critical for reconstructing layout
Self-Attention Output	ViT	\(\:N\times\:D\)	Weighted combination of all patch tokens within each Transformer block	Captures long-range dependencies across entire scene
Class Token \(\:{\varvec{Z}}_{\varvec{c}\varvec{l}\varvec{s}}\)	ViT	\(\:D\)	Special token embedding after final encoder layer	Holistic, global image representation for classification
Convolutional Stem Map	MLP	\(\:h\times\:w\times\:C\)	Two-layer \(\:3\times\:3\) CNN feature map	Extracts fine-grained textures (e.g., foliage, sand grains)
Global Avg. Pool \(\:\varvec{m}\)	MLP	\(\:C\)	Spatially averaged convolutional feature map	Condenses local features into a fixed-length vector
MLP Hidden Vectors \(\:{\varvec{h}}_{\varvec{k}}\)	MLP	\(\:D\)	Sequence of \(\:K\:FC+GELU\) layers projecting \(\:m\) into the shared latent space	Builds hierarchical, non-linear abstractions of local cues
Gating Vector\(\:\:\varvec{g}\)	Fusion	\(\:D\)	Sigmoid of concatenated \(\:[{Z}_{cls};\:{h}_{k}]\)	Dynamically weights global vs. local features
Fused Feature \(\:\varvec{f}\)	Fusion	\(\:D\)	Element-wise blend\(\::\:g\odot\:{Z}_{cls}+(1-g)\odot\:{h}_{k}\)	Balanced representation for final classification
Class Probabilities	Head	5	SoftMax over \(\:f\)	Final predicted distribution over the five landscape classes

Back to article page

Search

Advanced search

Quick links