Table 2 Detailed structure of the proposed HyperFusion-Net architecture.

From: HyperFusionNet combines vision transformer for early melanoma detection and precise lesion segmentation

Stages

Layers

Output size

Input image

Input layer (resize to 224 × 224)

224 × 224 × 3

Stem

Conv2D(64) @3 × 3 &1 + BatchNorm + ReLU

112 × 112 × 64

MaxPooling2D @2 × 2 &2

56 × 56 × 64

Transformer encoder

Multi-Path ViT with patch embedding (16 × 16), 4 encoder layers, and multi-head attention

14 × 14 × 768

Positional Encoding + LayerNorm + FeedForward (MLP)

14 × 14 × 768

Segmentation branch

Attention U-Net Decoder with skip connections from patch embeddings

224 × 224 × 1

Upsampling + Conv2D(128 → 64 → 32) + Attention Gates

Fusion block

Cross-Attention Fusion between Transformer and U-Net outputs

224 × 224 × 64

Classifier head

Global Average Pooling + Dense(128) + Dropout(0.5) + Dense(1) + Sigmoid

1