Table 2 Detailed structure of the proposed HyperFusion-Net architecture.

Stages	Layers	Output size
Input image	Input layer (resize to 224 × 224)	224 × 224 × 3
Stem	Conv2D(64) @3 × 3 &1 + BatchNorm + ReLU	112 × 112 × 64
Stem	MaxPooling2D @2 × 2 &2	56 × 56 × 64
Transformer encoder	Multi-Path ViT with patch embedding (16 × 16), 4 encoder layers, and multi-head attention	14 × 14 × 768
Transformer encoder	Positional Encoding + LayerNorm + FeedForward (MLP)	14 × 14 × 768
Segmentation branch	Attention U-Net Decoder with skip connections from patch embeddings	224 × 224 × 1
Segmentation branch	Upsampling + Conv2D(128 → 64 → 32) + Attention Gates	224 × 224 × 1
Fusion block	Cross-Attention Fusion between Transformer and U-Net outputs	224 × 224 × 64
Classifier head	Global Average Pooling + Dense(128) + Dropout(0.5) + Dense(1) + Sigmoid	1

Quick links

Search