Table 9 Ablation study on WeedSwin model architecture. The Lightweight Backbone variant focuses on reduced computational complexity with shallower transformer depths [2,2,9,2] and smaller window size (7), achieving 25% FLOPs reduction with minimal performance impact. The Enhanced Encoder-Decoder variant improves feature representation through increased feature levels (5) and deeper encoder-decoder (6 layers), achieving highest detection recall at cost of speed. The Optimized Training variant explores alternative optimization strategies with cosine scheduling and modified denoising parameters, showing balance between accuracy and computational requirements.

From: WeedSwin hierarchical vision transformer with SAM-2 for multi-stage weed detection and classification

Model configuration

Model architecture variations

Performance metrics

Reduced depth

Smaller window

More levels

Deeper E-D

Cosine LR

DN Config

mAP↑

mAR↑

FPS↑

FLOPs (T)↓

Params (M)↓

WeedSwin (Original)

      

0.993

0.985

218.27

0.114

40.48

Lightweight Backbone

    

0.969

0.958

210.51

0.085

40.12

Enhanced Encoder-Decoder

  

  

0.993

0.996

80.60

0.250

269.00

Optimized Training

    

0.982

0.980

88.14

0.254

265.00

  1. Significant values are in bold.