Table 9 Ablation study on WeedSwin model architecture. The Lightweight Backbone variant focuses on reduced computational complexity with shallower transformer depths [2,2,9,2] and smaller window size (7), achieving 25% FLOPs reduction with minimal performance impact. The Enhanced Encoder-Decoder variant improves feature representation through increased feature levels (5) and deeper encoder-decoder (6 layers), achieving highest detection recall at cost of speed. The Optimized Training variant explores alternative optimization strategies with cosine scheduling and modified denoising parameters, showing balance between accuracy and computational requirements.
Model configuration | Model architecture variations | Performance metrics | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
Reduced depth | Smaller window | More levels | Deeper E-D | Cosine LR | DN Config | mAP↑ | mAR↑ | FPS↑ | FLOPs (T)↓ | Params (M)↓ | |
WeedSwin (Original) | 0.993 | 0.985 | 218.27 | 0.114 | 40.48 | ||||||
Lightweight Backbone | ✓ | ✓ | 0.969 | 0.958 | 210.51 | 0.085 | 40.12 | ||||
Enhanced Encoder-Decoder | ✓ | ✓ | 0.993 | 0.996 | 80.60 | 0.250 | 269.00 | ||||
Optimized Training | ✓ | ✓ | 0.982 | 0.980 | 88.14 | 0.254 | 265.00 | ||||