Table 4 Overview of model architectures evaluated on the STARSS23 validation set

Approach	Format	Params (M)	Data (h)	Model Architecture(s)	\({\text{ER}}_{\le 2{0}^{\circ }}\,\downarrow\)	\({\text{F}}_{\le 2{0}^{\circ }}\,\uparrow\)	LE_CD ↓	LR_CD ↑	\({{\mathcal{E}}}_{{\rm{SELD}}}\,\downarrow\)
DCASE
Wang et al.⁷⁸	FOA	63	192	Sound Separation, ResNet-Conformer, Model Ensemble	0.38	66.0%	12. 8°	75.0%	0.260
Xue et al.⁷⁹	FOA	33	-	ResNet-Conformer, MS-CAM	0.44	54.2%	13. 9°	67.9%	0.324
Hu et al.¹⁴⁹	FOA	85	49	EINV2	0.48	47.3%	16. 1°	62.6%	0.368
Kang et al.¹²¹	FOA	202	212	ResNet-Conformer, Model Ensemble	0.43	55.8%	15. 9°	71.5%	0.311
Kim and Ko¹²²	FOA	138	192	ResNet, SqEx, Model Ensemble	0.47	51.7%	15. 2°	70.2%	0.334
Zhang et al.¹⁵⁰	FOA	104	128	CNN-Conformer, Model Ensemble	0.46	52.0%	14. 0°	59.5%	0.356
Wu¹⁵¹	FOA	85	-	EINV2	0.54	41.1%	22. 3°	62.3%	0.407
Shul et al.⁸³	FOA	2	192	Divided Spectro-Temporal Attention	0.49	42.7%	16. 7°	55.2%	0.401
Kumar et al.¹⁵²	FOA	3	52	CNN-Conformer	0.39	56.0%	20. 3°	63.0%	0.328
DCASE Baseline	FOA	0.7	24	SELDNet	0.57	29.9%	22. 0°	47.7%	0.479
DCASE Baseline	MIC	0.7	24	SELDNet	0.62	27.8%	27. 0°	44.3%	0.512
Non-DCASE
Shul et al.⁸⁷	FOA	-	24	CST-Former	0.41	57.7%	13. 8°	68.3%	0.307
Berghi et al.²⁰	FOA	-	272	CNN-Conformer	0.51	50.2%	15. 4°	56.4%	0.382
Hu et al.⁸⁹	FOA	34.6	-	HTS-AT, PSELDNet	0.39	62.4%	14. 4°	77.7%	0.267
Mu et al.⁸⁶	FOA	26.9	-	Multi-Feature Fusion, EINV2	0.54	42.5%	18. 7°	62.6%	0.398
Jiang et al.¹⁵³	FOA	63	192	ResNet-Conformer	0.42	57.0%	14. 3°	67.0%	0.310
He et al.⁴⁸	FOA	-	24	Pretrained SSAST	0.49	44.4%	18. 8°	62.1%	0.382
Zhang et al.⁸¹	FOA	4	28.3	CRNN10, AADA	0.53	32.9%	31. 3°	40.5%	0.405

This table summarizes selected systems, detailing their recording format, model complexity (in parameter count), training data volume, and key evaluation metrics. DCASE submissions are sorted by their challenge ranking, and only systems outperforming the baseline are listed.

Quick links

Search