Table 10 Performance of different speech conditioned face generation systems.

System	Inputs	Size of evaluation dataset	Output image	Recall @1 (%)	Recall @3 (%)	Recall @5 (%)	Recall @10 (%)
Speech2Face: learning the face behind a voice³	Random	5000		1		5	10
	Audio 3 s		Normalized	8.54		24.8	38.54
	Audio 6 s		Normalized	10.92		30.6	45.82
Speech fusion to face: bridging the gap between human’s vocal characteristics and facial imaging⁴		378	64 × 64	5.80		20.40	36.70
			128 × 128	5.00		19.40	32.30
			64 × 64 and 128 × 128	6.10		18.80	35.40
Speaker embedding SLF face generation	Random	118	128 × 128 non-normalized	0.85	2.50	4.20	8.50
	Audio and additional attributes			9.30	16.10	23.70	34.70
	Audio			8.50	13.60	17.80	29.60

Quick links

Search