Table 10 Performance of different speech conditioned face generation systems.
System | Inputs | Size of evaluation dataset | Output image | Recall @1 (%) | Recall @3 (%) | Recall @5 (%) | Recall @10 (%) |
|---|---|---|---|---|---|---|---|
Speech2Face: learning the face behind a voice3 | Random | 5000 | 1 | 5 | 10 | ||
Audio 3 s | Normalized | 8.54 | 24.8 | 38.54 | |||
Audio 6 s | 10.92 | 30.6 | 45.82 | ||||
Speech fusion to face: bridging the gap between human’s vocal characteristics and facial imaging4 | 378 | 64 × 64 | 5.80 | 20.40 | 36.70 | ||
128 × 128 | 5.00 | 19.40 | 32.30 | ||||
64 × 64 and 128 × 128 | 6.10 | 18.80 | 35.40 | ||||
Speaker embedding SLF face generation | Random | 118 | 128 × 128 non-normalized | 0.85 | 2.50 | 4.20 | 8.50 |
Audio and additional attributes | 9.30 | 16.10 | 23.70 | 34.70 | |||
Audio | 8.50 | 13.60 | 17.80 | 29.60 |