Table 10 Performance of different speech conditioned face generation systems.

From: Scalable multimodal approach for face generation and super-resolution using a conditional diffusion model

System

Inputs

Size of evaluation dataset

Output image

Recall @1 (%)

Recall @3 (%)

Recall @5 (%)

Recall @10 (%)

Speech2Face: learning the face behind a voice3

Random

5000

 

1

 

5

10

Audio 3 s

Normalized

8.54

 

24.8

38.54

Audio 6 s

10.92

 

30.6

45.82

Speech fusion to face: bridging the gap between human’s vocal characteristics and facial imaging4

 

378

64 × 64

5.80

 

20.40

36.70

128 × 128

5.00

 

19.40

32.30

64 × 64 and 128 × 128

6.10

 

18.80

35.40

Speaker embedding SLF face generation

Random

118

128 × 128 non-normalized

0.85

2.50

4.20

8.50

Audio and additional attributes

9.30

16.10

23.70

34.70

Audio

8.50

13.60

17.80

29.60