Table 8 The performance of different systems with respect to different conditional scale values.

From: Scalable multimodal approach for face generation and super-resolution using a conditional diffusion model

System

Cond. scale

R@1 (%)

R@3 (%)

R@5 (%)

R@10 (%)

Gender accuracy (%)

Ethnicity accuracy (%)

Age group accuracy (%)

RMSE of age (years)

Average PSNR

Average SSIM

Random selection

0.85

2.50

4.20

8.50

Only image guide

6.80

11.10

16.90

22.90

Full input

0

5.10

7.60

14.40

20.30

71.20

46.60

42.40

7.55

16.72

0.4

2

9.30

18.60

28.80

42.40

89.00

62.70

55.10

5.87

16.52

0.38

4

7.60

22.00

28.00

39.00

91.50

64.40

58.50

5.05

15.57

0.34

6

16.90

24.60

31.40

46.60

87.30

64.40

58.50

5.59

14.96

0.3

8

16.90

30.50

33.90

45.80

90.70

65.30

57.60

5.62

14.37

0.27

Speaker embedding only

0

1.70

3.40

4.20

10.20

60.20

45.80

47.50

8.52

10.95

0.24

2

8.50

12.70

17.80

29.70

89.00

46.60

43.20

6.93

11.11

0.23

4

8.50

13.60

17.80

29.60

89.00

47.50

41.50

7.54

10.52

0.2

6

6.80

15.30

21.20

31.40

85.60

54.20

48.30

8.31

10.38

0.18

8

4.30

8.50

14.50

32.50

88.00

47.90

43.60

7.96

10.26

0.16

  1. Significant values are given in bold.