Table 7 Text classification results (F1-scores) for test-class-mhr (fivefold CV; results averaged per class). We use 2S-KS test for (a) comparisons between models trained with the same type of data; * marks statistically significant improvements for LDA over BoW, and CNN over LDA (α = 0.05, n1 = n2 = 30); (b) comparisons within a model trained with the different types of data (column KS test). Models using less than all key phrases that provided results closest to those with real data are highlighted in bold. We also report results of our ablation experiments when the training data contain only the context of key phrases, real, or generated.

From: Generation and evaluation of artificial mental health records for Natural Language Processing

 

ICD-10

  
 

F20

F32

F60

F31

F25

F10

av.

KS test, (D, p-value)

BoW

 genuine

0.47

0.31

0.32

0.20

0.14

0.24

0.28

 

 all

0.47

0.33

0.27

0.23

0.17

0.23

0.28

0.07, 0.88

 top+meta

0.48

0.36

0.29

0.20

0.14

0.26

0.29

0.09, 0.61

 one+meta

0.46

0.34

0.29

0.23

0.14

0.26

0.29

0.07, 0.80

 key

0.47

0.27

0.26

0.11

0.12

0.23

0.24

0.17, 0.02

LDA

 genuine*

0.55

0.47

0.35

0.32

0.25

0.40

0.39

 

 all*

0.55

0.44

0.35

0.31

0.26

0.37

0.38

0.11, 0.35

 top+meta*

0.52

0.43

0.37

0.29

0.25

0.40

0.38

0.09, 0.51

 one+meta*

0.50

0.45

0.36

0.28

0.23

0.39

0.37

0.14, 0.10

 key*

0.54

0.45

0.38

0.30

0.24

0.40

0.39

0.07, 0.88

CNN

 genuine*

0.66

0.59

0.51

0.37

0.23

0.53

0.48

 

 all*

0.65

0.57

0.47

0.27

0.24

0.50

0.45

0.14, 0.10

 top+meta*

0.63

0.55

0.45

0.31

0.23

0.42

0.43

0.20, 4e−3

 one+meta*

0.59

0.52

0.42

0.25

0.15

0.43

0.39

0.22, 1e−3

 key

0.57

0.34

0.33

0.23

0.20

0.35

0.34

0.37, 1.9e−09

No key phrases

CNN

 genuine

0.48

0.34

0.22

0.22

0.15

0.12

0.25

 

 top+meta

0.30

0.30

0.09

0.25

0.09

0.03

0.18

0.24, 2.7e−04

LDA

 genuine*

0.41

0.40

0.32

0.22

0.20

0.26

0.30

 

 top+meta*

0.29

0.37

0.28

0.23

0.14

0.25

0.26

0.23, 4.4e−04