Table 3 Distribution of annotated entities in all datasets. The n corresponds to the number of documents while figures in cells correspond to the number of instances for this entity.

From: Improving social determinants of health documentation in French electronic health records using large language models

Entities

MUSCADET-InHouse (n = 1700)

MUSCADET-Synthetic (n = 340)

UW-FrenchSDOH (n = 364)

InHouse Tuberculosis and ALS (n = 400)

Train

Dev

Test

Test

Test

Test

Living_Alone

194

32

55

37

17

10

Living_WithOthers

412

61

128

89

73

44

MaritalStatus_Single

50

9

14

20

20

6

MaritalStatus_InRelationship

629

72

184

132

127

72

MaritalStatus_Divorced

70

13

14

17

18

8

MaritalStatus_Widowed

69

8

17

18

9

3

Descendants_Yes

845

101

226

154

80

64

Descendants_No

98

13

25

33

5

6

Job

828

109

247

216

95

58

Last_job

751

100

217

204

85

49

Employment_Working

348

53

92

114

56

15

Employment_Unemployed

113

18

31

17

9

7

Employment_Student

30

4

14

17

8

0

Employment_Pensioner

305

45

91

48

34

23

Employment_Other

82

3

32

21

16

8

Alcohol

498

70

127

193

251

42

Tobacco

627

94

164

219

263

60

Drug

78

11

20

101

136

17

Housing_Yes

682

86

203

122

48

137

Housing_No

16

3

3

13

0

2

PhysicalActivity_Yes

154

23

33

63

21

4

PhysicalActivity_No

37

6

11

14

3

0

Income

24

3

13

8

0

2

Education

58

5

18

20

3

1

Ethnicity

69

13

21

18

1

27