Table 9 Results of the summarization models on the objective_results division, test set 1.

From: Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Model

Evaluation score on the objective_results division

ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

BLEURT

medcon

Average

Retrieval-based

trainUMLS

30.26

14.89

29.87

66.24

37.25

8.91

34.35

trainsent

40.52

18.21

38.87

73.33

45.79

12.45

41.03

BART-based

BART

0.00

0.00

0.00

0.00

5.45

0.00

1.36

BART (Division)

30.48

19.16

27.80

66.64

43.07

21.56

39.27

BART + FTSAMSum

20.79

0.46

20.67

54.54

28.32

0.77

24.40

BART + FTSAMSum (Division)

29.45

18.01

26.63

66.43

40.75

20.17

38.01

BioBART

17.50

0.00

17.50

52.44

25.33

0.00

22.36

BioBART (Division)

35.38

14.33

32.79

68.40

47.63

15.69

39.81

LED-based

LED

0.00

0.00

0.00

0.00

5.45

0.00

1.36

LED (Division)

14.04

4.97

11.08

48.86

9.61

7.86

19.09

LED + FTpubMed

0.00

0.00

0.00

0.00

5.45

0.00

1.36

LED + FTpubMed (Division)

10.48

3.64

8.32

42.43

7.13

8.86

16.48

OpenAI (wo FT)

Text-Davinci-002

41.48

20.12

39.95

70.61

50.79

24.42

44.92

Text-Davinci-003

44.92

25.21

43.84

72.35

55.87

29.37

48.90

ChatGPT

34.50

17.75

30.84

66.68

48.51

22.28

41.29

GPT-4

37.65

19.94

35.73

68.33

48.50

26.73

43.67

  1. Similar to objective_exam, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the higher sparsity of this division, low amounts of content (sometimes only 2-3 sentences), and the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.