Table 10 Results of the summarization models on the assessment_and_plan division, test set 1.

From: Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Model

Evaluation score on the assessment_and_plan division

ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

BLEURT

medcon

Average

Retrieval-based

trainUMLS

44.59

21.50

29.66

70.39

44.77

24.70

42.94

trainsent

41.28

19.73

28.02

69.48

43.18

18.79

40.28

BART-based

BART

0.00

0.00

0.00

0.00

29.05

0.00

7.26

BART (Division)

43.31

20.59

26.55

67.49

40.99

32.30

42.73

BART + FTSAMSum

1.52

0.49

0.87

35.38

19.79

1.00

14.28

BART + FTSAMSum (Division)

43.89

21.37

27.56

68.09

41.96

31.33

43.08

BioBART

0.00

0.00

0.00

0.00

29.05

0.00

7.26

BioBART (Division)

42.44

19.44

26.42

67.57

43.88

31.12

43.00

LED-based

LED

0.00

0.00

0.00

0.00

29.05

0.00

7.26

LED (Division)

28.23

6.13

12.44

55.75

27.78

21.94

30.27

LED + FTpubMed

0.00

0.00

0.00

0.00

29.05

0.00

7.26

LED + FTpubMed (Division)

28.00

5.99

13.07

55.68

20.95

25.01

29.33

OpenAI (wo FT)

Text-Davinci-002

30.90

12.27

21.44

61.01

44.98

35.04

40.64

Text-Davinci-003

35.41

14.86

25.38

63.97

49.18

46.40

46.19

ChatGPT

36.43

12.50

23.32

63.56

48.21

43.71

44.89

GPT-4

38.16

14.12

24.90

64.26

49.41

42.36

45.44

  1. Similar to objective_exam and objective_results, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.