Table 6 Results of the summarization models evaluated at the full note level, test set 1.

From: Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Model

ROUGE-1

ROUGE-2

ROUGE-L

medcon

Transcript-copy-and-paste

longest spearker turn

27.84

9.32

23.44

32.37

longest doctor turn

27.47

9.23

23.20

32.33

12 speaker turns

33.16

10.60

30.01

39.68

12 doctor turns

35.88

12.44

32.72

47.79

transcript

32.84

12.53

30.61

55.65

Retrieval-based

trainUMLS

43.87

17.55

40.47

33.30

trainsent

41.59

15.50

38.20

26.17

BART-based

BART

41.76

19.20

34.70

43.38

BART (Division)

51.56

24.06

45.92

47.23

BART + FTSAMSum

40.87

18.96

34.60

41.55

BART + FTSAMSum (Division)

53.46

25.08

48.62

48.23

BioBART

39.09

17.24

33.19

42.82

BioBART (Division)

49.53

22.47

44.92

43.06

LED-based

LED

28.37

5.52

22.78

30.44

LED (Division)

34.15

8.01

29.80

32.67

LED + FTpubMed

27.19

5.30

21.80

27.44

LED + FTpubMed (Division)

30.46

6.93

26.66

32.34

OpenAI (wo FT)

Text-Davinci-002

41.08

17.27

37.46

47.39

Text-Davinci-003

47.07

22.08

43.11

57.16

ChatGPT

47.44

19.01

42.47

55.84

GPT-4

51.76

22.58

45.97

57.78

  1. Simple retrieval-based methods provided strong baselines wih better out-of-the-box performances than LED models and full-note BART models. In general for BART and LED fine-tuned models, division-based generation worked better. OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.