Table 11 Model performance on different test sets splits, comparison between virtscribe dialogues with ASR and human transcript.

From: Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Test set

Bart Fine-tuning

Test Split

ROUGE-1

ROUGE-2

ROUGE-L

medcon

1

train

ASR

48.61

18.94

41.74

42.63

+trainASR

ASR

49.70

19.96

43.82

41.96

train

human

48.28

20.09

43.98

46.13

+trainASR

human

48.50

19.52

43.59

42.85

2

train

ASR

51.29

21.31

43.76

45.21

+trainASR

ASR

50.42

21.30

44.68

43.71

train

human

50.11

20.80

44.44

43.35

+trainASR

human

48.44

20.47

43.68

44.28

3

train

ASR

50.41

20.01

43.79

49.91

+trainASR

ASR

49.22

19.72

43.19

44.18

train

human

50.86

19.50

44.59

45.48

+trainASR

human

47.42

18.42

42.67

44.72

  1. The model finetuned on the train set is the BART + FTSAMSum (Division) fine-tuned with 10 epochs on the original train set, as in the baseline methods. The train + trainASR model refers to the BART + FTSAMSum (Division) finetuned for 3 more epochs on the virtscribe with ASR split of the train set.