Table 2 Performance of different fine-tuned models for seizure frequency attribute extraction on the test set

From: Leveraging pretrained language models for seizure frequency extraction from epilepsy evaluation reports

Model

Precision (%)

Recall (%)

F1-score (%)

bert-large-cased

87.19 ± 1.74

90.9 ± 1.35

89 ± 1.43

biobert-large-cased

87.45 ± 1.68

90.28 ± 1.34

88.84 ± 1.38

Bio_ClinicalBERT

83.98 ± 1.97

88.05 ± 1.56

85.96 ± 1.67

Llama-2-70b-hf

84.64 ± 2.68

85.83 ± 2.17

85.23 ± 2.33

GPT-3.5 Turbo

88.99 ± 1.62

90.23 ± 1.7

87.91 ± 1.61

GPT-4

90.23 ± 1.7

93.51 ± 1.21

91.84 ± 1.36

  1. Data are shown as mean ± standard deviation. Highest scores are highlighted.