Table 1 Performance of various fine-tuned models for seizure frequency phrase extraction on the test set

From: Leveraging pretrained language models for seizure frequency extraction from epilepsy evaluation reports

Model

Precision (%)

Recall (%)

F1-score (%)

bert-large-cased

77.33 ± 4.24

71.95 ± 4.65

74.51 ± 4.2

biobert-large-cased

78.83 ± 4.06

75.43 ± 4.27

77.06 ± 3.86

Bio_ClinicalBERT

70.12 ± 4.79

65.8 ± 4.58

67.84 ± 4.29

Llama-2-70b-hf

80.72 ± 4.16

80.69 ± 3.65

80.68 ± 3.58

GPT-3.5 Turbo

84.53 ± 3.85

77.13 ± 4.15

80.64 ± 3.81

GPT-4

86.61 ± 4.28

85.04 ± 3.51

85.79 ± 3.59

  1. Data are shown as mean ± standard deviation. Highest scores are highlighted.