Table 2 Results on human-annotated small-scale test datasets - Internal hold-out (Mayo Clinic) and External (MIMIC-III)

From: Weakly supervised language models for automated extraction of critical findings from radiology reports

Dataset

Models

Prompting setup

Metrics

Ā Ā Ā 

Precision

Recall

F1-score

Internal hold-out test

Mistral-PT

Zero-Shot

0.38

0.33

0.35

Few-Shot

0.41

0.56

0.47

Mistral-WFT

Zero-Shot

0.57

0.42

0.48

Few-Shot

0.63

0.41

0.49

BioMistral-PT

Zero-Shot

0.47

0.23

0.31

Few-Shot

0.53

0.31

0.39

BioMistral-WFT

Zero-Shot

0.63

0.51

0.56

Few-Shot

0.68

0.49

0.57

External test

Mistral-PT

Zero-Shot

0.41

0.37

0.39

Few-Shot

0.53

0.41

0.46

Mistral-WFT

Zero-Shot

0.42

0.47

0.44

Few-Shot

0.45

0.51

0.48

BioMistral-PT

Zero-Shot

0.38

0.29

0.33

Few-Shot

0.41

0.37

0.39

BioMistral-WFT

Zero-Shot

0.57

0.45

0.50

Few-Shot

0.65

0.51

0.57

  1. PT Pre-trained, WFT Weakly Fine-tuned.