Table 1 Model performance—zero-shot prompting with definitions

From: Privacy-preserving large language models for structured medical information retrieval

 

Sensitivity

Specificity

Positive predictive value

Negative predictive value

Accuracy

 

7b

13b

70b

7b

13b

70b

7b

13b

70b

7b

13b

70b

7b

13b

70b

Ascites

1.00

0.75

0.95

0.77

0.99

0.95

0.16

0.71

0.44

1.00

0.99

1.00

0.78

0.98

0.95

Abdominal pain

0.88

0.74

0.84

0.67

0.89

0.97

0.38

0.60

0.86

0.96

0.94

0.97

0.71

0.86

0.95

Shortness of breath

0.87

0.42

0.87

0.77

0.99

0.96

0.45

0.86

0.82

0.96

0.89

0.97

0.79

0.88

0.94

Confusion

0.63

0.59

0.76

0.89

0.90

0.94

0.34

0.34

0.54

0.96

0.96

0.98

0.87

0.87

0.93

Liver cirrhosis

1.00

0.96

1.00

0.70

0.99

0.96

0.16

0.81

0.56

1.00

1.00

1.00

0.71

0.99

0.96

  1. Comparing three versions of Llama 2, the largest (70b) model showed the highest performance whereas the smallest (7b) model performed worst. The 13b and 70b models show higher accuracy across all conditions when compared to the 7b model.