Table 1 Model performance—zero-shot prompting with definitions

	Sensitivity			Specificity			Positive predictive value			Negative predictive value			Accuracy
	7b	13b	70b	7b	13b	70b	7b	13b	70b	7b	13b	70b	7b	13b	70b
Ascites	1.00	0.75	0.95	0.77	0.99	0.95	0.16	0.71	0.44	1.00	0.99	1.00	0.78	0.98	0.95
Abdominal pain	0.88	0.74	0.84	0.67	0.89	0.97	0.38	0.60	0.86	0.96	0.94	0.97	0.71	0.86	0.95
Shortness of breath	0.87	0.42	0.87	0.77	0.99	0.96	0.45	0.86	0.82	0.96	0.89	0.97	0.79	0.88	0.94
Confusion	0.63	0.59	0.76	0.89	0.90	0.94	0.34	0.34	0.54	0.96	0.96	0.98	0.87	0.87	0.93
Liver cirrhosis	1.00	0.96	1.00	0.70	0.99	0.96	0.16	0.81	0.56	1.00	1.00	1.00	0.71	0.99	0.96

Comparing three versions of Llama 2, the largest (70b) model showed the highest performance whereas the smallest (7b) model performed worst. The 13b and 70b models show higher accuracy across all conditions when compared to the 7b model.

Quick links

Search