Fig. 3: Confusion matrices for extracted features with one-shot prompting.
From: Privacy-preserving large language models for structured medical information retrieval

The confusion matrices visualize the performance of the Llama 2 models with 70 billion parameters in retrieving the presence or absence of the five features ascites, abdominal pain, shortness of breath, confusion and liver cirrhosis in all n = 500 medical histories from MIMIC IV. All matrices are divided into four quadrants with the two labels “true” or “false” in each axis. The x-axis depicts the predicted labels, the y-axis depicts the true labels. The confusion matrices are normalized to show proportions, where each cell represents the fraction of predictions within the actual class. Values along the diagonal indicate correct predictions (true positives and true negatives), while off-diagonal values represent misclassifications (false positives and false negatives). The numbers indicate absolute counts, the figure in brackets indicate fractions. The sum of each row’s fractions equals 1, indicating the proportion of predictions for each actual class. a shows the best one-shot prompt architecture and results. Whereas adding definitions, which improved performance with zero-shot prompting, deteriorated the results for one-shot prompting (b).