Fig. 2: Confusion matrices for extracted features with zero-shot prompting. | npj Digital Medicine

Fig. 2: Confusion matrices for extracted features with zero-shot prompting.

From: Privacy-preserving large language models for structured medical information retrieval

Fig. 2

a shows the prompt modules used for zero shot prompting. The detailed instruction was included, followed by a report and the corresponding instruction formulated as a question. This was followed by a definition of the features to be extracted. b The confusion matrices visualize the performance of the Llama 2 models with 7 billion, 13 billion and 70 billion parameters in retrieving the presence or absence of the five features ascites, abdominal pain, shortness of breath, confusion and liver cirrhosis in all n = 500 medical histories from MIMIC IV. All matrices are divided into four quadrants with the two labels “true” or “false” in each axis. The x-axis depicts the predicted labels, the y-axis depicts the true labels. The confusion matrices are normalized to show proportions, where each cell represents the fraction of predictions within the actual class. Values along the diagonal indicate correct predictions (true positives and true negatives), while off-diagonal values represent misclassifications (false positives and false negatives). The sum of each row’s fractions equals 1, indicating the proportion of predictions for each actual class. The “n” values represent the absolute number of observations in each category. In the top left matrix, the extraction of ascites with the 70b model is shown. The top left quadrant (true negatives) shows a high score of 0.95, indicating a high rate of correct predictions for non-cases of ascites. The top right quadrant (false positives) has a score of 0.05, suggesting few cases were incorrectly predicted as having ascites. The bottom left quadrant (false negatives) has a score of 0.05, indicating few cases were incorrectly identified as not having ascites. Finally, the bottom right quadrant (true positives) shows a high score of 0.95, which means a high rate of correct predictions for actual cases.

Back to article page