Fig. 1: Model performance on human-annotated test datasets. | npj Digital Medicine

Fig. 1: Model performance on human-annotated test datasets.

From: Weakly supervised language models for automated extraction of critical findings from radiology reports

Fig. 1

a This column shows the performance on the Internal hold-out (Mayo Clinic) test set; b This column shows the performance on the External (MIMIC-III) test set. Each row corresponds to the pre-trained (PT) and weakly fine-tuned (WFT) models. For each model, we plot the histogram bars for Rouge-1 (blue), BLEU (yellow), G-Eval (green), and Prometheus (red). For each model, we consider two prompting techniques: Zero-shot (ZS) and Few-shot (FS) for weak label generation. The score for each metric (normalized between 0 and 1) is added to the top of its corresponding bar plot. Error bars denote standard deviations. The scores for models trained using weak labels generated by FS-based prompting are shown with .

Back to article page