Fig. 6: Benchmarking with external validation data.
From: Synoptic reporting by summarizing cancer pathology reports using large language models

Performance of the OpenAI’s GPT-4o and the fine-tuned LLAMA-2 7B model (best-performing model) on an external dataset in Sushil et al. 29. We test the ability of the two models to extract 4 data elements which correspond to the labels provided in the external dataset. The accuracy reported here is obtained by manual comparison of the labels provided in ref. 29 against the model responses. Note: Fine-tuned LLAMA-2 has only been fine-tuned on our data, it has not seen examples from the external data.