Fig. 3: Qualitative evaluation results on accuracy, completeness, and readability.

A The overall results of the fine-tuned BART, GPT-3.5 zero-shot, GPT-4 zero-shot, and LLaMA 2 zero-shot models on a scale of 1 to 5, based on random 50 testing instances from the PubMed Text Summarization dataset. B and C display the number of winning, tying, and losing cases when comparing GPT-4 zero-shot to GPT-3.5 zero-shot and GPT-4 zero-shot to the fine-tuned BART model, respectively. Table 4 shows the results in digits for complementary. Detailed results, including statistical tests and examples, are provided in Supplementary Information S3.