Fig. 2: Model performance in the complex clinical case diagnosis task under automatic evaluation.
From: Medical foundation large language models for comprehensive text analysis and beyond

The figure presents the top-K accuracy (where 1 ≤ K ≤ 5) of Me-LLaMA-70B-chat, ChatGPT, GPT-4, and LLaMA2-70B-chat on a complex clinical case diagnosis task, evaluated automatically.