Fig. 6: Sample pair “initial case + question” and evaluation criteria for this question.
From: Autonomous medical evaluation for guideline adherence of large language models

The figure presents a clinical scenario of a 58-year-old female patient with breast-related symptoms, including the initial case description and the first question from the AMEGA benchmark. Below the scenario, an evaluation table outlines the criteria used to assess the AI model’s response, including the correct diagnosis of invasive breast cancer and various symptoms to be identified. Each criterion is assigned a specific score, demonstrating the detailed and systematic approach of the AMEGA benchmark in evaluating AI models’ adherence to medical guidelines in diagnosing breast cancer.