Fig. 1: Figure 1 illustrates the structured evaluation process of the AMEGA benchmark, detailing how candidate AI models are assessed on medical cases.
From: Autonomous medical evaluation for guideline adherence of large language models

The diagram uses color-coded shapes to represent different components: Green rectangles indicate benchmark content (initial case, questions, sections, and criteria), forming the foundation of the evaluation. Blue elements (ellipse for the Candidate Model, parallelogram for its output) show the AI model being tested and its responses. Orange elements (ellipse for the Evaluator Model, parallelogram for its output) represent the evaluation process and results. Gray rectangles display the scoring at various levels (criterion, section, question, and case). The workflow progresses from left to right, starting with the initial case and questions. The Candidate Model provides answers, which the Evaluator Model assesses against predefined criteria. Solid lines show the active processing flow, while dashed lines indicate data dependencies. Circular symbols represent mathematical operations: “×” for multiplication (applying criterion scores) and “+” for addition (summing scores at different levels). A key feature is the Reask process, allowing the Candidate Model to refine its answer if the initial response does not fully meet the criteria and if Reasks are permitted. This systematic approach ensures a comprehensive evaluation of the AI model’s medical knowledge and reasoning capabilities, with scores aggregated from individual criteria up to the overall case level.