Fig. 5: Evaluated responses of Mistral-7B-instruct-v0.2 and GPT-4-1106-preview.
From: Autonomous medical evaluation for guideline adherence of large language models

This figure compares the responses of two AI models to the breast cancer diagnosis question from Fig. 6. Mistral-7B-instruct-v0.2 (left) scores 8.0/8.0, while GPT-4-1106-preview (right) scores 7.5/8.0. Both correctly identify breast cancer as the primary diagnosis. Mistral provides a concise list of symptoms and risk factors, while GPT-4 offers a more structured, detailed response with numbered points. Highlighted text indicates key symptoms and findings aligning with the evaluation criteria. The slight score difference may reflect variations in symptom description completeness or specificity.