Fig. 1: Percentage of Accurately Predicting Medical Fitness for Surgery Across Different Agents.

The figure illustrates the percentage of accurate assessments for medical fitness for surgery made by various LLMs and human evaluators. Each bar represents the accuracy of a specific model or human-generated response. The overall accuracy of GPT4_international models was 93.0%, which was significantly higher than human evaluators (86.0%).