Extended Data Fig. 1: Evaluation of BioMedAgent using BioMed-AQA.

a. The calculation of the Win score and determining task execution status of success or failed. b. ROC curve demonstrating the performance of the autoscoring agent, showing high accuracy and reliability with an AUC of 0.926, indicating strong alignment with manual evaluations. c. Confusion matrix comparing autoscoring results with manual evaluations. d. Summary of BioMed-AQA and BioMed-AQA-MCQ. BioMed-AQA (n = 327) consists of open questions derived from three sources: simulated datasets (37.31%), literature-derived datasets (46.79%), and tool tutorial datasets (15.90%). BioMed-AQA-MCQ (n = 172) is a multiple-choice subset of BioMed-AQA, consisting of single-choice questions (73.26%) and multi-choice questions (26.74%), designed to enable automated and objective evaluation. All tasks from O, P, S in BioMed-AQA were designed with one corresponding multiple-choice question. M and V tasks were not included, as they are less suitable for the multiple-choice format.