Fig. 2: Benchmark and example performance of the EvoMDT system on oncology-related reasoning tasks.
From: EvoMDT: a self-evolving multi-agent system for structured clinical decision-making in multi-cancer

a Comprehensive comparison of model performance across six medical question–answering benchmarks, including MedQA, MedQA-Cancer, MedMCQA-Cancer, MedQuAD, MedQARo, and ROND. The radar chart shows that EvoMDT consistently outperforms baseline large language models (Llama3-70B, Claude-3, and Med-PaLM 2) in both accuracy-based and semantic similarity metrics. b Quantitative performance comparison on standardized medical benchmarks (left) and custom oncology datasets (right). EvoMDT achieved the highest accuracy (0.86–0.89) on multiple-choice datasets and superior BERTScores (0.63–0.68) on generative question–answering tasks, indicating improved semantic fidelity to expert responses. c Representative multiple-choice example from the MedQA dataset. The model correctly identifies cyclophosphamide (d) as the causative agent of hemorrhagic cystitis in a patient with non-Hodgkin lymphoma following chemotherapy, demonstrating clinically consistent pharmacologic reasoning.