Fig. 2: Performance on ASCO-SNO-ASTRO guideline.

Comparative performance analysis of medical experts (blue bars: Nrs_a1, Nrs_a2, Nrs_r1, Nrs_r2, Rad_a1, Rad_a2, Rad_r1, Rad_r2) and large language models (orange bars: GPT-4o, Gemini, Copilot, DeepSeek) in ASCO-SNO-ASTRO clinical practice guideline assessments. 2A Accuracy rates for Strength of Recommendation. 2B Cohen’s kappa values for Strength of Recommendation. 2C Accuracy rates for Quality of Evidence. 2D Cohen’s kappa values for Quality of Evidence. Participant identifiers: Nrs = Neurosurgery (a = attending, r = resident); Rad = Radiation oncology (a = attending, r = resident). Higher kappa values indicate greater Convergence in responses relative to the reference standard.