Extended Data Table 8 Inter-model Cohen kappa based on the first run of each model for USMLE, RECIST, Medicilline questions, and NEJM diagnostic cases

From: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning