Extended Data Table 7 Fleiss kappa between the 3 runs of each model for test-retest repeatability

From: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning