Extended Data Table 6 BERTscore (F1) between the 3 runs of each model for test-retest repeatability

From: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning