Extended Data Table 5 RECIST 1.1 performances for the 3 LLMs per category of response based on the consensus between the 3 human experts (average of the 3 runs)

From: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

  1. CR= Complete Response; PR= Partial Response; SD= Stable Disease; PD= Progressive Disease.