Extended Data Table 4 Comparison of the average scores for the linguistic evaluation of the “Findings to Impression” task on the 2 databases (private database of reports and MIMIC) based on the evaluation of 2 independent radiologists for each database

From: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

  1. Number of tokens per summary provided by the model and ratio of the number of tokens for the Findings section prompted/ Impression output.