Fig. 3: Diagnostic performance and time-cost analysis of the LLMs and the physicians.

a The bar chart illustrates the ranking of coverage rates for different LLMs and physicians. b The bar chart illustrates the ranking of accuracies for different LLMs and physicians. c, d The left heatmap shows the coverage rates and the right heatmap shows the accuracies of each LLM compared to physicians using or not using auxiliary methods. Corresponding statistical significance is shown in each heatmap. e, f The left heatmap displays the coverage rates, and the right heatmap presents the accuracies of each LLM compared to physicians, along with their corresponding statistical significance. g, h The left heatmap shows the coverage rates and the right heatmap displays the accuracies of each LLM vs. physicians in the GI case subgroup, including the corresponding statistical significance. Claude 3.5 Sonnet had the highest coverage rates. Apart from Claude 3.5 Sonnet, Physician 4 significantly outperformed other LLMs in diagnostic accuracy for GI cases. i, j The left heatmap illustrates the coverage rates and the right heatmap shows the accuracies of each LLM compared to physicians in the non-GI case subgroup, along with their corresponding statistical significance. k The bar chart compares the average time taken to answer a single question by seven different LLMs and the 22 physicians. It also presents a cost comparison between the fee charged by the LLMs for answering a single question and the cost range for a single outpatient visit to one of the 22 experienced physicians (excluding examination and treatment fees).