Fig. 1: Performance comparison of GPT-3·5 vs GPT-4 vs Google.
From: Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

a Performance of GPT-3·5 vs GPT-4 vs Google for diagnosis. b Performance of GPT-3·5 vs GPT-4 for examination (exact adjusted p-value p = 3.2241·10−6). c Performance of GPT-3·5 vs GPT-4 for treatment. Bubble plots show the pairwise comparison of two approaches. Cumulative frequency plots show the cumulative number of cases (Y-axis) and their accuracy scores (X-axis) for each disease frequency subgroup (light blue: rare, intermediate blue: less frequent, dark blue: frequent). One-sided Mann-Whitney test was applied for statistical testing (adjusted with Bonferroni correction for multiple testing considering n = 12 tests for diagnosis, n = 7 tests for examination and treatment).