Fig. 4: Comparison of diagnostic performance across different models.

a Bar charts of diagnostic accuracy calculated for each disease classification across different models from internal (upper, Dataset 1–5) and external (lower, Dataset 6–10) evaluations. The bar colors represent disease classifications. The line graphs below denote study centers, models used, and data providers. b Heatmaps of diagnostic performance metrics after internal (left) and external (right) evaluations of different models. For each heatmap, metrics in the text model and text + smartphone model are normalized together by column, ranging from -2 (blue) to 2 (red). Disease types are classified into six categories and displayed by different colors. c Multivariate logistic regression analysis of diagnostic accuracy for all cases (left) and subgroup analysis for follow-up cases (right) during clinical evaluation. The first category in each factor is used as a reference, and OR values and 95% CIs for other categories are calculated against these references. OR, odds ratio; CI, confidence interval; *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.