Figure. 3
From: Artificial intelligence in risk prediction and diagnosis of vertebral fractures

a. This Sankey diagram represents the categorization of 162 total artificial intelligence (AI) models into various types and performance levels based on their AUROC scores. Of the total, 136 models are designated as diagnostic AI models, while 26 are predictive AI models. The performance levels, corresponding to different ranges of AUROC scores, are color-coded and flow from these categories into four distinct performance categories: Suboptimal performance (AUROC score: 0.5–0.7) includes 9 models, acceptable performance (AUROC score: 0.7–0.8) includes 25 models, excellent performance (AUROC score: 0.8–0.9) includes 20 models, and outstanding performance (AUROC score: 0.9+) includes 49 models. Additionally, there are 59 models for which no AUROC values are provided. Diagnostic AI models are further broken down into osteoporotic vertebral fractures (OVFs) with 11 models, vertebral compression fractures (VCFs) with 39 models, and non-pathological vertebral fractures (non-pathological VFs) with 30 models. Each subgroup of fractures feeds into the various performance levels, showing the distribution of models’ performance based on their diagnostic category. b. This radar chart provides a comparative visualization of the mean performance metrics for different groups of AI models. The chart is segmented into five performance metrics: AUROC, Accuracy, Precision, Sensitivity, and Specificity, with values ranging from 0 to 1. There are four groups of models compared: Predictive Models, Non-Pathological Vertebral Fracture (VF) Diagnostic Models, Osteoporotic Vertebral Fractures (OVF) Diagnostic Models, and Vertebral Compression Fractures (VCF) Diagnostic Models. Each group is represented by a different coloured line that traces the mean score for each performance metric. The lines create shapes that allow for an at-a-glance comparison of how each model group performs across these metrics. The closer the edge of a shape is to the outer perimeter of the radar chart, the higher the mean performance score for that metric. The chart facilitates a direct comparison of the model groups, indicating areas where some models excel or where there may be room for improvement.