Figure 2

The calibration profiles of the best performing machine learning classifiers (a–d) fitted on the RadLex mappings and of the random forests meta/ensemble learner (e,f) fitted on the predicted probabilities of the ML-algorithms as features on all outer folds combined (N = 206). Probability estimates for each report by each ML classifier were recorded i.e. how likely it is that the predicted target label is “ASPECTS: yes”. The reliability of these predictions can be assessed visually on calibration plots. Calibration curves are created by grouping reports into discrete bins based on their assigned probability estimates by the ML-model. Thus, the probability space [0–1] gets discretized into bins (i.e. 0–0.1, 0.1–0.2, …, 0.8–0.9, 0.9–1.0; grey grid). The points represent the mean predicted probability (x-axis) and the observed fraction (y-axis) of true (“yes”) labels for the subset of reports falling in that respective range. For ideally calibrated models, the mean predicted probability and observed fraction should be identical within each bin, hence the calibration curve would lie on the diagonal (grey line). Rug plots (blue lines, findings; red lines, impressions) indicate the axis-values of the aforementioned aggregated bin measures (thick lines) and probability estimates of single reports (thin lines). ELNET (a) was more suitable for the impressions (red) particularly in the 0.50–0.75 range, corresponding to its top 3 ranked accuracy. Linear kernel SVMs (b) showed well-calibrated estimates for the 0.50–1.0 probability domain for both the findings (blue) and impressions (red). XGBoost (c) presented an almost ideal calibration curve on the findings (blue) while being the most accurate ML classifier (Table 2). FastText (d) achieved the highest overall accuracy when trained on the impressions (red) with partly well-calibrated estimates (0.75–1) but it was poorly calibrated on the findings (blue). The RF meta/ensemble learner (e) showed a reasonably well-calibrated profile when trained on probability outputs of all ML-algorithms (16 × ML models both findings and impressions; see Table 3). The histogram inset displays the bimodal distribution of its probability estimates. It showed (f) similar calibration profiles when trained either only on 8–8 ML model estimates of the findings (blue) or the impressions (red), respectively.