Fig. 8: Performance of models used in this study from nested-cross validations and external validation.

Evaluation metrics, namely from (i) sensitivity, (ii) specificity, (iii) balanced accuracy and (iv) F1 score for five models from a Nested CV (median of repeated nested cross validations) from the training data (n = 382 compounds) and b external test set (n = 244 compounds). Early-stage fusion and Late-stage fusion models combining all three feature sets of Cell Painting, Gene Expression and Morgan have higher F1 score for compounds exhibiting mitochondrial toxicity and extrapolate well into new structural space in external test set compared to models using Morgan fingerprints where F1 Score performance falls by 60% (0.25–0.40 in absolute terms).