Table 2 Correlation of reliability metrics to model performance

	Rank-based correlation to bin order
Binning metric	Balanced accuracy	Hit rate	Precision
Scaffold sim	0.42 ± 0.06	0.51 ± 0.04	0.50 ± 0.06
Molecular core overlap	0.28 ± 0.07	0.22 ± 0.09	0.25 ± 0.07
Pharmacophore similarity	0.19 ± 0.07	0.37 ± 0.09	0.43 ± 0.08
Embedding distance	0.36 ± 0.06	0.24 ± 0.09	0.29 ± 0.08
Uncertainty	0.51 ± 0.08	0.62 ± 0.06	0.72 ± 0.04
Unfamiliarity	0.58 ± 0.04	0.52 ± 0.07	0.52 ± 0.05

Correlation (Kendall’s τ) between several bin-wise performance metrics and the bin order. Molecules are binned into eight bins per dataset by: mean pharmacophore similarity to the training set (cosine distance computed on CATS descriptors), mean scaffold (Tanimoto on ECFPs) similarity to the training set, mean molecular core overlap (MCS fraction) to the training set, Mahalanobis distance of embeddings (z vectors) to the training set, prediction uncertainty and unfamiliarity. Mean and standard error of the mean for all datasets are reported. A correlation of 1.0 indicates perfect model calibration. For every metric, bins are ordered to reflect low to high confidence. Highest correlations are reported in bold.

Quick links

Search