Fig. 1: Annotation bias in BindingDB training data and DeepPurpose predictions. | Nature Communications

Fig. 1: Annotation bias in BindingDB training data and DeepPurpose predictions.

From: Improving the generalizability of protein-ligand binding predictions with AI-Bind

Fig. 1

a Distributions of the number of annotations in the benchmark BindingDB data are shown in double logarithmic axes (log-log plot), indicate that P(kp) and P(kl) are well approximated by a power law for both proteins (pink) and ligands (green), with approximate degree exponents γp = 2.84 and γl = 2.94, respectively. b The average Kd over the links for different degree values {kp} are negatively correlated with rSpearman(kp, 〈Kd〉) = −0.47. For the ligands, we observe similar anti-correlation with rSpearman(kl, 〈Kd〉) = −0.29. c The distribution of degree ratios for the proteins {ρp} and the ligands {ρl} in the original DeepPurpose training dataset (for a selected fold from the 5-fold cross-validation). The degree ratio, defined in Equation (1), refers to the ratio of positive annotations to the total annotations for a given node in the protein-ligand interaction network. After thresholding Kd values associated with each link to create the binary labels, the hubs on average get more positive or binding annotations, whereas the low-degree nodes get both binding and non-binding annotations. As the hubs are associated with many links in the network, learning the type of binding from the degree information helps ML models to achieve good performance by leveraging shortcut learning. The Source Data File provided with the manuscript contains the number of samples per data point in the plots.

Back to article page