Fig. 1: Overview of the similarity computation between two protein–ligand complexes.
From: Resolving data bias improves generalization in binding affinity prediction

a, Our structure-based dataset filtering algorithm evaluates structural similarity of protein–ligand complexes using a combination of TM scores, Tanimoto scores and pocket-aligned ligand r.m.s.d. The Tanimoto scores identify chemically similar ligands and range from 0 (no similarity) to 1 (identical). TM scores are computed with TM-align, a tool that compares protein structures by finding the optimal alignment of their three-dimensional structures and outputs a score ranging from 0 (no similarity) to 1 (identical). This score identifies proteins with high structural similarity, even when sequence identity is low (for example, when one protein is a substructure of the other). Pocket-aligned ligand r.m.s.d. scores compare the positioning of ligands within aligned protein pockets. Ligands are transformed into the same coordinate frame using the optimal alignment from TM-align, and an r.m.s.d. calculation provides a quantitative measure of positional similarity. b, Decision tree showing the exclusion criteria and the information flow of the filtering algorithm when comparing a training and a test complex. The first layer of the algorithm compares the affinity labels of the complexes (pK values; see Methods). Training complexes with high structural similarity but different activity are not excluded, to avoid excluding data points that can provide valuable insights into activity cliffs. The second layer excludes all training complexes with similar affinity and identical ligand (Tanimoto > 0.9) to make the test complex ligands unique and avoid successful predictions through ligand memorization. The third and fourth layer exclude training complexes based on protein similarity and a combined assessment of ligand and binding conformation similarity. This four-layer approach identifies complexes with similar interaction patterns, even when traditional sequence-based methods would overlook these similarities.