Fig. 2: Evaluation of different ranking methods of OSMES with known substrates of PLP-enzymes.

a Representation of the 6 ranking methods related to the best cluster (BC; red tones) and the largest cluster (LC; yellow tones). The bar plot represents the 200 conformations of a single docking run clustered with a 3 Å RMSD threshold; LCC and BCC methods consider the number of conformations in the respective cluster. The atoms of the substrate considered in the energy-based ranking methods (BCE, LCE, BCaaE, LCaaE) are highlighted in the insets. b Scheme of the side view of the PLP pyridine ring and the three Cα bonds with the respective angles (χ) with respect to the PLP ring plane. c Catalytically favorable conformations (CFC) in the three different PLP-dependent reactions. The conformations from docking analysis are considered CFC if the distance (d) between Nε of catalytic lysine and imine carbon is ≤5 Å in the catalytic cluster, and the bond cleaved in the expected reaction (superior circumradius) is nearly orthogonal to the PLP ring (plane), that is its angle χ has the maximum relative value (see Methods). d Bar plot highlighting in blue the number of CFC in different clusters. Black arrow indicates the Catalytic Cluster (CC) which does not always coincide with BC (red) or LC (yellow). e Letter-value plot showing the distribution of the validation set (n = 42) colored according to the 7 ranking methods. BC related methods are colored in red tones; LC related methods are colored in yellow tones; CC-CFC is colored in blue. Individual dots representing ranking position of positive controls (i.e., enzymes known to act on the substrate) are colored according to substrate (legend); black dashed line delimits the top 10 positions. The band indicates the median, the main box indicates the first and third quartiles with every further minor box splitting the remaining data into two halves. f Receiver operating characteristic curve (ROC) for the different ranking methods colored as in panel e; the dotted diagonal represents an area under curve (AUROC) value of 0.5. Source data are provided as a Source Data file.