Fig. 3: Computational filtering and selection of molecules.

A To test the extrapolation performance of the machine learning models, the reaction data set was split using four different strategies: (i) Predicting novel combinations between known N-arenes and acids (zero-dimensional [0D] split, shown in light blue); (ii) Extrapolation to novel N-arenes (one-dimensional [1D] split for N-arenes [1DN], shown in pale purple); (iii) Extrapolation to novel acids (1D split for acids [1DA], shown in beige); and (iv) Extrapolation to both novel N-arenes and acids (2D split, shown in light brown). B Visualization of machine learning results for the four data set splitting strategies, showing reaction yield prediction (left and center) and binary reaction outcomes (right). Error bars indicate the standard deviation of a four-fold cross-validation; individual data points are shown. C Predicted potency for generated molecules with a predicted pIC50≥6. D Predicted reaction outcomes: Molecules with a predicted reaction yield ≥5% are classified as positive, otherwise negative. E Visualization of the subset of molecules with predicted positive reaction outcomes and pIC50 ≥8. F Template docking was demonstrated with an example template and four products, where the coordinates of the template were kept fixed. Key amino acids are shown, i.e., Met123, Ala51, Arg57, and Tyr194. The template compound is illustrated in light green indicating the initial position and placement of the docked ligands and the four products in the docked conformation in light brown. G Predicted absorption, distribution, metabolism and excretion (ADME) properties of the final subset of 212 molecules. The physicochemical and ADME properties considered are (from left to right): LogD, kinetic solubility assay (LYSA) solubility in μg/mL, P-glycoprotein (P-gp) apparent permeability, and parallel artificial membrane permeability assay (PAMPA) in 10-6cm/s. The 14 molecules selected for synthesis are indicated by dashed lines. pIC50 = The negative logarithmic concentration of the half maximal inhibitory concentration in mol/L.