Fig. 2: Bioactivity prediction.
From: Leveraging molecular structure and bioactivity with chemical language models for de novo drug design

a A CLM for molecule generation iteratively predicts the next character in a SMILES string given the preceding characters (“autoregressive” approach). b An E-CLM (a CLM pretrained with the ELECTRA method) is trained on corrupted SMILES strings aiming to predict, for each string character, whether it is the original (correct) or a corrupted (substituted) character. c Activity distribution of the PI3Kγ ligands. Compounds with annotated pIC50 ≤ 4.0 were considered “inactive”, and a pIC50 value of 6.5 was used to separate the “moderately active” from the “highly active” compounds. d The molecular structures (in the form of a SMILES string) of the fine-tuning set were used to focus the CLM (pretrained on the US patent database) on the chemical space of the target of interest (PI3Kγ). To account for the uncertainty in the predictions, we employed an ensemble of 100 models to rank the generated molecules by the number of “votes”.