Introduction

Machine learning (ML) methods are beginning to contribute to, and sometimes even disrupt, the traditional way of conducting drug discovery programs1,2,3,4,5,6,7. While the predominant focus remains on ML methods following the “property prediction” or “design first then predict” paradigms, there is a growing attention towards generative models (GMs) operating under the inverse paradigm of “describe first then design”. Property prediction models are limited by exhaustive evaluation of molecular libraries, hindering exploration of extensive and diverse chemical spaces8. In contrast, GMs can generate unseen molecules with preferred properties8,9,10,11 without the costly evaluation of limited libraries.

GMs aim to learn underlying patterns in datasets and use this knowledge to produce unseen data12. When trained on datasets of molecules and their properties, GMs can generate novel molecules with tailored characteristics, such as designated structural13,14,15,16,17,18,19 or physico-chemical properties14,15,16,18,20,21,22,23, or improved affinity to a target13,14,15,16,17,18,19,22,24,25,26,27,28,29,30,31,32,33,34,35. However, molecular GMs face several challenges: (1) Insuficient target engagement due to limited target-specific data29,36, affecting the accuracy of the coupled data-driven affinity predictors13,17,18,24. (2) Lack of synthetic accessibility (SA) in generated molecules. Previous GMs addressed this by confining the generation into the vicinity of the training dataset, selected for its good SA,18,19,25,26,35, using estimators via reinforcement learning (RL)14,16,20,37 or by employing sampling methods based on reactive building blocks17,38. However, these approaches limit the novelty of the generated molecules. (3) The applicability domain problem, which refers to the capacity of GMs to generalize to new data outside the training space. Despite progress, few studies have applied GMs to molecular discovery campaigns with experimental testing14,19,22,26,27,28,30,34,38.

Molecular GMs employ various architectures, each with distinct advantages and limitations. RL approaches22,26,36 enable goal-directed generation but often struggle with sparse rewards30, particularly in affinity optimization, where there is a strong imbalance towards inactive molecules. Generative Adversarial Networks, while capable of producing high yields of chemically valid molecules, often face issues like mode collapse or training instability39. Autoregressive transformer models32,40 leverage large pre‐trained chemical language models41,42 to capture long‐range dependencies, yet their sequential decoding makes training and sampling slower. Diffusion‐based GMs43,44,45 have shown exceptional sample diversity and high‐quality chemically rich outputs by iteratively denoising random noise into valid molecular graphs, though this requires considerable sampling steps and incurs significant computational overhead.

While no perfect framework exists for molecular GMs, one might be inclined to use a Variational Autoencoder (VAE). It consists of two networks: an encoder that maps the input molecular representation to a probability distribution in a lower-dimensional latent space, and a decoder that reconstructs the original molecular representation from this latent space46. VAE’s continuous and structured latent space enables smooth and controlled interpolation of samples47, facilitating the generation of molecules with specific properties. Besides, VAEs offer a useful balance with a rapid, parallelizable sampling, an interpretable latent space, and robust, scalable training that performs well even in low-data regimes. This combination makes VAEs a natural choice for integration with active learning (AL) cycles where speed, stability, and directed exploration are all critical.

AL in drug discovery is an iterative feedback process that prioritizes experimental or computational evaluation of molecules based on model-driven uncertainty or diversity criteria48,49, thereby maximizing information gain while minimizing resource use50. It has been shown to significantly improve the discovery of synergistic drug combinations, achieving 5–10× higher hit rates than random selection by iteratively refining predictive models with newly assayed data51. In affinity-driven campaigns, deep batch AL methods select compound batches predicted to be high-value binders, reducing the number of docking or ADMET assays needed to identify top candidates52. Human-in-the-loop implementations further integrate expert input with QSAR or QSPR models to guide de novo library optimization and property‐based filtering, accelerating hit finding in low-data regimes53.

In this study, we developed a molecular GM workflow featuring a VAE with two nested AL cycles to overcome GMs' limitations. Traditionally, AL in computational drug discovery involves training a predictive model on an initial set of molecules. The model is then iteratively retrained by selecting the most informative molecules from a fixed library based on a defined protocol, progressively improving accuracy while minimizing the number of assays. Our work extends this paradigm by embedding a generative VAE directly within the AL cycles: instead of selecting only from existing molecules, the VAE proposes entirely new molecules guided by chemoinformatics oracles (drug-likeness, synthetic-accessibility, and variability filters), which are then evaluated by molecular modeling (MM) physics-based oracles (docking scores). Evaluated molecules meeting predefined criteria are used to fine-tune the VAE, creating a self-improving cycle that simultaneously explores novel regions of chemical space while focusing on molecules with higher predicted affinity. This VAE-AL GM workflow aimed to optimize target engagement by iteratively guiding the generation with MM’s physics-based predictions. These predictions offer greater reliability than data-driven methods, especially in low-data regimes8. AL also aimed to enhance predictive metrics such as SA, promoting the generation of more synthesizable molecules. By promoting dissimilarity from the training data, AL aimed to improve generalization, enabling the generation of genuinely novel molecules. In summary, our VAE-AL GM workflow aims to generate drug-like, synthesizable molecules with high novelty, diversity, and excellent docking scores. Additionally, it refines virtual hits through Monte Carlo simulations with PEL54,55 to improve docking scores and poses, thereby enabling better candidate selection. Selected candidates are further validated through absolute binding free energy (ABFE) simulations and bioassays.

We tested our VAE-AL GM workflow on two targets with different amounts of training data. The first target, cyclin-dependent kinase 2 (CDK2), regulates cell progression and is a potential therapeutic target for certain tumors56,57. Despite over 10,000 disclosed CDK2 inhibitors, several in clinical trials56,58,59, a CDK2-selective inhibitor remains undiscovered. The second target, Kirsten rat sarcoma viral oncogene homolog (KRAS), is a well-known oncogene60 associated with fatal cancers, including pancreatic61, lung62, and colorectal63. The discovery of the KRAS SII allosteric site64 enabled the development of covalent inhibitors targeting KRASG12C, offering promise for selective KRAS inhibition65,66. However, most of these inhibitors are based on a single scaffold initially disclosed by Amgen67. Recent studies suggested that SII selective inhibition could also be effective for KRASG12D68, leveraging a salt bridge with D12. Mirati’s MRTX1133 non-covalent inhibitor proved it by effectively targeting both the inactive and active states of KRASG12D, inhibiting KRAS pathway69. From a generative AI perspective, we chose these targets to evaluate our GM workflow on a densely populated patent space (CDK2) and a sparsely populated chemical space (KRAS). After several generation cycles, the workflow successfully generated diverse, drug-like molecules with excellent docking scores and predicted SA for each target. For CDK2, 10 molecules were selected for synthesis, resulting in 6 successful syntheses and 3 additional analogs. Of these, 8 showed in vitro activity to CDK2, with 1 reaching nanomolar potency. Building on the reliable performance of ABFE demonstrated in the CDK2 case, we identified 4 molecules with potential activity against KRAS.

Results

Designing an optimal GM workflow

The designed molecular GM workflow follows a structured pipeline for generating molecules with desired properties (Fig. 1). Key steps include:

  1. 1.

    Data representation: Training molecules are represented as SMILES, tokenized, and converted into one-hot encoding vectors before input into the VAE.

  2. 2.

    Initial training: The VAE is initially trained on a general training set to learn how to generate viable chemical molecules. It is then fine-tuned on a target-specific training set (initial-specific training set) to learn how to generate molecules with increased target engagement.

  3. 3.

    Molecule generation: After the initial training, the VAE is sampled to yield new molecules.

  4. 4.

    Inner AL cycles: In the first inner AL cycle, chemically valid generated molecules (hereafter referred to as generated molecules) are evaluated for druggability, SA, and similarity to the initial-specific training set using chemoinformatic predictors as a property oracle. Molecules meeting threshold criteria are added to a temporal-specific set. This set is used to fine-tune the VAE in subsequent training, prioritizing molecules with desired properties. Inner AL cycles continue iteratively through generation, evaluation, and fine-tuning. From the second AL cycle onwards, the similarity is assessed against the cumulative temporal-specific set.

  5. 5.

    Outer AL cycle: An outer AL cycle begins after a set number of inner AL cycles. During the outer cycle, accumulated molecules in the temporal-specific set undergo docking simulations, serving as an affinity oracle. Molecules meeting docking score thresholds are transferred to the permanent-specific set, which is used to fine-tune the VAE. Outer AL cycles then proceed iteratively, with nested inner AL iterations, evaluation with docking, and fine-tuning. In the succeeding nested inner AL cycles, similarity is assessed against the permanent-specific set.

  6. 6.

    Candidate selection: After a set number of outer AL cycles, stringent filtration and selection processes are applied to identify the most promising candidates from the generated molecules accumulated in the permanent-specific set. Specifically, intensive MM simulations, such as PELE54,55 are used to provide an in-depth evaluation of binding interactions and stability within protein–ligand complexes.

Fig. 1: GM workflow pipeline.
figure 1

The GM workflow involves two nested iterative processes: a lower-level iteration (inner AL cycle) and a higher-level iteration (outer AL cycle). During inner AL cycles, generated molecules are evaluated based on Quantitative Estimate of Drug Likeliness (QED), SA score, and Tanimoto similarity. Outer AL cycles use the Glide gscore for evaluation. The specific training set initially consists of molecules with known or predicted affinity to the target (initial-specific training set). Inner AL cycles enrich this set with molecules meeting QED, SA and similarity thresholds (temporal-specific set). In the outer AL cycles, accumulated molecules that meet the Glide gscore thresholds are retained in the specific training set (permanent-specific set), while those that do not are discarded. After candidate selection, generated molecules are validated using absolute binding free energy simulations (ABFE) and bioassays (see “Methods”).

The nested inner and outer AL cycles are designed to generate molecules with desired properties. Inner cycles drive chemical space exploration by promoting dissimilarity to the specific set and generation of drug-like, synthesizable candidates through promoting molecules with good predicted drugability and SA. The outer AL cycles guide the generation toward molecules with higher predicted affinity by leveraging a physics-based MM method, such as docking.

Cyclin dependent Kinase 2 (CDK2)

We start testing our GM workflow with CDK2, chosen for its abundant data on known inhibitors and the need for novel inhibitors with unexplored CDK2 scaffolds.

Given the exploratory nature of this first test, we began by executing a single outer AL cycle comprising 16 inner AL cycles. For the initial-specific training, we used a set of 1061 experimental CDK2 inhibitors. The inner AL cycles evaluated molecules based on the following thresholds: QED values above 0.6, SA scores below 7.0, and similarity below 0.6. The first outer AL cycle generated 49,796 molecules, distributed unevenly across inner cycles due to the stochastic nature of generation (Fig. 2A and  S1A; Supplementary Data Table S1). 30.3% of these met the inner AL cycles thresholds, enriching the temporal-specific set (Figs. 2B, C, and S1B; Supplementary Data Table S1). The two-dimensional molecular representation provided by the Uniform Manifold Approximation and Projection (UMAP) algorithm70 allowed us to visualize the exploratory nature of the GM workflow over the inner AL cycles (Fig. 2D). During the first five cycles, the workflow exhaustively explored the vicinity of the initial-specific set, while scouting nearby regions. From the sixth cycle onward, the workflow began exploring more distant regions, intensively generating molecules farther from the initial-specific set from cycles seven to sixteen.

Fig. 2: Inner AL cycle decomposition for the first outer AL cycle of CDK2.
figure 2

A Generated molecules at each inner AL cycle. B Molecules meeting QED, SA and similarity thresholds at each inner AL cycle. C Cumulative plot of the temporal-specific set enrichment over the inner AL cycles. D UMAP illustrating the enrichment of the temporal-specific set over the inner AL cycles. The markers’ styles and colors of molecules from different inner AL cycles are indicated in the legend. The initial-specific set is represented as ‘initial’ and the corresponding enrichments at each inner AL cycle as “innerX”.

The first outer AL cycle concluded with a docking evaluation, filtering molecules in the temporal-specific set using Glide docking71 and applying a Glide gscore threshold of −8.0 kcal·mol−1 or lower. Only 1.8% of the generated molecules met this threshold and were transferred to the permanent-specific set for the next outer AL cycle.

In the second outer AL cycle, the GM workflow generated 24,766 molecules, with 48.6% meeting the inner AL cycles thresholds and 4.8% the outer AL cycle thresholds (Supplementary Data Table S1 and Fig. S1A). To further enhance chemical space exploration, we lowered the similarity threshold to 0.4 for the third and fourth outer AL cycles. The third cycle began from the accumulated permanent-specific set, comprising experimental CDK2 inhibitors and previously generated molecules with predicted affinity. In contrast, the fourth cycle restarted from the initial-specific set of solely experimental CDK2 inhibitors, providing an independent starting point. As expected, the stringent thresholding significantly reduced the percentage of molecules accepted by the inner AL cycles, with only 7.8% and 7.3% for the third and fourth cycles, respectively. Notably, 8.1 and 6.1% of these were also accepted by the outer AL Glide gscore threshold, comparable to the 5.8 and 9.9% in the first and second cycles, indicating that, despite exploring more diverse chemical space, we still generated molecules meeting the affinity criteria (Supplementary Data Table S1). In the fifth outer AL cycle, we added all molecules accepted by previous outer cycles into the permanent-specific set and reverted to a similarity threshold of 0.6. This generated 36,465 molecules, with 51.1% meeting the inner cycles thresholds and 6.3% the outer cycle thresholds (Fig. 3A and S1B; Supplementary Data Table S1). The stringent third and fourth outer AL cycles boosted molecule generation in the fifth cycle, leading to the generation of diverse molecules with high affinity for CDK2. The affinity improvement is evident in the Glide gscore range of −8.0 to −11.5 kcal·mol−1 (Fig. 3C, D), with 4627 molecules generated, a 21-fold increase from the initial-specific set. Of these, 3142 had a maximum similarity below 0.3 to the initial-specific set. Notably, 28 molecules had gscores below −11.5 kcal·mol−1, compared to just 3 in the initial-specific set, with 11 showing a maximum similarity below 0.3.

Fig. 3: Generated molecules over five outer AL cycles for CDK2.
figure 3

A Cumulative plot of the specific set enrichment over the inner/outer AL cycles. Vertical dotted lines mark the end of each outer AL cycle, dividing the plot into separate outer cycles. Red dots represent inner AL cycles using the CDK2 initial-specific set. Blue dotted lines indicate the integration of permanent-specific sets from the third and fourth outer AL cycles by the fifth outer AL cycle. B UMAP illustrating the generated molecules over the outer AL cycles, with marker styles and colors indicating different cycles as shown in the legend. The initial-specific set is labeled ‘initial’ and outer AL cycles as “outerX” C Glide gscore histograms of docked molecules across outer AL cycles compared to the Glide gscore histogram of the initial-specific set of CDK2. A red vertical line marks the −8.0 kcal·mol−1 gscore threshold. D Zoomed-in view of the best scoring molecules in (C).

The UMAP of the outer AL cycles (Figs. 3B and  S2) illustrates the generative process. Molecules from the first outer cycle populated the space near the initial-specific set and sparsely spread out. Then, the second outer cycle expanded into the sparsely covered right-central region, moving further from the initial-specific set. The third outer cycle, remained near previously densely explored regions despite the reduced similarity threshold. The fourth AL cycle avoided the initial-specific set’s vicinity and explored a poorly populated central region. Finally, the fifth outer cycle explored less populated areas of the left-central region, leveraging the variability introduced particularly during the fourth cycle. Figure S3A, highlights the GM exploration beyond the initial-specific set, while Fig. S3B show that outer AL-accepted molecules are evenly distributed across the newly explored space.

During the candidate selection phase, molecules in the permanent-specific set were stringently filtered using a Glide gscore threshold of −10 kcal·mol−1 and a similarity threshold of 0.3, resulting in the retention of 324 molecules (Fig. 4A). These molecules were evenly distributed throughout the newly explored chemical space (Fig. S3C). Next, we conducted more robust MM simulations, previously validated (see “Methods: PELE Validation”), to better discriminate docking false positives. Specifically, we used the all-atom Monte Carlo PELE software (see “Methods: PELE rescoring”) to refine the docking poses and scores of the 324 prefiltered CDK2 molecules.

Fig. 4: CDK2 candidate selection and validation.
figure 4

A Scatter plot of CDK2 generated molecules with a Glide gscore below −8 kcal·mol−1 and a maximum similarity to the initial-specific set below 0.5. Red dotted lines indicate the stringent thresholds of −10 kcal·mol−1 for gscore and 0.3 for similarity used during the candidate selection. B Table with the MM affinity predictions and maximum similarity to the initial-specific set molecules for the synthesized molecules. C Structures of synthesized molecules. “A-“ denotes analog; (R) and (S) indicate right- and left-handed enantiomers, respectively. D Eight of the nine synthesized molecules presented in vitro activity against CDK2 with IC₅₀ values below 50 μM. Outer5_105 showed potent and complete inhibition of CDK2 with an IC50 of 71 nM. JNJ-770662197 was used as a control compound.

After refinement, we visually inspected the poses to identify molecules with promising binding modes and unique kinase hinge regions, thus aiming to identify atypical CDK2 scaffolds (Fig. S4). We selected ten molecules for two parallel validations: synthesis and bioassays, and ABFE simulations72 (Supplementary Data Table S2).

Synthesis and bioassays

Six of the ten molecules, along with two analogs and one chiral compound, were successfully synthesized (see Supplementary “Methods: Compound Synthesis”). The inhibitory effects of these nine molecules were experimentally tested (see Methods: Inhibitory assays”). Eight presented in vitro activity with IC₅₀ values below 50 μM. Of these, four have IC₅₀ values between 5 and 10 μM, and one showed an IC₅₀ of 71 nM (Fig. 4B–D, Supplementary Data Table S3).

ABFE simulations

Known inhibitors for a CDK2 structure (PDB: 3BHV) were used to benchmark ABFE calculations’ performance with the CDK2 system (see “Methods: ABFE simulations” and Supplementary Data Table S3)73. Five of the eight molecules with in vitro activity presented a ABFE predicted Kd below 50 μM, and the molecule without in vitro activity presented a predicted Kd in the millimolar range (Supplementary Data Table S3). Notably, the correlation coefficient between the ABFE ΔG and pIC50 from inhibitory assays was −0.75 (−0.88 when excluding the two analogs) (Fig. S5).

To ensure the novelty of the tested compounds, we searched them in the ENAMINE Real Space 76B74 using the BiosolveIT tool infiniSee75. None of the 10 compounds selected for synthesis were found in this space. For 6, no analogs with a similarity above 0.5 were identified. For the remaining 4, their closest analogs had similarities between 0.54 and 0.75 featuring similar, though not identical, scaffolds (Supplementary Data Table S4 and Fig. S6).

Kirsten rat sarcoma viral oncogene homolog (KRAS)

We tested the capacity of our GM workflow to generate molecules inhibiting KRASG12D through its SII allosteric site. Unlike CDK2, we encountered a scarcity of known KRASG12D SII inhibitors, collecting only 73 molecules with known affinity to KRAS SII. These were mainly developed using structure-based strategies with few scaffolds69,76, resulting in limited molecular variability. To anticipate potential issues from this small, homogeneous set, we created an additional set of 1891 molecules with unknown experimental affinity, obtained via high-throughput virtual screening (HTVS) of a subset of Enamine molecules (see “Methods: KRASG12D initial-specific training sets”), referred to as the KRAS unknown specific set.

We conducted two parallel generative processes: one using the known specific set (known generative processes) and another the unknown specific set (unknown generative processes). Each involved 4 outer AL cycles with 15 inner AL cycles. We set QED, SA, and similarity thresholds of 0.6, 7.0, and 0.6 for the inner AL cycles, and a Glide gscore threshold of −8.0 kcal·mol−1 for the outer AL cycles.

The average percentage of molecules accepted by the outer AL cycles was similar for the KRAS known generation (2.3 ± 1.3%) and CDK2 (2.8 ± 2.3%) but significantly higher for the KRAS unknown generation (15.8 ± 2.9%) (Supplementary Data Tables S1 and S5). The higher molecular variability in the KRAS unknown specific set (see Fig. S7), derived from the HTVS, compensated for the lack of experimental affinity data, resulting in a higher acceptance rate.

The UMAP in Fig. 5B separates molecules generated from KRAS’s known and unknown generative processes (known and unknown molecules, respectively) into two distinct clusters, reflecting two independent processes that produce two unrelated molecular sets. To understand the formation of the two clusters, we visually inspected their molecules. Most known molecules displayed anomalous chemical structures, such as broken bicycles or seven-membered rings or larger (Fig. S8). These artifacts likely stem from two positively charged bicycles decorating most of the known specific set’s scaffolds, such as those in MRTX1133 (Fig. 6E). We inferred that our GM workflow struggled to interpret these structures, leading to the generation of non-natural chemical structures. As the unknown molecules did not exhibit this issue, subsequent analysis focused on them. The unknown molecules showed significant affinity improvements compared to their specific set (Fig. 5C, D). While only 1185 molecules in the unknown specific set had a Glide gscore below −8.0 kcal·mol−1, we generated 23,488 such molecules, including 11,195 with maximum similarity to the unknown specific set below 0.3, marking a 20-fold increase. Notably, 125 unknown molecules had gscores below −10.0 kcal·mol−1, compared to only 1 in the unknown specific set, 64 of which had a maximum similarity below 0.3.

Fig. 5: KRASG12D generated molecules over four outer AL cycles for the known and unknown generative processes.
figure 5

A Cumulative plots of the known and unknown specific set enrichment over their inner/outer AL cycles. Vertical dotted lines mark the end of each outer AL cycle, dividing the plot into separate outer cycles. Red dots represent inner AL cycles using the KRAS known or unknown specific sets. B UMAP illustrating the generated molecules over the outer AL cycles for both generation processes. Marker’s styles and colors of molecules from different outer AL cycles and generative processes are indicated in the legend. The two distinct generative processes, are labeled as “known” and “unknown”, the initial-specific set is labeled as “initial” and the multiple outer AL cycles as “outerX”. C Glide gscore histograms of docked molecules across the outer AL cycles of the process initiated with the unknown specific set compared to the Glide gscore histogram of the unknown specific set of KRASG12D. D. Zoomed-in view of the right tail of the histograms in panel C.

Fig. 6: KRASG12D candidate selection and validation of unknown molecules.
figure 6

A Scatter plot of KRASG12D unknown molecules with a Glide gscore below −8 kcal·mol−1 and a maximum similarity to the initial-specific set below 0.30. Red dotted lines indicate the stringent thresholds of −9 kcal·mol−1 for gscore and 0.25 for similarity used during candidate selection. B Linear regression and correlation coefficient between Glide gscore and ABFE ΔG (left) and between PELE BFE and ABFE ΔG (right) for the 19 candidate drugs selected after PELE rescoring. C Table showing MM affinity predictions and maximum similarity to the unknown initial-specific set molecules for molecules with potential activity against KRAS. D Structures of molecules with potential activity against KRAS by ABFE simulations. E MRTX1133 Mirati’s KRASG12D inhibitor currently in clinical trials.

During candidate selection, we applied strict filters for similarity (0.25) and Glide gscore (−9 kcal·mol−1), retaining 279 unknown molecules (Fig. 6A) for PELE refinement. Noting the formation of salt bridges between KRAS ASP12 and GLU62 with KRAS inhibitors, we scrutinized the net charge and the estimated PELE binding free energy (PELE BFE) of the retained molecules (Supplementary Data Table S6). As expected, PELE BFE favored molecules with higher net charges. Consequently, we discarded those with a net charge of 3 or more due to concerns about stability, solubility, and permeability. After visually inspecting the remaining molecules, we selected 19 for further validation. Given the high correlation between ABFE ΔG and pIC50 observed in CDK2-synthesized molecules (Fig. S5), we employed ABFE simulations as an in silico validation strategy for KRAS unknown molecules.

ABFE simulations

We benchmarked KRAS ABFE simulations using four KRAS inhibitors with experimental data for the KRAS structure (PDB: 7RPZ) (see “Methods: ABFE simulations” and Supplementary Data Table S7). PELE BFE correlated well with ABFE ΔG (correlation coefficient of 0.64), while Glide gscore did not (correlation coefficient of −0.10) (Fig. 6C). This supports using PELE to refine Glide poses and ABFE simulations to validate PELE predictions. Four of the 19 tested molecules were predicted to have a Kd < 15 μM (Fig. 6B, D and Supplementary Data Table S7).

We searched for the four molecules with a Kd < 15 μM in the ENAMINE Real Space 76B, but none of them were present. For three, no analogs with a similarity above 0.4 were identified. For the remaining one, the closest analog had a similarity of 0.72 and featured a similar, but not identical, scaffold (Supplementary Data Table S8 and Fig. S9).

Discussion

We developed a GM workflow consisting of a VAE with two nested AL cycles powered by chemoinformatics and MM predictors. Inner AL cycles facilitated the generation of drug-like molecules with good synthetic properties and promoted molecular diversity. Outer AL cycles contributed to generating virtual hits with improved target engagement. The final selection phase allowed for a rational selection of candidate drugs, ultimately validated by in silico and in vitro assays.

We tested our GM workflow on two model systems: CDK2 and KRAS. CDK2 was selected for its extensive experimental data and the need for novel inhibitors with unexplored scaffolds. We aimed to bypass typical hinge binder scaffolds of kinases, potentially generating novel CDK2 inhibitors. After five outer AL cycles, we generated a diverse set of molecules with exceptionally good Glide gscores. Following refinement, we selected 10 candidate drugs with novel hinge regions, subsequently validated through synthesis, bioassays, and ABFE simulations. Six were successfully synthesized, along with two analogs and one chiral compound, demonstrating our workflow’s ability to overcome the SA problem. Notice that the remaining four could, in principle, be synthesized but required too many synthetic steps, as stated by Enamine Ltd. Most synthesized molecules (8/9) showed in vitro activity against CDK2, with two in the single-digit micromolar range and one in the nanomolar range. This proved the capacity of our workflow to generate molecules with tested target engagement.

The KRAS study tested our GM workflow in a low-data regime. Moreover, KRASG12D-specific molecules were rather complex. Using these molecules as a specific set led to generating anomalous chemical structures. To address this, we used the results of a HTVS as specific training (thus employing virtual data), resulting in the generation of a diverse set of molecules with excellent Glide gscore. These were further refined, yielding 19 candidate drugs. Based on the strong correlation between experimental assays and ABFE in the CDK2 study, we validated the KRAS candidates using ABFE simulations, identifying four molecules with predicted Kd < 15 μM. KRAS generative processes validated the use of virtual data for specific training, a strategy easily applicable to almost any target.

Although one might argue that using an AL protocol with additional modeling tools could propagate errors, our results show that the current procedure allows the GM workflow to generate molecules with predicted tight binding to protein targets, leading to validated candidates. In fact, the nested AL cycles enable rapid exploration of a molecular space with high in silico predicted scores, where we expect a reduction of false positives. Moreover, when correctly implemented (dynamic adjustment of model parameters and frequent evaluation of generated molecules), our GM workflow not only generates drug-like molecules with high docking scores but also ensures scaffold diversity and good SA, essential for early-phase hit screening.

Methods

Training sets

General training set

The general training set comprises 247,119 SMILES selected from over 2.7 million molecules in the ChEMBL 30 molecules database77 by applying drug-like filters.

CDK2 initial-specific training set

The CDK2 initial-specific training set consists of 1061 distinct molecules experimentally tested for inhibitory potency against CDK2, with IC50 values ranging from 0.2 to 100 μM. These molecules were selected from ChEMBL77 and PDBbind78 (datasets curated in July 2022). The final selection was diversified for scaffold content.

KRASG12D initial-specific training sets

The KRASG12D known initial-specific training set consists of 60 KRASG12D inhibitors and 13 KRASG12C inhibitors with trimmed warheads. The inclusion of KRASG12C inhibitors was based on their binding to the SII allosteric pocket, the same as the KRASG12D inhibitors. Of the 60 KRASG12D inhibitors, 10 were obtained from PDB79 by applying a positional filter towards ASP12 to the ligands present in KRASG12D structures. The remaining 50 inhibitors were from chemical databases: 31 from ChEMBL77, 14 from PubChem80, and 5 from MedChemExpress81. Of the 13 KRASG12C inhibitors, 7 were obtained from PDB using a positional filter towards CYS12 to the ligands in KRASG12C structures. The last 5 were clinically approved or in different clinical phases ligands found in the literature82.

The KRASG12D unknown initial-specific training set is composed of 1891 molecules. These were obtained from an HTVS of a representative 500 K subset of the Enamine REAL collection74, using Glide HTVS. We filtered this subset to include only molecules having 25–32 HAC (heavy atom count), 2–3 aromatic rings, and a single basic amine, chosen to exploit the binding specificity introduced by the G12D mutation in KRASG12D. To perform the HTVS, we used a grid created with the Mirati MRTX1133 molecule and two constraints: (1) a distance constraint between the molecule's basic amine and G12D, (2) and a hydrophobic aromatic constraint with the hydrophobic subcavity of SII. The top-ranked compounds based on docking scores were selected to form the KRASG12D initial-specific training set.

Target structure selection

The selection of target structures for the MM simulations of generated molecules was as follows:

  1. 1.

    Gathering and processing structures: we collected target crystallographic structures from the PDB, filtering by resolution (<2.5 Å) and source organism (Homo Sapiens). Water, ligands, and ions were removed, and structures were split into chains. Chains outside the residue range of 280–300 for CDK2 and 155–175 for KRASG12D were discarded. The remaining structures were superimposed onto a reference structure using TMalign83 and prepared with Schrodinger’s Protein Preparation Wizard (PrepWizard)84, which added missing hydrogen atoms with PROPKA85 (pH 7.0) and optimized hydrogen bonds (HBs) using ProtAssign.

  2. 2.

    Clustering candidate structures by binding site occupancy: we calculated the volume and occupancy of the ATP-binding site for CDK2 structures and the SII site for KRASG12D structures using Schrödinger’s SiteMap86. With the occupancy files, we generated a pairwise volume overlapping matrix for each target using the trajectory_binding_site_volumes.py script from SiteMap. Hierarchical clustering was performed on the volume overlapping matrices using the Seaborn Python library87, clustering target structures based on the shape of their binding site. Next, we focused on the cluster of interest for each target (e.g., KRASG12D in its inactive form) and identified the structure with the highest average overlapping volume as the cluster center. The top 10 structures with the highest overlap to this center were further evaluated based on resolution, completeness, density map quality, and ligand presence to select the final target structure.

GM workflow

VAE

The VAE architecture comprised an encoder, a latent space, and a decoder. The input is represented as a 3D tensor resulting from the one-hot encoding of the SMILES sequence. The encoder fed the input into a Long Short-Term Memory (LSTM) layer followed by a fully connected layer with 256 units. This layer further processes the encoded sequence information and reduces its dimensionality into a 128-dimensional bottleneck layer. This is known as the latent space, which compresses the data into a latent vector. The decoder reconstructs the sequences, expanding the latent vector through an LSTM layer followed by a fully connected layer with 256 units. The activation function used was ReLu. Finally, a softmax activation layer was applied to produce the output sequence.

Nested AL

In our workflow, AL is implemented through a tightly coupled generate, filter, and fine-tuning cycle that iteratively guides the VAE toward more desirable regions of chemical space.

First, the inner AL cycle samples a large batch of molecules from the VAE’s latent space. These raw proposals are then subjected to chemoinformatic oracles, QED, SA thresholds, and a Tanimoto similarity cutoff against the initial-specific set. Molecules meeting all thresholds are accumulated in a temporal-specific set, while others are discarded. This set focuses the next fine-tuning step on scaffolds that combine novelty, synthesizability, and basic drug-like properties.

Second, after a predefined number of inner cycles, an outer cycle begins, where the accumulated molecules in the temporal-specific set are docked to the target protein, and only those meeting a predefined predicted affinity threshold are retained. These predicted binders are appended to a permanent-specific set used to fine-tune the VAE, biasing its latent space toward chemistries predicted to be more potent binders. The workflow then restarts with another round of inner cycles and continues for a predefined number of outer cycles.

Crucially, both “pass” and “fail” generated molecules at each cycle and threshold, feed back into the model: negative examples inform it where not to sample, while positives reinforce productive generative directions. This sampling and evaluation process, where every generation cycle is immediately judged and used to reshape the generator, embodies the active-learning core of our design i.e., the model continuously learns from its own outputs, converging on richer, more potent, and practically accessible drug candidates over successive cycles.

Temporal-specific set

This set is built over inner AL cycle iterations. After sampling a batch of molecules from the VAE, we apply the chemoinformatic filters (QED, SA, Tanimoto similarity) and retain only those that pass. These selected molecules, along with those from the initial-specific set, are accumulated to form the temporal-specific set, which is then used to fine-tune the VAE’s weights. Since compounds that fail the filters are not carried over, the set captures only the most promising, yet still exploratory, scaffolds for that cycle.

Permanent-specific set

This set is built over outer AL cycle iterations. Molecules in the temporal-specific set that clear the outer loop, i.e., those that pass docking thresholds, are appended to the permanent-specific set, which accumulates across all cycles. This growing pool of predicted binders serves two functions: i) it provides a steadily enriched training corpus that biases the VAE toward regions of chemical space with improved target engagement over time, and (ii) it constitutes the library of top‐candidate chemistries that we carry forward into final in silico and in vitro validation. Because every compound that fails the affinity test is not added here, the permanent-specific set enforces a long‐term memory of quality: once a molecule makes the cut, it influences all future generations, pushing the model ever closer to potent, synthesizable hits.

Inner AL cycle

The QED score was calculated using the Chem module of the RDKit Python library88. QED score falls within a range of 0 to 1, where 0 implies that all druglike properties are unfavorable, and 1 that all properties are favorable89. The molecular similarity was calculated with the Tanimoto metric (0 indicating no similarity, 1 indicating identical molecules)90 and the Morgan4 fingerprint91, both computed using the DataStructs module of RDKit. The SA score, ranging from 1 (easy to synthesize) to 10 (difficult to synthesize), was obtained using the SAScore module from the Scopy library37.

Outer AL cycle

The docking score was calculated using the Schrödinger software Glide71. Glide gscore was employed as the docking metric (see “Methods: Glide docking for more details on the docking protocol”). The gscore outer AL cycle threshold of −8 kcal·mol−1 of CDK2 generative processes was set by averaging the gscores of CDK2 initial-specific set molecules with an IC50 below 1 µM. The gscore threshold of −8 kcal·mol−1 for the outer AL cycle of KRAS generative processes was set as the gscore setting the 10th percentile of the unknown HTVS.

Molecular modeling

Glide docking

The Glide protocol used to compute the docking score during the outer AL cycle depends on the system:

CDK2

We assessed the predictive capability of various Glide protocols to distinguish between active and inactive CDK2 molecules. We employed the ROC curve method and its area under the curve (AUC) to compare the different protocols’ predictive power. IC50 values of initial-specific set molecules were used as the real classifier, and the protocol’s Glide gscores as predicted classifiers. An IC50 threshold of 10 µM was fixed to distinguish between experimentally active and inactive molecules while the gscore threshold varied within a specified range, classifying molecules between computationally active and innactive. The protocol with the highest AUC, indicating the best predictive power, was the XP Glide with a HB constraint to LEU83 of CDK2.

KRASG12D

For KRASG12D, we were unable to evaluate different Glide protocols as we did with CDK2 due to limited experimental data. Specifically, from the KRAS initial-specific set, we excluded KRASG12C covalent inhibitors (with warheads removed), compounds with Kd instead of IC₅₀, and crystal structures lacking affinity data. The remaining IC₅₀-reported molecules had a narrow activity range and belonged to a few closely related congeneric series developed through structure-based design, limiting both chemical diversity and affinity variability. Given these constraints, a correlation analysis like that shown for CDK2 in Fig. S10 was not feasible. Instead, we assessed multiple constraint combinations based on the known binding mode of the MRTX1133 molecules (PDB 7RPZ) and selected the one that best reproduced the original co-crystal structure pose. Additionally, some known inhibitors of KRASG12D, mostly MRTX1133 derivatives68,69, were docked using the same protocol to validate the methodology. We chose an SP Glide protocol with two constraints: a HB constraint with ASP12 and a positional hydrophobic constraint centered on the naphthalene binding subcavity.

PELE rescoring

We used the PELE software92 during the candidate selection phase to enhance the granularity of the docking poses and scores of the generated molecules. For each molecule, we conducted a constraint PELE rescoring simulation starting from the Glide-docked structures. This involved a local adaptive PELE exploration93 with 32 cores over 20 epochs, each with 12 Monte Carlo steps. Glide docking HB constraints were maintained during the PELE simulation: LEU83 for CDK2, and ASP12 for KRASG12D. Each simulation took approximately three wall clock hours on the MareNostrum IV supercomputer at the Barcelona Supercomputing Center.

As a PELE rescoring metric, we estimated PELE BFE by calculating the expected value of the PELE Binding Energies (BE) observed during the simulation:

$${BFE}= < {BE} > \,={\sum}_{i}^{N} \, {p}_{i}B{E}_{i}$$
(1)

where N is the total number of PELE conformations, BEi the PELE BE at each PELE conformation i, and pi the probability of the BEi state of energy given by the Boltzman distribution:

$${p}_{i}=\frac{{e}^{-T{E}_{i}/{K}_{b}T}}{Q}$$
(2)

where, TEi is the total energy of the BEi state of energy, T the absolute temperature (set at 298 K) and Kb the Boltzman constant. Finally Q is defined as:

$$Q={\sum}_{i}^{N} \, {e}^{\frac{T{E}_{i}}{{K}_{b}T}}$$
(3)

PELE’s validation

To validate PELE’s rescoring, we tested it on a subset of CDK2 inhibitors from its initial-specific set. We assessed PELE’s accuracy improvement by comparing the Glide gscores and PELE BFE against experimental affinities. The correlation coefficient between experimental and predicted values increased from 0.56 (Glide gscore) to 0.67 (PELE BFE) (Fig. S10).

Absolute binding free energy simulations

PELE refined poses underwent a short MD simulation with the same force field and explicit solvent as the subsequent free energy calculations. Only docked ligands that remained stable during the MD simulation advanced to ABFE simulation, performed under the same conditions as the X-ray control ligands. ABFE simulations were carried out using a double decoupling scheme as described by Boresch et al.72 and in ref. 94. Briefly, starting from the physical ligand in water, the van der Waals (vdw) and electrostatic interactions within the ligand and between the ligand and water are slowly turned off until the ligand becomes dummy; second, the dummy ligand is attached to the protein binding pocket through a set of cross-link restraints similar to what was proposed by (Boresch et al.); third, the intra-ligand and ligand-environment vdw and electrostatic interactions for the restrained ligand are slowly turned on in the protein binding pocket and the cross-link restraints are relaxed after that. Prior to running on ABFE on generated ligands, we benchmarked ABFE simulations on an existing set of CDK2 and KRAS control ligands (Supplementary Data Tables S3 and S7).

ABFE predicted Kd was computed as,

$${K}_{d}={e}^{-\frac{\Delta G}{{RT}}}$$
(4)

derived from,

$$\Delta G=-{RT}{{\mathrm{ln}}}{K}_{d}$$
(5)

where, ΔG is the ABFE ΔG in Joules, R is the gas constant (8.3145 J·mol−1·K−1) and T the temperature (298 K).

Inhibitory assays

The following fluorescence-based assays were performed by Axxam. The LANCE Ultra Kinase Assay from Revvity was used for testing the synthesized molecules. Briefly, the assay uses Europium-labeled anti-phospho-Ser/Thr antibody and the following alternative ULight™-labeled peptide substrates: (a) eIF4E-Binding Protein 1 (4E-BP), STTPGGTLFSTTPG, or (b) Myelin Basic Protein (MBP), CFFKNIVTPRTPPPSQGK. When peptide substrates are phosphorylated by CycE1-CDK2, the phosphorylated site is recognized by the Eu-labeled anti-phospho-Ser/Thr Ab. Upon excitation of the Europium donor fluorophore at 337 nm, energy is transferred to the ULight™ acceptor dye on the substrate, resulting in the emission of light at 665 nm. The fluorescent signal is proportional to the level of ULight™ peptide phosphorylation and to the kinase activity. All molecules were tested as a 8 point dose-response with the top dose at either 32 μM or 100 μM and 1.5× dilutions towards 0 μM.

Assays were conducted in a reaction buffer containing 50 mM HEPES (pH 7.5), 1 mM EGTA, 10 mM MgCl₂, 2 mM DTT, 0.001% Tween-20, and 0.001% fatty acid-free BSA. All kinase reaction reagents, such as LANCE Ultra ULight Phospho-4EBP1 (Thr37/46) Peptide Substrate (substrate 1), LANCE Ultra ULight Total MBP Peptide Substrate (substrate 2), and CycE1-CDK2 enzyme, were prepared in 1× reaction buffer. All the concentrations of the reaction components refer to a 10 µL assay volume. The assay was assembled directly in ProxiPlate-384 small volume (Revvity #6008230) as follows: 2.5 µL of CycE1-CDK2 (5 nM for substrate 1, and 0.15 nM for substrate 2) were preincubated for 30 min at room temperature with 5 µL of the inhibitor or 0.5% DMSO. After preincubation, 2.5 µL of either substrate 1 or substrate 2 (50 nM) and ATP (150 µM) were added to initiate the kinase reaction and incubated for 60 min at room temperature. The enzymatic reaction was stopped by the addition of 10 µL of EDTA (10 mM) pre-mixed with the detection reagents (0.5 nM of either antibody 1 or antibody 2 in LANCE ™ Detection Buffer) and incubated for 60 min. The TR-FRET signal was measured at PHERAstar FSX (BMG Labtech) with the HTRF 337/665/620 nm optic module and laser lamp as light source. All data were generated by testing each experimental condition in quadruplicate.

TR-FRET data were expressed as ratio of fluorescence emissions at 665/620 nm*10000. Data have been normalized on the Neutral Control (enzymatic reaction +0.5% DMSO) and Inhibitor Control (IC100 of JNJ-7706621 1 µM)

Data analysis and visualization

All molecular visualizations (Figs. 4C, 6D, S4, S6, S8, and S9) were rendered with MarvinSketch95. UMAPs (Figs. 2D, 3B, 5B, S2, S3, and S7) were generated using the umap-learn Python package70, with Morgan4 as the molecular fingerprint91, and the Hamming Distance as the metric96.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.