Accelerating discovery of bioactive ligands with pharmacophore-informed generative models

Xie, Weixin; Zhang, Jianhang; Xie, Qin; Gong, Chaojun; Ren, Yuhao; Xie, Jin; Sun, Qi; Xu, Youjun; Lai, Luhua; Pei, Jianfeng

doi:10.1038/s41467-025-56349-0

Download PDF

Article
Open access
Published: 10 March 2025

Accelerating discovery of bioactive ligands with pharmacophore-informed generative models

Nature Communications volume 16, Article number: 2391 (2025) Cite this article

11k Accesses
7 Citations
2 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 07 April 2025

This article has been updated

Abstract

Deep generative models have advanced drug discovery but often generate compounds with limited structural novelty, providing constrained inspiration for medicinal chemists. To address this, we develop TransPharmer, a generative model that integrates ligand-based interpretable pharmacophore fingerprints with a generative pre-training transformer (GPT)-based framework for de novo molecule generation. TransPharmer excels in unconditioned distribution learning, de novo generation, and scaffold elaboration under pharmacophoric constraints. Its unique exploration mode could enhance scaffold hopping, producing structurally distinct but pharmaceutically related compounds. Its efficacy is validated through two case studies involving the dopamine receptor D2 (DRD2) and polo-like kinase 1 (PLK1). Notably, three out of four synthesized PLK1-targeting compounds show submicromolar activities, with the most potent, IIP0943, exhibiting a potency of 5.1 nM. Featuring a new 4-(benzo[b]thiophen-7-yloxy)pyrimidine scaffold, IIP0943 also has high PLK1 selectivity and submicromolar inhibitory activity in HCT116 cell proliferation. TransPharmer offers a promising tool for discovering structurally novel and bioactive ligands.

Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor

Article Open access 12 November 2022

Prospective de novo drug design with deep interactome learning

Article Open access 22 April 2024

Target-specific de novo design of drug candidate molecules with graph-transformer-based generative adversarial networks

Article 15 September 2025

Introduction

Identifying compounds with bioactivity against desired targets has been one of the important objectives for rational drug discovery. Deep learning-based generative models have emerged as currently predominant methodologies, demonstrating their efficacy in advancing towards this objective^{1,2,3,4,5,6,7,8,9,10,11,12,13}. One well-known instance is that scientists at Insilico Medicine successfully employed their generative model, GENTRL, to uncover nanomolar inhibitors for the DDR1 kinase within a short timeline¹. Beyond GENTRL, researchers exhibit a fervent interest in exploring the potential of molecular generative models through investigations of diverse combinations of model components, including architectures^{14,15,16,17,18,19}, molecular representations^20,21,22, and optimization algorithms^23,24,25,26.

Effective as generative models are, their efficiency raises new concerns: how does the creativity of generative models compare to that of humans? Can the designs generated by these models inspire human experts? In 2018, Bush et al. conducted an interesting experiment—a Turing test involving three molecular generators²⁷, including RG2Smi, a deep learning-based generative model²⁸. They found that it was hard for RG2Smi to propose molecular designs that align with those of human medicinal chemists or gain acceptance from them. Moreover, the novelty of the bioactive compounds generated automatically has constantly been under debate^29,30,31. Moret et al. fine-tuned their chemical language models (CLMs) using 46 highly active PI3Kγ inhibitors before employing them to generate new inhibitors against PI3Kγ kinase. The chemical structures of the most potent ligands designed or inspired by CLMs, namely compounds 18 and 22, exhibit a high degree of similarity to known PI3Kγ inhibitors². Other studies that applied transfer learning to bias molecular generators toward specific protein targets often encounter varying degrees of novelty issues with the bioactive compounds generated^3,4,5,6. These results underscore the urgent need for a deep understanding of the “correct recipes” for generative models to produce compounds that are bioactive while novel enough, in order to serve as useful copilots for human medicinal chemists.

Pharmacophore-informed generative models present alternative approaches to promote this understanding. The pharmacophore model, rooted in pharmaceutical features, offers a coarse-grained solution for molecular representation, facilitating scaffold hopping among chemically diverse ligands^32,33. Furthermore, pharmacophore serves as a bridge linking molecular structure and bioactivity. Given these advantages, there has been a recent surge in interest regarding the utilization of pan-pharmacophore features for molecular generation^34,35,36. For instance, Imrie et al. introduced DEVELOP, a pharmacophore-aware generative model employing 3D grids to represent target pharmacophores, for linker design and scaffold elaboration³⁴. Their results demonstrated that generative models can leverage pharmacophoric information to produce molecules with distinct structures that maintain crucial non-bond interactions with receptors. Similarly, LigDream encodes and decodes 3D voxels representing five common types of pharmacophore features for de novo molecular design³⁵. Other pan-pharmacophore features have been incorporated into generative models, including condition vectors indicating the shortest bond distances, as well as the presence, absence, or exact quantities of specific pharmacophoric features^37,38. Recently, Zhu et al. introduced PGMG, which employs a fully connected graph containing selected pharmacophore features of a reference compound³⁹. PGMG was able to generate drug-like molecules with superior docking scores compared to known bioactive ligands and showcased its capability of scaffold hopping from an initial EGFR inhibitor. However, it is noteworthy that most novel molecules generated by pharmacophore-based generative models have not yet undergone wet lab experimental testing to validate this methodology.

In this study, we present TransPharmer as an innovative pharmacophore-aware generative model, which employs ligand-based pharmacophore kernels to achieve structural abstraction while preserving fine-grained topological information. The ligand-based pharmacophore kernels are similar to those used in the previous studies for ligand-based virtual screening^40,41. Our pharmacophore kernels are encoded into multi-scale and interpretable fingerprints, serving as prompts for TransPharmer. The architecture of TransPharmer is reminiscent of a generative pre-training transformer (GPT)⁴², as illustrated in Fig. 1, establishing a connection between pharmacophores and molecular structures represented by the simplified molecular-input line-entry system (SMILES)⁴³. We posit that equipping GPT with pharmacophore knowledge enables the model to focus on the pharmaceutical aspects of the chemical structures and generate drug-like molecules. During our evaluation, TransPharmer demonstrated superior performance compared to other baseline models in tasks involving de novo generation and scaffold elaboration under pharmacophoric constraints. We also highlight TransPharmer’s distinct mode in probing the local chemical landscape surrounding a reference compound, rendering it highly suitable for scaffold-hopping tasks in drug discovery. We further validate the capability of TransPharmer to produce innovative and bioactive ligands through two case studies involving DRD2 and PLK1. Notably, we experimentally tested four generated compounds targeting PLK1, which feature a new series of scaffolds. Among these, three out of four compounds exhibit inhibitory activity below 1 μM, with the most potent one, IIP0943, demonstrating a potency of 5.1 nM (4.8 nM for the reference PLK1 inhibitor). Furthermore, IIP0943 exhibits high selectivity for PLK1 compared to other Plks and submicromolar activity in cell proliferation against the HCT116 cell line. TransPharmer thus represents a pharmacophore-based generative model successfully executing scaffold hopping to produce unique compounds with potent bioactivity. The 4-(benzo[b]thiophen-7-yloxy)pyrimidine scaffold of IIP0943 may offer new insights for obtaining improved PLK1 inhibitors.

**Fig. 1: The schematic diagram of TransPharmer architecture.**

Results

In this work, we developed a pharmacophore-based generative model named TransPharmer, which leverages the topological pharmacophore fingerprints of given ligands to guide molecule generation. The workflow and model setups are shown in Fig. 1 and detailed in subsections “Pharmacophore features and fingerprint extraction” and “Model architecture”.

We observed that the pharmacophore fingerprints employed in our study have the potential to establish connections between structurally distinct ligands that exhibit activity towards the same target. Additionally, these fingerprints demonstrate a notable relationship with bioactivity, allowing for the distinction between active and inactive ligands (Supplementary Notes subsection “Pharmacophore fingerprints as fuzzy and interpretable representations”). The unconditional version of TransPharmer demonstrates accurate modeling of chemical space, achieving the top rank among established methods in overall performance in the GuacaMol benchmark⁴⁴, and achieving top 2 ranks in six out of fifteen metrics benchmarked in MOSES⁴⁵ (Supplementary Notes subsection “Benchmarking the unconditional TransPharmer and other evaluations”).

The results in this section are organized as follows: first, we evaluate the performance of TransPharmer on two tasks involving pharmacophore-constrained molecule generation. Secondly, we compare the unique mode of chemical space exploration of TransPharmer with a previous method based on structure mutations. Thirdly, we demonstrate the capability of TransPharmer to generate active ligands through a retrospective case study of recalling known DRD2 actives distinct from the ones seen during training. Lastly, we highlight the potential of TransPharmer in a prospective case study for discovering potent and highly selective PLK1 inhibitors with scaffolds different from previous ones.

Pharmacophore-constrained molecule generation

One of the central objectives for pharmacophore-conditioned generative models is to generate molecules conforming to the desired pharmacophores, which entails two aspects. Firstly, basic attributes of the pharmacophores of the generated molecules should match those of the target, such as the number of individual pharmacophoric features. Generating molecules with the requisite number of pharmacophoric features has been an essential objective^37,38,46. Here, we computed the averaged difference in the number of individual pharmacophoric features of generated molecules with respect to the target pharmacophores (referred to as D_count, see definition in Section “Evaluation metrics”). Secondly, the targeted pharmacophore and the generated molecule’s pharmacophore should have a high degree of overall similarity. Similar to measuring molecular similarity using fingerprints such as Morgan fingerprints⁴⁷, pharmacophoric similarity can be calculated by computing the Tanimoto coefficient of two pharmacophoric fingerprints. Here, we adopt ErG fingerprints implemented in RDKit⁴⁸ to measure pharmacophoric similarity (referred to as S_pharma, see definition in Section “Evaluation metrics”) to avoid any artificial positive results of our models. ErG fingerprints are another pharmacophoric fingerprint introduced by researchers in Lilly and have demonstrated potential applications for scaffold hopping³². ErG fingerprints show a discernible correlation with the pharmacophoric fingerprints utilized in TransPharmer (Supplementary Fig. 1).

We compare our models with LigDream³⁵, PGMG³⁹, and DEVELOP³⁴ as baselines in both tasks of de novo generation and scaffold elaboration. As pharmacophoric feature counts have been utilized as explicit controls over the generated molecules^37,38,46, we establish another baseline by training a “TransPharmer-count” model that only accepts the requirement of desired amounts of individual features. Furthermore, to investigate the effect of the length of pharmacophoric fingerprints used in our model, three variants of TransPharmer were examined: “TransPharmer-72bit”, “TransPharmer-108bit” and “TransPharmer-1032bit”. These variants are conditioned on 72-bit, 108-bit, and 1032-bit pharmacophoric fingerprints, respectively.

For the de novo generation task (Table 1), TransPharmer models outperform the baseline models by generating molecules with higher pharmacophoric similarity. It is noteworthy that the TransPharmer-count model achieves the lowest deviation in feature counts, while the TransPharmer-1032bit model ranks as the second lowest in this regard. It is not directly comparable between PGMG and other methods since PGMG is primarily designed to align with a specific subset of pharmacophore features (specifically, 3–7 features), whereas models such as TransPharmer aim to generate molecules that satisfy the entire set of pharmacophore features of a reference compound. Consequently, We re-evaluated TransPharmer based on the match score utilized in PGMG and discovered that the match scores achieved by TransPharmer are close to those of PGMG, with the smallest difference being less than 10% of PGMG’s score (Supplementary Table 1). Meanwhile, it is worth noting that PGMG is sensitive to the maximum number of input pharmacophore features specified by users, which leads to a notable deviation in molecular sizes compared to the desired targets, particularly when reference compounds possess flexible conformations (Supplementary Table 2). Further discussion can be found in the Supplementary Notes.

Table 1 Results of the pharmacophore-constrained de novo generation task and scaffold elaboration task

Full size table

In the scaffold elaboration task, four TransPharmer models generated molecules with substantially higher pharmacophoric similarity than those of DEVELOP. It seems that DEVELOP exhibits limitations in adhering to the provided pharmacophore conditions, often resulting in the generation of molecules unrelated to and much larger than the reference compound (Supplementary Tables 3 and 4). Among the four TransPharmer models evaluated, the TransPharmer-1032bit model achieves the highest similarity score. The TransPharmer-count model is slightly better than the TransPharmer-72bit model in the mean pharmacophoric similarity, but the variance is larger. The deviation of feature counts is similar to those in the de novo generation task.

These findings suggest the benefits of employing pharmacophoric fingerprints that explicitly encode the topology of pharmacophores. In comparison to the 3D voxels of pharmacophoric points encoded by convolutional layers in LigDream or DEVELOP, pharmacophoric fingerprints offer more distinct instructions for molecule generation and may avoid ambiguous guidance resulting from insufficient training of the convolutional neural networks. When compared to simplified condition vectors like feature counts, pharmacophoric fingerprints encompass comprehensive information regarding the topology of pharmacophores, thereby providing superior guidance. In contrast to the use of pharmacophore graphs of selected features in PGMG, TransPharmer exhibits superior control over global molecular properties, such as molecular weight and the number of heavy atoms, resulting in improved sampling efficiency. An ablation study further showed that the incorporation of topological distance information and feature combinations into the pharmacophore fingerprint substantially contributes to TransPharmer’s overall performance (Supplementary Notes “Ablation study” and Supplementary Table 15). For the de novo generation task, removing the topological distance information decreased the pharmacophore similarity score from 0.50 to 0.38, and removing both the topological distance information and feature combinations further decreased it to 0.31. For the scaffold elaboration task, removing both the topological distance information and feature combinations decreased the pharmacophore similarity score from 0.70 to 0.55.

Our analysis also reveals that TransPharmer models with longer pharmacophoric fingerprints consistently generate molecules with higher similarity to the target pharmacophore (Table 1), which conforms to our motivation to obtain fine-grained representations of pharmacophore. Moreover, these models generated molecules that were more similar to the conditioning compound in terms of topological structure and had a relatively higher repetition rate (Supplementary Table 5). Depending on specific needs, the flexibility of TransPharmer allows users to choose which model is most suitable for their intended applications. Overall, the excellent performance and flexibility of TransPharmer make it a viable option for a wide range of scenarios such as novel hits discovery or lead optimization.

Exploring local chemical space

Efficiently exploring the vast chemical space remains a challenging task in drug discovery. One common approach is to start with a few compounds and search their neighborhood but the exploring direction can be quite arbitrary. Molecular similarity-constrained exploration/optimization is one of the widely adopted ways to identify compounds with the desired similarity level to the starting compound^49,50,51,52. In the previous section, we demonstrated that TransPharmer can efficiently explore the local chemical space in a pharmacophore-constrained fashion. Here, we compare the exploring mode of TransPharmer with those of molecular similarity-constrained methods, using a specific starting compound as a showcase, and illustrate the significance of this exploring mode in drug discovery.

We used Onvansertib, a known inhibitor of PLK1⁵³, as the starting compound to provide a target pharmacophore. STONED⁵¹, which can perform molecular similarity-constrained exploration by altering the given compound structure, was used for comparison. STONED can rapidly traverse the target neighborhood in the chemical space by mutating the characters of the SELFIES string of the starting compound. Apart from the default setting, STONED can be tuned to produce highly similar structures to the starting compound by restricting the mutation area to the terminal 10% interval of the SELFIES string.⁵¹ STONED in the default and tuned settings are referred to as “STONED” and “STONED-focused”, respectively, and the details of each setting can be found in Section “The settings of compared methods”. We evaluated five models, including STONED in two settings and three TransPharmer models (72-bit, 108-bit, and 1032-bit), by sampling 10,000 non-duplicate chemical structures and obtaining their pharmacophoric similarity and molecular similarity distributions with respect to the starting compound. The molecular similarity is given by the Tanimoto coefficient of Morgan fingerprints with a radius of 2 implemented by RDKit⁴⁸.

Figure 2 shows that the molecular and pharmacophoric similarity scores of the generated molecules from STONED tend to approach the same ends of the scoring range, while those of TransPharmer-72-bit are distributed near the opposite sides (Fig. 2b, e). In other words, molecules generated by TransPharmer-72-bit can be topologically dissimilar but pharmacophorically similar to the starting compound, whereas molecules from STONED are either similar in both molecular structure and pharmacophore to the starting compound, or dissimilar in both aspects (see some examples in Fig. 2g). TransPharmer can also produce structurally and pharmacophorically similar structures by using more fine-grained fingerprints (Fig. 2c, f).

**Fig. 2: Chemical space exploration around Onvansertib.**

The plot of the local chemical space spanned by the two similarity axes with the averaged scores of each model marked in the corresponding places in Fig. 2 illustrates that TransPharmer and STONED explore the chemical space in different directions and regions (Fig. 2d). Molecular similarity constrained methods like STONED traverse along the diagonal, while pharmacophore constrained methods like TransPharmer traverse along a line that is close to horizontal. In addition to providing new directions to explore, TransPharmer models have a unique potential to discover structurally distinct molecules while maintaining high pharmacophoric similarity (at the bottom right corner in Fig. 2d), which is essential for molecular optimization in practice, such as scaffold hopping.

Case study of DRD2

DRD2 is a well-studied target for which many active compounds have been reported. Although ligands with known bioactivities exist, the pursuit of novel ligands with improved characteristics, such as better binding affinity or ADME/T properties, remains ongoing. Therefore, it is essential for generative models to be able to discover active ligands with novel structures, unrestricted by previously observed ligands.

A retrospective experiment was conducted to assess TransPharmer’s ability to discover distinct and active ligands. Known DRD2 active ligands were divided into two subsets using scaffold clustering (see Section “Settings for DRD2 recall experiment”), with an average molecular similarity across these subsets of around 0.2. One subset is visible to TransPharmer during training, while the other subset is excluded from the training set. Upon completion of the training, active ligands in the training set were encoded into 72-bit pharmacophoric fingerprints and used by TransPharmer as “active conditions” for molecule generation. The retrieval of the reserved active ligands was examined. This experimental setup mimics a common but challenging scenario in drug discovery to uncover bioactive ligands possessing novel scaffold series given the known active ligands. For comparison, another set of unrelated molecules to DRD2 from the training set were used as “baseline conditions” by TransPharmer (Fig. 3b). The comparison between using DRD2 actives as conditions (active conditions) and using baseline conditions aims to demonstrate the difficulty of this task and the consistency of TransPharmer.

**Fig. 3: The retrospective experiment for DRD2.**

The performance of TransPharmer to retrieve active ligands in the reserved subset was evaluated in two aspects. Firstly, the recall rate was calculated for all generated molecules, demonstrating the maximum potential of TransPharmer to discover active ligands under ideal conditions. However, considering the limited budgets for experimental testing, in reality, the precision of generative models is also important. In this context, we assessed the (apparent) precision of TransPharmer by enumerating active ligands found within a smaller set of repeatedly generated molecules, specifically 4000 molecules in this experiment. These molecules were generated with a higher sampling probability, indicating a greater confidence for TransPharmer to produce them during the initial sampling phase. The precision is apparent because we only search for known active ligands within the generated set, and the remaining portion likely contains potentially active ligands. Note that 4000 was chosen to be comparable to the number of active ligands unseen by TransPharmer.

Our model rediscovered 4.95% of the active ligands in the unseen subset with sufficient sampling when conditioned on seen active ligands, compared to 0.88% when using baseline conditions (Fig. 3c). If generated molecules highly similar (Tanimoto similarity over 0.8) to any of the active ligands in the unseen subset are considered successful recalls as well, a recall rate of up to 12.1% is observed, consistently higher than that of using baseline conditions (3.2%). (Supplementary Table 6). As for the precision number, up to 15 active ligands in the unseen subset were recalled with a molecular similarity requirement of ≥0.8, which is 7-fold higher than that of using baseline conditions (Supplementary Table 6). Upon inspecting some recalled active ligands and their most similar counterparts in known active ligands, we observed that in some cases TransPharmer appeared to take shortcuts, such as borrowing subgraphs from seen molecules or making modifications based on them. However, TransPharmer was also able to rediscover ligands that are structurally distinct from any active ligands it had seen (Fig. 3d).

As previously stated, our search is limited to known active ligands within the generated set and the remaining portion likely contains potentially active ligands. This approach provides a conservative estimate of the proportion of generated molecules that exhibit activity towards DRD2. In order to obtain a more precise estimation, we conducted a virtual screening experiment following the DeepDrugCoder⁵⁴. A DRD2 predictive model was established to predict the probability of a generated molecule exhibiting bioactivity towards DRD2 (more details in Section “DRD2 QSAR model”). We then randomly sampled 100 known active compounds from the reserved test set of the QSAR model. These compounds were used as conditions by TransPharmer to sample 256 times per active compound. The fraction of 25,600 generated SMILES strings that are valid, unique, and predicted to be active (with a predicted probability ≥ 0.5) was then computed to compare with the results of DeepDrugCoder.

We found that 27% of the generated molecules were predicted as actives, while DeepDrugCoder’s physchem-based (PCB) model reported a fraction of 54%, and the fingerprint-based (FPB) model reported 19%. Since the PCB model was trained with the additional information about prior predicted bioactivity from the same QSAR model, the high ratio of molecules predicted to be active is not surprising. On the other hand, our model outperforms the FPB model in terms of ratio by over 40%, emphasizing the importance of using pharmacophoric information to identify active compounds. For molecules predicted to be active but not previously identified as DRD2 actives, we assessed structural similarities to their nearest DRD2 active neighbor. The similarity score distribution peaks around 0.4, with 43% of the molecules having a similarity score below 0.4, a commonly used threshold for classifying dissimilar compounds (Supplementary Fig. 2). This suggests a high degree of structural novelty among the generated molecules compared to known DRD2 actives. Overall, these findings highlight TransPharmer’s capability to both rediscover known active ligands and to generate structurally distinct compounds with potential bioactivity.

Case study of PLK1

PLK1 plays a key role in mitosis progression and has been implicated in various cellar pathways^{55,56,57,58,59}. Targeting PLK1 has emerged as a promising therapeutic strategy for cancer treatment, as the overexpression of PLK1 has been associated with tumor development and progression^60,61. In this section, we exemplify the application of TransPharmer in the generation of distinct and active PLK1 inhibitors using the topological pharmacophore fingerprint derived from Onvansertib, a potent and selective inhibitor to PLK1 currently undergoing clinical trials (e.g., ClinicalTrials.gov identifier NCT03829410).

One million samples were generated by TransPharmer conditioned on the 72-bit pharmacophore fingerprint of Onvansertib under a low-temperature hyperparameter of 0.7. Subsequent to the removal of invalid SMILES strings and duplicated molecules, a total of 178,103 unique molecules were obtained. To gain insights into the chemical space covered by the training set, the generated molecules (both conditionally and unconditionally), and the known PLK1 active ligands, a t-distributed stochastic neighbor embedding (t-SNE) plot was generated. As shown in Fig. 4a, TransPharmer shifted from the broader chemical space of the training set to the localized chemical space surrounding Onvansertib, which appears as an “outlier”, drifting apart from other PLK1 active ligands and the majority of training molecules. The similarity distributions between the generated molecules and Onvansertib also confirm the bias of TransPhamer towards the target pharmacophore, with a median pharmacophoric similarity of 0.92, and demonstrates the capability of TransPhamer to explore distinct structures, with a median molecular similarity of 0.28 (Fig. 4b).

**Fig. 4: Generation of a virtual compound library against PLK1 using TransPharmer and compound prioritization.**

We then carried out virtual screening against the generated compound library to identify drug-like hit compounds targeting PLK1. First, molecules exhibiting pharmacophoric similarity to Ovansertib below 0.85 were eliminated. Second, Lipinski’s rule of five⁶² with the maximum allowed molecular weight set to 1000, and medicinal chemistry filters⁴⁵ were applied to retrieve drug-like generated molecules. Third, molecules containing the same pyrazolo-quinazoline core as Onvansertib were removed. While TransPharmer often produces distinct structures, it also tends to generate the identical moieties of the reference compound which best satisfy the conditional pharmacophore fingerprint. The remaining compounds were then docked into the ATP-binding pocket of PLK1 using Glide in standard precision mode⁶³. Polar interaction (hydrogen bonds or salt bridges) between ligands and residue Lys82, Cys133, Glu140, and Asp194 were examined using PLIP⁶⁴. Compounds with a docking score better than -9.0 kcal/mol and forming polar interaction with at least two key residues (where hinge region residue Cys133 is requisite) were kept. These molecules then underwent a two-step clustering process. First, identical Bemis-Murcko scaffolds⁶⁵ were grouped; second, the scaffolds were clustered using the Butina algorithm⁶⁶, efficiently implemented in chemfp⁶⁷, with Morgan fingerprints⁴⁷ (radius 2, 2048-bit) as molecular descriptors and a distance threshold of 0.1. 2300 representative members from each cluster with the best docking score and ligand efficiency (docking score divided by molecular weight) were selected and docked into PLK1 again using Glide in extra precision mode⁶⁸. Upon completion of the docking, the molecules were ranked based on their overall performance, considering docking score, ligand efficiency, and the binding mode of the No.1 pose.

We systematically inspected the top-ranked generated molecules and selected 42 candidate compounds taking into account factors such as synthesizability, novelty, and the diversity of the generated compounds. The comprehensive listing of the molecular structure for these 42 compounds is available in the Supplementary Information (Supplementary Figs. 3 and 4). These compounds were classified into five groups based on their core fragments that potentially bind to the hinge region of the kinase domain of PLK1 (Fig. 4c). A detailed examination of the known PLK1 inhibitors sharing these cores revealed that the majority of them exhibit low bioactivities, with the exception of ligands featuring core 2, displaying moderate to high bioactivities. Notably, 2/3 (28 out of 42) of our generated compounds carry core 1, whereas only one known active ligand features this core. This underscores the novelty of the new scaffolds containing core 1 as potential PLK1 inhibitors. Subsequently, these 42 compounds underwent binding free energy estimation using MM/GBSA^69,70,71,72 and evaluation of binding stability through 100 ns MD simulation. Among them, four compounds were selected based on their estimated binding free energies and consistent binding behavior within the pocket, of which three compounds carried core 1 while one compound featured core 2.

These four compounds, namely lig-3, lig-182, lig-524, and lig-886, were subjected to chemical synthesis. Several minor modifications were made to the generated structures due to the intricacies of chemical synthesis and the need to address potential metabolic instability. The finally synthesized structures largely adhered to the designed structures by TransPharmer (Fig. 4d). To clarify, these synthesized structures are referred to as IIP0942, IIP0943, IIP0944, and IIP0945, corresponding to the original lig-3, lig-182, lig-886 and lig-524, respectively. The chemical synthesis route of IIP0943 is shown in Fig. 5a; detailed chemical syntheses of all identified compounds are presented in the Supplementary Methods. The synthesis process for IIP0943 starts with the reaction of 3-methoxy-2-nitrobenzaldehyde (943-0) and methyl 2-mercaptoacetate to yield 943-1. Subsequent removal of the methyl group resulted in 943-2. The reaction of 943-2 with 2,4-dichloro-5-methylpyrimidine produced 943-3. The formation of intermediate B occurred through a Buchwald-Hartwig amination reaction between 5-bromo-2-methoxyaniline and 1-methylpiperazine. The intermediate B was then combined with 943-3 in another Buchwald-Hartwig amination reaction, leading to the formation of ester 943-4. Subsequent treatment with ammonia/methanol resulted in the production of the final compound, IIP0943.

**Fig. 5: The synthetic route and the enzymatic inhibition activities of the designed compounds.**

The obtained compounds were then tested for their inhibitory activities against PLK1 kinase. Out of the four tested compounds, three show activities with half maximal inhibitory concentration (IC₅₀) less than 1 μM (Fig. 5b, c). Notably, IIP0943 emerges as the most potent among them, with an IC₅₀ value of 5.1 ± 1.7 nM against PLK1, while Onvansertib exhibits an IC₅₀ of 4.8 ± 0.7 nM. The confidence intervals of IC₅₀ values can be found in Supplementary Table 7. To investigate the selectivity of these compounds, the IC₅₀ values against other Plks and FAK kinase were determined for the two most potent compounds, namely IIP0943 and IIP0942. The inclusion of FAK kinase was prompted by the identification of a potent FAK inhibitor, BI-4464, which exhibits structural similarity to and forms a comparable binding pose to IIP0943 (PDB ID: 6I8Z). This similarity was revealed during our molecular novelty assessment, where we searched for analogous compounds to IIP0943 in the ChEMBL database (Section “Molecular novelty assessment” and Supplementary Fig. 8).

The results indicate that both IIP0942 and IIP0943 exhibit excellent selectivity towards PLK1 within the PLK family (Table 2). IIP0943 also shows moderate inhibition against FAK, with an IC₅₀ of 264 ± 32 nM, which is over 50-fold less potent than its inhibitory effect against PLK1. IIP0942 also exhibits an IC₅₀ of 87.4 ± 11.1 nM against FAK, with an over two-fold selectivity for PLK1. These compounds were further tested on the HCT116 cell line and IIP0943 showed cell proliferation inhibition with an IC₅₀ of 0.22 ± 0.003 μM (Fig. 6a and Supplementary Fig. 5).

Table 2 Enzymatic activity of IIP0942, IIP0943, and Onvansertib against PLK1/2/3 and FAK, respectively

Full size table

**Fig. 6: The cellular inhibition activity and the docking pose of the generated compound IIP0943.**

To understand the potency and selectivity of IIP0943, IIP0943 was docked into the ATP-binding pocket of PLK1. The 4-(benzo[b]thiophen-7-yloxy)pyrimidine core of IIP0943 resides between Cys67 and Phe183 (not depicted due to space constraints). The 5-methyl group in the 2-aminopyrimidine is accommodated by a hydrophobic pocket formed by Ala80, Val114, and Leu130 (Fig. 6c). Four hydrogen bonds are formed: the 2-aminopyrimidine moiety forms two hydrogen bonds with the backbond NH and C=O groups of the hinge region Cys133; the amide group forms one hydrogen bond with the side chain of Lys82, and another hydrogen bond with Asp194 in the DFG motif. Under physiological conditions, the 4-methylpiperazino moiety becomes protonated, forming a salt bridge with Glu140. This interaction is believed to contribute to the discernible PLK1 selectivity vs PLK 2–3, since the same type of interaction is hampered in both PLK2 and PLK3 where Glu140 is replaced by histidine^53,73,74. The superposition of the docking pose of IIP0943 and the crystal pose of Onvansertib revealed a noteworthy distinction in the orientation of IIP0943’s benzo[b]thiophene-2-carboxamide moiety (Fig. 6d). This moiety points towards residues in the back cleft from a different angle, which could potentially be compensated by the flexible side chain of Lys82.

Finally, we performed a comprehensive novelty assessment of the identified hit compounds. By searching for structural analogs in public databases, we confirmed the novelty of the designed PLK1 inhibitors, particularly IIP0943, across three levels: within known PLK1 active compounds, within reported bioactive ligands, and within patented molecules (refer to Supplementary Notes “Molecular novelty assessment of the discovered hits”). This evaluation highlights the capability of TransPharmer to make meaningful contributions to real-world drug discovery efforts.

Discussion

More on chemical space exploration

Yoshimori and colleagues discussed the distinction between structure- and pharmacophore-steered molecular generation in their reinforcement learning-based approach⁷⁵. They compared two agent networks, each guided by rewards based on either molecular similarity or pharmacophoric similarity to a known ligand associated with the target of interest. One notable observation was that the agent guided by the molecular similarity reward successfully generated a larger number of molecules exhibiting topological similarities to the reference ligand, but essentially failed to produce any molecules with a satisfactory pharmacophoric score. This finding implies the inherent limitations of molecular similarity-constrained methods when it comes to exploring the local chemical space.

In our study, we discovered that methods focused on generating structurally analogous compounds could yield molecules that share similarities in both topological structure and ligand pharmacophores. This finding is rational since the concept of ligand pharmacophore is rooted in molecular structure. Moreover, we made an intriguing observation that these two modes of exploration can be complementary, covering distinct regions within the local chemical space. They can also overlap when a fine-grained pharmacophoric representation is employed along with a high molecular similarity cutoff.

Potential biases in the case study of PLK1 inhibitors design

We think the following aspects might introduce biases that could affect the current results in the case study of designing PLK1 inhibitors.

During molecule generation

(a) Input pharmacophore fingerprint/reference compound. Since TransPharmer is a conditional generative model, the choice of input condition (pharmacophore fingerprint of the reference ligand) could be the largest source of bias in this work. We selected Onvansertib as the reference ligand for its potency and high selectivity towards PLK1, as well as its recent activity in clinical trials. We also used its follow-up derivatives (compounds 13 and 25)⁷⁶ as inputs for TransPharmer to generate compounds in our in-house tests and observed slight variations in chemical space coverage (visualized by t-SNE plots). We surmise that using pharmacophore fingerprints from other unlike PLK1 inhibitors would result in significant differences in generated compounds. (b) Model hyperparameters. One key hyperparameter is the sampling temperature (t). This parameter re-weights the multinomial distribution of each token in generated SMILES strings, with lower temperatures increasing the probability of the top-ranked tokens relatively. In our tests, a higher sampling temperature (t = 1.2) improved diversity but also significantly reduced the performance of top-ranked compounds compared to a default lower temperature (t = 0.7). We expect there are some balance points to improve diversity without losing much performance. We stuck to the sampling temperature at 0.7 in our work, but users may explore this hyperparameter further in their own studies by adjusting it in the configuration file.

During virtual screening

(a) Novelty filters. Although TransPharmer can generate novel compounds, chemical structures that are highly similar to the reference compound also appeared in the generated set as they easily fulfilled the target pharmacophore condition. To avoid the overrepresentation of these similar compounds among top-ranked compounds, we used SMARTS patterns to retain structurally novel compounds. In the PLK1 case study, we used the pattern c12ncncc1-CC-[n,c]3:[n,c]:[n,c]:[n,c]:[n,c]32 to filter out molecules with scaffolds similar to the pyrazolo-quinazoline core of Onvansertib. This novelty filter was very effective, but a different SMARTS pattern may significantly impact the results. (b) Target binding mode. Since we used Onvansertib to provide the input pharmacophore fingerprint, we focused on four polar interactions between Onvansertib and PLK1: hydrogen bonds with Cys133, Lys82, and Asp194, and a salt bridge with Glu140. Generated compounds were scored based on the occurrence of these interactions in their docked complex with PLK1, with a mandatory requirement for forming hydrogen bonds with Cys133. Different scoring criteria might yield different outcomes.

During manual inspection

After the virtual screening, a ranked list of 2300 generated compounds was cherrypicked to retain 42 promising compounds for further evaluation. We specifically focused on some aspects of Onvansertib during our visual inspection and cherry-picking. These may constitute potential biases in the case study of PLK1 as well. (a) Core region. One objective of the PLK1 case study was to identify compounds with distinct scaffolds from the pyrazolo-quinazoline core moiety of Onvansertib, which is patented (such as WO2008074788). During the manual inspection of the generated compounds, we prioritized novel scaffolds while tolerating those containing Onvansertib’s 1-phenylpiperazine moiety. (b) 3D shape. When inspecting the docking poses, we tended to select compounds adopting a similar U-shape pose to Onvansertib (PDB ID: 2YAC), although we also considered promising molecules with different binding poses (e.g., L-shape).

Model controllability and interpretability

We conducted three conditional tests to examine the controllability of our TransPharmer model under different conditions: (i) setting all 72-dimensional bits to 0; (ii) setting all 72-dimensional bits to 1; (iii) setting one bit to 1 while keeping the other bits at 0, thereby generating 72 single-activated-bit fingerprints. These fingerprints were then used as condition vectors to generate molecules. As expected, the outcomes of conditions (i) and (ii) were random. For condition (iii), we collected the frequency of each bit under the generation conditions. The resulting heatmap is shown in Supplementary Fig. 10. According to the results, our model can be effectively controlled by single-activated-bit conditions, producing molecules that are enriched in features corresponding to the high-frequency activated bits. The low-frequency activated bits, however, exhibited a random activation pattern, indicating that the model struggles to learn these less frequent features, which correspond to relatively rare structural motifs, such as positively or negatively charged groups, zinc-ion-binding moieties, and their combinations. This limitation could potentially be addressed by further fine-tuning with relevant datasets.

To further investigate what the models have learned, we analyzed the attention maps across all transformer blocks and attention heads. Though most attention maps appeared sparse, particularly in higher layers (Supplementary Fig. 6a), we did observe meaningful patterns in some densely activated maps. As exemplified in Supplementary Fig. 6b, where the oxygen atom of the ligand, which is the only hydrogen bond acceptor, activates all the corresponding acceptor-related bits in the pharmacophore fingerprint. However, these observations were limited and did not capture all the relationships between each fingerprint bit and the corresponding molecular features as one might expect.

These may be caused by the following reasons: (1) The attention maps may not capture all parts of a generated compound. The linker atoms between pharmacophore features are not expected to activate attention, which could explain the sparsity of the attention maps. The knowledge required to generate proper linkers likely resides in the auto-regressive probabilistic distribution learned by the decoder, but this information cannot be revealed by attention maps. (2) The information from different positions becomes increasingly mixed in higher layers. This accounts for the distinctions observed in the attention maps in the first couple of layers, while in the higher layers, the attention weights tend to be more uniform. These observations are consistent with findings from previous studies and could potentially be addressed by techniques such as attention rollout or attention flow⁷⁷. However, currently, there are no universal methods for analyzing attention maps. Even in well-defined natural language processing tasks, a technique that works well for one task may fail for another⁷⁷. Therefore, fully understanding how generative models learn requires further study.

On 3D pharmacophores

The complementary nature of protein-ligand interactions as 3D spatial pharmacophores is widely recognized. Ligand-based pharmacophores are analogous to a particular kind of negative image in the binding site. The generated compounds by our model may satiate the actual 3D binding pharmacophores given a predetermined 2D pharmacophore fingerprint. Actually, by listing all potential pharmacophore topologies in the measured Euclidean distances of two or three points, the 3D spatial pharmacophores may be transformed into a variety of 2D pharmacophore topologies. We think that it is doable to manually design a desirable pharmacophore topology. As a result, it is simple to discretize a set of appropriate pharmacophore topologies into bit fingerprints that are numerous criteria to steer molecular generation. Even while ligand-based 3D pharmacophores offer one option to guide the generative model, it is still challenging to guarantee that active ligand conformations are generated. To get around this issue, one might predict active ligand conformations using other deep-learning models⁷⁸.

Toward more universal generative models

The pharmacophoric fingerprints utilized in TransPharmer serve as valuable prompts, enabling the model to seamlessly transition between designing ligands for different targets without requiring additional fine-tuning. This capability was demonstrated in the case studies on DRD2 and PLK1, showing that TransPharmer can be readily applied in diverse scenarios.

Compared to other molecular generative models based on GPT-like architectures¹⁸, TransPharmer offers two primary contributions. By prompting with pharmacophoric fingerprints, TransPharmer incorporates prior knowledge into the generation of pharmaceutically relevant compounds, thereby aligning more closely with the goals of medicinal chemists. This approach also paves the way to the development of extensive pharmaceutical generative models that integrate multimodal knowledge alongside basic chemical principles derived from molecular structures⁷⁹. Additionally, TransPharmer leverages the structural hopping properties of pharmacophores to aid in discovering novel compounds with bioactivity against the same pharmaceutical targets.

Nonetheless, several directions can be explored in the future to enhance the model’s versatility and general applicability. First, additional generation modes, such as fragment-linking, should be incorporated alongside de novo generation and scaffold elaboration. Advances in unordered chemical language modeling can directly support these functionalities⁸⁰. Second, generative models that produce easily synthesizable molecules are preferable, as they can accelerate the timeline for wet lab experimental validation⁸¹. Finally, multi-objective optimization should be integrated into the generative process to support more efficient design. Recent advances, such as integrating Pareto optimization with generative models, may help identify novel compounds with a balanced profile⁸².

Methods

Pharmacophore features and fingerprint extraction

The molecular graph is first converted into a fully connected graph of pharmacophore features using the definition of ligand-based pharmacophores from RDKit v2021.9 (BaseFeature. fdef)⁸³. This definition encompasses eight types of pharmacophore features, including hydrogen-bond acceptors and donors, aromatic rings, moieties possessing positive or negative ionizability, hydrophobic entities, or those associated with Zn ion binding. Detailed patterns for each type are presented in Supplementary Table 8. To derive the pharmacophore fingerprints utilized in TransPharmer, we obtained two-point and three-point combinations of pharmacophore features, as well as the shortest topological distances between each feature pair. The topological distances were discretized into 2-bin (the range for short distances as [0, 3) and for long distances as [3, 8)) or 3-bin (the range for short distances as (0, 2), for medium distances as [2, 5) and for long distances as [5, 8)) signals. When a distance falls within a specific range, the corresponding bit is set to 1, otherwise 0; if the distance exceeds the maximum considered distance, 8 in this study, there will be null signals (00 or 000). For the two-point pharmacophoric features with 2-bin and 3-bin discretization schemes, the lengths of the binary pharmacophore fingerprints obtained are 72 and 108, respectively. For the combination of two- and three-point pharmacophoric features with the 2-bin scheme, the lengths of the fingerprints are 1032. The fingerprint extraction process was built based on the 2D pharmacophore fingerprint modules implemented in RDKit⁴⁸.

Model architecture

As illustrated in Fig. 1 and Supplementary Fig. 11, TransPharmer receives the pairings of a SMILES string and its extracted pharmacophore fingerprint as two-channel input during training. After segmental encoding and positional encoding, these input embeddings are fed into the Transformer decoder with multi-head self-and-mixed attention blocks to decode the SMILES tokens in the next position. The segmental encoding aims to distinguish between tokens and conditions by using explicit label vectors (0s for pharmacophore fingerprint and 1s for SMILES token). The positional encoding adopts a rotary positional encoding⁸⁴ by multiplying the embedding vectors by the rotation matrix as follows,

$${{Attention}}{(Q,K,V)}_{m}=\frac{{\sum }_{n=1}^{N}{({R}_{{{\Theta }},m}^{d}\phi ({q}_{m}))}^{T}({R}_{{{\Theta }},n}^{d}\varphi ({k}_{n})){v}_{n}}{\mathop{\sum }_{n=1}^{N}\phi {({q}_{m})}^{T}\varphi ({k}_{n})}$$

(1)

where φ(*) and ϕ(*) are usually non-negative functions, ${R}_{{{\Theta }},m}^{d}$ and ${R}_{{{\Theta }},n}^{d}$ are rotation matrix. This positional encoding was demonstrated to be more compatible with the linear operation in the attention block and to converge faster during training.⁸⁴ A slim version of the GPT-3 model⁴² is utilized for the multi-head Transformer decoder. And self-and-mixed attention blocks are adopted for adequate information exchange in order to learn implicit associations. With the processing of the Transformer decoder, the final layer outputs the probabilities of the next SMILES tokens using linear transformation and softmax operations. The hyperparameters of TransPharmer are shown in Supplementary Table 9.

Data set setup

We use the GuacaMol dataset⁴⁴, which is derived from the ChEMBL24 database and is composed of about 1.6 million unique compounds. The sizes of the training, validation, and testing sets are 1,273,104 (80%), 79,562 (5%), and 238,681 (15%), respectively, for model development and evaluation, following the data splitting of GuacaMol. All TransPharmer models in the pharmacophore-constrained molecule generation tasks were trained on the GuacaMol dataset.

A 108-token vocabulary was first constructed from the SMILES strings from the GuacaMol dataset, which contains ’#’, ’%10’, ’%11’, ’%12’, ’(’, ’)’, ’-’, ’1’, ’2’, ’3’, ’4’, ’5’, ’6’, ’7’, ’8’, ’9’, ’<’, ’=’, ’B’, ’Br’, ’C’, ’Cl’, ’F’, ’I’, ’N’, ’O’, ’P’, ’S’, ’[B-]’, ’[BH-]’, ’[BH2-]’, ’[BH3-]’, ’[B]’, ’[Br-]’, ’[Br+2]’, ’[C+]’, ’[C-]’, ’[CH+]’, ’[CH-]’, ’[CH2+]’, ’[CH2]’, ’[CH]’, ’[Cl+]’, ’[Cl-]’, ’[Cl+3]’, ’[Cl+2]’, ’[F-]’, ’[F+]’, ’[H]’, ’[I+]’, ’[I+2]’, ’[I+3]’, ’[IH2]’, ’[IH]’, ’[I-]’, ’[N+]’, ’[N-]’, ’[NH+]’, ’[NH-]’, ’[NH2+]’, ’[NH3+]’, ’[N]’, ’[O+]’, ’[O-]’, ’[OH+]’, ’[O]’, ’[P-]’, ’[P+]’, ’[PH+]’, ’[PH2+]’, ’[PH]’, ’[S+]’, ’[S-]’, ’[SH+]’, ’[SH]’, ’[Se-]’, ’[Se+]’, ’[SeH+]’, ’[SeH]’, ’[Se]’, ’[SeH2]’, ’[Si-]’, ’[SiH-]’, ’[SiH2]’, ’[SiH]’, ’[Si]’, ’[SH-]’, ’[b-]’, ’[bH-]’, ’[c+]’, ’[c-]’, ’[cH+]’, ’[cH-]’, ’[n+]’, ’[n-]’, ’[nH+]’, ’[nH]’, ’[o+]’, ’[s+]’, ’[sH+]’, ’[se+]’, ’[se]’, ’b’, ’c’, ’n’, ’o’, ’p’ and ’s’. After removing less frequent tokens (including ’[Br+2]’,’[Br-]’,’[Cl+2]’,’[Cl+3]’,’[Cl+]’,’[Cl-]’,’[F-]’,’[I+2]’,’[I+3]’,’[I-]’,’[P-]’,’[SH-]’,’[Se-]’,’[SeH2]’), a 94-token vocabulary is used to process SMILES strings from different sources. Those containing tokens outside the vocabulary were removed.

8323 DRD2 actives were collected from the ExCAPE-DB⁸⁵ and 7939 were left after the elimination of invalid SMILES strings (can not parsed by RDKit) and duplicate structures (share the same canonical SMILES strings). Over 40,000 DRD2 inactives were also downloaded from the ExCAPE-DB and 7939 of them were randomly sampled for visualization and comparison with actives. TransPharmer in the recall experiment of DRD2 actives was retrained on the merged dataset of GuacaMol and DRD2 actives, described in Section “Settings for DRD2 recall experiment”

3873 entries of PLK1 actives were also collected from the ExCAPE-DB database and all of them have valid and non-duplicate SMILES strings. Each entry contains the molecular structure in SMILES format and the bioactivity record in the uniform expression of “pXC50” (e.g., pIC50 or pEC50. pIC50 = 9 corresponds to an IC₅₀ value of 1 nM).

Settings for pharmacophore-constrained molecule generation

Three-hundred compounds (referred to as conditioning compounds) were randomly selected from the reserved test set, and each model used their pharmacophoric information to guide the de novo generation of novel molecules. For the task of scaffold elaboration, each conditioning compound is fragmented into two parts by breaking a random acyclic single bond between two non-hydrogen atoms. One fragment of the conditioning compound is chosen arbitrarily as the core or starting fragment, while the other becomes a reference elaboration. Using the core fragments as starting points, each model performs scaffold elaboration guided by the pharmacophoric information of the reference fragments. For both tasks, each model attempts to generate 600 molecules for every conditioning compound, and invalid and duplicate molecules are filtered out before further evaluation. Detailed parameter settings for the three external baseline models can be found in the Section “The settings of compared methods”.

The settings of compared methods

LigDream³⁵

LigDream can generate novel molecules guided by the three-dimensional shape and pharmacophoric features of a reference compound. LigDream contains a shape variational autoencoder (VAE), which encodes a voxelized 3D molecular structure into its latent code and reconstructs the voxelized compound representation from it, and a shape captioning recurrent neural network (RNN), which decodes the voxelized representation to the SMILES of a specific molecule. The LigDream authors found that the VAE reparametrization factor λ and the RNN probabilistic sampling can provide different sources of sampling variability. In this study, we set the reparametrization factor λ to 1.0 and turned off the RNN probabilistic sampling as suggested. The model weight was obtained from their public repository (https://github.com/compsciencelab/ligdream).

PGMG³⁹

PGMG receives a fully connected graph containing selected pharmacophore features. This graph is encoded using a Gated Graph Convolutional Network to obtain an embedding vector, which is subsequently decoded into SMILES strings using transformer encoder-decoder blocks. In this study, following the training process of PGMG, a pharmacophore hypothesis was constructed by randomly selecting 3–7 pharmacophore features for each test case molecule, and the 3D coordinates for each feature were obtained from the molecular conformation embedded using the ETKDG⁸⁶ method, as implemented in RDKit. We utilized the pretrained PGMG (accessible at https://github.com/CSUBioGroup/PGMG) to generate 600 samples for each pharmacophore hypothesis. For other exploration settings and evaluation of PGMG, please refer to Table S2 in the supplementary materials.

DEVELOP³⁴

DEVELOP integrated pharmacophoric information of the regions to be explored into the process of fragment linking or scaffold elaboration and has shown broad potential in scenarios such as PROTAC design or R-group optimization. We utilized the scripts provided by the authors of DEVELOP to prepare the pharmcophore information and index files and perform the required preprocessing for our testing data. We loaded the pretrained model weights (accessible at https://github.com/oxpig/DEVELOP) and adopted the default parameters during generation following the instructions of the setting used to generate molecules with the same number of atoms as the reference molecule.

STONED⁵¹

Superfast traversal, optimization, novelty, exploration, and discovery (STONED) is an algorithm that can perform local chemical subspace exploration around a target molecule and other functionalities. STONED achieves these by modifying the SELFIES²¹ string representation of the reference molecules. The amount and location of the modified characters have different effects on the similarity between the resultant structures and the original ones. Basically, restricting the amount or the location of the SELFIES changes to either the initial or the terminal region yields similar mutated structures⁵¹.

In this study, we utilized this feature of STONED to explore the local chemical space around a target molecule in two different modes. The first one is the default mode ("STONED”) which allows the mutation positions to be chosen randomly and the number of mutations up to 5. The “STONED” mode produces both similar and dissimilar structures to the starting molecule. The second is the “STONED-focused” mode that allows only 1 modification and restricts the mutation position to the terminal 10% of the SELFIES. The “STONED-focused” mode was set up intentionally to produce highly similar mutated structures to the target one. For the “STONED” mode, we sampled 10,000 times for each molecule, while for the “STONED-focused” mode we sampled 100,000 times because the mutated structures have a high probability of repeating themselves. Only the non-duplicate parts of the generated molecules were retained by examining their canonical SMILES strings output by RDKit⁴⁸.

Evaluation metrics

In this study, we use S_struct to represent the molecular similarity score which is measured by the Tanimoto coefficient of 2048-bit Morgan circular fingerprints with a radius of 2, and use S_pharma to represent the pharmacophoric similarity score which is measured by the Tanimoto coefficient of ErG fingerprints³² implemented by RDKit. Molecules will go through charge neutralization before similarity scoring.

The deviation of pharmacophoric feature counts D_count is formulated as follows:

$${D}_{{{\rm{count}}}}=\frac{1}{N}\mathop{\sum }_{i}^{N}{\sum }_{j}^{m}| {n}_{j}^{i}-{n}_{j}^{{{\rm{ref}}}}|$$

(2)

where ${n}_{j}^{i}$ is the number of jth pharmacophoric features in ith generated molecule, ${n}_{j}^{{{\rm{ref}}}}$ is the number of jth pharmacophoric feature in the reference molecule, m is the total number of pharmacophoric feature types (m = 8 in this study) and N is the total number of generated molecules.

The recall rate in the recall experiment of DRD2 actives is formulated as follows:

$${{\rm{Recall}}}=\frac{{{\#}} {{\rm{Known}}}\,{{\rm{actives}}}\,{{\rm{in}}}\,{{\rm{the}}}\,{{\rm{generated}}}\,{{\rm{set}}}}{{{\#}} {{\rm{Known}}}\,{{\rm{actives}}}\,{{\rm{in}}}\,{{\rm{the}}}\,{{\rm{reserved}}}\,{{\rm{set}}}}$$

(3)

The apparent precision is the number of known actives unseen by TransPharmer found within the set of 4000 generated molecules. These molecules were sampled more than once during the generation process and continuously joined the generated set until the size of 4000 was reached.

Settings for DRD2 recall experiment

The Bemis-Murcko scaffolds of the 7939 DRD2 actives were extracted and clustered using Butina algorithm⁶⁶ implemented in RDKit⁴⁸, with Morgan fingerprints⁴⁷ (radius 2, 2048-bit) as molecular descriptors and a distance threshold of 0.4. Then, scaffold clusters were sorted by size in descending order and 3717 ligands with scaffolds in the odd-indexed clusters were added into the training set of TransPharmer, while 4222 ligands with scaffolds in the even-indexed clusters were actives to be recalled. During generation, active ligands in the training set were encoded into 72-bit pharmacophoric fingerprints and used as prompts of TransPharmer to generate 1000 SMILES per condition, yielding a total of 3,717,000 generated samples. 3717 unrelated molecules to DRD2 randomly drawn from the training set (Fig. 3b) were also encoded into 72-bit pharmacophoric fingerprints for TransPharmer to generate 1000 SMILES per condition.

DRD2 QSAR model

A classification model employing a Support Vector Machine (SVM) for the prediction of bioactivity was developed following DeepDrugCoder⁵⁴. The standard implementation of SVM from the scikit-learn v0.20.347 Python package was used, with the radial basis function as the kernel function. The model was trained to discriminate active compounds from inactive ones based on their 2048-bit-radius 2 Morgan fingerprint representations. Model weights and optimized hyperparameters were loaded from https://github.com/pcko1/Deep-Drug-Coder/tree/master/models. The model outputs the probability of a compound being active against DRD2.

t-distributed stochastic neighbor embedding (t-SNE)

To visualize the chemical space encompassed by the generated molecules, the training set, and the known PLK1 active ligands, we constructed t-SNE plots. The 108-bit pharmacophore fingerprints were used as the molecular descriptors. The perplexity parameter was set to 50. A subset of known PLK1 active ligands was constructed by removing ligands with a pXC50 value lower than 6. Employing the Barnes-Hut implementation of the t-SNE algorithm⁸⁷, we obtained two-dimensional representations for 574 PLK1 active ligands and each 5,000 randomly selected molecules from the training set and two generated sets (both conditionally and unconditionally).

Molecular docking

The receptor structure was taken from the Protein Data Bank (PDB)⁸⁸ (PDB ID: 2YAC) and prepared using the Schrodinger Protein Preparation Wizard⁸⁹ with default parameters, i.e., we added hydrogens, protonated non-residue molecules at pH 7 ± 2 using Epik⁹⁰, removed waters, ions and crystallization artifacts (e.g., tartaric acid), optimized hydrogen bond assignment at pH 7 using PROPKA⁹¹ and minimized the structure using the OPLS3e force field⁹². A grid was defined using the centroid of the co-crystallized ligand Ovansertib as the center. Before the docking procedure, the generated ligands were prepared using LigPrep⁹³ to enumerate unspecified stereocentres, tautomers, and protonation states and perform minimization using the OPLS3e force field. Each molecule along with any respective variants was then docked using Glide⁶³. We performed a redocking of Onvansertib into the ATP pocket of PLK1 to validate our docking protocol. Onvansertib was favorably scored by Glide with docking scores lower than −11 kcal/mol and the RMSD of its No.1 docking pose and the co-crystal pose is less than 0.5 Å.

Molecular dynamics (MD) simulation

MD simulation was carried out on the systems of PLK1 in complex with generated ligands. The systems were first minimized through steepest descent minimization until the termination condition, i.e., the maximum force below 10.0 kJ/mol, was satisfied. After minimization, the systems were heated to 300 K over 100 picoseconds (ps) using the NVT ensemble with a restraint of 1000 kJ/mol nm⁻² on both the kinase and ligands, followed by an additional 100 ps of NVT equilibration with a restraint solely on the protein. Next, 100 ps of NPT equilibration was conducted. Finally, either a 4-nanosecond (ns) production run for estimating binding free energy or a 100 ns run for evaluating binding stability was conducted. The long-range electrostatics were accounted for by means of the particle mesh Ewald (PME) method, with a cutoff of 12 Å applied uniformly across all the MD simulations. All hydrogen-heavy atom bonds were constrained by the LINCS method, and simulations were executed with a time step of 2 femtoseconds. Temperature coupling utilized the V-rescale method. To assess the stability of the simulated systems, the root-mean-square deviation (RMSD) was computed based on the last 20 ns of the trajectory after performing the alignment of protein structures. We validated our MD simulation protocol by carrying out a 100 ns run for the system of PLK1 in a complex with Onvansertib. The last 20 ns RMSD is 1.71 Å, indicating high binding stability of Onvansertib in the ATP pocket of PLK1.

Molecular mechanics with generalized Born and surface area solvation (MM/GBSA)

The MM/GBSA calculations were conducted employing gmx_MMPBSA v1.6⁹⁴, a tool derived from AMBER’s MMPBSA.py. The GBOBC2 (igb = 5) model was utilized in this study, with a salt concentration set at 0.15 M. For the kinase, the ff14SB force field was employed, while the General Amber Force Field was applied to the generated ligands. Other default parameters for MM/GBSA calculations were applied.

Chemical synthesis

We assessed the synthesizability using the Synthetic Accessibility (SA) score⁹⁵ to estimate the ease of synthesis of the designed compounds, supplemented by manual inspection by medicinal chemistry experts. The primary synthetic data are available in the Supplementary Methods.

In vitro kinase activity assays

In vitro kinase activity assays were conducted through ADP-Glo assay services provided by Conradbio (Conradbio, China). The protocol for the PLK1 assay is described as follows (protocols for other kinases are very similar). Enzyme, substrate, ATP, and compounds were diluted in a Kinase Buffer composed of 40 mM Tris (pH 7.5), 20 mM MgCl₂, 0.1 mg/ml BSA, and 50 μM DTT. In a 384-well low-volume plate, 1 μl of the compound or 5% dimethyl sulfoxide (DMSO), 2 μl of PLK1 enzyme (15 ng/well), and 2 μl of substrate/ATP mix (final concentration: 20 μM ATP, 0.2 μg/μl Casein) were added to each well. The plate was then incubated at 25 °C for 60 min to allow for kinase activity. Following the enzymatic reaction, 5 μl of ADP-Glo^TM Reagent was added to each well, and the plate was incubated at 25 °C for an additional 40 min. Subsequently, 10 μl of Kinase Detection Reagent was added, and the plate was incubated for a final 30 min at 25 °C. Luminescence was recorded with an integration time of 0.5 s.

IC50 values were calculated using Prism 8 by fitting the following equation:

$$Y=Bottom+{(Top-Bottom)}/(1+1{0}^{({{{\rm{logIC}}}}_{50}-X)}\times HillSlope),$$

(4)

where X is a log of concentration, Y is a response, and top and bottom are the responses of controls. Each assay was repeated at least three times, and we computed the mean and standard deviation for the values.

Cell viability assays

Cell viability assays were conducted through CellTiter-Glo assay services provided by Conradbio (Conradbio, China). The protocol is briefly described as follows. Firstly, when the cell confluence reaches 80%, cells are collected and counted. Subsequently, a cell suspension is diluted, and 80 μl of the suspension is seeded into each well of a 96-well U-bottom plate. The plate is then placed in a 37 °C, 5% CO₂ incubator for optimal cell growth. After 24 h of incubation, a 20 μl aliquot of a diluted compound solution is added to specific wells on the plate, 0.5% dimethyl sulfoxide (DMSO) is used as a negative control. Following the compound addition, the plate is returned to the incubator for an additional day. Upon completion of the incubation period, the CellTiter-Glo assay (Promega) is performed according to the manufacturer’s manual. This assay is designed to measure cell viability based on luminescence, providing insights into the impact of the compounds on cellular health. Finally, data calculation is carried out to analyze the results of the CTG assay using Prism 8.

Molecular novelty assessment

A molecular novelty assessment of the designed compounds exhibiting IC₅₀ below 1 μM, namely IIP0942, IIP0943, and IIP0945, was performed within the ExCAPE-DB and ChEMBL databases and using SciFinder. The settings and results can be found in the Supplementary Notes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Source data are provided with this paper as a Source Data file. Supplementary data in this study are provided in the Supplementary Information. The generated molecules in the case studies of DRD2 and PLK1 in this study have been deposited at Zenodo via https://doi.org/10.5281/zenodo.14227821⁹⁶. The crystal structure of PLK1 used in this study is available in the RCSB PDB database under accession code 2YAC [https://doi.org/10.2210/pdb2YAC/pdb]. Source data are provided with this paper.

Code availability

The source codes of TransPharmer is available at https://github.com/iipharma/transpharmer-repo and Zenodo (https://doi.org/10.5281/zenodo.14228119)⁹⁷.

Change history

07 April 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41467-025-58701-w

References

Zhavoronkov, A. et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Article CAS PubMed MATH Google Scholar
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid x receptor modulators. Commun. Chem. 1, 68 (2018).
Article Google Scholar
Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7, eabg3338 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Korshunova, M. et al. Generative and reinforcement learning approaches for the automated de novo design of bioactive compounds. Commun. Chem. 5, 129 (2022).
Article PubMed PubMed Central MATH Google Scholar
Li, Y. et al. Generative deep learning enables the discovery of a potent and selective ripk1 inhibitor. Nat. Commun. 13, 6891 (2022).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Ren, F. et al. Alphafold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel cdk20 small molecule inhibitor. Chem. Sci. 14, 1443–1452 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Ballarotto, M. et al. De novo design of nurr1 agonists via fragment-augmented generative deep learning in low-data regime. J. Med. Chem. 66, 8170–8177 (2023).
Chenthamarakshan, V. et al. Accelerating drug target inhibitor discovery with a deep generative foundation model. Sci. Adv. 9, eadg7865 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhu, W. et al. Discovery of novel and selective sik2 inhibitors by the application of alphafold structures and generative models. Bioorg. Med. Chem. 91, 117414 (2023).
Article CAS PubMed MATH Google Scholar
Li, Y. et al. Discovery of potent, selective, and orally bioavailable small-molecule inhibitors of cdk8 for the treatment of cancer. J. Med. Chem. 66, 5439–5452 (2023).
Article ADS CAS PubMed MATH Google Scholar
Tan, X. et al. Automated design and optimization of multitarget schizophrenia drug candidates by deep learning. Eur. J. Med. Chem. 204, 112572 (2020).
Article CAS PubMed MATH Google Scholar
Xie, W., Wang, F., Li, Y., Lai, L. & Pei, J. Advances and challenges in de novo drug design using three-dimensional deep generative models. J. Chem. Inf. Model. 62, 2269–2279 (2022).
Article CAS PubMed MATH Google Scholar
Flam-Shepherd, D., Zhu, K. & Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nat. Commun. 13, 3293 (2022).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Li, Y. et al. Deepscaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning. J. Chem. Inf. Model. 60, 77–91 (2019).
Article ADS PubMed MATH Google Scholar
Wang, M. et al. Relation: a deep generative model for structure-based de novo drug design. J. Med. Chem. 65, 9478–9492 (2022).
Article CAS PubMed MATH Google Scholar
Kuznetsov, M. & Polykovskiy, D. Molgrow: a graph normalizing flow for hierarchical molecular generation. Proc. AAAI Conf. Artif. Intell. 35, 8226–8234 (2021).
MATH Google Scholar
Bagal, V., Aggarwal, R., Vinod, P. & Priyakumar, U. D. Molgpt: Molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 62, 2064–2076 (2021).
Article PubMed MATH Google Scholar
Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 1–11 (2023).
O’Boyle, N. & Dalke, A. Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/CHEMRXIV.7097960.V1 (2018).
Krenn, M., Häse, F., Nigam, A. K., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Zhou, G. et al. Uni-mol: a universal 3d molecular representation learning framework (2023).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminformatics 9, 1–14 (2017).
Article Google Scholar
Blaschke, T. et al. Reinvent 2.0: an ai tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).
Article CAS PubMed MATH Google Scholar
Wang, J. et al. Chemistga: a chemical synthesizable accessible molecular generation algorithm for real-world drug discovery. J. Med. Chem. 65, 12482–12496 (2022).
Article CAS PubMed Google Scholar
Li, Y., Pei, J. & Lai, L. Structure-based de novo drug design using 3d deep generative models. Chem. Sci. 12, 13664–13675 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bush, J. T. et al. A turing test for molecular generators. J. Med. Chem. 63, 11964–11971 (2020).
Article CAS PubMed MATH Google Scholar
Pogány, P., Arad, N., Genway, S. & Pickett, S. D. De novo molecule design by translating from reduced graphs to smiles. J. Chem. Inf. Model. 59, 1136–1146 (2018).
Article PubMed Google Scholar
Walters, W. P. & Murcko, M. Assessing the impact of generative ai on medicinal chemistry. Nat. Biotechnol. 38, 143–145 (2020).
Article CAS PubMed MATH Google Scholar
Zhavoronkov, A. & Aspuru-Guzik, A. Reply to ‘assessing the impact of generative ai on medicinal chemistry’. Nat. Biotechnol. 38, 146–146 (2020).
Article CAS PubMed MATH Google Scholar
Ivanenkov, Y. et al. The hitchhiker’s guide to deep learning driven generative chemistry. ACS Med. Chem. Lett. 14, 901–915 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Stiefl, N., Watson, I. A., Baumann, K. & Zaliani, A. Erg: 2d pharmacophore descriptions for scaffold hopping. J. Chem. Inf. Model. 46, 208–220 (2006).
Article CAS PubMed Google Scholar
Reutlinger, M. et al. Chemically advanced template search (cats) for scaffold-hopping and prospective target prediction for ‘orphan’molecules. Mol. Inform. 32, 133 (2013).
Article CAS PubMed PubMed Central MATH Google Scholar
Imrie, F., Hadfield, T. E., Bradley, A. R. & Deane, C. M. Deep generative design with 3d pharmacophoric constraints. Chem. Sci. 12, 14577–14589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Skalic, M., Jiménez, J., Sabbadin, D. & De Fabritiis, G. Shape-based generative modeling for de novo drug design. J. Chem. Inf. Model. 59, 1205–1214 (2019).
Article CAS PubMed Google Scholar
Skalic, M., Sabbadin, D., Sattarov, B., Sciabola, S. & De Fabritiis, G. From target to drug: generative modeling for the multimodal structure-based ligand design. Mol. Pharm. 16, 4282–4291 (2019).
Article CAS PubMed Google Scholar
Yang, Y. et al. Syntalinker: automatic fragment linking with deep conditional transformer neural networks. Chem. Sci. 11, 8312–8322 (2020).
Article CAS PubMed PubMed Central MATH Google Scholar
Hadfield, T. E., Imrie, F., Merritt, A., Birchall, K. & Deane, C. M. Incorporating target-specific pharmacophoric information into deep generative models for fragment elaboration. J. Chem. Inf. Model. 62, 2280–2292 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhu, H., Zhou, R., Cao, D., Tang, J. & Li, M. A pharmacophore-guided deep learning approach for bioactive molecular generation. Nat. Commun. 14, 6234 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Mahé, P., Ralaivola, L., Stoven, V. & Vert, J.-P. The pharmacophore kernel for virtual screening with support vector machines. J. Chem. Inf. Model. 46, 2003–2014 (2006).
Article PubMed MATH Google Scholar
Capecchi, A., Probst, D. & Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminformatics 12, 1–15 (2020).
Article MATH Google Scholar
Radford, A. & Narasimhan, K. Im&&proving language understanding by generative pre-training (2018).
Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS MATH Google Scholar
Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. Guacamol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Article CAS PubMed Google Scholar
Polykovskiy, D. et al. Molecular sets (moses): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Article CAS PubMed PubMed Central Google Scholar
Skalic, M., Varela-Rial, A., Jiménez, J., Martínez-Rosell, G. & De Fabritiis, G. Ligvoxel: inpainting binding pockets using 3d-convolutional neural networks. Bioinformatics 35, 243–250 (2019).
Article CAS PubMed Google Scholar
Morgan, H. L. The generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service. J. Chem. Documentation 5, 107–113 (1965).
Article CAS MATH Google Scholar
Rdkit: Open-&&source cheminformatics. https://www.rdkit.org.
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2323–2332 (PMLR, 2018).
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In International Conference on Machine Learning, 4839–4848 (PMLR, 2020).
Nigam, A., Pollice, R., Krenn, M., Gomes, G. D. P. & Aspuru-Guzik, A. Beyond generative models: Superfast traversal, optimization, novelty, exploration and discovery (stoned) algorithm for molecules using selfies. Chem. Sci. 12, 7079–7090 (2021).
Article CAS PubMed PubMed Central Google Scholar
He, J. et al. Transformer-based molecular optimization beyond matched molecular pairs. J. Cheminformatics 14, 18 (2022).
Article CAS MATH Google Scholar
Beria, I. et al. Nms-p937, a 4,5-dihydro-1h-pyrazolo[4,3-h]quinazoline derivative as potent and selective polo-like kinase 1 inhibitor. Bioorg. Med. Chem. Lett. 21, 2969–2974 (2011).
Article CAS PubMed MATH Google Scholar
Kotsias, P.-C. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat. Mach. Intell. 2, 254–265 (2020).
Article Google Scholar
Schmucker, S. & Sumara, I. Molecular dynamics of plk1 during mitosis. Mol. Cell. Oncol. 1, e954507 (2014).
Article PubMed PubMed Central MATH Google Scholar
Luo, J. et al. A genome-wide rnai screen identifies multiple synthetic lethal interactions with the ras oncogene. Cell 137, 835–848 (2009).
Article CAS PubMed PubMed Central MATH Google Scholar
Yu, C. et al. Genome-wide crispr-cas9 knockout screening identifies grb7 as a driver for mek inhibitor resistance in kras mutant colon cancer. Oncogene 41, 191–203 (2022).
Article CAS PubMed Google Scholar
Smits, V. A. et al. Polo-like kinase-1 is a target of the dna damage checkpoint. Nat. Cell Biol. 2, 672–676 (2000).
Article CAS PubMed MATH Google Scholar
Li, Z. et al. Polo-like kinase 1 (plk1) overexpression enhances ionizing radiation-induced cancer formation in mice. J. Biol. Chem. 292, 17461–17472 (2017).
Article CAS PubMed PubMed Central MATH Google Scholar
Chiappa, M. et al. Present and future perspective on plk1 inhibition in cancer treatment. Front. Oncol. 12, 903016 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Zhang, J., Zhang, L., Wang, J., Ouyang, L. & Wang, Y. Polo-like kinase 1 inhibitors in human cancer therapy: development and therapeutic potential. J. Med. Chem. 65, 10133–10160 (2022).
Article CAS PubMed MATH Google Scholar
Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23, 3–25 (1997).
Article CAS Google Scholar
Halgren, T. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 2. enrichment factors in database screening. J. Med. Chem. 47, 1750–1759 (2004).
Article CAS PubMed MATH Google Scholar
Salentin, S., Schreiber, S., Haupt, V. J., Adasme, M. F. & Schroeder, M. Plip: fully automated protein–ligand interaction profiler. Nucleic Acids Res. 43, W443–W447 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Article CAS PubMed MATH Google Scholar
Butina, D. Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).
Article CAS Google Scholar
Dalke, A. The chemfp project. J. Cheminformatics 11, 1–21 (2019).
Article Google Scholar
Friesner, R. A. et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein- ligand complexes. J. Med. Chem. 49, 6177–6196 (2006).
Article CAS PubMed MATH Google Scholar
Feig, M. et al. Performance comparison of generalized born and poisson methods in the calculation of electrostatic solvation energies for protein structures. J. Comput. Chem. 25, 265–284 (2004).
Article CAS PubMed MATH Google Scholar
Onufriev, A., Case, D. A. & Bashford, D. Effective born radii in the generalized born approximation: the importance of being perfect. J. Comput. Chem. 23, 1297–1304 (2002).
Article CAS PubMed Google Scholar
Bashford, D. & Case, D. A. Generalized born models of macromolecular solvation effects. Annu. Rev. Phys. Chem. 51, 129–152 (2000).
Article ADS CAS PubMed Google Scholar
Onufriev, A., Bashford, D. & Case, D. A. Modification of the generalized born model suitable for macromolecules. J. Phys. Chem. B 104, 3712–3720 (2000).
Article CAS Google Scholar
Emmitte, K. A. et al. Design of potent thiophene inhibitors of polo-like kinase 1 with improved solubility and reduced protein binding. Bioorg. Med. Chem. Lett. 19, 1694–1697 (2009).
Article CAS PubMed MATH Google Scholar
Beria, I. et al. Identification of 4, 5-dihydro-1 h-pyrazolo [4, 3-h] quinazoline derivatives as a new class of orally and selective polo-like kinase 1 inhibitors. J. Med. Chem. 53, 3532–3551 (2010).
Article CAS PubMed MATH Google Scholar
Yoshimori, A., Kawasaki, E., Kanai, C. & Tasaka, T. Strategies for design of molecular structures with a desired pharmacophore using deep reinforcement learning. Chem. Pharm. Bull. 68, 227–233 (2020).
Article CAS MATH Google Scholar
Caruso, M. et al. 5-(2-amino-pyrimidin-4-yl)-1h-pyrrole and 2-(2-amino-pyrimidin-4-yl)-1, 5, 6, 7-tetrahydro-pyrrolo [3, 2-c] pyridin-4-one derivatives as new classes of selective and orally available polo-like kinase 1 inhibitors. Bioorg. Med. Chem. Lett. 22, 96–101 (2012).
Article CAS PubMed MATH Google Scholar
Abnar, S. & Zuidema, W. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4190–4197 (Association for Computational Linguistics, 2020).
Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind: Geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, 20503–20521 (PMLR, 2022).
Fang, Y. et al. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nat. Mach. Intell. 5, 542–553 (2023).
Article MATH Google Scholar
Yue, J. et al. Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language. Chem. Sci. 15, 13727–13740 (2024).
Article CAS PubMed PubMed Central MATH Google Scholar
Swanson, K. et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nat. Mach. Intell. 6, 338–353 (2024).
Article MATH Google Scholar
Abeer, A. N., Urban, N. M., Weil, M. R., Alexander, F. J. & Yoon, B.-J. Multi-objective latent space optimization of generative molecular design models. Patterns 5, 101042 (2024).
Landrum, G. A., Penzotti, J. E. & Putta, S. Feature-map vectors: a new class of informative descriptors for computational drug discovery. J. Comput. Aided Mol. Des. 20, 751–762 (2006).
Article ADS CAS PubMed Google Scholar
Su, J., Lu, Y., Pan, S., Wen, B. & Liu, Y. Roformer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024).
Sun, J. et al. Excape-db: an integrated large scale dataset facilitating big data analysis in chemogenomics. J. Cheminformatics 9, 1–9 (2017).
ADS Google Scholar
Riniker, S. & Landrum, G. A. Better informed distance geometry: using what we know to improve conformation generation. J. Chem. Inf. Model. 55, 2562–2574 (2015).
Article CAS PubMed Google Scholar
Van Der Maaten, L. Accelerating t-sne using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).
MathSciNet MATH Google Scholar
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Madhavi Sastry, G., Adzhigirey, M., Day, T., Annabhimoju, R. & Sherman, W. Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J. Comput. Aided Mol. Des. 27, 221–234 (2013).
Article ADS CAS PubMed Google Scholar
Shelley, J. C. et al. Epik: a software program for pk a prediction and protonation state generation for drug-like molecules. J. Comput. Aided Mol. Des. 21, 681–691 (2007).
Article ADS CAS PubMed MATH Google Scholar
Søndergaard, C. R., Olsson, M. H., Rostkowski, M. & Jensen, J. H. Improved treatment of ligands and coupling effects in empirical calculation and rationalization of p k a values. J. Chem. Theory Comput. 7, 2284–2295 (2011).
Article PubMed MATH Google Scholar
Roos, K. et al. Opls3e: Extending force field coverage for drug-like small molecules. J. Chem. Theory Comput. 15, 1863–1874 (2019).
Article CAS PubMed MATH Google Scholar
Schrödinger, LLC, New York, NY. Ligprep.
Valdés-Tresanco, M. S., Valdés-Tresanco, M. E., Valiente, P. A. & Moreno, E. gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs. J. Chem. Theory Comput. 17, 6281–6291 (2021).
Article PubMed MATH Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics 1, 1–11 (2009).
Article MATH Google Scholar
Xie, W. et al. Accelerating discovery of bioactive ligands with pharmacophore informed generative models. supplementary data for transpharmer. Zenodo https://doi.org/10.5281/zenodo.14227821 (2024).
Xie, W. et al. Accelerating discovery of bioactive ligands with pharmacophore informed generative models. transpharmer-repo. Zenodo https://doi.org/10.5281/zenodo.14228119 (2024).

Download references

Acknowledgements

We thank Dr. Guangwei He for insightful discussions on experimental design and are grateful to his team at HIPI for their support during experiments. We also acknowledge Dr. Chu, Dr. Xu, and Minghan He for their assistance with statistical analysis and design. This work was supported in part by the National Key R&D Program of China (grant 2023YFF1205103), the National Natural Science Foundation of China (grants 220330010), and the Chinese Academy of Medical Sciences (grant 2021-I2M-5-014). Additionally, this work received support from Anhui’s Plans for Major Provincial Science & Technology Projects (Grant 202303a07020009).

Author information

These authors contributed equally: Weixin Xie, Jianhang Zhang.

Authors and Affiliations

Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
Weixin Xie, Jin Xie, Luhua Lai & Jianfeng Pei
Infinite Intelligence Pharma, Beijing, China
Jianhang Zhang, Qin Xie, Chaojun Gong & Youjun Xu
BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing, China
Yuhao Ren, Qi Sun & Luhua Lai
Peking University Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Chengdu, China
Qi Sun & Luhua Lai
Research Unit of Drug Design Method, Chinese Academy of Medical Sciences, Beijing, China
Qi Sun, Luhua Lai & Jianfeng Pei

Authors

Weixin Xie
View author publications
Search author on:PubMed Google Scholar
Jianhang Zhang
View author publications
Search author on:PubMed Google Scholar
Qin Xie
View author publications
Search author on:PubMed Google Scholar
Chaojun Gong
View author publications
Search author on:PubMed Google Scholar
Yuhao Ren
View author publications
Search author on:PubMed Google Scholar
Jin Xie
View author publications
Search author on:PubMed Google Scholar
Qi Sun
View author publications
Search author on:PubMed Google Scholar
Youjun Xu
View author publications
Search author on:PubMed Google Scholar
Luhua Lai
View author publications
Search author on:PubMed Google Scholar
Jianfeng Pei
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.X. devised the ideas. J.Z. and W.X. implemented the deep learning model and performed the model training. W.X., J.Z., and Y.X. discussed and analyzed the data. W.X., Q.X., and C.G. conducted the compound screening and prioritization. Y.R., J.X., and Q.S. performed chemical structure characterization and analyzed data. Y.X., L.L., and J.P. supervised the project. W.X., J.Z., Y.X., L.L., and J.P. wrote the manuscript.

Corresponding authors

Correspondence to Youjun Xu, Luhua Lai or Jianfeng Pei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Tunca Dogan, Daniela Schuster, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xie, W., Zhang, J., Xie, Q. et al. Accelerating discovery of bioactive ligands with pharmacophore-informed generative models. Nat Commun 16, 2391 (2025). https://doi.org/10.1038/s41467-025-56349-0

Download citation

Received: 25 January 2024
Accepted: 13 January 2025
Published: 10 March 2025
DOI: https://doi.org/10.1038/s41467-025-56349-0

This article is cited by

Comparative effect of gibberellic acid and brassinolide for mitigating drought stress in pea (Pisum sativum L.)
- Atif Kamran
- Kainat Shakeel
- Lubaba Komal
Physiology and Molecular Biology of Plants (2025)