Molecular deep learning at the edge of chemical space

van Tilborg, Derek; Rossen, Luke; Grisoni, Francesca

doi:10.1038/s42256-026-01216-w

Download PDF

Article
Open access
Published: 22 April 2026

Molecular deep learning at the edge of chemical space

Nature Machine Intelligence volume 8, pages 575–587 (2026) Cite this article

19k Accesses
8 Altmetric
Metrics details

Subjects

Abstract

Molecular machine learning models often fail to generalize beyond the chemical space of their training data, limiting their ability to reliably perform predictions on structurally novel bioactive molecules. Here, to advance the ability of machine learning to go beyond the ‘edge’ of their training chemical space, we introduce a joint modelling approach that combines molecular property prediction with molecular reconstruction. This approach allows the introduction of unfamiliarity, a reconstruction-based metric that enables the estimation of model generalizability. Via a systematic analysis spanning more than 30 bioactivity datasets, we demonstrate that unfamiliarity not only effectively identifies out-of-distribution molecules but also serves as a reliable predictor of classifier performance. Even when faced with the presence of strong distribution shifts on large-scale molecular libraries, unfamiliarity yields robust and meaningful molecular insights that go unnoticed by traditional methods. Finally, we experimentally validate unfamiliarity-based molecule screening in the wet lab for two clinically relevant kinases, discovering seven compounds with low micromolar potency and limited similarity to training molecules. This demonstrates that unfamiliarity can extend the reach of machine learning beyond the edge of the charted chemical space, advancing the discovery of diverse and structurally novel molecules.

Coverage bias in small molecule machine learning

Article Open access 09 January 2025

Chemical language models enable navigation in sparsely populated chemical space

Article 19 July 2021

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

Article Open access 06 January 2025

Main

Molecular machine learning is rapidly gaining traction in early drug discovery^1,2,3,4,5. One key objective is identifying novel bioactive molecules (‘hits’) on one or more pharmacological targets⁶. In this context, finding structurally novel hit molecules is crucial for addressing unmet therapeutic needs^7,8, ensuring commercial viability⁹ and overcoming drug resistance^10,11. However, moving beyond the structural features of the training molecules (for example, to identify novel bioactive molecular cores) poses a substantial challenge for machine learning models, which often fail when applied to out-of-distribution (OOD) molecules^12,13,14. This is especially true for discrete data, such as molecules, which can quickly deviate from the data distribution learned during model training. This is further exacerbated by the scarcity of structurally diverse molecular data with high-quality experimental annotations^12,13,14,15, due to the costly and time-consuming nature of biochemical experiments. As a result, training sets typically contain only hundreds of molecules, while libraries used for screening may contain billions of existing, but previously unseen, chemicals to be predicted¹⁶. Dealing with the resulting distribution shifts makes the discovery of structurally novel hit molecules with machine learning a herculean task. In this regard, quantifying how reliable predictions are beyond the ‘edge’ of the explored chemical space holds enormous promise.

Ensuring prediction reliability in prospective hit-screening campaigns has been an active topic of research^17,18. A well-established approach involves defining an applicability domain^19,20, which delimits the chemical space of reliable predictions, most often via a threshold on molecular similarity to the training data²¹ (Fig. 1a). However, this method does not incorporate the information learned by the model, and, due to its similarity-based definition, hampers the discovery of structurally novel molecules. Another widely used approach is based on uncertainty estimation^22,23, which leverages the model’s prediction confidence, often through probabilistic modelling techniques. While uncertainty estimation allows the consideration of structurally novel molecules in principle, it may provide overconfident predictions when confronted with OOD samples^14,24,25. Hence, being able to make reliable predictions on OOD molecules remains one of the core challenges of molecular machine learning in drug discovery.

**Fig. 1: Estimating unfamiliarity of molecular data using joint modelling.**

Here we offer a fresh perspective on how to better navigate the ‘edge’ of chemical space with deep learning, while accounting for prediction reliability on OOD molecules. To achieve this, we leverage recent advances in generative deep learning for de novo molecule design^{2,26,27,28,29}, in particular autoencoders^27,30,31. Autoencoders can be trained to encode molecular structures into a lower-dimensional latent space, and subsequently decode them back to their original form^27,30. In this work, through joint molecular modelling (Fig. 1b), we simultaneously train deep learning models to predict molecular properties (for example, bioactivity) and reconstruct the input molecule in a semi-supervised^32,33 manner, that is, by learning from a combination of labelled and unlabelled molecular data.

Our joint learning approach breaks with the well-established application of leveraging a self-supervised learning task for generative chemistry²⁷ or for predictive performance improvement^{33,34,35,36,37}, by using reconstruction capabilities as a direct proxy for OOD estimation^38,39,40,41. Specifically, we hypothesize that poorly reconstructed molecules are less familiar to the model, indicating that they fall outside the distribution learned from the training data. Building on this hypothesis, we introduce a metric, termed unfamiliarity, which captures a model’s reconstruction ability and is proposed to quantify how much a molecule deviates from the training distribution.

In this systematic study, spanning 33 experimentally labelled molecular datasets, we show not only that the introduced unfamiliarity metric is a robust indicator of molecular distribution shifts, but also that it strongly correlates with classifier performance. The capacity of unfamiliarity to identify structurally diverse and bioactive molecule is further validated in the wet lab, discovering several compounds with low micromolar activity on two kinase proteins.

Ultimately, the introduced concept of molecular unfamiliarity provides a principled approach to estimating model generalizability, even in the presence of molecular distribution shifts. Our approach offers a fresh perspective on estimating prediction reliability, complementing established concepts such as the applicability domain^19,20 and uncertainty estimation^22,23—guiding the discovery of structurally novel molecules in a more precise and informed manner.

Results

In what follows, we will elaborate on our joint molecular model (JMM), introduce the unfamiliarity metric and demonstrate its ability to quantify molecular distribution shifts. Next, we relate molecular property prediction to distribution shifts and leverage this relationship to estimate prediction reliability. Finally, we apply the unfamiliarity metric to prioritize molecules in a virtual screening case study, followed by experimental validation in the wet lab.

Unfamiliarity and joint molecular modelling

The JMM (Fig. 1b) is based on a semi-supervised^32,33 autoencoder, building on seminal work in molecular generative modelling²⁷. First, molecules are represented as Simplified Molecular Input Line Entry System (SMILES) strings⁴², which encode molecular topology and atom/bond types in a textual format. SMILES strings are encoded into a compressed latent vector (z), using a one-dimensional convolutional neural network⁴³, which was shown to capture elements of bioactivity effectively^27,44. The z vector is then decoded back into the input representation using a recurrent neural network with long short-term memory (LSTM)^26,28,29, in a self-supervised manner. This encoder-decoder was pretrained on ~1.2 M unlabelled molecular structures from ChEMBL⁴⁵, a dataset large enough to learn the ‘grammar’ of SMILES strings⁴⁶. The model was then finetuned using each labelled molecule, by passing the same latent representation (z) to an approximate Bayesian classifier⁴⁷, which predicts molecular properties and simultaneously estimates prediction uncertainty (Fig. 1b). Reconstruction and property prediction were trained jointly (Methods; equation (13)), to ensure that the shared latent space captures relevant information for both tasks.

The performance of molecular reconstruction was quantified via the reconstruction loss. This was computed as the total negative log-likelihood loss of all SMILES string tokens, normalized by SMILES token length:

$${{\mathscr{L}}}_{\mathrm{reconstruction}}\left(x\right)=-\frac{1}{\left|x\right|}\mathop{\sum }\limits_{i\in x}\log p\left({t}_{i}|{t}_{ < i}\right),$$

(1)

where t_i is the ith non-padding element (‘token’) in the input SMILES string (x), and |x| is the number of non-padding tokens in the sequence. Here, $p({t}_{i}|{t}_{ < i})$ denotes the probability assigned by the decoder to the next token t_i given all preceding tokens in the sequence. From the reconstruction loss, we obtain the unfamiliarity metric (${\mathbb{U}}$), as follows:

$${\mathbb{U}}\left({\rm{x}}\right)=\log {{\mathscr{L}}}_{{\rm{reconstruction}}}\left(x\right).$$

(2)

The unfamiliarity metric depends on vocabulary size V and theoretically ranges from near −∞ to log(log(V)). The lower the reconstruction loss, the lower the unfamiliarity, and vice versa. In our case, a model with V = 35 would have an upper ${\mathbb{U}}$ limit of ~1.27 under ideal circumstances with uniformly predicted token probabilities. However, in practice, this upper bound can be exceeded when a model assigns extremely low probabilities to certain tokens.

Detecting molecular distribution shifts

To investigate whether the unfamiliarity score reflects molecular distribution shifts, we collected 33 experimentally annotated datasets from literature^48,49,50, spanning various biological properties and sizes (Supplementary Table 1). These datasets were split into groups of in-distribution and OOD molecules. First, we performed spectral clustering (Methods) on molecular cyclic skeletons⁵¹ (core ring systems without exocyclic substituents), which ensured that structurally similar molecules were consistently grouped together (Fig. 1c). Based on cluster distances in each dataset, the most distant clusters (representing approximately 25% of all molecules) were used as an OOD test set (test_OOD). The remaining molecules were split into an in-distribution test set (test_ID, ~25% of total) and a training set (train_ID, ~50%; Supplementary Fig. 1).

To confirm that the molecules in the test_OOD set originate from different data distributions than the other two sets, we used three approaches to capture molecular similarity⁵²:

(1)
Similarity of extended connectivity fingerprints⁵³ (ECFPs), which capture the presence of atom-centred substructures. ECFPs were computed on molecular cyclic skeletons, and similarity was quantified via the Tanimoto coefficient⁵⁴.
(2)
Overlap of structural cores⁵⁵, computed as the maximal common substructure (MCS) fraction between molecular graphs (Methods).
(3)
Pharmacophore similarity, computed on Chemically Advanced Template Search (CATS) descriptors⁵⁶, using the cosine similarity.

Across all similarity metrics, molecules in the test_OOD set exhibited statistically significant differences from the molecules in the training set and test_ID, whereas no differences were found between these latter two (Fig. 2a–c, paired, two-sided, Wilcoxon signed-rank test, α = 0.05).

**Fig. 2: Detecting induced molecular distribution shifts with the unfamiliarity score.**

To benchmark the effect of such a distribution shift on predictive performance, we trained three well-established⁴⁸ molecular property prediction models using molecular descriptors: a random forest (RF) and a multilayer perceptron (MLP) combined with ECFPs, as well as a RF model combined with CATS pharmacophore descriptors. As expected, the performance of these baselines consistently degraded on test_OOD compared with test_ID (Fig. 2d, paired, two-sided, Wilcoxon signed-rank test, P < 0.05) as measured via balanced accuracy (Methods; equation (15)). Such performance degradation is a hallmark of OOD data⁵⁷.

Having established a baseline, we evaluated the classification performance of the proposed JMM (Fig. 1b) on each dataset. The JMM achieved a balanced accuracy of 0.75 ± 0.02 on test_ID molecules, which was slightly lower than the balanced accuracy of 0.78 ± 0.02 achieved by the ECFP-based model (Fig. 2d, P = 3.7 × 10⁻⁵, paired, two-sided, Wilcoxon signed-rank test). Such a small performance gap suggests limited practical differences, and it aligns with existing literature on the performance of SMILES-based versus ECFP-based models⁴⁴. Notably, training the JMM with or without the reconstructive decoder had no effect on the performance (test_ID: P = 0.499, test_OOD: P = 0.594; paired, two-sided, Wilcoxon signed-rank test). These results indicated that the performance of the JMM is in line with literature standards, despite the addition of the molecular reconstruction task. In other words, the decoder enables unfamiliarity estimation without imposing a performance penalty on the classifier.

Next, we aimed to verify the hypothesis that molecules well-represented in the training data (or test_ID) are reconstructed better by the JMM than ‘unfamiliar’ OOD molecules^38,39,40,41. To this end, we inferred ${\mathbb{U}}(x)$ on all labelled molecules in the respective test_ID and test_OOD sets for every trained JMM. As expected, test_OOD molecules received significantly higher unfamiliarity scores than test_ID molecules (two-sided Kolmogorov–Smirnov (KS) test, P < 0.05; Fig. 2e), as was especially visible on a dataset basis (Fig. 2f). Importantly, these differences in unfamiliarity were not driven by factors like SMILES string length or complexity, branching, molecular graph complexity, molecular weight or the number of functional groups (Supplementary Fig. 2). These results suggest that the model’s ability to reconstruct molecules depends not on molecular complexity, but rather on its ‘proximity’ to the training data distribution.

For both molecules in test_ID and test_OOD, this was further confirmed by the direct relationship between a molecule’s unfamiliarity and its distance to the training data, regardless of the data splits (Supplementary Fig. 3), for example, in terms of structural core overlap (Fig. 2g,h). Moreover, we found that, in general, there is a moderate-to-strong relationship between unfamiliarity and structural distance to the training data (Table 1). In this context, unfamiliarity correlates well to multiple and complementary similarity metrics, suggesting that it provides a generalizable, model-driven perspective on learned data distributions.

Table 1 Correlation of reliability metrics to train set similarity

Full size table

Unfamiliarity and bioactivity prediction

Because molecular reconstruction and molecular property prediction depend on the same learned latent representation, we tested how informative ${\mathbb{U}}(x)$ is for assessing the predictive capabilities of the model. We compared our approach with other well-established measures of reliability, namely:

(1)
Similarity to training set molecules (data-driven), measured as the average pharmacophore similarity, cyclic skeleton similarity, or molecular core overlap.
(2)
Embedding distance (model-driven), defined as the average Mahalanobis distance⁵⁸ of a molecule’s embedding z (Fig. 1b) to the learned embeddings of the training set.
(3)
Prediction uncertainty (model-driven), based on approximate Bayesian modelling of the classifier (Methods; equation (12)).

We found that all tested reliability metrics are indicative of a model’s predictive performance, considering balanced accuracy, hit rate and precision (Supplementary Fig. 4). Prediction uncertainty and unfamiliarity have a moderate to strong correlation to model performance (Table 2). In other words, when uncertainty and/or unfamiliarity are high (on a dataset level), models make more erroneous bioactivity predictions, whereas molecules for which these metrics are low, are generally predicted accurately (Supplementary Fig. 4). Interestingly, unfamiliarity and prediction uncertainty seem to be unrelated as metrics themselves (Supplementary Fig. 5, Spearman correlation, r = 0.10 ± 0.05). This aspect indicates that both metrics capture complementary information about prediction reliability and motivates the introduction of unfamiliarity alongside uncertainty for molecular machine learning.

Table 2 Correlation of reliability metrics to model performance

Full size table

‘Model-driven’ reliability metrics outperformed all ‘data-driven’ methods that calculate a molecule’s distance to the training set using predefined descriptors and metrics. This indicates that models can extract relevant information from their training data that cannot be captured through predefined measures of molecular similarity alone.

Interestingly, the unfamiliarity score proved to be a considerably better reliability metric than the embedding distance to the training embeddings, even though both capture elements of structural similarity learned by the JMM (Table 1). In other words, the ability of a model to reconstruct a molecule from an internal representation provides more insight into prediction reliability than the ‘internal’ molecular embeddings directly used by the classifier. This suggests that reconstructing a molecule from an embedding is not only highly informative for prediction reliability, but also for the quality of the embedding itself. These findings corroborate that simple embedding distance metrics fail to capture the ‘chemical nuances’ that affect task-specific outcomes and that embedding quality is better assessed through downstream tasks than by proximity to training embeddings, as previously suggested for computer vision⁵⁹.

Furthermore, our results highlight the distinctions between embedding distance, unfamiliarity and prediction uncertainty. While both unfamiliarity and prediction uncertainty correlate with performance independently, the latter does not strongly relate to the structural properties of molecules (Table 1). By contrast, unfamiliarity captures both molecular similarity and predictive performance simultaneously. In other words, embedding distance primarily reflects p(x), uncertainty relates to p(y|x), and unfamiliarity integrates both, providing insight into p(y, x|x). This suggests unfamiliarity as a holistic metric of prediction reliability, effectively linking structural information to prediction confidence.

Virtual hit screening

To further explore the use of unfamiliarity to navigate chemical space, we extended our analysis to large-scale screening libraries to mimic a realistic virtual screening scenario. Although these libraries are unlabelled (and therefore do not allow performance evaluation), they contain a larger and more diverse set of molecules, potentially revealing additional differences compared with the smaller test sets analysed previously.

Using a combined set of 1.4 M molecules from three reliable commercial screening libraries (Asinex⁶⁰, Specs⁶¹ and Enamine⁶²; Supplementary Table 4), we performed inference with all 33 trained models. The screening molecules showed a lower structural overlap with the training sets than the test_OOD molecules (Fig. 3a and Supplementary Fig. 6a). Based on previous results (Fig. 2d), this indicates that we can also expect a corresponding performance drop on the screening molecules. Still, estimated prediction uncertainty did not highlight meaningful differences between screening molecules and test_ID molecules (Fig. 3b; KS statistic D = 0.181, indicating a limited effect size). Prediction uncertainty alone would suggest that the screening libraries fall within a model’s operating limits. Unfamiliarity scores, meanwhile, reveal strong distribution shifts, both overall (Fig. 3c; KS statistic D = 0.999), and especially on a dataset basis (Supplementary Fig. 6).

**Fig. 3: Virtual screening on 1.4M commercially available molecules.**

The estimated prediction uncertainty on this large-scale library does not directly correlate with a molecule’s structural similarity to the training data (Supplementary Fig. 7a, Spearman correlation, r = −0.04 ± 0.02). Unfamiliarity, meanwhile, shows a moderate relationship to structural similarity (Supplementary Fig. 7b, Spearman correlation, r = 0.21 ± 0.02). Finally, prediction uncertainty and unfamiliarity remain independent (Fig. 3d, Spearman correlation, r = −0.03 ± 0.03). Our findings confirm that the observed trends in small datasets remain robust across large screening libraries. Moreover, they highlight the ability of unfamiliarity to detect OOD shifts that routinely used metrics might fail to capture.

Uncertainty and unfamiliarity show stark differences across the molecular library, both from a pan-pharmacological angle (Fig. 3d) and for the well-studied serine/threonine-protein kinase PIM1 as a highlighted case (Fig. 3e). Globally, molecules with high ${\mathbb{U}}\left(x\right)$ scores are ‘structurally atypical’ (for example, molecules i and ii, Fig. 3d). Molecules with low ${\mathbb{U}}\left(x\right)$ scores, meanwhile, display key characteristics of bioactive molecules, for example, steroid-like structures⁶³ and bioactive cores⁶⁴ (molecules iii and iv, Fig. 3d). Similar trends are observed across individual protein targets, as exemplified by PIM1 (Fig. 3e). Here, molecules with low ${\mathbb{U}}\left(x\right)$ contain well-known pharmacophores (for example, pyrimidinone cores⁶⁵).

Finally, we found no relationship between unfamiliarity and the quantitative estimate of drug-likeness⁶⁶ (Spearman correlation: r = −0.04 ± 0.03) or synthetic accessibility⁶⁷ (Spearman correlation: r = 0.01 ± 0.02), demonstrating that low unfamiliarity scores do not simply capture drug-likeness. Importantly, structural insights are not captured by uncertainty estimation. This might indicate that uncertainty estimation is unreliable on strongly OOD data, as has been suggested previously^14,24,25.

Prospective virtual screening

To validate the proposed approach prospectively under real-world conditions, we screened a commercial compound library comprising approximately 180,000 drug-like molecules (from Specs⁶¹; Methods) to identify inhibitors of two pharmacologically relevant kinase targets: PIM1 and cyclin-dependent kinase 1 (CDK1). Notably, CDK1 data were not used elsewhere in this study, serving as a fully independent test case.

For each target, we trained a JMM on all available data (1,443 training molecules for PIM1 and 312 for CDK1) and predicted bioactivity across the screening library. Molecules were then ranked by their distance to the so-called utopia point⁶⁸—the geometric optimum balancing multiple objectives (Methods). Using the predicted bioactivity ${\mathbb{E}}(y|x)$, prediction uncertainty ${{{\mathbb{H}}}}(y|x)$ and unfamiliarity ${\mathbb{U}}(x)$ as three complementary objectives, we selected the ten best molecules with three alternative trade-offs to illustrate how uncertainty and unfamiliarity behave under distribution shifts (Fig. 4a,d):

(A)
High ${\mathbb{E}}\left(y,|,x\right),\,$High ${{{\mathbb{H}}}}(y|x)$, and low ${\mathbb{U}}(x)$.
(B)
High ${\mathbb{E}}\left(y,|,x\right),\,$Low ${{{\mathbb{H}}}}(y|x)$, and low ${\mathbb{U}}(x)$.
(C)
High ${\mathbb{E}}\left(y,|,x\right),\,$Low ${{{\mathbb{H}}}}(y|x)$, and high ${\mathbb{U}}(x)$.

**Fig. 4: Experimental screening for PIM1 and CDK1.**

To limit structural overlap with the training data and among selected molecules, compounds with a Tanimoto coefficient (on ECFPs) ≥0.70 to either the training set or other selected compounds were excluded, thereby further challenging the models beyond their training distribution.

All 60 prioritized molecules (Supplementary Figs. 9 and 10) were structurally distant from their respective training sets, with maximal Tanimoto coefficients (on ECFPs) of 0.28 ± 0.05 for PIM1 and 0.28 ± 0.06 for CDK1. Molecules selected via the high unfamiliarity strategy were particularly atypical for kinase inhibitors (Supplementary Figs. 9 and 10), aligning with previous findings (Fig. 3d,e). After experimentally testing all compounds at a single concentration of 10 µM (Fig. 4b and Supplementary Fig. 8), we identified four initial hits (>50% protein inhibition) and six weak hits (>25% inhibition) for PIM1. For CDK1, we found one initial hit and five weak hits.

The six most active compounds per target were further characterized to determine dose–response curves and the corresponding half maximal inhibitory concentration (IC₅₀). For PIM1, all six compounds showed dose-dependent inhibition (Supplementary Fig. 11), although none achieved full inhibition within the tested concentration range (1 nM–10 µM). Compounds 4, 10 and 25 exhibited clear sigmoidal partial inhibition curves, with low/sub micromolar potency (IC₅₀ 1.5 ± 0.4 µM, IC₅₀ 0.92 ± 0.97 µM and IC₅₀ 0.87 ± 0.5 µM, respectively). Compounds 9 and 18 demonstrated weaker inhibition with incomplete curve resolution; for these, we report upper-bound potency estimates (IC₅₀ 5.9 µM and IC₅₀ 7.5 µM, respectively).

For CDK1, inhibition was generally less pronounced (Fig. 4d and Supplementary Fig. 12), possibly due to the smaller training set (312 molecules). Only compound 40 yielded a complete dose–response curve, with partial inhibition (IC₅₀ 2.9 ± 0.75 µM). Compound 33 showed modest activity with an upper-bound IC₅₀ ~3.4 µM. The remaining compounds (31, 48, 53 and 59) showed insufficient inhibition to estimate an IC₅₀.

Overall, our screening experiment achieved hit rates of approximately 17% for PIM1 (5 of 30 compounds with clear dose–response inhibition) and 7% for CDK1 (2 of 30), with several additional weak actives. These hit rates exceed those typically reported for traditional kinase-focused screening campaigns, which often range from 0.1% to 5% (refs. ^69,70,71,72). Notably, all identified hits had a maximal substructure similarity to the training molecules below 38% (measured as the Tanimoto coefficient on ECFPs), and emerged from purely prospective, machine learning-guided selection.

While the limited number of compounds tested per strategy and the absence of a full 2 × 2 factorial design prevents us from drawing statistically robust conclusions and cleanly separating main effects, five of the seven compounds with low micromolar potency originated from selection method A (low unfamiliarity, high uncertainty). By contrast, selecting molecules with low prediction uncertainty (methods B and C) did not yield a clear advantage, consistent with earlier results indicating that uncertainty is not a reliable performance signal under distribution shifts (Fig. 3a–c). Although preliminary, these prospective results provide practical evidence that the proposed approach can identify novel bioactive matter and support unfamiliarity as a useful handle for navigating chemical space under distribution shifts.

Discussion

This study introduced unfamiliarity—a metric that captures a molecule’s distance from the data distribution learned by a deep learning model. Unfamiliarity is computed via a joint modelling approach, trained to simultaneously perform molecular reconstruction and property prediction. By capturing the model error in reconstructing previously unseen molecules, unfamiliarity quantifies molecular distribution shifts. Our results demonstrate that unfamiliarity is a reliable and powerful indicator of a model’s prediction reliability when applied to new molecules. As a classification reliability metric, unfamiliarity is as informative as prediction uncertainty estimated via approximate Bayesian modelling, yet the two remain independent. Unlike prediction uncertainty, however, unfamiliarity also captures a molecule’s structural distance from the learned data distribution. Notably, in a large-scale virtual screening campaign, unfamiliarity provided far more meaningful molecular insights than uncertainty estimation when faced with strong distribution shifts. The complementarity of unfamiliarity and prediction uncertainty as reliability metrics highlights unfamiliarity as a valuable tool for molecular machine learning.

The prospective validation further underscores the usefulness of the introduced unfamiliarity metric to complement uncertainty-based molecule prioritization. Despite using small training sets and screening only a handful of compounds per target, we identified multiple molecules with low micromolar activity. Prediction uncertainty, meanwhile, did not seem indicative of performance when operating outside the training support^14,24,25. This suggests that unfamiliarity-aware selection can enable informed and precise exploration of the chemical space, unlocking new opportunities for discovering novel molecular candidates.

Our findings advocate for the adoption of unfamiliarity over traditional, similarity-based methods for applicability domain definition. Moreover, because unfamiliarity bears promise to reveal distribution shifts that would be undetected through molecular similarity or underestimated by uncertainty-based approaches, we recommend its adoption when screening large-scale molecular libraries. Crucially, because the unfamiliarity landscape reveals gaps in a model’s learned distribution, its applications could assist in guiding reinforcement learning applications⁷³ or extend beyond ‘one-shot’ virtual screening to iterative approaches such as active learning^74,75 to guide molecule acquisition. Ultimately, our study highlights the advantages of joint modelling not only in de novo design applications²⁷ but also in capturing structure–activity relationships. The information provided by joint modelling and unfamiliarity is expected to drive the development of more reliable and generalizable models—accelerating the exploration of novel regions in the chemical space with greater confidence and precision.

Methods

Data preprocessing and analysis

Molecular representation and description

Each molecular structure in this study was represented as a SMILES⁴² string. CATS⁵⁶ descriptors and 2048-bit ECFPs⁵³ (using a radius of 2) were computed for each molecule.

Data collection and curation

Thirty-three labelled datasets of molecular structures with their corresponding experimental target property were used:

LIT-PCBA⁴⁹ (3 targets). The ESR1 (antagonism), TP53 and PPARγ bioactivity datasets were downloaded from LIT-PCBA⁴⁹ (accessed in August 2023 at https://drugdesign.unistra.fr/LIT-PCBA).
The Ames mutagenicity dataset⁵⁰ was downloaded from http://pubs.acs.org.
MoleculeACE⁴⁸ (29 targets). Bioactivity datasets were downloaded from https://github.com/molML/MoleculeACE.

Moreover, small molecules were collected from ChEMBLv33⁴⁵ for model pretraining. Because molecular structures from ChEMBL were used for pretraining, molecules with a Bemis–Murcko scaffold⁷⁶ similar to any such scaffold in the labelled datasets (Tanimoto similarity coefficient⁵⁴ on EFCPs larger than 0.7) were removed. This included molecules without a Bemis–Murcko scaffold (that is, containing no rings).

For prospective virtual screening, the most recent CDK1 (CHEMBL308) data were fetched from ChEMBL v35⁴⁵ (accessed in April 2025) as an additional and independent dataset. Raw data were processed in accordance with previous work⁴⁸.

All bioactivity endpoints of the MoleculeACE⁴⁸ and CDK1 datasets were converted from continuous regression labels into binary classification labels. Molecules with an EC₅₀ (half maximal effective concentration) or K_i (inhibitory constant) of 100 nM or lower were labelled as bioactive, whereas less potent molecules were labelled as inactive. For the LIT-PCBA⁴⁹ and Ames mutagenicity dataset⁵⁰, their original binary classification labels were used.

For all datasets, SMILES strings were preprocessed using RDKit v. 2024.3.3⁷⁷. For each SMILES string, stereochemistry tokens as well as salts and solvents (Supplementary Table 2) were removed. Each molecule was sanitized, neutralized using predefined neutralization reactions (Supplementary Table 3), and its SMILES string was canonicalized. Disconnected structures and molecules that contained formal charges, contained complex ring systems (SMILES strings with a ring index of 9 or higher), non-standard isotopes or any atoms other than Cl, Br, H, C, N, O, F, S and I were removed. Molecules were removed if they contained more than 100 tokens in their canonical SMILES string or if they could not be featurized into CATS descriptors and/or ECFP fingerprints. Sizes of datasets before and after data curation are presented in Supplementary Table 1.

Virtual screening library (retrospective)

Molecules were gathered from three commercial screening libraries:

Asinex⁶⁰ screening libraries: downloaded from https://www.asinex.com/screening-libraries-(all-libraries) (accessed in March 2025).
Specs⁶¹: downloaded from https://www.specs.net (accessed in March 2025).
Enamine hit locator⁶²: downloaded from https://enamine.net/compound-libraries/diversity-libraries/hit-locator-library-460 (accessed in March 2025).

All molecules were processed in the same way as the training datasets. Finally, all unique molecules were aggregated.

Virtual screening library (prospective)

The most recent Specs library was downloaded from https://www.specs.net/index.php?view=databases&page=download (accessed in April 2025). All molecules were processed in the same way as the training datasets. To ensure that molecules were compatible with our experimental assay (likely soluble in 1% dimethylsulfoxide (DMSO)), physicochemical rules were enforced based on relaxed rule-of-five and Veber criteria:

(1)
Molecular weight between 200 g mol⁻¹ and 600 g mol⁻¹.
(2)
logP lower than 6.
(3)
Total polar surface area between 20 Å² and 140 Å².
(4)
Number of hydrogen bond donors lower than six.
(5)
Number of rotatable bonds lower than ten.
(6)
A maximum of two rule-of-five violations.

In addition, molecules with a terminal enone, isocyanate, quinone, aromatic nitro groups, azide or epoxide groups were removed by using SMILES Arbitrary Target Specification (SMARTS) patterns to prevent assay interference. Furthermore, to prevent the selection of trivially simple molecules and enrich the general-purpose Specs library for kinase-relevant chemical space, several general criteria were enforced:

(1)
A molecule must have an ATP mimetic core, that is, at least one heteroatom in a ring, or a fused carbocyclic system.
(2)
A molecule must have a polar anchor to ensure solubility or solvent interaction.
(3)
A molecule must have enough hydrophobic mass or planarity to fill the kinase ATP pocket (for example, gatekeeper region).
(4)
A molecule must have a directional H-bond donor/acceptor to ensure potential for hinge interaction.
(5)
A molecule must have two or more rings.

These rules are intentionally permissive for new chemotypes, and 98.6% of all kinase inhibitors that ever went into clinical trials⁷⁸ pass these filters. Finally, molecules with a Tanimoto similarity on ECFPs >0.7 to any molecule in the respective target’s data (PIM1, and CDK1) were removed. This left 185,298 and 185,336 screening molecules for PIM1 and CDK1, respectively.

Molecular cyclic skeletons

Cyclic skeletons⁵¹ (core ring systems without exocyclic substituents) were extracted from molecules to serve as the most fundamental molecular scaffold representation. From each molecule, Bemis–Murcko scaffolds⁷⁶ were obtained, removing peripheral substituents. Remaining double-bonded exocyclic substituents were then removed, and all atoms and bonds were made generic to obtain the final cyclic skeleton.

Data splitting

The curated molecular structures from ChEMBL were split into a training (80%, n = 1,230,041), a test set (10%, n = 153,755) and a validation (10%, n = 153,755) set, using random splitting. All labelled datasets were split into a training set (~50%), test set (~25%) and OOD set (~25%). To determine the OOD molecules, spectral clustering was performed on unique cyclic skeletons (see below). The molecules corresponding to the scaffolds in the clusters with the lowest mean cluster similarity to all other clusters that constituted approximately 25% of the total dataset size were taken as the OOD set. The remaining molecules (that is, the ~75% most similar molecules) were split randomly in a train and test set, with the test set being equal in size to the OOD set. An overview of all datasets is presented in Supplementary Table 1.

Spectral clustering

Spectral clustering was performed on a molecular similarity matrix A using Sci-kit learn⁷⁹. A is an n × n matrix where each element A_ij is the Tanimoto coefficient on ECFPs T_ij between two molecular structures. By using a molecular similarity matrix directly, we bypass the complex, high-dimensional and non-Euclidean nature of molecular structures. From this affinity matrix, the symmetrically normalized Laplacian was constructed as follows:

$${L}_{\mathrm{sym}}=I-{D}^{-1/2}A{D}^{-1/2},$$

(3)

where I is the identity matrix and D is the degree matrix. Subsequently, eigenvalue decomposition was performed:

$${L}_{\mathrm{sym}}={\bf{U}}\Lambda {{\bf{U}}}^{T}{.}$$

(4)

To determine the number of clusters k for spectral clustering, the eigenvalues λ₁, λ₂, …, λ_n were sorted in ascending order and the elbow (the point of maximal curvature) of the resulting sequence was estimated using the kneed algorithm⁸⁰. Finally, the spectral embeddings of the data were clustered by taking the top k (smallest) eigenvectors U_k, normalizing U_k to unit length and performing k-means clustering on all rows (u_k).

Chemical space visualization

To visualize each labelled dataset, molecules were first encoded as ECFPs. The resulting binary ECFPs were reduced to 100 components using truncated singular value decomposition and embedded into a two-dimensional space using t-distributed stochastic neighbour embedding with a perplexity value of 30 and default settings.

MCS fraction

To compute molecular core similarity, we computed the MCS fraction⁵⁵ between a molecule M_a and a reference molecule M_b as

$$\mathrm{MCS}\,\mathrm{fraction}=\frac{\left|\mathrm{MCS}\left({M}_{a},\,{M}_{b}\right)\right|}{\left|{M}_{a}\right|},$$

(5)

where |M_a| is the number of atoms in M_a and MCS is the maximal common substructure between the two molecules M_a and M_b, as determined by the FMCS algorithm⁷ in RDKit. A high MCS fraction indicates that a molecule shares a significant portion of their overall core structure with a reference molecule. This implementation is asymmetric.

Molecular complexity

To quantify molecular complexity we compute the well-established Bertz complexity⁸¹ and Böttcher complexity⁸². In addition, to align complexity measures with the task of reconstructing SMILES strings, we also estimated the complexity of the molecular graph directly⁸³ as

$${C}_{\mathrm{graph}}=V{\log }_{2}V-\mathop{\sum }\limits_{i\in {\mathscr{G}}}{V}_{i}{\log }_{2}{p}_{i},$$

(6)

where G represents the molecular graph, V is the total number of elements in the graph and V_i is the number of the elements in the ith set of elements. In a similar fashion, we also estimated the complexity of a SMILES string, via their entropy computed as

$${C}_{\mathrm{SMILES}}=-\mathop{\sum }\limits_{i\in S}{p}_{i}{\log }_{2}{p}_{i},$$

(7)

where S represents the set of unique tokens in a SMILES string, and p_i is the probability of the ith token occurring in S. Tokens representing the start, end of sequence, and padding were not considered. Moreover, for each molecule, we counted the number of SMILES tokens (excluding padding), the number of SMILES string branches (that is, ‘(‘ tokens), and the presence of 50 unique molecular patterns⁷⁴.

Virtual screening

Top-k molecules were selected in a multi-objective manner by selecting the k molecules closest to the utopia point⁶⁸. The distance to this ideal point can be calculated as

$${d}_{\mathrm{utopia}}=\sqrt{\mathop{\sum }\limits_{i=1}^{n}{\left({\mathrm{norm}}_{i}\right)}^{2}},$$

(8)

where norm_i is the normalized objective that is either maximized (for example, predicted bioactivity) or minimized (for example, prediction uncertainty):

$$\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{max}]}=\frac{{x}_{\text{max}}-{x}_{i}}{{x}_{\text{max}}-{x}_{\text{min}}}{\rm{or}}\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{min}]}=\frac{{x}_{i}-{x}_{\text{min}}}{{x}_{\text{max}}-{x}_{\text{min}}}.$$

(9)

Machine learning

Encoder

Canonical SMILES strings were encoded by a one-dimensional convolutional neural network. SMILES string character tokens were embedded using a randomly initialized trainable embedding layer of size 128. Several one-dimensional convolutional layers were used with a stride of 1 and no padding. Each layer was followed by a ReLU activation, standard max pooling with a kernel size equal to that of the convolutional layers, and dropout. Both convolutional and pooling layers used a stride of 1 and no padding. The final output was flattened and compressed to a latent vector (z) of size 128 using a fully connected layer. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of convolutional layers [2, 3], filter size [256, 512], kernel size [6, 8], weight decay on convolutional neural network weights [0, 1 × 10⁻⁴] and dropout [0, 0.1].

Decoder

Encoded latent molecular representations in z were reconstructed back to SMILES strings using a conditioned LSTM model. A randomly initialized trainable embedding layer of 128 neurons was used to embed SMILES string character tokens. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of LSTM layers (n_layers) [2, 3] and the LSTM hidden size (${\mathrm{size}}_{\mathrm{layers}}$) [256, 512]. Models were trained autoregressively without teacher forcing using next token prediction based on the tokens predicted in the previous steps rather than the ground truth tokens. A reconstruction loss normalized for sequence length was used (equation (1)). To condition the model, the LSTM hidden state h₀ was initialized with z for every molecule. To correctly match all dimensions of h₀ (${n}_{\mathrm{layers}},{\mathrm{size}}_{\mathrm{layers}}$), z was first transformed to ${n}_{\mathrm{layers}}\,\times {\mathrm{size}}_{\mathrm{layers}}$ with a fully connected layer, after which it was reshaped into ${{\rm{n}}}_{\mathrm{layers}}$ chunks of ${\mathrm{size}}_{\mathrm{layers}}$.

Approximate Bayesian classifier

Labels were predicted from either ECFPs or SMILES strings encoded into latent molecular representationsz. A MLP was used with several fully connected layers and an output layer consisting of two neurons. The number of MLP layers [2, 3] and the MLP hidden size [1,024, 2,048] were optimized (see ‘Hyperparameter optimization’ section). To estimate prediction uncertainty, anchored ensembling⁴⁷ was implemented on the MLP as in our previous work⁷⁴. We used an ensemble of M = 10 models. For each model, $m\in [1\ldots M]$, we regularized its parameters θ_m with a set of ‘anchored’ parameters θ_anchor,m that prevent different models in the ensemble to converge to the same parameter space. Each model is initiated with distinct θ_anchor, which is controlled by different random seeds. The classification loss in our implementation is defined as

$${{\mathscr{L}}}_{\mathrm{MLP}}=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}\mathop{\underbrace{\log {p}_{m}\left(\,{y|x}\right)s}}\limits_{\text{prediction}}+\mathop{\underbrace{\lambda {{||}{\theta }^{m}-{\theta }_{\mathrm{anchor}}^{m}{||}}^{2}}}\limits_{\mathrm{anchoring}},\,$$

(10)

where λ is a regularization coefficient (set to 3 × 10⁻⁴). To estimate the expected value ${\mathbb{E}}$ of each molecule x, we take the mean prediction over the whole ensemble, as follows:

$${\mathbb{E}}\left(\,{y|x}\right)=\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,y,|,x\right).$$

(11)

Similarly, we estimate prediction uncertainty for each molecule $x$ as the mean entropy ${\mathbb{H}}$ over all models in the ensemble:

$${{{\mathbb{H}}}}\left({y|x}\right)=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,{y|x}\right)\log {p}_{m}\left(\,{y|x}\right).$$

(12)

JMM

Canonical SMILES strings were encoded into compressed latent molecular representations ${\bf{z}}$ using the encoder model described above. Subsequently, ${\bf{z}}$ was used to perform both molecular property prediction with a classifier and molecular reconstruction using the ${\bf{z}}$-conditioned decoder model²⁷. The model was optimized in a joint fashion using the following weighted composite loss function:

$${{\mathscr{L}}}_{\mathrm{JMM}}={{\mathscr{L}}}_{\mathrm{reconstruction}}+\gamma \times {{\mathscr{L}}}_{\mathrm{MLP}},$$

(13)

where the scalar γ was set at 0.1 based on preliminary experiments. A regular autoencoder was used as preliminary experiments showed no performance benefits of the more complex variational autoencoder⁸⁴ architecture.

RF

An RF classifier was trained on either ECFPs or CATS⁵⁶ descriptors. The following hyperparameters were optimized (see ‘Hyperparameter optimization’): the number of trees [100, 250, 500, 1,000], the maximal tree depth [10, 20, 30, ∞] and the minimal samples per split [2, 5, 10].

Model training

Autoencoders

Encoder–decoder models were (pre)trained to reconstruct SMILES strings of general drug-like molecules from ChEMBL using the Adam optimizer. Mini batches of 256 random molecules were sampled from the training data for 1,000,000 steps using uniform sampling. Gradients were clipped with a max norm of 5. Early stopping with a patience of up to 20 evaluation checkpoints was implemented by monitoring validation loss every 10,000 steps. The model checkpoint with the best validation loss was used.

Classifiers

Classifiers using ECFPs or SMILES strings as input were trained for molecular property prediction on each of the labelled dataset using a similar setup to the autoencoders. However, tenfold Monte Carlo cross-validation was used with 10% validation splits instead of one predefined data split. Mini batches of 64 were resampled during training based on the occurrence of their class with

$${P}_{c}=1-\frac{{n}_{c}}{N},$$

(14)

where P_c is the probability of sampling class c, n_c is the number of samples of class c, and N is the total number of samples. Models were trained for 5,000 steps with an early stopping patience of 10 evaluation checkpoints, performed every 10 steps. The model checkpoint with the best validation loss was used. For the RF control models, molecules were weighted inversely proportionally to their class frequency to mitigate class imbalance during training.

JMMs

Joint models, each consisting of a SMILES string encoder, a decoder and a classifier (Fig. 1b), were initialized with pretrained weights. The SMILES string encoders and classifiers used weights from models trained on the labelled datasets. For the decoder, decoder weights were used from an autoencoder pretrained on ChEMBL. Using a mini batch size of 64, the joint models were finetuned for 10,000 steps with an early stopping patience of 50 evaluation checkpoints, performed every 20 steps. A learning rate of 3 × 10⁻⁶ was used for the encoder and classifier, whereas a learning rate of 3 × 10⁻⁷ was used for the decoder. No weight decay was applied.

Hyperparameter optimization

Hyperparameters, as specified previously, were optimized for all autoencoders, classifiers (using ECFPs or SMILES strings), and RF control models using a simple grid search. Tenfold Monte Carlo cross-validation was used, repeatedly using 10% of the training data as a validation split. The hyperparameters of the model with the best mean validation loss was used.

Model evaluation

Predictions were evaluated according to the following performance metrics:

$$\text{Balanced accuracy}=\frac{1}{2}\left(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}+\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}\right),$$

(15)

$$\text{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$$

(16)

where TP is the number of true (that is, correctly predicted) positives, TN is the number of true negatives, FN is the number of false negatives, and FP is the number of false positives. In addition, the hit rate (true positive rate) was determined for virtual screening experiments as

$$\text{Hit rate}=\frac{\mathrm{TP}}{P},$$

(17)

and enrichment factor as

$$\text{Enrichment factor}=\frac{{\mathrm{TP}}_{k}}{P/N}.$$

(18)

Here, TP_k is the number of correctly identified positives in the subset of k prioritized molecules, P is the total number of positives in the full dataset, and N is the total number of molecules in the full dataset.

Biological characterization

Sixty screening compounds were purchased from Specs Compound Handling B.V. and dissolved at 10 mM in 100% DMSO.

To screen for bioactivity, a point screening was first performed at a concentration of 10 µM (in 1% DMSO) in technical triplicates using the ADP-Glo Kinase Assay platform from Promega using the Chemi-Verse PIM1 Kinase Assay Kit and the Chemi-Verse CDK1/CyclinA2 Kinase Assay Kit from BPS Bioscience in Costar flat white 96-well plates. AZD1208 (CAS 1204144-28-4) and dinaciclib (CAS 779353-01-4), purchased from TargetMol Chemicals, were used as positive controls for PIM1 and CDK1, respectively. Bioactivity was measured as the area under the curve of an 18-step luminescence scan between 398 nm and 653 nm with an integration time of 1 s and a settle time of 100 ms, normalized for the signal in buffer-only wells.

For each target protein, the six compounds with the highest bioactivity at 10 µM were followed up with an 8-point dose–response curve using the same assay. Screening compounds were measured in technical triplicates from 10 µM to 0.0046 µM, whereas reference compounds were measured in duplicate.

Hardware and training set-up

All computational experiments were performed on a Lenovo ThinkSystem SD650-N v2 server equipped with Intel Xeon Platinum 8360Y central processing units and NVIDIA A100 (40 GB) graphics processing units. Up to five models were trained in parallel on a single graphics processing unit.

Software and code

All code was implemented in Python (v. 3.12). Deep learning models were implemented using PyTorch (v. 2.3.0)⁸⁵. Traditional machine learning models and clustering was implemented using Sci-kit learn v.1.4.0 (ref. ⁷⁹). All molecular data were handled using RDKit (v.2024.3.3)⁷⁷. For data visualization, R (v.4.3.0) and the R package ggplot2 (v.3.4.2) were used along with Adobe Illustrator.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All processed datasets and results are available via Zenodo at https://doi.org/10.5281/zenodo.14865513 (ref. ⁸⁶), except for data derived from commercial screening libraries due to licensing restrictions. Source data are provided with this paper. These data are also available via figshare at https://doi.org/10.6084/m9.figshare.30665201.v1 (ref. ⁸⁷). Source data are provided with this paper.

Code availability

Code is available via figshare at https://doi.org/10.6084/m9.figshare.30665201.v1 (ref. ⁸⁷). The Python code to replicate and extend our study, alongside the R scripts to visualize all results, is available via GitHub at https://github.com/molML/JointMolecularModel and via Zenodo at https://doi.org/10.5281/zenodo.18846066 (ref. ⁸⁸).

References

Gawehn, E., Hiss, J. A. & Schneider, G. Deep learning in drug discovery. Mol. Inform. 35, 3–14 (2016).
Article Google Scholar
Jiménez-Luna, J., Grisoni, F., Weskamp, N. & Schneider, G. Artificial intelligence in drug discovery: recent advances and future perspectives. Expert Opin. Drug Discov. 16, 949–959 (2021).
Article Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Article Google Scholar
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
Article Google Scholar
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).
Article Google Scholar
Walters, W. P., Stahl, M. T. & Murcko, M. A. Virtual screening—an overview. Drug Discov. Today 3, 160–178 (1998).
Article Google Scholar
Yera, E. R., Cleves, A. E. & Jain, A. N. Chemical structural novelty: on-targets and off-targets. J. Med. Chem. 54, 6771–6785 (2011).
Article Google Scholar
Wills, T. J. & Lipkus, A. H. Structural approach to assessing the innovativeness of new drugs finds accelerating rate of innovation. ACS Med. Chem. Lett. 11, 2114–2119 (2020).
Article Google Scholar
Shimizu, Y. et al. AI-driven molecular generation of not-patented pharmaceutical compounds using world open patent data. J. Cheminform 15, 120 (2023).
Article Google Scholar
Atanasov, A. G. et al. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).
Article Google Scholar
Wright, G. D. The antibiotic resistome: the nexus of chemical and genetic diversity. Nat. Rev. Microbiol. 5, 175–186 (2007).
Article Google Scholar
Ji, Y. et al. DrugOOD: out-of-distribution dataset curator and benchmark for AI-aided drug discovery—a focus on affinity prediction problems with noise annotations. In Proc. AAAI Conference on Artificial Intelligence Vol. 37 8023–8031 (AAAI, 2023).
Dias, A. L., Bustillo, L. & Rodrigues, T. Limitations of representation learning in small molecule property prediction. Nat. Commun. 14, 6394 (2023).
Article Google Scholar
Tossou, P., Wognum, C., Craig, M., Mary, H. & Noutahi, E. Real-world molecular out-of-distribution: specification and investigation. J. Chem. Inf. Model. 64, 697–711 (2024).
Article Google Scholar
Martinez-Mayorga, K. et al. The pursuit of accurate predictive models of the bioactivity of small molecules. Chem. Sci. 15, 1938–1952 (2024).
Article Google Scholar
Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23, 101681 (2020).
Article Google Scholar
Nigam, A. et al. Assigning confidence to molecular property prediction. Expert Opin. Drug Discov. 16, 1009–1023 (2021).
Article Google Scholar
Nada, H., Meanwell, N. A. & Gabr, M. T. Virtual screening: hope, hype, and the fine line in between. Expert. Opin. Drug Discov. 20, 145–162 (2025).
Article Google Scholar
Hanser, T., Barber, C., Marchaland, J. F. & Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. 27, 865–881 (2016).
Article Google Scholar
Mathea, M., Klingspohn, W. & Baumann, K. Chemoinformatic classification methods and their applicability domain. Mol. Inform. 35, 160–180 (2016).
Article Google Scholar
Wang, Z. & Chen, J. Applicability domain characterization for machine learning QSAR models. In Proc. Machine Learning and Deep Learning in Computational Toxicology (ed. Hong, H.) 323–353 (Springer, 2023).
Wang, D. et al. Learning with uncertainty to accelerate the discovery of histone lysine-specific demethylase 1A (KDM1A/LSD1) Inhibitors. Brief. Bioinform. 24, bbac592 (2023).
Article Google Scholar
Soleimany, A. P. et al. Evidential deep learning for guided molecular property prediction and discovery. ACS Cent. Sci. 7, 1356–1367 (2021).
Article Google Scholar
Ovadia, Y. et al. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Proc. 33rd Conference on Neural Information Processing Systems (Curran Associates, 2019).
Scalia, G., Grambow, C. A., Pernici, B., Li, Y.-P. & Green, W. H. Evaluating scalable uncertainty estimation methods for deep learning-based molecular property prediction. J. Chem. Inf. Model. 60, 2697–2717 (2020).
Article Google Scholar
Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7, eabg3338 (2021).
Article Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article Google Scholar
Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37, 1700153 (2018).
Article Google Scholar
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 1–9 (2018).
Article Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Article Google Scholar
Iovanac, N. C. & Savoie, B. M. Simpler is better: how linear prediction tasks improve transfer learning in chemical autoencoders. J. Phys. Chem. A 124, 3679–3685 (2020).
Article Google Scholar
Kingma, D. P., Mohamed, S., Rezende, D. J. & Welling, M. Semi-supervised learning with deep generative models. Adv. Neural Inf. Process. Syst. 27, 3581–3589 (2014).
Google Scholar
Doersch, C. & Zisserman, A. Multi-task self-supervised visual learning. In Proc. IEEE International Conference on Computer Vision (IEEE, 2017).
Assran, M. et al. Self-supervised learning from images with a joint-embedding predictive architecture. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15619–15629 (IEEE, 2023).
Kim, D., Yoo, Y., Park, S., Kim, J. & Lee, J. SelfReg: self-supervised contrastive regularization for domain generalization. In Proc. IEEE/CVF International Conference on Computer Vision 9619–9628 (IEEE, 2021).
Sun, Y. et al. Test-time training with self-supervision for generalization under distribution shifts. In Proc. 37th International Conference on Machine Learning 9229–9248 (PMLR, 2020).
Albuquerque, I., Naik, N., Li, J., Keskar, N. & Socher, R. Improving out-of-distribution generalization via multi-task self-supervised pretraining. Preprint at https://arxiv.org/abs/2003.13525 (2020).
Pimentel, M. A. F., Clifton, D. A., Clifton, L. & Tarassenko, L. A review of novelty detection. Signal Process. 99, 215–249 (2014).
Article Google Scholar
Zhou, Y. Rethinking reconstruction autoencoder-based out-of-distribution detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 7379–7387 (IEEE, 2022).
Chalapathy, R. & Chawla, S. Deep learning for anomaly detection: a survey. Preprint at https://arxiv.org/abs/1901.03407 (2019).
Ruff, L. et al. Deep semi-anomaly detection. In Proc. International Conference on Learning Representations (OpenReview.net, 2020).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article Google Scholar
Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S). In Proc. International Conference on Learning Representations Workshop (Curran Associates, 2016).
Özçelik, R. & Grisoni, F. A hitchhiker’s guide to deep chemical language processing for bioactivity prediction. Digit. Discov. 4, 316–325 (2025).
Article Google Scholar
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Article Google Scholar
Skinnider, M. A., Stacey, R. G., Wishart, D. S. & Foster, L. J. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759–770 (2021).
Article Google Scholar
Pearce, T., Leibfried, F. & Brintrup, A. Uncertainty in neural networks: approximately Bayesian ensembling. In Proc. 23rd International Conference on Artificial Intelligence and Statistics 234–244 (PMLR, 2020).
van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
Article Google Scholar
Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).
Article Google Scholar
Hansen, K. et al. Benchmark data set for in silico prediction of Ames mutagenicity. J. Chem. Inf. Model. 49, 2077–2081 (2009).
Article Google Scholar
Manelfi, C. et al. ‘Molecular anatomy’: a new multi-dimensional hierarchical scaffold analysis tool. J. Cheminform 13, 54 (2021).
Article Google Scholar
Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry. J. Med. Chem. 57, 3186–3204 (2014).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
Cereto-Massagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015).
Article Google Scholar
Rossen, L., Sirockin, F., Schneider, N. & Grisoni, F. Scaffold hopping with generative reinforcement learning. J. Chem. Inf. Model. 65, 6513–6525 (2025).
Article Google Scholar
Reutlinger, M. et al. Chemically Advanced Template Search (CATS) for scaffold-hopping and prospective target prediction for ‘orphan’ molecules. Mol. Inform. 32, 133–138 (2013).
Article Google Scholar
Fooladi, H., Vu, T. N. L., Mathea, M. & Kirchmair, J. Evaluating machine learning models for molecular property prediction: performance and robustness on out-of-distribution data. J. Chem. Inf. Model. 65, 9871–9891 (2025).
Article Google Scholar
Lee, K., Lee, K., Lee, H. & Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Adv. Neural Inf. Process. Syst. 31, 7167–7177 (2018).
Google Scholar
Boix-Adsera, E., Lawrence, H., Stepaniants, G. & Rigollet, P. GULP: a prediction-based metric between representations. Adv. Neural Inf. Process. Syst. 35, 7115–7127 (2022).
Article Google Scholar
Asinex screening library. Asinex https://www.asinex.com/screening-libraries-(all-libraries) (2025).
Specs screening library. Specs https://www.specs.net, (2025).
Enamine Hit Locator. Enamine https://enamine.net/compound-libraries/diversity-libraries/hit-locator-library-460, (2025).
Welsch, M. E., Snyder, S. A. & Stockwell, B. R. Privileged scaffolds for library design and drug discovery. Curr. Opin. Chem. Biol. 14, 347–361 (2010).
Article Google Scholar
Roskoski, R. Properties of FDA-approved small molecule protein kinase inhibitors: a 2024 update. Pharmacol. Res. 200, 107059 (2024).
Article Google Scholar
Wu, P., Nielsen, T. E. & Clausen, M. H. FDA-approved small-molecule kinase inhibitors. Trends Pharmacol. Sci. 36, 422–439 (2015).
Article Google Scholar
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Article Google Scholar
Gunantara, N. A review of multi-objective optimization: methods and its applications. Cogent Eng. 5, 1502242 (2018).
Article Google Scholar
Liao, G. et al. Identification of small-molecule inhibitors of human inositol hexakisphosphate kinases by high-throughput screening. ACS Pharmacol. Transl. Sci. 4, 780–789 (2021).
Article Google Scholar
Imamura, R. M. et al. Inexpensive high-throughput screening of kinase inhibitors using one-step enzyme-coupled fluorescence assay for ADP detection. SLAS Discov. 24, 284–294 (2019).
Article Google Scholar
Puhl-Rubio, A. C. et al. Use of protein kinase–focused compound libraries for the discovery of new inositol phosphate kinase inhibitors. SLAS Discov. 23, 982–988 (2018).
Article Google Scholar
Mezna, M. et al. Development of a high-throughput screening method for LIM kinase 1 using a luciferase-based assay of ATP consumption. SLAS Discov. 17, 460–468 (2012).
Article Google Scholar
Yoshizawa, T. et al. A data-driven generative strategy to avoid reward hacking in multi-objective molecular design. Nat. Commun. 16, 2409 (2025).
Article Google Scholar
van Tilborg, D. & Grisoni, F. Traversing chemical space with active deep learning for low-data drug discovery. Nat. Comput. Sci. 4, 786–796 (2024).
Article Google Scholar
Schmidt, S., Schenk, L., Schwinn, L. & Günnemann, S. Joint out-of-distribution filtering and data discovery active learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 25677–25687 (IEEE, 2025).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996.
Article Google Scholar
Landrum, G. RDKit: Open-Source Cheminformatics. RDKit https://www.rdkit.org (2006).
Carles, F., Bourg, S., Meyer, C. & Bonnet, P. PKIDB: a curated, annotated and updated database of protein kinase inhibitors in clinical trials. Molecules 23, 908 (2018).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. Finding a ‘kneedle’ in a haystack: detecting knee points in system behavior. In Proc. 31st International Conference on Distributed Computing Systems Workshops (IEEE, 2011).
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Article Google Scholar
Böttcher, T. An additive definition of molecular complexity. J. Chem. Inf. Model. 56, 462–470 (2016).
Article Google Scholar
Bonchev, D., Mekenyan, O. V. & Trinajstić, N. Isomer discrimination by topological information approach. J. Comput. Chem. 2, 127–148 (1981).
Article Google Scholar
Kipf, T. N. & Welling, M. Variational graph auto-encoders. In Proc. NeurIPS Workshop on Bayesian Deep Learning (Curran Associates, 2016).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).
Google Scholar
van Tilborg, D., Rossen, L. & Grisoni, F. Zenodo repository (data). Zenodo https://doi.org/10.5281/zenodo.14865513 (2026).
van Tilborg, D., Rossen, L. & Grisoni, F. Data for figure 2–4 for the paper “Molecular deep learning at the edge of chemical space”. figshare https://doi.org/10.6084/m9.figshare.30665201.v1 (2025).
van Tilborg, D., Rossen, L. & Grisoni, F. Zenodo repository (code). Zenodo https://doi.org/10.5281/zenodo.18846066 (2026).

Download references

Acknowledgements

This research was funded by the European Union (ERC, ReMINDER, 101077879, to F.G.). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. We acknowledge support from the Federation of European Biochemical Societies (FEBS) Excellence Award (to F.G.), the Irene Curie Fellowship, the Centre for Living Technologies and SURF (NWO grant EINF-5379 to D.v.T.). We thank A. Gardin for brainstorming on figures, R. Özçelik for technical feedback, G. Landrum for fruitful scientific discussions, S. Sueron for suggestions on the experimental work and E. R. Starr for technical feedback and proofreading the paper.

Author information

Authors and Affiliations

Institute for Complex Molecular Systems, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands
Derek van Tilborg, Luke Rossen & Francesca Grisoni
Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, the Netherlands
Derek van Tilborg & Francesca Grisoni

Authors

Derek van Tilborg
View author publications
Search author on:PubMed Google Scholar
Luke Rossen
View author publications
Search author on:PubMed Google Scholar
Francesca Grisoni
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: D.v.T. and F.G.; methodology: D.v.T. and F.G.; data curation: D.v.T.; computational experiments: D.v.T.; biochemical experiments: D.v.T.; formal analysis: D.v.T.; software: D.v.T. with contributions from L.R.; visualization: D.v.T.; writing—original draft: D.v.T.; writing—review and editing: all authors.

Corresponding author

Correspondence to Francesca Grisoni.

Ethics declarations

Competing interests

The authors declare no competing interest.

Peer review

Peer review information

Nature Machine Intelligence thanks Nessa Carson, Haicang Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Tables 1–4 and Figs. 1–12.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Source data

Source Data Fig. 2 (download ZIP )

Processed results to plot Fig. 2.

Source Data Fig. 3 (download ZIP )

Processed results to plot Fig. 3.

Source Data Fig. 4 (download ZIP )

Processed results to plot Fig. 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

van Tilborg, D., Rossen, L. & Grisoni, F. Molecular deep learning at the edge of chemical space. Nat Mach Intell 8, 575–587 (2026). https://doi.org/10.1038/s42256-026-01216-w

Download citation

Received: 01 August 2025
Accepted: 06 March 2026
Published: 22 April 2026
Version of record: 22 April 2026
Issue date: April 2026
DOI: https://doi.org/10.1038/s42256-026-01216-w