Main

Molecular machine learning is rapidly gaining traction in early drug discovery1,2,3,4,5. One key objective is identifying novel bioactive molecules (‘hits’) on one or more pharmacological targets6. In this context, finding structurally novel hit molecules is crucial for addressing unmet therapeutic needs7,8, ensuring commercial viability9 and overcoming drug resistance10,11. However, moving beyond the structural features of the training molecules (for example, to identify novel bioactive molecular cores) poses a substantial challenge for machine learning models, which often fail when applied to out-of-distribution (OOD) molecules12,13,14. This is especially true for discrete data, such as molecules, which can quickly deviate from the data distribution learned during model training. This is further exacerbated by the scarcity of structurally diverse molecular data with high-quality experimental annotations12,13,14,15, due to the costly and time-consuming nature of biochemical experiments. As a result, training sets typically contain only hundreds of molecules, while libraries used for screening may contain billions of existing, but previously unseen, chemicals to be predicted16. Dealing with the resulting distribution shifts makes the discovery of structurally novel hit molecules with machine learning a herculean task. In this regard, quantifying how reliable predictions are beyond the ‘edge’ of the explored chemical space holds enormous promise.

Ensuring prediction reliability in prospective hit-screening campaigns has been an active topic of research17,18. A well-established approach involves defining an applicability domain19,20, which delimits the chemical space of reliable predictions, most often via a threshold on molecular similarity to the training data21 (Fig. 1a). However, this method does not incorporate the information learned by the model, and, due to its similarity-based definition, hampers the discovery of structurally novel molecules. Another widely used approach is based on uncertainty estimation22,23, which leverages the model’s prediction confidence, often through probabilistic modelling techniques. While uncertainty estimation allows the consideration of structurally novel molecules in principle, it may provide overconfident predictions when confronted with OOD samples14,24,25. Hence, being able to make reliable predictions on OOD molecules remains one of the core challenges of molecular machine learning in drug discovery.

Fig. 1: Estimating unfamiliarity of molecular data using joint modelling.
Fig. 1: Estimating unfamiliarity of molecular data using joint modelling.The alternative text for this image may have been generated using AI.
Full size image

a, Conceptual representation of the applicability domain. Molecules close to the training data in chemical space are within a models’ applicability domain. Molecules outside of this boundary are considered OOD. b, The architecture of the JMM estimates how ‘unfamiliar’ a molecule is to the model through its reconstruction loss. c, Inducing molecular distribution shifts by separating molecular data into in-distribution and OOD groups through spectral clustering. Results for the Orexin receptor 2 (OX2R) dataset are shown. CNN, convolutional neural network. RNN, recurrent neural network.

Here we offer a fresh perspective on how to better navigate the ‘edge’ of chemical space with deep learning, while accounting for prediction reliability on OOD molecules. To achieve this, we leverage recent advances in generative deep learning for de novo molecule design2,26,27,28,29, in particular autoencoders27,30,31. Autoencoders can be trained to encode molecular structures into a lower-dimensional latent space, and subsequently decode them back to their original form27,30. In this work, through joint molecular modelling (Fig. 1b), we simultaneously train deep learning models to predict molecular properties (for example, bioactivity) and reconstruct the input molecule in a semi-supervised32,33 manner, that is, by learning from a combination of labelled and unlabelled molecular data.

Our joint learning approach breaks with the well-established application of leveraging a self-supervised learning task for generative chemistry27 or for predictive performance improvement33,34,35,36,37, by using reconstruction capabilities as a direct proxy for OOD estimation38,39,40,41. Specifically, we hypothesize that poorly reconstructed molecules are less familiar to the model, indicating that they fall outside the distribution learned from the training data. Building on this hypothesis, we introduce a metric, termed unfamiliarity, which captures a model’s reconstruction ability and is proposed to quantify how much a molecule deviates from the training distribution.

In this systematic study, spanning 33 experimentally labelled molecular datasets, we show not only that the introduced unfamiliarity metric is a robust indicator of molecular distribution shifts, but also that it strongly correlates with classifier performance. The capacity of unfamiliarity to identify structurally diverse and bioactive molecule is further validated in the wet lab, discovering several compounds with low micromolar activity on two kinase proteins.

Ultimately, the introduced concept of molecular unfamiliarity provides a principled approach to estimating model generalizability, even in the presence of molecular distribution shifts. Our approach offers a fresh perspective on estimating prediction reliability, complementing established concepts such as the applicability domain19,20 and uncertainty estimation22,23—guiding the discovery of structurally novel molecules in a more precise and informed manner.

Results

In what follows, we will elaborate on our joint molecular model (JMM), introduce the unfamiliarity metric and demonstrate its ability to quantify molecular distribution shifts. Next, we relate molecular property prediction to distribution shifts and leverage this relationship to estimate prediction reliability. Finally, we apply the unfamiliarity metric to prioritize molecules in a virtual screening case study, followed by experimental validation in the wet lab.

Unfamiliarity and joint molecular modelling

The JMM (Fig. 1b) is based on a semi-supervised32,33 autoencoder, building on seminal work in molecular generative modelling27. First, molecules are represented as Simplified Molecular Input Line Entry System (SMILES) strings42, which encode molecular topology and atom/bond types in a textual format. SMILES strings are encoded into a compressed latent vector (z), using a one-dimensional convolutional neural network43, which was shown to capture elements of bioactivity effectively27,44. The z vector is then decoded back into the input representation using a recurrent neural network with long short-term memory (LSTM)26,28,29, in a self-supervised manner. This encoder-decoder was pretrained on ~1.2 M unlabelled molecular structures from ChEMBL45, a dataset large enough to learn the ‘grammar’ of SMILES strings46. The model was then finetuned using each labelled molecule, by passing the same latent representation (z) to an approximate Bayesian classifier47, which predicts molecular properties and simultaneously estimates prediction uncertainty (Fig. 1b). Reconstruction and property prediction were trained jointly (Methods; equation (13)), to ensure that the shared latent space captures relevant information for both tasks.

The performance of molecular reconstruction was quantified via the reconstruction loss. This was computed as the total negative log-likelihood loss of all SMILES string tokens, normalized by SMILES token length:

$${{\mathscr{L}}}_{\mathrm{reconstruction}}\left(x\right)=-\frac{1}{\left|x\right|}\mathop{\sum }\limits_{i\in x}\log p\left({t}_{i}|{t}_{ < i}\right),$$
(1)

where ti is the ith non-padding element (‘token’) in the input SMILES string (x), and |x| is the number of non-padding tokens in the sequence. Here, \(p({t}_{i}|{t}_{ < i})\) denotes the probability assigned by the decoder to the next token ti given all preceding tokens in the sequence. From the reconstruction loss, we obtain the unfamiliarity metric (\({\mathbb{U}}\)), as follows:

$${\mathbb{U}}\left({\rm{x}}\right)=\log {{\mathscr{L}}}_{{\rm{reconstruction}}}\left(x\right).$$
(2)

The unfamiliarity metric depends on vocabulary size V and theoretically ranges from near −∞ to log(log(V)). The lower the reconstruction loss, the lower the unfamiliarity, and vice versa. In our case, a model with V = 35 would have an upper \({\mathbb{U}}\) limit of ~1.27 under ideal circumstances with uniformly predicted token probabilities. However, in practice, this upper bound can be exceeded when a model assigns extremely low probabilities to certain tokens.

Detecting molecular distribution shifts

To investigate whether the unfamiliarity score reflects molecular distribution shifts, we collected 33 experimentally annotated datasets from literature48,49,50, spanning various biological properties and sizes (Supplementary Table 1). These datasets were split into groups of in-distribution and OOD molecules. First, we performed spectral clustering (Methods) on molecular cyclic skeletons51 (core ring systems without exocyclic substituents), which ensured that structurally similar molecules were consistently grouped together (Fig. 1c). Based on cluster distances in each dataset, the most distant clusters (representing approximately 25% of all molecules) were used as an OOD test set (testOOD). The remaining molecules were split into an in-distribution test set (testID, ~25% of total) and a training set (trainID, ~50%; Supplementary Fig. 1).

To confirm that the molecules in the testOOD set originate from different data distributions than the other two sets, we used three approaches to capture molecular similarity52:

  1. (1)

    Similarity of extended connectivity fingerprints53 (ECFPs), which capture the presence of atom-centred substructures. ECFPs were computed on molecular cyclic skeletons, and similarity was quantified via the Tanimoto coefficient54.

  2. (2)

    Overlap of structural cores55, computed as the maximal common substructure (MCS) fraction between molecular graphs (Methods).

  3. (3)

    Pharmacophore similarity, computed on Chemically Advanced Template Search (CATS) descriptors56, using the cosine similarity.

Across all similarity metrics, molecules in the testOOD set exhibited statistically significant differences from the molecules in the training set and testID, whereas no differences were found between these latter two (Fig. 2a–c, paired, two-sided, Wilcoxon signed-rank test, α = 0.05).

Fig. 2: Detecting induced molecular distribution shifts with the unfamiliarity score.
Fig. 2: Detecting induced molecular distribution shifts with the unfamiliarity score.The alternative text for this image may have been generated using AI.
Full size image

a, Mean scaffold similarity of data splits in the labelled datasets to their train set (n = 33 datasets). Similarity is calculated as the Tanimoto coefficient between ECFPs of Bemis–Murcko scaffolds. Every point in the box plot represents a dataset. b, Mean MCS fraction (MCSF) between data splits in the labelled sets to their train set (n = 33 datasets). c, Mean pharmacophore similarity (CATS12) between data splits in the labelled sets to their train set (n = 33 datasets). d, Predictive performance on the bioactivity finetuning sets (n = 33 datasets). From left to right, RF models using CATS descriptors, RF using ECFPs, MLP using ECFPs, MLP using a SMILES string encoder, and JMM using the same SMILES string encoder. e, Distribution of the JMM’s unfamiliarity score for all molecules in testID and testOOD across all labelled datasets (n = 14,081 molecules per split). Box plots in ae show the median (centre line), 25th and 75th percentiles (box bounds) and 1.5× interquartile range (IQR) (whiskers). f, Distributions of the JMM’s unfamiliarity score in testID and testOOD per datasets (dataset acronyms are specified in Supplementary Table 1). Statistically significant differences (P < 0.05) in af are denoted with an asterisk, determined by paired, two-sided, Wilcoxon signed-rank tests (ad) and two-sided KS tests (e and f). Non-significant differences are denoted as n.s. g, Relationship between the JMM’s unfamiliarity score and the mean MCSF similarity of all molecules in the labelled datasets (n = 14,081 molecules per split) to their respective train set. h, Relationship between binned unfamiliarity and MCSF similarities to the respective train set. Unfamiliarity values were binned per dataset. Points represent the mean over all datasets (n = 33), and error bars represent the standard error. All model-derived scores represent means over 10-fold Monte Carlo cross-validation (10% validation samples).

Source data

To benchmark the effect of such a distribution shift on predictive performance, we trained three well-established48 molecular property prediction models using molecular descriptors: a random forest (RF) and a multilayer perceptron (MLP) combined with ECFPs, as well as a RF model combined with CATS pharmacophore descriptors. As expected, the performance of these baselines consistently degraded on testOOD compared with testID (Fig. 2d, paired, two-sided, Wilcoxon signed-rank test, P < 0.05) as measured via balanced accuracy (Methods; equation (15)). Such performance degradation is a hallmark of OOD data57.

Having established a baseline, we evaluated the classification performance of the proposed JMM (Fig. 1b) on each dataset. The JMM achieved a balanced accuracy of 0.75 ± 0.02 on testID molecules, which was slightly lower than the balanced accuracy of 0.78 ± 0.02 achieved by the ECFP-based model (Fig. 2d, P = 3.7 × 10−5, paired, two-sided, Wilcoxon signed-rank test). Such a small performance gap suggests limited practical differences, and it aligns with existing literature on the performance of SMILES-based versus ECFP-based models44. Notably, training the JMM with or without the reconstructive decoder had no effect on the performance (testID: P = 0.499, testOOD: P = 0.594; paired, two-sided, Wilcoxon signed-rank test). These results indicated that the performance of the JMM is in line with literature standards, despite the addition of the molecular reconstruction task. In other words, the decoder enables unfamiliarity estimation without imposing a performance penalty on the classifier.

Next, we aimed to verify the hypothesis that molecules well-represented in the training data (or testID) are reconstructed better by the JMM than ‘unfamiliar’ OOD molecules38,39,40,41. To this end, we inferred \({\mathbb{U}}(x)\) on all labelled molecules in the respective testID and testOOD sets for every trained JMM. As expected, testOOD molecules received significantly higher unfamiliarity scores than testID molecules (two-sided Kolmogorov–Smirnov (KS) test, P < 0.05; Fig. 2e), as was especially visible on a dataset basis (Fig. 2f). Importantly, these differences in unfamiliarity were not driven by factors like SMILES string length or complexity, branching, molecular graph complexity, molecular weight or the number of functional groups (Supplementary Fig. 2). These results suggest that the model’s ability to reconstruct molecules depends not on molecular complexity, but rather on its ‘proximity’ to the training data distribution.

For both molecules in testID and testOOD, this was further confirmed by the direct relationship between a molecule’s unfamiliarity and its distance to the training data, regardless of the data splits (Supplementary Fig. 3), for example, in terms of structural core overlap (Fig. 2g,h). Moreover, we found that, in general, there is a moderate-to-strong relationship between unfamiliarity and structural distance to the training data (Table 1). In this context, unfamiliarity correlates well to multiple and complementary similarity metrics, suggesting that it provides a generalizable, model-driven perspective on learned data distributions.

Table 1 Correlation of reliability metrics to train set similarity

Unfamiliarity and bioactivity prediction

Because molecular reconstruction and molecular property prediction depend on the same learned latent representation, we tested how informative \({\mathbb{U}}(x)\) is for assessing the predictive capabilities of the model. We compared our approach with other well-established measures of reliability, namely:

  1. (1)

    Similarity to training set molecules (data-driven), measured as the average pharmacophore similarity, cyclic skeleton similarity, or molecular core overlap.

  2. (2)

    Embedding distance (model-driven), defined as the average Mahalanobis distance58 of a molecule’s embedding z (Fig. 1b) to the learned embeddings of the training set.

  3. (3)

    Prediction uncertainty (model-driven), based on approximate Bayesian modelling of the classifier (Methods; equation (12)).

We found that all tested reliability metrics are indicative of a model’s predictive performance, considering balanced accuracy, hit rate and precision (Supplementary Fig. 4). Prediction uncertainty and unfamiliarity have a moderate to strong correlation to model performance (Table 2). In other words, when uncertainty and/or unfamiliarity are high (on a dataset level), models make more erroneous bioactivity predictions, whereas molecules for which these metrics are low, are generally predicted accurately (Supplementary Fig. 4). Interestingly, unfamiliarity and prediction uncertainty seem to be unrelated as metrics themselves (Supplementary Fig. 5, Spearman correlation, r = 0.10 ± 0.05). This aspect indicates that both metrics capture complementary information about prediction reliability and motivates the introduction of unfamiliarity alongside uncertainty for molecular machine learning.

Table 2 Correlation of reliability metrics to model performance

‘Model-driven’ reliability metrics outperformed all ‘data-driven’ methods that calculate a molecule’s distance to the training set using predefined descriptors and metrics. This indicates that models can extract relevant information from their training data that cannot be captured through predefined measures of molecular similarity alone.

Interestingly, the unfamiliarity score proved to be a considerably better reliability metric than the embedding distance to the training embeddings, even though both capture elements of structural similarity learned by the JMM (Table 1). In other words, the ability of a model to reconstruct a molecule from an internal representation provides more insight into prediction reliability than the ‘internal’ molecular embeddings directly used by the classifier. This suggests that reconstructing a molecule from an embedding is not only highly informative for prediction reliability, but also for the quality of the embedding itself. These findings corroborate that simple embedding distance metrics fail to capture the ‘chemical nuances’ that affect task-specific outcomes and that embedding quality is better assessed through downstream tasks than by proximity to training embeddings, as previously suggested for computer vision59.

Furthermore, our results highlight the distinctions between embedding distance, unfamiliarity and prediction uncertainty. While both unfamiliarity and prediction uncertainty correlate with performance independently, the latter does not strongly relate to the structural properties of molecules (Table 1). By contrast, unfamiliarity captures both molecular similarity and predictive performance simultaneously. In other words, embedding distance primarily reflects p(x), uncertainty relates to p(y|x), and unfamiliarity integrates both, providing insight into p(y, x|x). This suggests unfamiliarity as a holistic metric of prediction reliability, effectively linking structural information to prediction confidence.

Virtual hit screening

To further explore the use of unfamiliarity to navigate chemical space, we extended our analysis to large-scale screening libraries to mimic a realistic virtual screening scenario. Although these libraries are unlabelled (and therefore do not allow performance evaluation), they contain a larger and more diverse set of molecules, potentially revealing additional differences compared with the smaller test sets analysed previously.

Using a combined set of 1.4 M molecules from three reliable commercial screening libraries (Asinex60, Specs61 and Enamine62; Supplementary Table 4), we performed inference with all 33 trained models. The screening molecules showed a lower structural overlap with the training sets than the testOOD molecules (Fig. 3a and Supplementary Fig. 6a). Based on previous results (Fig. 2d), this indicates that we can also expect a corresponding performance drop on the screening molecules. Still, estimated prediction uncertainty did not highlight meaningful differences between screening molecules and testID molecules (Fig. 3b; KS statistic D = 0.181, indicating a limited effect size). Prediction uncertainty alone would suggest that the screening libraries fall within a model’s operating limits. Unfamiliarity scores, meanwhile, reveal strong distribution shifts, both overall (Fig. 3c; KS statistic D = 0.999), and especially on a dataset basis (Supplementary Fig. 6).

Fig. 3: Virtual screening on 1.4M commercially available molecules.
Fig. 3: Virtual screening on 1.4M commercially available molecules.The alternative text for this image may have been generated using AI.
Full size image

All model-derived scores represent means over 10-fold Monte Carlo cross-validation (10% validation samples). a, Distributions of the mean Tanimoto similarity on ECFPs to each respective training set of molecules from testID (n = 14,081), testOOD (n = 14,081) and the combined screening libraries (n = 46,048,926). Results of all 33 drug targets are combined. b, Distributions of estimated prediction uncertainty \({{{\mathbb{H}}}}(y|x)\) for all molecules in testID, testOOD and the combined screening libraries. c, Distributions of unfamiliarity scores \({\mathbb{U}}\left(x\right)\) for all molecules in testID, testOOD and the combined screening libraries. Statistically significant differences (P < 0.001) are denoted with an asterisk, determined by two-sided KS tests (a-c). KS test statistics (D) are as follows. a: Library versus testID, D = 0.499; testOOD versus testID, D = 0.279; b: library versus testID, D = 0.181; testOOD versus testID, D = 0.155; c: library versus testID, D = 0.999; testOOD versus testID, D = 0.368. d, Relationship between uncertainty and unfamiliarity for all molecules in the screening libraries (n = 1,395,422), averaged over all 33 drug targets. The mean Spearman correlation is reported ± s.e.m. Four molecules predicted as generally bioactive across all drug targets are annotated, each close to a utopia point (for example, lowest uncertainty and lowest unfamiliarity for molecule iii; Methods). e, Relationship between uncertainty and unfamiliarity for all molecules in the screening libraries, specifically for serine/threonine-protein kinase PIM1 (n = 1,395,422). Four molecules predicted as bioactive for PIM1 are annotated, each close to a utopia point.

Source data

The estimated prediction uncertainty on this large-scale library does not directly correlate with a molecule’s structural similarity to the training data (Supplementary Fig. 7a, Spearman correlation, r = −0.04 ± 0.02). Unfamiliarity, meanwhile, shows a moderate relationship to structural similarity (Supplementary Fig. 7b, Spearman correlation, r = 0.21 ± 0.02). Finally, prediction uncertainty and unfamiliarity remain independent (Fig. 3d, Spearman correlation, r = −0.03 ± 0.03). Our findings confirm that the observed trends in small datasets remain robust across large screening libraries. Moreover, they highlight the ability of unfamiliarity to detect OOD shifts that routinely used metrics might fail to capture.

Uncertainty and unfamiliarity show stark differences across the molecular library, both from a pan-pharmacological angle (Fig. 3d) and for the well-studied serine/threonine-protein kinase PIM1 as a highlighted case (Fig. 3e). Globally, molecules with high \({\mathbb{U}}\left(x\right)\) scores are ‘structurally atypical’ (for example, molecules i and ii, Fig. 3d). Molecules with low \({\mathbb{U}}\left(x\right)\) scores, meanwhile, display key characteristics of bioactive molecules, for example, steroid-like structures63 and bioactive cores64 (molecules iii and iv, Fig. 3d). Similar trends are observed across individual protein targets, as exemplified by PIM1 (Fig. 3e). Here, molecules with low \({\mathbb{U}}\left(x\right)\) contain well-known pharmacophores (for example, pyrimidinone cores65).

Finally, we found no relationship between unfamiliarity and the quantitative estimate of drug-likeness66 (Spearman correlation: r = −0.04 ± 0.03) or synthetic accessibility67 (Spearman correlation: r = 0.01 ± 0.02), demonstrating that low unfamiliarity scores do not simply capture drug-likeness. Importantly, structural insights are not captured by uncertainty estimation. This might indicate that uncertainty estimation is unreliable on strongly OOD data, as has been suggested previously14,24,25.

Prospective virtual screening

To validate the proposed approach prospectively under real-world conditions, we screened a commercial compound library comprising approximately 180,000 drug-like molecules (from Specs61; Methods) to identify inhibitors of two pharmacologically relevant kinase targets: PIM1 and cyclin-dependent kinase 1 (CDK1). Notably, CDK1 data were not used elsewhere in this study, serving as a fully independent test case.

For each target, we trained a JMM on all available data (1,443 training molecules for PIM1 and 312 for CDK1) and predicted bioactivity across the screening library. Molecules were then ranked by their distance to the so-called utopia point68—the geometric optimum balancing multiple objectives (Methods). Using the predicted bioactivity \({\mathbb{E}}(y|x)\), prediction uncertainty \({{{\mathbb{H}}}}(y|x)\) and unfamiliarity \({\mathbb{U}}(x)\) as three complementary objectives, we selected the ten best molecules with three alternative trade-offs to illustrate how uncertainty and unfamiliarity behave under distribution shifts (Fig. 4a,d):

  1. (A)

    High \({\mathbb{E}}\left(y,|,x\right),\,\)High \({{{\mathbb{H}}}}(y|x)\), and low \({\mathbb{U}}(x)\).

  2. (B)

    High \({\mathbb{E}}\left(y,|,x\right),\,\)Low \({{{\mathbb{H}}}}(y|x)\), and low \({\mathbb{U}}(x)\).

  3. (C)

    High \({\mathbb{E}}\left(y,|,x\right),\,\)Low \({{{\mathbb{H}}}}(y|x)\), and high \({\mathbb{U}}(x)\).

Fig. 4: Experimental screening for PIM1 and CDK1.
Fig. 4: Experimental screening for PIM1 and CDK1.The alternative text for this image may have been generated using AI.
Full size image

Compound selection was based on unfamiliarity scores and uncertainty estimates averaged over 10-fold Monte Carlo cross-validation (10% validation samples). All kinase activity measurements represent the mean of three technical replicates, except for positive control compounds (AZD1208 and dinaciclib). a, The ten most promising PIM1 inhibitors were selected from a library of ~180,000 compounds using three combinations of uncertainty and unfamiliarity (A: most uncertain and least unfamiliar B: least uncertain and least unfamiliar; C: least uncertain and most unfamiliar). b, Measured PIM1 activity across selected molecules at 10 µM and their maximum Tanimoto similarity (on ECFPs) to PIM1 training molecules. Lower PIM1 activity means stronger inhibition. c, Box plot of measured PIM1 activity (n = 10 molecules per method). The solid line represents PIM1 activity without any screening compound, while the dashed line represents PIM1 activity with a potent control inhibitor (AZD1208). Statistically significant differences (α = 0.05) are denoted with an asterisk, and were determined by paired, two-sided, Wilcoxon signed-rank tests. Box plots show the median (centre line), 25th and 75th percentiles (box bounds) and 1.5× IQR (whiskers). d, The ten most promising CDK1 inhibitors for each selection method. e, Measured CDK1 activity across selected molecules at 10 µM and their maximum Tanimoto similarity (on ECFPs) to CDK1 training molecules. f, Box plot of measured CDK1 activity (n = 10 molecules per method). The solid line represents CDK1 activity without any screening compound, while the dashed line represents CDK1 activity with a potent control inhibitor (dinaciclib). Box plots show the median (centre line), 25th and 75th percentiles (box bounds) and 1.5× IQR (whiskers). Statistically significant differences (α = 0.05) are denoted with an asterisk, and were determined by paired, two-sided, Wilcoxon signed-rank tests. g, Measured protein activity across all 60 screened compounds at 10 µM (method A: 1–10 and 31–40; method B: 11–20 and 41–50; method C: 21–30 and 51–60). Selected compounds (identifier highlighted in boldface) displayed in panel h) were further characterized for their dose–response curve. h, Structures and determined IC50 of the six most promising compounds for PIM1 and CDK1. IC50 values that could not be determined are denoted as NA.

Source data

To limit structural overlap with the training data and among selected molecules, compounds with a Tanimoto coefficient (on ECFPs) ≥0.70 to either the training set or other selected compounds were excluded, thereby further challenging the models beyond their training distribution.

All 60 prioritized molecules (Supplementary Figs. 9 and 10) were structurally distant from their respective training sets, with maximal Tanimoto coefficients (on ECFPs) of 0.28 ± 0.05 for PIM1 and 0.28 ± 0.06 for CDK1. Molecules selected via the high unfamiliarity strategy were particularly atypical for kinase inhibitors (Supplementary Figs. 9 and 10), aligning with previous findings (Fig. 3d,e). After experimentally testing all compounds at a single concentration of 10 µM (Fig. 4b and Supplementary Fig. 8), we identified four initial hits (>50% protein inhibition) and six weak hits (>25% inhibition) for PIM1. For CDK1, we found one initial hit and five weak hits.

The six most active compounds per target were further characterized to determine dose–response curves and the corresponding half maximal inhibitory concentration (IC50). For PIM1, all six compounds showed dose-dependent inhibition (Supplementary Fig. 11), although none achieved full inhibition within the tested concentration range (1 nM–10 µM). Compounds 4, 10 and 25 exhibited clear sigmoidal partial inhibition curves, with low/sub micromolar potency (IC50 1.5 ± 0.4 µM, IC50 0.92 ± 0.97 µM and IC50 0.87 ± 0.5 µM, respectively). Compounds 9 and 18 demonstrated weaker inhibition with incomplete curve resolution; for these, we report upper-bound potency estimates (IC50 5.9 µM and IC50 7.5 µM, respectively).

For CDK1, inhibition was generally less pronounced (Fig. 4d and Supplementary Fig. 12), possibly due to the smaller training set (312 molecules). Only compound 40 yielded a complete dose–response curve, with partial inhibition (IC50 2.9 ± 0.75 µM). Compound 33 showed modest activity with an upper-bound IC50 ~3.4 µM. The remaining compounds (31, 48, 53 and 59) showed insufficient inhibition to estimate an IC50.

Overall, our screening experiment achieved hit rates of approximately 17% for PIM1 (5 of 30 compounds with clear dose–response inhibition) and 7% for CDK1 (2 of 30), with several additional weak actives. These hit rates exceed those typically reported for traditional kinase-focused screening campaigns, which often range from 0.1% to 5% (refs. 69,70,71,72). Notably, all identified hits had a maximal substructure similarity to the training molecules below 38% (measured as the Tanimoto coefficient on ECFPs), and emerged from purely prospective, machine learning-guided selection.

While the limited number of compounds tested per strategy and the absence of a full 2 × 2 factorial design prevents us from drawing statistically robust conclusions and cleanly separating main effects, five of the seven compounds with low micromolar potency originated from selection method A (low unfamiliarity, high uncertainty). By contrast, selecting molecules with low prediction uncertainty (methods B and C) did not yield a clear advantage, consistent with earlier results indicating that uncertainty is not a reliable performance signal under distribution shifts (Fig. 3a–c). Although preliminary, these prospective results provide practical evidence that the proposed approach can identify novel bioactive matter and support unfamiliarity as a useful handle for navigating chemical space under distribution shifts.

Discussion

This study introduced unfamiliarity—a metric that captures a molecule’s distance from the data distribution learned by a deep learning model. Unfamiliarity is computed via a joint modelling approach, trained to simultaneously perform molecular reconstruction and property prediction. By capturing the model error in reconstructing previously unseen molecules, unfamiliarity quantifies molecular distribution shifts. Our results demonstrate that unfamiliarity is a reliable and powerful indicator of a model’s prediction reliability when applied to new molecules. As a classification reliability metric, unfamiliarity is as informative as prediction uncertainty estimated via approximate Bayesian modelling, yet the two remain independent. Unlike prediction uncertainty, however, unfamiliarity also captures a molecule’s structural distance from the learned data distribution. Notably, in a large-scale virtual screening campaign, unfamiliarity provided far more meaningful molecular insights than uncertainty estimation when faced with strong distribution shifts. The complementarity of unfamiliarity and prediction uncertainty as reliability metrics highlights unfamiliarity as a valuable tool for molecular machine learning.

The prospective validation further underscores the usefulness of the introduced unfamiliarity metric to complement uncertainty-based molecule prioritization. Despite using small training sets and screening only a handful of compounds per target, we identified multiple molecules with low micromolar activity. Prediction uncertainty, meanwhile, did not seem indicative of performance when operating outside the training support14,24,25. This suggests that unfamiliarity-aware selection can enable informed and precise exploration of the chemical space, unlocking new opportunities for discovering novel molecular candidates.

Our findings advocate for the adoption of unfamiliarity over traditional, similarity-based methods for applicability domain definition. Moreover, because unfamiliarity bears promise to reveal distribution shifts that would be undetected through molecular similarity or underestimated by uncertainty-based approaches, we recommend its adoption when screening large-scale molecular libraries. Crucially, because the unfamiliarity landscape reveals gaps in a model’s learned distribution, its applications could assist in guiding reinforcement learning applications73 or extend beyond ‘one-shot’ virtual screening to iterative approaches such as active learning74,75 to guide molecule acquisition. Ultimately, our study highlights the advantages of joint modelling not only in de novo design applications27 but also in capturing structure–activity relationships. The information provided by joint modelling and unfamiliarity is expected to drive the development of more reliable and generalizable models—accelerating the exploration of novel regions in the chemical space with greater confidence and precision.

Methods

Data preprocessing and analysis

Molecular representation and description

Each molecular structure in this study was represented as a SMILES42 string. CATS56 descriptors and 2048-bit ECFPs53 (using a radius of 2) were computed for each molecule.

Data collection and curation

Thirty-three labelled datasets of molecular structures with their corresponding experimental target property were used:

Moreover, small molecules were collected from ChEMBLv3345 for model pretraining. Because molecular structures from ChEMBL were used for pretraining, molecules with a Bemis–Murcko scaffold76 similar to any such scaffold in the labelled datasets (Tanimoto similarity coefficient54 on EFCPs larger than 0.7) were removed. This included molecules without a Bemis–Murcko scaffold (that is, containing no rings).

For prospective virtual screening, the most recent CDK1 (CHEMBL308) data were fetched from ChEMBL v3545 (accessed in April 2025) as an additional and independent dataset. Raw data were processed in accordance with previous work48.

All bioactivity endpoints of the MoleculeACE48 and CDK1 datasets were converted from continuous regression labels into binary classification labels. Molecules with an EC50 (half maximal effective concentration) or Ki (inhibitory constant) of 100 nM or lower were labelled as bioactive, whereas less potent molecules were labelled as inactive. For the LIT-PCBA49 and Ames mutagenicity dataset50, their original binary classification labels were used.

For all datasets, SMILES strings were preprocessed using RDKit v. 2024.3.377. For each SMILES string, stereochemistry tokens as well as salts and solvents (Supplementary Table 2) were removed. Each molecule was sanitized, neutralized using predefined neutralization reactions (Supplementary Table 3), and its SMILES string was canonicalized. Disconnected structures and molecules that contained formal charges, contained complex ring systems (SMILES strings with a ring index of 9 or higher), non-standard isotopes or any atoms other than Cl, Br, H, C, N, O, F, S and I were removed. Molecules were removed if they contained more than 100 tokens in their canonical SMILES string or if they could not be featurized into CATS descriptors and/or ECFP fingerprints. Sizes of datasets before and after data curation are presented in Supplementary Table 1.

Virtual screening library (retrospective)

Molecules were gathered from three commercial screening libraries:

All molecules were processed in the same way as the training datasets. Finally, all unique molecules were aggregated.

Virtual screening library (prospective)

The most recent Specs library was downloaded from https://www.specs.net/index.php?view=databases&page=download (accessed in April 2025). All molecules were processed in the same way as the training datasets. To ensure that molecules were compatible with our experimental assay (likely soluble in 1% dimethylsulfoxide (DMSO)), physicochemical rules were enforced based on relaxed rule-of-five and Veber criteria:

  1. (1)

    Molecular weight between 200 g mol−1 and 600 g mol−1.

  2. (2)

    logP lower than 6.

  3. (3)

    Total polar surface area between 20 Å2 and 140 Å2.

  4. (4)

    Number of hydrogen bond donors lower than six.

  5. (5)

    Number of rotatable bonds lower than ten.

  6. (6)

    A maximum of two rule-of-five violations.

In addition, molecules with a terminal enone, isocyanate, quinone, aromatic nitro groups, azide or epoxide groups were removed by using SMILES Arbitrary Target Specification (SMARTS) patterns to prevent assay interference. Furthermore, to prevent the selection of trivially simple molecules and enrich the general-purpose Specs library for kinase-relevant chemical space, several general criteria were enforced:

  1. (1)

    A molecule must have an ATP mimetic core, that is, at least one heteroatom in a ring, or a fused carbocyclic system.

  2. (2)

    A molecule must have a polar anchor to ensure solubility or solvent interaction.

  3. (3)

    A molecule must have enough hydrophobic mass or planarity to fill the kinase ATP pocket (for example, gatekeeper region).

  4. (4)

    A molecule must have a directional H-bond donor/acceptor to ensure potential for hinge interaction.

  5. (5)

    A molecule must have two or more rings.

These rules are intentionally permissive for new chemotypes, and 98.6% of all kinase inhibitors that ever went into clinical trials78 pass these filters. Finally, molecules with a Tanimoto similarity on ECFPs >0.7 to any molecule in the respective target’s data (PIM1, and CDK1) were removed. This left 185,298 and 185,336 screening molecules for PIM1 and CDK1, respectively.

Molecular cyclic skeletons

Cyclic skeletons51 (core ring systems without exocyclic substituents) were extracted from molecules to serve as the most fundamental molecular scaffold representation. From each molecule, Bemis–Murcko scaffolds76 were obtained, removing peripheral substituents. Remaining double-bonded exocyclic substituents were then removed, and all atoms and bonds were made generic to obtain the final cyclic skeleton.

Data splitting

The curated molecular structures from ChEMBL were split into a training (80%, n = 1,230,041), a test set (10%, n = 153,755) and a validation (10%, n = 153,755) set, using random splitting. All labelled datasets were split into a training set (~50%), test set (~25%) and OOD set (~25%). To determine the OOD molecules, spectral clustering was performed on unique cyclic skeletons (see below). The molecules corresponding to the scaffolds in the clusters with the lowest mean cluster similarity to all other clusters that constituted approximately 25% of the total dataset size were taken as the OOD set. The remaining molecules (that is, the ~75% most similar molecules) were split randomly in a train and test set, with the test set being equal in size to the OOD set. An overview of all datasets is presented in Supplementary Table 1.

Spectral clustering

Spectral clustering was performed on a molecular similarity matrix A using Sci-kit learn79. A is an n × n matrix where each element Aij is the Tanimoto coefficient on ECFPs Tij between two molecular structures. By using a molecular similarity matrix directly, we bypass the complex, high-dimensional and non-Euclidean nature of molecular structures. From this affinity matrix, the symmetrically normalized Laplacian was constructed as follows:

$${L}_{\mathrm{sym}}=I-{D}^{-1/2}A{D}^{-1/2},$$
(3)

where I is the identity matrix and D is the degree matrix. Subsequently, eigenvalue decomposition was performed:

$${L}_{\mathrm{sym}}={\bf{U}}\Lambda {{\bf{U}}}^{T}{.}$$
(4)

To determine the number of clusters k for spectral clustering, the eigenvalues λ1, λ2, …, λn were sorted in ascending order and the elbow (the point of maximal curvature) of the resulting sequence was estimated using the kneed algorithm80. Finally, the spectral embeddings of the data were clustered by taking the top k (smallest) eigenvectors Uk, normalizing Uk to unit length and performing k-means clustering on all rows (uk).

Chemical space visualization

To visualize each labelled dataset, molecules were first encoded as ECFPs. The resulting binary ECFPs were reduced to 100 components using truncated singular value decomposition and embedded into a two-dimensional space using t-distributed stochastic neighbour embedding with a perplexity value of 30 and default settings.

MCS fraction

To compute molecular core similarity, we computed the MCS fraction55 between a molecule Ma and a reference molecule Mb as

$$\mathrm{MCS}\,\mathrm{fraction}=\frac{\left|\mathrm{MCS}\left({M}_{a},\,{M}_{b}\right)\right|}{\left|{M}_{a}\right|},$$
(5)

where |Ma| is the number of atoms in Ma and MCS is the maximal common substructure between the two molecules Ma and Mb, as determined by the FMCS algorithm7 in RDKit. A high MCS fraction indicates that a molecule shares a significant portion of their overall core structure with a reference molecule. This implementation is asymmetric.

Molecular complexity

To quantify molecular complexity we compute the well-established Bertz complexity81 and Böttcher complexity82. In addition, to align complexity measures with the task of reconstructing SMILES strings, we also estimated the complexity of the molecular graph directly83 as

$${C}_{\mathrm{graph}}=V{\log }_{2}V-\mathop{\sum }\limits_{i\in {\mathscr{G}}}{V}_{i}{\log }_{2}{p}_{i},$$
(6)

where G represents the molecular graph, V is the total number of elements in the graph and Vi is the number of the elements in the ith set of elements. In a similar fashion, we also estimated the complexity of a SMILES string, via their entropy computed as

$${C}_{\mathrm{SMILES}}=-\mathop{\sum }\limits_{i\in S}{p}_{i}{\log }_{2}{p}_{i},$$
(7)

where S represents the set of unique tokens in a SMILES string, and pi is the probability of the ith token occurring in S. Tokens representing the start, end of sequence, and padding were not considered. Moreover, for each molecule, we counted the number of SMILES tokens (excluding padding), the number of SMILES string branches (that is, ‘(‘ tokens), and the presence of 50 unique molecular patterns74.

Virtual screening

Top-k molecules were selected in a multi-objective manner by selecting the k molecules closest to the utopia point68. The distance to this ideal point can be calculated as

$${d}_{\mathrm{utopia}}=\sqrt{\mathop{\sum }\limits_{i=1}^{n}{\left({\mathrm{norm}}_{i}\right)}^{2}},$$
(8)

where normi is the normalized objective that is either maximized (for example, predicted bioactivity) or minimized (for example, prediction uncertainty):

$$\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{max}]}=\frac{{x}_{\text{max}}-{x}_{i}}{{x}_{\text{max}}-{x}_{\text{min}}}{\rm{or}}\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{min}]}=\frac{{x}_{i}-{x}_{\text{min}}}{{x}_{\text{max}}-{x}_{\text{min}}}.$$
(9)

Machine learning

Encoder

Canonical SMILES strings were encoded by a one-dimensional convolutional neural network. SMILES string character tokens were embedded using a randomly initialized trainable embedding layer of size 128. Several one-dimensional convolutional layers were used with a stride of 1 and no padding. Each layer was followed by a ReLU activation, standard max pooling with a kernel size equal to that of the convolutional layers, and dropout. Both convolutional and pooling layers used a stride of 1 and no padding. The final output was flattened and compressed to a latent vector (z) of size 128 using a fully connected layer. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of convolutional layers [2, 3], filter size [256, 512], kernel size [6, 8], weight decay on convolutional neural network weights [0, 1 × 10−4] and dropout [0, 0.1].

Decoder

Encoded latent molecular representations in z were reconstructed back to SMILES strings using a conditioned LSTM model. A randomly initialized trainable embedding layer of 128 neurons was used to embed SMILES string character tokens. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of LSTM layers (nlayers) [2, 3] and the LSTM hidden size (\({\mathrm{size}}_{\mathrm{layers}}\)) [256, 512]. Models were trained autoregressively without teacher forcing using next token prediction based on the tokens predicted in the previous steps rather than the ground truth tokens. A reconstruction loss normalized for sequence length was used (equation (1)). To condition the model, the LSTM hidden state h0 was initialized with z for every molecule. To correctly match all dimensions of h0 (\({n}_{\mathrm{layers}},{\mathrm{size}}_{\mathrm{layers}}\)), z was first transformed to \({n}_{\mathrm{layers}}\,\times {\mathrm{size}}_{\mathrm{layers}}\) with a fully connected layer, after which it was reshaped into \({{\rm{n}}}_{\mathrm{layers}}\) chunks of \({\mathrm{size}}_{\mathrm{layers}}\).

Approximate Bayesian classifier

Labels were predicted from either ECFPs or SMILES strings encoded into latent molecular representationsz. A MLP was used with several fully connected layers and an output layer consisting of two neurons. The number of MLP layers [2, 3] and the MLP hidden size [1,024, 2,048] were optimized (see ‘Hyperparameter optimization’ section). To estimate prediction uncertainty, anchored ensembling47 was implemented on the MLP as in our previous work74. We used an ensemble of M = 10 models. For each model, \(m\in [1\ldots M]\), we regularized its parameters θm with a set of ‘anchored’ parameters θanchor,m that prevent different models in the ensemble to converge to the same parameter space. Each model is initiated with distinct θanchor, which is controlled by different random seeds. The classification loss in our implementation is defined as

$${{\mathscr{L}}}_{\mathrm{MLP}}=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}\mathop{\underbrace{\log {p}_{m}\left(\,{y|x}\right)s}}\limits_{\text{prediction}}+\mathop{\underbrace{\lambda {{||}{\theta }^{m}-{\theta }_{\mathrm{anchor}}^{m}{||}}^{2}}}\limits_{\mathrm{anchoring}},\,$$
(10)

where λ is a regularization coefficient (set to 3 × 10−4). To estimate the expected value \({\mathbb{E}}\) of each molecule x, we take the mean prediction over the whole ensemble, as follows:

$${\mathbb{E}}\left(\,{y|x}\right)=\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,y,|,x\right).$$
(11)

Similarly, we estimate prediction uncertainty for each molecule \(x\) as the mean entropy \({\mathbb{H}}\) over all models in the ensemble:

$${{{\mathbb{H}}}}\left({y|x}\right)=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,{y|x}\right)\log {p}_{m}\left(\,{y|x}\right).$$
(12)

JMM

Canonical SMILES strings were encoded into compressed latent molecular representations \({\bf{z}}\) using the encoder model described above. Subsequently, \({\bf{z}}\) was used to perform both molecular property prediction with a classifier and molecular reconstruction using the \({\bf{z}}\)-conditioned decoder model27. The model was optimized in a joint fashion using the following weighted composite loss function:

$${{\mathscr{L}}}_{\mathrm{JMM}}={{\mathscr{L}}}_{\mathrm{reconstruction}}+\gamma \times {{\mathscr{L}}}_{\mathrm{MLP}},$$
(13)

where the scalar γ was set at 0.1 based on preliminary experiments. A regular autoencoder was used as preliminary experiments showed no performance benefits of the more complex variational autoencoder84 architecture.

RF

An RF classifier was trained on either ECFPs or CATS56 descriptors. The following hyperparameters were optimized (see ‘Hyperparameter optimization’): the number of trees [100, 250, 500, 1,000], the maximal tree depth [10, 20, 30, ∞] and the minimal samples per split [2, 5, 10].

Model training

Autoencoders

Encoder–decoder models were (pre)trained to reconstruct SMILES strings of general drug-like molecules from ChEMBL using the Adam optimizer. Mini batches of 256 random molecules were sampled from the training data for 1,000,000 steps using uniform sampling. Gradients were clipped with a max norm of 5. Early stopping with a patience of up to 20 evaluation checkpoints was implemented by monitoring validation loss every 10,000 steps. The model checkpoint with the best validation loss was used.

Classifiers

Classifiers using ECFPs or SMILES strings as input were trained for molecular property prediction on each of the labelled dataset using a similar setup to the autoencoders. However, tenfold Monte Carlo cross-validation was used with 10% validation splits instead of one predefined data split. Mini batches of 64 were resampled during training based on the occurrence of their class with

$${P}_{c}=1-\frac{{n}_{c}}{N},$$
(14)

where Pc is the probability of sampling class c, nc is the number of samples of class c, and N is the total number of samples. Models were trained for 5,000 steps with an early stopping patience of 10 evaluation checkpoints, performed every 10 steps. The model checkpoint with the best validation loss was used. For the RF control models, molecules were weighted inversely proportionally to their class frequency to mitigate class imbalance during training.

JMMs

Joint models, each consisting of a SMILES string encoder, a decoder and a classifier (Fig. 1b), were initialized with pretrained weights. The SMILES string encoders and classifiers used weights from models trained on the labelled datasets. For the decoder, decoder weights were used from an autoencoder pretrained on ChEMBL. Using a mini batch size of 64, the joint models were finetuned for 10,000 steps with an early stopping patience of 50 evaluation checkpoints, performed every 20 steps. A learning rate of 3 × 10−6 was used for the encoder and classifier, whereas a learning rate of 3 × 10−7 was used for the decoder. No weight decay was applied.

Hyperparameter optimization

Hyperparameters, as specified previously, were optimized for all autoencoders, classifiers (using ECFPs or SMILES strings), and RF control models using a simple grid search. Tenfold Monte Carlo cross-validation was used, repeatedly using 10% of the training data as a validation split. The hyperparameters of the model with the best mean validation loss was used.

Model evaluation

Predictions were evaluated according to the following performance metrics:

$$\text{Balanced accuracy}=\frac{1}{2}\left(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}+\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}\right),$$
(15)
$$\text{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$$
(16)

where TP is the number of true (that is, correctly predicted) positives, TN is the number of true negatives, FN is the number of false negatives, and FP is the number of false positives. In addition, the hit rate (true positive rate) was determined for virtual screening experiments as

$$\text{Hit rate}=\frac{\mathrm{TP}}{P},$$
(17)

and enrichment factor as

$$\text{Enrichment factor}=\frac{{\mathrm{TP}}_{k}}{P/N}.$$
(18)

Here, TPk is the number of correctly identified positives in the subset of k prioritized molecules, P is the total number of positives in the full dataset, and N is the total number of molecules in the full dataset.

Biological characterization

Sixty screening compounds were purchased from Specs Compound Handling B.V. and dissolved at 10 mM in 100% DMSO.

To screen for bioactivity, a point screening was first performed at a concentration of 10 µM (in 1% DMSO) in technical triplicates using the ADP-Glo Kinase Assay platform from Promega using the Chemi-Verse PIM1 Kinase Assay Kit and the Chemi-Verse CDK1/CyclinA2 Kinase Assay Kit from BPS Bioscience in Costar flat white 96-well plates. AZD1208 (CAS 1204144-28-4) and dinaciclib (CAS 779353-01-4), purchased from TargetMol Chemicals, were used as positive controls for PIM1 and CDK1, respectively. Bioactivity was measured as the area under the curve of an 18-step luminescence scan between 398 nm and 653 nm with an integration time of 1 s and a settle time of 100 ms, normalized for the signal in buffer-only wells.

For each target protein, the six compounds with the highest bioactivity at 10 µM were followed up with an 8-point dose–response curve using the same assay. Screening compounds were measured in technical triplicates from 10 µM to 0.0046 µM, whereas reference compounds were measured in duplicate.

Hardware and training set-up

All computational experiments were performed on a Lenovo ThinkSystem SD650-N v2 server equipped with Intel Xeon Platinum 8360Y central processing units and NVIDIA A100 (40 GB) graphics processing units. Up to five models were trained in parallel on a single graphics processing unit.

Software and code

All code was implemented in Python (v. 3.12). Deep learning models were implemented using PyTorch (v. 2.3.0)85. Traditional machine learning models and clustering was implemented using Sci-kit learn v.1.4.0 (ref. 79). All molecular data were handled using RDKit (v.2024.3.3)77. For data visualization, R (v.4.3.0) and the R package ggplot2 (v.3.4.2) were used along with Adobe Illustrator.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.