Abstract
Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL’s impact on evaluating biomedical machine learning models.
Similar content being viewed by others
Introduction
Supervised machine learning (ML) is one of the fastest-growing research fields, leading to advances in many computer and life science domains. Many bioinformatics fields benefit from using ML models, e.g., molecular property prediction1 and drug-target interaction prediction2.
For successful deployment of these ML models in real-world use cases, it is crucial that the reported performance estimates reliably represent model performance during inference. If the test set used for model evaluation does not represent the data used at inference time, the model can show inflated performance scores during testing, impeding successful model deployment in a real-world use case. Such misrepresentation can happen when the model uses information from the training set at test time, although this information is not available during inference. This phenomenon is called information leakage or data leakage3,4. Recent studies show that information leakage is a highly relevant problem in many subfields of ML-based research, leading to inflated performances and overoptimistic conclusions in biomedical ML research4,5,6 and beyond7.
The simplest form of information leakage is having the same samples in multiple folds of the data split. This is easy to control, and most common splitting techniques avoid this by removing duplicate data points. Another type of information leakage that is more complex to detect can occur when similarities between data points in the training and in the test sets are larger than similarities between data points in the training set and in the data that one intends to use during inference4. In such a case, an ML model is benchmarked on test data that is in-distribution with respect to the training data, although its intended use case is to yield reliable predictions also for out-of-distribution (OOD) data. Hence, a model may perform well on the test data by relying on similarity-based shortcuts that do not generalize to the intended real-world application scenario.
Especially for biomolecular data that exhibit complex dependency structures, one can easily fall into this trap by following a standard strategy in the ML community to randomly split a benchmarking dataset into training, validation, and test folds. For instance, it has been shown that this problem pervades the field of research on deep learning models to predict protein-protein interaction (PPI) from protein sequences8,9,10. While many of these models perform excellently when evaluated on the random data splits used in the original publications, performance often becomes close to random when evaluated on protein pairs with low homology to the training data that represents the desired use case to predict PPIs for poorly characterized proteins10. Another biomedical ML problem for which similar pitfalls have been described is the problem of predicting the deleteriousness of missense variants11,12. For this problem, it has been shown that when variants that are similar in that they affect the same protein are assigned to different splits, ML models can achieve excellent test performances by relying on protein-level shortcuts (e.g., a simple protein-level majority vote based on the variants in the training fold). Such models then generalize poorly to variants of sparsely annotated proteins and will systematically misclassify minority-class variants of proteins for which both deleterious and non-deleterious variants are seen at inference time11.
In this work, we address the problem of information leakage due to misleading evaluation on in-distribution data for models intended for deployment on OOD data. To this end, we developed DataSAIL, an algorithmic framework and tool to split datasets into multiple folds that allow us to realistically estimate model performance on OOD data. While such datasets have been curated for specific ML tasks on biomolecular data (e.g., PINDER13 and the gold standard dataset developed in10 for PPI prediction and PLINDER14 for protein-ligand interaction prediction), DataSAIL is generic and can be used to split any kind of data, as long as a similarity or distance measure for the contained data points is available.
We formulate the data splitting problem underlying DataSAIL as a constrained optimization problem, prove that this problem is NP-hard, and present a Python package that heuristically solves this problem using clustering and integer linear programming (ILP). Unlike existing tools and algorithms to compute data splits that reduce information leakage, DataSAIL can automatically compute splits for heterogeneous data of two different types and combines stratification with similarity-aware splitting (see Table 1 and section “Detailed description of related work” in the Supplementary Materials for comparison with existing tools). Moreover, DataSAIL is more versatile than existing approaches because it can be used out-of-the-box for various types of molecular data. We validate DataSAIL by showing how it can reduce leakage between training and test data for various ML models trained on both one- and two-dimensional biomolecular datasets.
Results
Data splits for supervised ML
In supervised ML, we are given a dataset \({{\mathcal{M}}}=\{({x}_{1},{y}_{1}),\ldots,({x}_{n},{y}_{n})\}\) of n samples with feature vectors xi ∈ X and labels yi ∈ Y, where X is a feature space and Y represents the space of labels. The goal is to learn a function fθ: X → Y that minimizes a loss function \({{\mathcal{L}}}({f}_{\theta }({{{\bf{x}}}}_{{{\bf{i}}}}),{y}_{i})\). This is achieved by selecting a hypothesis space \({{\mathcal{H}}}\) of candidate functions and fitting \({f}_{\theta }\in {{\mathcal{H}}}\) within the hypothesis space15. To develop a supervised ML model fθ, one needs to split \({{\mathcal{M}}}\) into three pairwise disjoint datasets: A training set \({{{\mathcal{M}}}}_{train}\) to learn the parameters θ, i.e., to select fθ from a fixed hypothesis space \({{\mathcal{H}}}\). A validation set \({{{\mathcal{M}}}}_{val}\) to optimize the hyper-parameters that determine the shape of \({{\mathcal{H}}}\) (e.g., number of hidden layers) or control the employed optimization strategy (e.g., learning rate or optimizer). And a test set \({{{\mathcal{M}}}}_{test}\) to assess the performance of the trained model on so far unseen data.
Our proposed method, DataSAIL, works for one-dimensional and two-dimensional datasets. In a one-dimensional dataset, one feature vector-output value pair (xi, yi) corresponds to one elementary data point; for example, a certain molecular property such as toxicity can be predicted for a single chemical compound (Fig. 1, 1D data). If the feature vector xi consists of two elementary data points (such as in drug-target interaction prediction, where xi represents a pair of a molecule and a protein target for which an interaction affinity yi should be predicted), we call this a two-dimensional dataset (Fig. 1, 2D data).
The symbol “Y” indicates the presence of a measurement, while the phylogenetic trees next to the matrix visualizations illustrate similarities between samples. The figure showcases all splitting tasks and their interrelations. Samples assigned to training are highlighted with a blue background, validation samples are in yellow, and test samples are marked in red. Unassignable tiles are left white. Created in BioRender. Joeres, R. (2025) https://BioRender.com/w47j283.
Input can be any data (special focus is on biochemical data). DataSAIL then computes a pairwise distance or similarity matrix (a) and stratifies the data into a constant number of clusters based on that (b). These clusters are split into k folds, using off-the-shelf ILP solvers (c). From the partitioning of the clusters, DataSAIL infers the partitioning of the elementary data points (d). Created in BioRender. Joeres, R. (2025) https://BioRender.com/m81k197.
Importantly, in a two-dimensional dataset, the similarity between molecules can be defined over each dimension, e.g., along the drug and target dimensions. We define different splitting tasks with abbreviations based on whether they account for similarity-induced information leakage and the dimensions of the dataset (1 or 2). In identity-based splittings, the similarity between molecules is not considered, whereas in similarity-based splittings, it is accounted for. Those tasks are visualized in Fig. 1 and include identity-based one-dimensional splitting (I1), identity-based two-dimensional splitting (I2), similarity-based one-dimensional splitting (S1), similarity-based two-dimensional splitting (S2), and random interaction-based splitting (R). In two-dimensional data splitting, interactions may exist that cannot be assigned to any split without leaking information if the two interacting molecules are assigned to different folds. Therefore, interactions can get lost in two-dimensional data splitting (white tiles in panels I2 and S2 in Fig. 1).
The (k, R, C)-DataSAIL problem
In this section, we introduce (k, R, C)-DataSAIL, which formalizes the problem of splitting an R-dimensional dataset into k folds such that data leakage is minimized and C classes that are present in the data (e.g., confounders such as sex or the output labels yi if the space of labels Y is discrete) are distributed equally among the k folds such that each fold preserves the overall class distribution. Intuitively, we define (k, R, C)-DataSAIL as the problem to minimize inter-class similarity while keeping similar class ratios across the splits. Although designed with biomedical applications in mind, our problem definition is generic and can be used for any dataset where a similarity or a distance measure is available for the contained data points. We present a generalized version of the problem for R-dimensional datasets, although our Python implementation only supports one- or two-dimensional input, i.e., R ≤ 2.
More formally, let \({{\mathcal{D}}}\) be the set of data points represented in \({{\mathcal{M}}}\) or a set of clusters defined over these data points (allowing clusters as elements of \({{\mathcal{D}}}\) will be important for our heuristic solver, as explained below). The data points/clusters \(x\in {{\mathcal{D}}}\) can have \(R\in {\mathbb{N}}\) different entity types t(x) ∈ [R]: = {1, …, R}. For instance, in a drug-target interaction scenario, we have R = 2, with the two entity types r ∈ {1, 2} corresponding to drugs and protein targets or clusters thereof. If \({{\mathcal{M}}}\) is a one-dimensional dataset, all data points/clusters have the same element type. Moreover, the data points or clusters \(x\in {{\mathcal{D}}}\) have cardinalities \(\kappa (x)\in {{\mathbb{N}}}_{\ge 1}\). For elementary data points, we always have κ(x) = 1; if \({{\mathcal{D}}}\) contains clusters, we may have κ(x) > 1. For each element type r ∈ [R], we write \({{{\mathcal{D}}}}_{t=r}:=\{x\in {{\mathcal{D}}}| t(x)=r\}\) and \({n}_{r}:={\sum}_{x\in {{{\mathcal{D}}}}_{t=r}}\kappa (x)\) to denote the set of all data elements of type r and their overall cardinality, respectively. Additionally, we assume that a similarity measure \(\,{\rm{sim}}\,:{{\mathcal{D}}}\times {{\mathcal{D}}}\to {\mathbb{R}}\) or a distance measure \(\,{\rm{dist}}\,:{{\mathcal{D}}}\times {{\mathcal{D}}}\to {\mathbb{R}}\) is available for \({{\mathcal{D}}}\) (see “Methods” for details on how to define sim and dist).
Information leakage has been defined qualitatively by Kaufman et al.3 and quantitatively by Elangovan et al.16 as follows:
This definition is incomplete as only the biggest leak per test sample is considered, and the validation set is ignored. In view of this, we define the leakage induced by a mapping \(\pi :{{\mathcal{D}}}\to \,[k]\) that splits \({{\mathcal{D}}}\) into k folds \({{{\mathcal{D}}}}_{i}^{\pi }:=\{x\in {{\mathcal{D}}}| \pi (x)=i\}\), i ∈ [k]: = {1, …, k}, as the total similarity
between data elements assigned to different folds. Here, \(\left[\cdot \right]:\{\perp,\top \}\to \{0,1\}\) is the Iverson bracket. The cardinalities are added as factors to our leakage function L to put a higher weight on similarities between larger clusters. Typically, we have k = 3 for data splitting in ML (\({{{\mathcal{D}}}}_{1}^{\pi }={{{\mathcal{D}}}}_{train}\), \({{{\mathcal{D}}}}_{2}^{\pi }={{{\mathcal{D}}}}_{val}\), \({{{\mathcal{D}}}}_{3}^{\pi }={{{\mathcal{D}}}}_{test}\)). We will define (k, R, C)-DataSAIL as the problem to minimize L(π), given two sets of constraints we introduce below.
Let si ∈ (0, 1) with \({\sum }_{i=1}^{k}{s}_{i}=1\) be user-provided desired split fractions for the k folds \({{{\mathcal{D}}}}_{i}^{\pi }\) induced by π (e.g., s1 = 0.8, s2 = 0.1, s3 = 0.1 for splitting the data into 80% training, 10% validation, and 10% test data). As a first set of constraints on π, we require that, for all pairs (i, r) ∈ [k] × [R] of entity types and folds, π respects the split fractions si up to a relative error ϵ ∈ [0, 1):
For elementary data points where κ(x) = 1, this means that the fraction of data points of type r (that is, the data points in \({{{\mathcal{D}}}}_{t=r}\)) that are assigned to split i (the data points in \({{{\mathcal{D}}}}_{i}^{\pi }\)) matches the desired split fractions up to a relative error ϵ.
In many ML applications, the data elements \(x\in {{\mathcal{D}}}\) may belong to one or multiple of C classes σ(x) ⊆ [C], and we would like to compute stratified splits where the desired split fractions si are respected for each class c ∈ C. To model this requirement, we add a constraint
for each triple (i, r, c) ∈ [k] × [R] × [C] of folds, entity types, and classes, where \({{{\mathcal{D}}}}_{t=r}^{\sigma=c}:=\{x\in {{{\mathcal{D}}}}_{t=r}| c\in \sigma (x)\}\) is the set of data elements of type r that belong to class c, \({n}_{r}^{c}:={\sum}_{x\in {{{\mathcal{D}}}}_{t=r}^{\sigma=c}}\kappa (x)\) is the overall cardinality of such data elements, and δ ∈ [0, 1] is an acceptable relative error. Two observations are important at this point:
-
For all ϵ ≥ δ, the set of constraints specified in Eq. (4) implies the constraints from Eq. (3), which can thus be discarded if δ ≤ ϵ.
-
When no class information is available (i.e., all data elements have the same “dummy class” σ(x) = 1), Eqs. (4) and (3) are equivalent up to choices of ϵ and δ and Eq. (4) can thus be discarded.
We can now define the (k, R, C)-DataSAIL problem:
Theorem 1
The (k, R, C)-DataSAIL problem is NP-hard for all \(k\in {{\mathbb{N}}}_{\ge 2}\), \(R\in {{\mathbb{R}}}_{\ge 1}\), and \(C\in {{\mathbb{N}}}_{\ge 1}\).
The proof for Theorem 1 is contained in Section “Proof of Theorem 1”. To compute leakage-reduced data splits despite this hardness result, we developed a heuristic workflow that first assigns individual data points to a fixed number of clusters and then solves a constant-size instance of (k, R, C)-DataSAIL where the clusters are treated as data elements (Fig. 2, see “Methods” for details). The (k, R, C)-DataSAIL problem can be formulated as an ILP with \({{\mathcal{O}}}(| {{\mathcal{D}}}| \cdot k+| {{\mathcal{D}}}{| }^{2})\) variables and \({{\mathcal{O}}}(k\cdot R\cdot C+| {{\mathcal{D}}}{| }^{2}\cdot k)\) constraints. The formulation of the ILP problem is given in Section “An ILP formulation of the (r, R, C)-DataSAIL problem”. Note that when \(| {{\mathcal{D}}}|\) is a constant-sized set of pre-computed clusters over the original dataset, our ILP has a constant number of variables and constraints and can hence be solved efficiently. To solve the constant-size instances, we use the standard ILP solvers.
Once a mapping π has been computed for \({{\mathcal{D}}}\), we can use it to compute a mapping for \({{\mathcal{M}}}\): First, we unpack π and assign each data element z contained in the cluster \(x\in {{\mathcal{D}}}\) to π(x), i.e., we define π(z): = π(x) for all z ∈ x (Fig. 2d). Then, we assign the feature vector-label pair \(({x}_{j},{y}_{j})\in {{\mathcal{M}}}\) to the split i if and only if all π(z) = i holds for all data points z represented by xj (e.g., a drug and a protein in the case of drug-target interaction prediction). Feature vector-label pairs (xj, yj) with conflicting assignments for different data points represented by xj are discarded. For instance, in a drug-target prediction scenario, it may happen that the drug and the protein jointly represented by xj are assigned to different splits by π (e.g., the drug is assigned to the training fold and the protein is assigned to the test fold). In this case, (xj, yj) is discarded.
Splitting biomolecular datasets
First, we consider one-dimensional data. We trained and tested four baseline ML models (random forests17 (RF), support vector machines18 (SVM), gradient boosting19 (XGB), and multilayer perceptrons20 (MLP)) and the deep learning model D-MPNN21 for molecular property predictions on random and similarity-based data splits computed with DataSAIL (S1) and two competitors. We used two widely used datasets from the MoleculeNet collection22 (QM8: regression problem, upper panels in Fig. 3; Tox21: classification problem, lower panels; Supplementary Fig. 1 shows further results for additional competitors and further one-dimensional datasets from MoleculeNet). As expected, splitting with DataSAIL leads to a better separation of training and test samples. In particular, on both datasets, DataSAIL’s data splits exhibit the lowest leakage L(π) among all compared data splits (see rightmost bar-plots in Fig. 3c, f). The other tools that aim to reduce information leakage—LoHi and DeepChem’s fingerprint-based splitting—only partly achieve this goal, thus leading to larger values of L(π). Overall, we observe that smaller values of L(π) are associated with larger drops in test performance in comparison to random splits (see Fig. 3c, f and Supplementary Fig. 2). This indicates that minimizing L(π) as implemented in DataSAIL indeed leads to harder splits and also shows that the ML models benchmarked here struggle to generalize to molecules with low Tanimoto similarity23,24,25 (the similarity measure we used to compute the values of L(π) reported in Fig. 3) with respect to the training data. For the deep learning model D-MPNN, these results are in line with findings reported in the original publication21, where the authors had shown that D-MPNN performs substantially worse on scaffold-based splits (which, like DataSAIL’s S1 splits, rely on molecular similarity) than on random splits (see Supplementary Table 1). In all figures, we depict the scaled L(π) as defined in Eq. (20).
We show QM8 (a, c, e) and Tox21 (b, d, f) from the MoleculeNet benchmark collection. a, d show the t-SNE embeddings for random split and b, e for DataSAIL's S1 split. c, f show ML model performances and information leakages for the different splits, quantified using mean absolute errors (MAE, lower is better) for QM8 and area under the receiver operating characteristic (ROC-AUC, higher is better) for Tox21.
Then, we consider two-dimensional data by splitting the LP-PDBBind dataset that contains binding affinities between 15,477 drugs and 12,718 protein targets (Fig. 4). We compared DataSAIL’s I2 and S2 splitting to I1 and S1 splitting for both drugs and targets, as well as to DeepChem’s fingerprint-based splitting26, LoHi27, and GraphPart28 (comparisons to additional splitting algorithms are shown in Supplementary Figs. 3 and 4). As for the one-dimensional data, splits computed by DataSAIL exhibit consistently low L(π) values (Fig. 4c, f, i), with the S2 splits performing particularly well. Moreover, splits with low L(π) values again lead to substantial drops in performance in comparison to I1 baselines which split the data randomly across the drug (Fig. 4c) or protein (Fig. 4f) axis. Another interesting observation is that, for all ML models, test performances are substantially worse for DataSAIL’s S2 splits than for all other tested data splits (Fig. 4i), showing that the tested binding affinity prediction models do not generalize well to scenarios where neither the drugs nor the proteins seen at inference time are similar to drugs and proteins contained in the training data. To our knowledge, DataSAIL is the only tool with out-of-the-box support for a splitting strategy that allows for the estimation of generalization capability in such scenarios. Strikingly, a comparison with the dataset-specific splits in the protein-ligand dataset PLINDER (Supplementary Table 2) shows that, in terms of L(π), DataSAIL’s automatically computed splits are competitive with splits curated for specific datasets.
a, d show the t-SNE embeddings for random splits (I1), b, e for one-dimensional similarity-based splits (S1), and g, h for two-dimensional similarity-based splits (S2). a, b, g show t-SNE embeddings of the drugs' ECFP4 fingerprints; t-SNE embeddings of the proteins' ESM2-t12 embeddings53 are visualized in (d, e, h) (gray dots visualize data points that had to be dropped for the two-dimensional splits). c, f, i show ML model performances, measured via the root mean squared error (RMSE, lower is better) and information leakages for the different splits.
Another improvement of DataSAIL over existing methods is the combination of stratified splitting with information leakage minimization. To show the effect of DataSAIL in this setting, we use the SR-ARE subchallenge from Tox21, for which 6889 active and 942 inactive small molecules exist in the dataset. Here, DataSAIL is not compared to a fully random split but to the classical, similarity-unaware stratified split. We investigate the effect of additionally introducing similarity-awareness and observe that the corresponding DataSAIL splits reduce information leakage considerably (Fig. 5). Again, the comparison of L(π) on the right of Fig. 5c shows that DataSAIL computes splits with reduced information leakage between the folds in comparison to the classical methods, and again we observe that these splits pose harder generalization tasks, leading consistent drops in performance across all tested ML models.
Effect of solvers and hyper-parameters, scalability
An important parameter of DataSAIL is the number of clusters K used to construct the constant-size (k, R, C)-DataSAIL instance to be fed into the ILP solver. The first row of Fig. 6 shows how K affects the quality of the splits (panel a) and the runtime of DataSAIL (panel b). We clustered the Tox21 dataset into various numbers of clusters using Tanimoto similarities of ECFPs. We then fed the resulting (k, R, C)-DataSAIL instances into the ILP solvers GUROBI, MOSEK, and SCIP and set a time limit of 2 h per solver. Interestingly, the quality of the splits does not improve for K > 150 and is already good for K ≈ 50, showing that a rather small number of clusters is sufficient for obtaining leakage-reduced data splits with DataSAIL. In terms of quality, the tested ILP solvers perform very similarly. GUROBI is the fastest solver.
Figure 6c shows how the quality of the splits depends on the acceptable relative errors ϵ and δ, tested for the SR-ARE subchallenge from Tox21 with quality quantified following Eq. (2). The classes of this dataset were the binary labels of the SR-ARE subchallenge. Therefore, we balanced positive and negative samples in both splits. We observe that the quality mainly depends on ϵ, which controls how close the obtained split fractions have to be to the user-requested split fractions si. Contrary to expected, we did not identify a dependency on δ. However, this is only a small example, and general trends may differ as datasets can vary greatly.
Because MoleculeNet offers a variety of datasets with different sizes, structures, and similarities, we use it for benchmarking the runtime of the various splitting techniques from DataSAIL, LoHi27, and DeepChem26 (Fig. 6d). As expected, the bigger the dataset, the slower the algorithms compute their splits. While DataSAIL is the slowest algorithm, it shows a benign scaling behavior and terminated for all datasets within a reasonable amount of time. In contrast, LoHi did not produce results for the MUV dataset within 12 h.
Discussion
Similarity is an often overlooked source of information leakage that is especially relevant when ML models are developed to be used on data with a distribution shift during inference. In this work, we present DataSAIL, a computational workflow and a tool to minimize similarity-induced information leakage when splitting data for ML model training and testing. DataSAIL provides better OOD data splits than state-of-the-art tools. We provide a formal definition of the underlying optimization problem, show that the problem is NP-hard, present a scalable heuristic, and empirically show that our heuristic can compute high-quality leakage-reduced data splits in a reasonable time, making DataSAIL a Swiss army knife for data splitting. DataSAIL can split one-dimensional and two-dimensional data and biochemical data of various types (small molecules, protein sequences, DNA and RNA sequences, genomes, and longer contigs). Our framework can also easily accommodate other data types, provided the user can provide similarities or distances between the data points.
A limitation of our implementation of DataSAIL is that it only supports R≤2 entity types, although our theoretical framework applies to arbitrary R-dimensional data. In future work, we plan to extend the DataSAIL implementation to work on arbitrary dimensional data. Another current limitation is the clustering step (Fig. 2b), where our implementation relies on spectral or agglomerative clustering and does not support custom clustering algorithms that may be more appropriate for specific data types. DataSAIL’s implementation also cannot handle similarities between entities of different types, although the theoretical framework allows for that. Moreover, splitting two-dimensional data with the current version of DataSAIL can lead to the loss of some feature vector-label tuples when the two elementary data points represented by the feature vector are assigned to different splits. This problem could be mitigated by adding a data loss penalization term to the objective function minimized by DataSAIL. Expanding the implementation to the theoretical limits would improve the versatility of the Python package but also increase the number of variables to deal with. Furthermore, DataSAIL uses off-the-shelf ILP solvers that naturally have a high overhead because they are applicable to multiple settings. Tailoring a solver to DataSAIL’s specific needs may thus improve performance and runtime considerably.
Finally, it is important to stress that testing models on challenging OOD splits as computed by DataSAIL is not appropriate in every ML development setting: The leakage function L(π) minimized by DataSAIL becomes small when \(\,{\rm{sim}}\,(x,{x}^{{\prime} })\) is small for data elements x and \({x}^{{\prime} }\) that π assigns to different splits. Since DataSAIL allows the user to select from various pre-implemented similarity functions sim and is open to custom similarity functions, the user has full control over the behavior of L(π). Which choice of sim is most appropriate depends on the intended deployment scenario for the evaluated ML model. In particular, if the inference-time data is expected to be similar to the training data with respect to the similarity function sim selected by the user, evaluating an ML model on the data splits computed by DataSAIL will lead to overly pessimistic results. When using DataSAIL to compute splits for evaluating an ML model that is intended to generalize to OOD data, model evaluators hence have to ensure that the selected similarity function sim indeed captures the intended generalization task. Given an appropriate choice of sim, a positive correlation between L(π) and performance then indicates that the tested ML models struggle to generalize the OOD scenarios modeled by sim.
Related to this, it may happen that the selected similarity function sim is correlated with the response variable to be predicted by the ML model (e.g., in a binary classification problem, it could happen that \(\,{\rm{sim}}\,(x,{x}^{{\prime} })\) is substantially larger for data points x and \({x}^{{\prime} }\) that fall into the same class than for data points that fall into different classes). In such scenarios, it is crucial that the user runs DataSAIL with the stratification constraint (4), where the classes C are defined according to the response variable. Without such a constraint, DataSAIL would compute splits that are highly imbalanced with respect to the response variable, which may again lead to overly pessimistic performance estimates.
Methods
Proof of Theorem 1
We show that the (k, R, C)-DataSAIL problem is NP-hard for all fixed constants \(k\in {{\mathbb{N}}}_{\ge 2}\) (number of folds), \(R\in {{\mathbb{N}}}_{\ge 1}\) (number of entity types), and \(C\in {{\mathbb{N}}}_{\ge 1}\) (number of classes). We proceed in three steps:
-
Step 1: We show that there is a polynomial-time reduction from (k, R, 1)-DataSAIL to (k, R, C)-DataSAIL for arbitrary fixed C ≥ 2.
-
Step 2: We show that there is a polynomial-time reduction from (k, 1, 1)-DataSAIL to (k, R, 1)-DataSAIL for arbitrary fixed R ≥ 2.
-
Step 3: We show that (k, 1, 1)-DataSAIL is NP-hard via a polynomial-time reduction from the minimum k-section problem, which is known to be NP-hard.
Step 1 is straightforward: Given an instance \({I}_{k,R,1}=({{\mathcal{D}}},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )\) of (k, R, 1)-DataSAIL (we can ignore the δ if C = 1), we construct an instance Ik,R,C of (k, R, C)-DataSAIL by arbitrarily assigning the data elements \(x\in {{\mathcal{D}}}\) to C classes and setting δ: = 1. Then, Eq. (4) is vacuous, implying that each \(\pi :{{\mathcal{D}}}\to [k]\) is a solution to Ik,R,1 if and only if it is a solution to Ik,R,C.
For Step 2, let \({I}_{k,1,1}=({{{\mathcal{D}}}}_{1},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )\) of (k, 1, 1)-DataSAIL. We now construct and instance \({I}_{k,R,1}=({{{\mathcal{D}}}}^{{\prime} },{{\rm{sim}}}^{{\prime} },{\kappa }^{{\prime} },{\{{s}_{i}\}}_{i=1}^{k},\epsilon )\) of (k, R, 1)-DataSAIL as follows: \({{{\mathcal{D}}}}^{{\prime} }\) contains \({{{\mathcal{D}}}}_{1}\) and R − 1 additional copies \({{{\mathcal{D}}}}_{r}\), r = 2, …, R. Let xr denote the copy in \({{{\mathcal{D}}}}_{r}\) of the data element \({x}_{1}\in {{{\mathcal{D}}}}_{1}\). We define \({\kappa }^{{\prime} }({x}_{r}):=\kappa ({x}_{1})\) for all copies. For all pairs of data elements \(({x}_{r},{x}_{{r}^{{\prime} }}^{{\prime} })\in {{{\mathcal{D}}}}^{{\prime} }\times {{{\mathcal{D}}}}^{{\prime} }\), we define
where M is some large enough constant (\(M:={R}^{2}\cdot {\sum}_{{x}_{1}{x}_{1}^{{\prime} }\in \left({{{\mathcal{D}}}}_{1}\atop 2\right)}\,{{\rm{sim}}}\,({x}_{1},{x}_{1}^{{\prime} })\) suffices). That is, similarities between different copies of the same data element are set to a very high value M, and all other similarities are inherited from the (k, 1, 1)-DataSAIL instance Ik,1,1.
Given an optimal solution π1 for Ik,1,1, we can always define an induced solution πR for Ik,R,1 as πR(xr): = π1(x1). Eq. (3) continues to hold because we have \({\kappa }^{{\prime} }({x}_{r})=\kappa ({x}_{1})\) for all r ∈ [R] and \({x}_{1}\in {{{\mathcal{D}}}}_{1}\). For each edge \({x}_{1}{x}_{1}^{{\prime} }\) contained in the cut induced by π1, there are \(\left(\begin{array}{c}R\\ 2\end{array}\right)+R\) copies contained in the cut induced by πR, all of which have weight \(\,{{\rm{sim}}}\,({x}_{1},{x}_{1}^{{\prime} })\cdot \kappa ({x}_{1})\cdot \kappa ({x}_{1}^{{\prime} })\): R copies of the form \({x}_{r}{x}_{r}^{{\prime} }\) and \(\left(\begin{array}{c}R\\ 2\end{array}\right)\) copies of the form \({x}_{r}{x}_{{r}^{{\prime} }}^{{\prime} }\) with \(r\ne {r}^{{\prime} }\). Moreover, the cut contains no other edges since, by definition of πR, all copies of the same node end up in the same split. Hence, we have
where OPT1 and OPTR denote the optima of Ik,1,1 and Ik,R,1, respectively.
Conversely, let \({\pi }_{R}^{{\prime} }\) be an optimal solution for Ik,R,1. Then \({\pi }_{R}^{{\prime} }\) puts all copies of the same data elements into the same folds, since otherwise, we would have \(L({\pi }_{R}^{{\prime} })\ge M > L({\pi }_{R})\), contradicting the optimality of \({\pi }_{R}^{{\prime} }\). For all \({x}_{1}\in {{{\mathcal{D}}}}_{1}\), we now define \({\pi }_{1}^{{\prime} }({x}_{1}):={\pi }_{R}^{{\prime} }({x}_{1})\). By counting edge copies as above, we obtain:
By combining the chains of inequalities in Eqs. (8) and (9), we obtain that \({\pi }_{1}^{{\prime} }\) is optimal for Ik,1,1. This concludes Step 2 of our proof.
For Step 3, we have to show that (k, 1, 1)-DataSAIL is NP-hard for all constants \(k\in {{\mathbb{N}}}_{\ge 2}\). This can be done via a reduction from the minimum k-section problem. Given a graph on G = (V, E) and a constant \(k\in {{\mathbb{N}}}_{\ge 2}\), the minimum k-section problem asks to find a partition π: V → [k] that splits V into k folds such that \({\sum}_{uv\in E}[\pi (u)\ne \pi (v)]\) is minimized and
holds for all i ∈ [k] (all folds have the same size). This problem is NP-hard, even when restricting to balanced instances with ∣V∣ = k ⋅ C for some \(C\in {{\mathbb{N}}}_{\ge 1}\)29, where the constraint in Eq. (10) simplifies to
Given a balanced instance (V, E, k) of the minimum k-section problem, we now define an instance \({I}_{k,1,1}=({{\mathcal{D}}},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )\) of (k, 1, 1)-DataSAIL by setting, \({{\mathcal{D}}}:=V\), κ(x): = 1 for all \(x\in {{\mathcal{D}}}\), \(\,{\rm{sim}}\,(x,{x}^{{\prime} }):=[x{x}^{{\prime} }\in E]\) for all \((x,{x}^{{\prime} })\in {{\mathcal{D}}}\times {{\mathcal{D}}}\), si: = k−1 for all i ∈ [k], and ϵ: = 0. Clearly, any solution π to (V, E, k) also solves Ik,1,1 and vice versa. Moreover, we have \(L(\pi )={\sum}_{uv\in E}[\pi (u)\ne \pi (v)]\) by construction of sim. Consequently, solving (V, E, k) is equivalent to solving Ik,1,1, which completes the proof.
An integer linear programming formulation of the (r, R, C)-DataSAIL problem
Our formulation contains binary variables ξx,i for all \((x,i)\in {{\mathcal{D}}}\times [k]\) that encode whether the data element x is assigned to fold i. Moreover, it contains binary variables \({\zeta }_{x{x}^{{\prime} }}\) for all unordered pairs of data elements \(x{x}^{{\prime} }\in \left(\begin{array}{c}{{\mathcal{D}}}\\ 2\end{array}\right)\), which are defined such that \({\zeta }_{x,{x}^{{\prime} }}=1\) if and only if x and \({x}^{{\prime} }\) are assigned to different folds.
Constraint (13) ensures that ξ encodes a partition πξ. Constraints (14) and (14) ensure that πξ respects the constraints from Eqs. (3) and (4), respectively. Constraint (16) ensures that
which implies that the objective minimized in (12) equals L(πξ). This, in turn, implies that the ILP given in the equations (12) to (18) is equivalent to (k, R, C)-DataSAIL. To see why (16) implies (19), note that the right-hand side of (16) is 0 for all i ∈ [k] if \({\pi }_{{{\boldsymbol{\xi }}}}(x)={\pi }_{{{\boldsymbol{\xi }}}}({x}^{{\prime} })\). Otherwise, the right-hand side of (16) is 1 for the unique fold i that contains x but not \({x}^{{\prime} }\). Since we minimize over ζ with non-negative coefficients in the objective, these considerations imply (19).
Implementation details
We here provide details on the workflow of the heuristic implemented in the DataSAIL Python package and visualized in Fig. 2. In the first step (Fig. 2a), the users can choose between several algorithms to compute similarities or distances for different data types or provide a custom matrix (following Table 2). All distances or similarities are scaled to [0, 1].
Subsequently (Fig. 2b), the input dataset \({{\mathcal{D}}}\) is clustered into K clusters, where K is a constant that the user can adjust. This is done separately for the two data types in two-dimensional datasets, leading to 2K clusters in total. To cluster similarities, DataSAIL uses spectral clustering30; for distances, agglomerative clustering is used31 (as implemented in scikit-learn32).
Using the resulting set of clusters \({{\mathcal{C}}}\), DataSAIL then constructs a problem instance of size K (or 2K for two-dimensional datasets) as follows (Fig. 2c): The clusters \(A\in {{\mathcal{C}}}\) act as data elements, inter-cluster similarities or distances \({{\rm{sim}}}_{{{\mathcal{C}}}},{d}_{{{\mathcal{C}}}}:{{\mathcal{C}}}\times {{\mathcal{C}}}\to {\mathbb{R}}\) are computed using average-, single-, or complete-linkage depending on the choice of the user (if distances are provided, the cluster distances are transformed to cluster similarities as \({{\rm{sim}}}_{{{\mathcal{C}}}}:=1-{{\rm{dist}}}_{{{\mathcal{C}}}}\)). Similarities between entities of different types are currently not supported, i.e., DataSAIL assumes \({{\rm{sim}}}_{{{\mathcal{C}}}}(A,{A}^{{\prime} })=0\) if A and \({A}^{{\prime} }\) are clusters of data elements of different types. Cardinalities are defined as \({\kappa }_{{{\mathcal{C}}}}(A):={\sum}_{x\in A}\kappa (x)\). The remaining parameters (number of folds k, type and class assignments t and σ, desired relative fold sizes si, error margins ϵ and δ) are inherited from the input provided by the user. This constant-size instance is then solved by feeding its ILP formulation (Section “An integer linear programming formulation of the (r, R, C)-DataSAIL problem”) into CVXPY33,34,35—a Python package for convex optimization that provides a unified interface for multiple solvers such as GUROBI36, MOSEK37, or SCIP38.
The ILP solvers return a partition \({\pi }_{{{\mathcal{C}}}}:{{\mathcal{C}}}\to [k]\) of the set of clusters \({{\mathcal{C}}}\). In the last step (Fig. 2d), this cluster partition is unpacked into a partition \(\pi :{{\mathcal{D}}}\to [k]\) of the original data points by setting \(\pi (x):={\pi }_{{{\mathcal{C}}}}(A)\) for each \(A\in {{\mathcal{C}}}\) and each x ∈ A. Note that, by definition of \({\kappa }_{{{\mathcal{C}}}}\), the fact that \({\pi }_{{{\mathcal{C}}}}\) respects the constraints specified in Eqs. (3) and (4) implies that the same constraints are also respected by π.
Datasets and machine learning models
For splitting one-dimensional data following (k, 1, 1)-DataSAIL, we use the MoleculeNet collection of benchmark datasets with different measured biochemical properties (e.g., toxicity or water solubility), which should be predicted in regression or classification22. Since this benchmark contains multiple datasets from different sources, the performance metrics differ between datasets.
An application case for the (k, 2, 1)-DataSAIL problem is the LP-PDBBind dataset, comprising experimentally measured binding affinities of 19,443 protein-ligand complexes39. To demonstrate the effect of information leakage in stratified splitting, we use the stress response-antioxidant response element (SR-ARE) subchallenge from Tox2140 as an instance of the (k, 1, 2)-DataSAIL problem, where the two classes are active and inactive small molecules in this pathway.
Four classical ML models were trained for all datasets: RF17, SVM18, XGB19, and vanilla multi-layer perceptrons (MLP)20. For the one-dimensional datasets, we additionally trained a directed message-passing graph neural network (D-MPNN)21. For the two-dimensional dataset LP-PDBBind, we additionally trained DeepDTA, a deep learning model comprising CNN-encoders for proteins and ligands and an MLP predictor based on the encoder outputs41. Both deep learning models were selected because they showed top performance in their respective tasks42,43, do not rely on pre-training (which introduces a new aspect into OOD performance evaluation), and are reasonably easy to use. All models’ exact training setups and parameterizations are described in Section “Training of supervised machine learning models”.
Validation protocol
We empirically investigated how DataSAIL improves estimating the performance of ML models on unseen data in two ML tasks: molecular property prediction and drug-target interaction prediction. Here, we compare DataSAIL’s similarity-based splitting to random splitting (identity-based splits fulfilling Eq. (3)), fingerprint-based splitting, and LoHi. An extensive comparison against other methods for leakage-reduced data splitting mentioned above is provided in Supplementary Section S1. Details on the hyper-parameters of DataSAIL used for the experiments are given in Section “Hyper-parameter choices”.
Using the compared data splitting approaches, we split the benchmark datasets into 80% training and 20% test data (i.e., we set k = 2, s1 = 0.8, and s2 = 0.2 for our experiments). We then trained five ML models on the training sets and evaluated their performances on the test sets. We did not need a validation set because we did not tune hyper-parameters. All test performances were averaged over five splittings of the datasets, with shuffling of the dataset between splittings to increase variability. Whenever random data splitting yields consistently better test performances than splitting with DataSAIL, there is evidence for similarity-induced information leakage that can be avoided by similarity-based splitting as implemented in DataSAIL. To better visualize the reduction of information leakage by DataSAIL, we show the average L(π) of each splitting algorithm on the right of the performances. For better interpretability and comparability, we define scaled L(π) by scaling L(π) as defined in Eq. (2) to the interval [0, 1] as follows:
Training of supervised machine learning models
We trained six different models, four of them (RFs, SVMs, XGB, and MLPs) based on the implementation in the scikit-learn v1.3.2 package. The SVMs and XGB were wrapped in the MultiOutput framework to make them applicable to multi-target learning in a one-versus-all fashion. Following Deng et al.1, we train the random forest as an ensemble of 500 trees, the SVMs with linear kernels, and XGB with default parameters. For the MLPs, we use 3 hidden layers with sizes 512, 256, and 64 and train them for 200 epochs. Otherwise, all models are trained with default parameters and for reproducibility with random_state = 42. The training for these four models was conducted on standard CPUs. The input to all four models for molecular property prediction is a Morgan (ECFP4) fingerprint with a radius 2 hashed to a bit vector size of 1024. When training them on LP-PDBBind splits, we concatenate the Morgan fingerprint of the drug, with radius 2, hashed to a bit vector size of 480, with the ESM-2 embedding of the target. We use the 12-layer ESM-2 model, producing a 480-dimensional protein embedding received from the fair-esm v2.0.0 Python package.
The fifth model, D-MPNN, was taken from ChemProp v1.6.144. As this is a graph neural network, the input is the SMILES string of a molecule. This model was trained with default parameters for 50 epochs on an NVIDIA RTX 3090 with 24 GB GPU RAM. The sixth model is DeepDTA; we used the implementation from the LP-PDBBind GitHub repository (https://github.com/THGLab/LP-PDBBind/). DeepDTA is a state-of-the-art model for drug-target interaction prediction based on two CNN encoders for SMILES and amino acid sequence input. It was trained for 50 epochs with kernel size 8 in both encoders on the same NVIDIA RTX 3090 GPU.
We used three solvers, GUROBI v11.0.0, MOSEK v10.1.21, and SCIP v7.0.3, retrieved through conda. For GUROBI and MOSEK, we issued academic licenses from their respective platforms.
Hyper-parameter choices
Table 3 summarizes the hyper-parameters and configurations of DataSAIL used to obtain the results reported in this paper. Except for the results reported in Fig. 6a, b, where varying ILP solvers were used and a time limit of 2 h was imposed, all splits were computed with GUROBI and a time limit of 1000 s.
Data availability
The data from MoleculeNet22 was fetched through the Python package deepchem v2.7.1. Links to download the individual datasets are available at https://moleculenet.org/datasets-1. The data from LP-PDBBind39 was taken from their GitHub repository (https://github.com/THGLab/LP-PDBBind/). The data for PLINDER14 was downloaded from the Google Cloud Storage (https://console.cloud.google.com/storage/browser/plinder). The data was taken from v2, and the files determined the splits from v0. It is important to mention that despite the variation in the versions, the benchmark is backward compatible, i.e., data is extended but not altered. Therefore, the v0 splits can be extracted from the v2 data.
Code availability
All code for DataSAIL and the experiments is available on GitHub at https://github.com/kalininalab/DataSAIL. The code for the experiments is provided in the experiments subfolder. Furthermore is the code deposited at Zenodo45.
References
Deng, J. et al. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 14, 6395 (2023).
Chatterjee, A. et al. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 14, 1989 (2023).
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012).
Bernett, J. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat. Methods 21, 1444–1453 (2024).
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
Tossou, P., Wognum, C., Craig, M., Mary, H. & Noutahi, E. Real-world molecular out-of-distribution: specification and investigation. J. Chem. Inf. Model. 64, 697–711 (2014).
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in ML-based science. Patterns 4, 100804 (2023).
Park, Y. & Marcotte, E. M. Flaws in evaluation schemes for pair-input computational predictions. Nat. Methods 9, 1134–1136 (2012).
Hamp, T. & Rost, B. More challenges for machine-learning protein interactions. Bioinformatics 31, 1521–1525 (2015).
Bernett, J., Blumenthal, D. B. & List, M. Cracking the black box of deep sequence-based protein-protein interaction prediction. Brief. Bioinforma. 25, bbae076 (2023).
Grimm, D. G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 36, 513–523 (2015).
Notin, P. et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems 36, 64331–64379 (2023).
Kovtun, D. et al. PINDER: The protein interaction dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603980 (2024).
Durairaj, J. et al. PLINDER: The protein-ligand interactions dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603955 (2024).
Cucker, F. & Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 39, 1–49 (2002).
Elangovan, A., He, J. & Verspoor, K. Memorization vs. generalization: quantifying data leakage in NLP performance evaluation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 16, 1325–1335 (2021).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Vapnik, V. N. The Nature Of Statistical Learning Theory (Springer Science & Business Media, 1999).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University (1974).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Tanimoto, T. T. An elementary mathematical theory of classification and prediction. Automatic Information Organization and Retrieval (McGraw-Hill, 1968).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019)
Steshin, S. Lo-Hi: Practical ML Drug Discovery Benchmark. Preprint at https://arXiv.org/abs/2310.06399 (2023).
Teufel, F. et al. GraphPart: homology partitioning for biological sequence analysis. NAR Genom. Bioinform. 5, lqad088 (2023).
Schmidt, T. J. On the Minimum Bisection Problem in Tree-Like and Planar Graphs. PhD thesis, Technical University of Munich (2017). Available from: https://mediatum.ub.tum.de/doc/1338548/404979.pdf.
Shi, J. & Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000).
Jain A. K. & Dubes R. C. Algorithms For Clustering Data (Prentice-Hall, Inc., 1988).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Diamond, S. & Boyd, S. CVXPY: a Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17, 1–5 (2016).
Agrawal, A., Verschueren, R., Diamond, S. & Boyd, S. A rewriting system for convex optimization problems. J. Control Decis. 5, 42–60 (2018).
Agrawal, A. & Boyd, S. Disciplined quasiconvex programming. Optim. Lett. 14, 1643–1657 (2020).
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. Available from https://www.gurobi.com.
MOSEK ApS. MOSEK Optimizer API for Python. Available from: https://docs.mosek.com/latest/pythonapi/index.html (2023).
Bestuzheva, K. et al. Enabling Research through the SCIP Optimization Suite 8.0. ACM Trans. Math. Softw. 49, 1–21 (2023).
Li, J. et al. Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction. Preprint at https://arXiv.org/abs/2308.09639 (2023).
National Center for Advancing Translational Sciences. The Tox21 data challenge 2014. Available from: https://tripod.nih.gov/tox21/challenge/data.jsp (2014).
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).
PWC. PapersWithCode.com. Accessed 1 February 2024. Available from: https://paperswithcode.com/paper/deepdta-deep-drug-target-binding-affinity.
PWC. PapersWithCode.com. Accessed: 1 February 2024. Available from: https://paperswithcode.com/paper/are-learned-molecular-representations-ready.
Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17 (2023).
Joeres, R., Blumenthal, D. B. & Kalinina, O.V. DataSAIL. Zenodo (2024). Available at https://doi.org/10.5281/zenodo.13938602.
Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nat. Chem. Biol. 18, 1033–1036 (2022).
Burns, J. W., Spiekermann, K. A., Bhattacharjee, H., Vlachos, D. G. & Green, W. H. Machine learning validation via rational dataset sampling with astartes. J. Open Source Softw. 8, 5996 (2023).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–14 (2016).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Acknowledgements
R.J. and O.V.K. thank Ilya Senatorov, Alexander Gress, and Anne Tolkmitt for fruitful discussions and the members of the Kalinina lab for testing the package. R.J. thanks Daniel Bojar for the opportunity to continue working on DataSAIL during his stay at the BojarLab, University of Gothenburg. R.J. was supported by the HelmholtzAI project XAI-Graph, the Knut and Alice Wallenberg Foundation, and the University of Gothenburg. D.B.B. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, grant no. 516188180), by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0309A and 01KD2419A), and by the Klaus Tschira Foundation (grant no. 00.003.2024). O.V.K. acknowledges financial support from the Klaus Faber Foundation.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
O.V.K. conceived the project. R.J. implemented the Python framework and carried out all experiments. D.B.B. conceived the theory and proved the NP-hardness. D.B.B. and O.V.K. jointly supervised the work. All authors contributed equally to writing and reviewing the manuscript.
Corresponding author
Ethics declarations
Competing interests
D.B.B. consults for BioVariance. The other authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Matthew Rosenblatt, Simon Steshin and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Joeres, R., Blumenthal, D.B. & Kalinina, O.V. Data splitting to avoid information leakage with DataSAIL. Nat Commun 16, 3337 (2025). https://doi.org/10.1038/s41467-025-58606-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-58606-8
This article is cited by
-
Don’t push the button! Exploring data leakage risks in machine learning and transfer learning
Artificial Intelligence Review (2025)