Data splitting to avoid information leakage with DataSAIL

Joeres, Roman; Blumenthal, David B.; Kalinina, Olga V.

doi:10.1038/s41467-025-58606-8

Download PDF

Article
Open access
Published: 08 April 2025

Data splitting to avoid information leakage with DataSAIL

Nature Communications volume 16, Article number: 3337 (2025) Cite this article

4936 Accesses
8 Citations
48 Altmetric
Metrics details

Subjects

Abstract

Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL’s impact on evaluating biomedical machine learning models.

Guiding questions to avoid data leakage in biological machine learning applications

Article 09 August 2024

An introduction to machine learning and analysis of its use in rheumatic diseases

Article 02 November 2021

Overlay databank unlocks data-driven analyses of biomolecules for all

Article Open access 07 February 2024

Introduction

Supervised machine learning (ML) is one of the fastest-growing research fields, leading to advances in many computer and life science domains. Many bioinformatics fields benefit from using ML models, e.g., molecular property prediction¹ and drug-target interaction prediction².

For successful deployment of these ML models in real-world use cases, it is crucial that the reported performance estimates reliably represent model performance during inference. If the test set used for model evaluation does not represent the data used at inference time, the model can show inflated performance scores during testing, impeding successful model deployment in a real-world use case. Such misrepresentation can happen when the model uses information from the training set at test time, although this information is not available during inference. This phenomenon is called information leakage or data leakage^3,4. Recent studies show that information leakage is a highly relevant problem in many subfields of ML-based research, leading to inflated performances and overoptimistic conclusions in biomedical ML research^4,5,6 and beyond⁷.

The simplest form of information leakage is having the same samples in multiple folds of the data split. This is easy to control, and most common splitting techniques avoid this by removing duplicate data points. Another type of information leakage that is more complex to detect can occur when similarities between data points in the training and in the test sets are larger than similarities between data points in the training set and in the data that one intends to use during inference⁴. In such a case, an ML model is benchmarked on test data that is in-distribution with respect to the training data, although its intended use case is to yield reliable predictions also for out-of-distribution (OOD) data. Hence, a model may perform well on the test data by relying on similarity-based shortcuts that do not generalize to the intended real-world application scenario.

Especially for biomolecular data that exhibit complex dependency structures, one can easily fall into this trap by following a standard strategy in the ML community to randomly split a benchmarking dataset into training, validation, and test folds. For instance, it has been shown that this problem pervades the field of research on deep learning models to predict protein-protein interaction (PPI) from protein sequences^8,9,10. While many of these models perform excellently when evaluated on the random data splits used in the original publications, performance often becomes close to random when evaluated on protein pairs with low homology to the training data that represents the desired use case to predict PPIs for poorly characterized proteins¹⁰. Another biomedical ML problem for which similar pitfalls have been described is the problem of predicting the deleteriousness of missense variants^11,12. For this problem, it has been shown that when variants that are similar in that they affect the same protein are assigned to different splits, ML models can achieve excellent test performances by relying on protein-level shortcuts (e.g., a simple protein-level majority vote based on the variants in the training fold). Such models then generalize poorly to variants of sparsely annotated proteins and will systematically misclassify minority-class variants of proteins for which both deleterious and non-deleterious variants are seen at inference time¹¹.

In this work, we address the problem of information leakage due to misleading evaluation on in-distribution data for models intended for deployment on OOD data. To this end, we developed DataSAIL, an algorithmic framework and tool to split datasets into multiple folds that allow us to realistically estimate model performance on OOD data. While such datasets have been curated for specific ML tasks on biomolecular data (e.g., PINDER¹³ and the gold standard dataset developed in¹⁰ for PPI prediction and PLINDER¹⁴ for protein-ligand interaction prediction), DataSAIL is generic and can be used to split any kind of data, as long as a similarity or distance measure for the contained data points is available.

We formulate the data splitting problem underlying DataSAIL as a constrained optimization problem, prove that this problem is NP-hard, and present a Python package that heuristically solves this problem using clustering and integer linear programming (ILP). Unlike existing tools and algorithms to compute data splits that reduce information leakage, DataSAIL can automatically compute splits for heterogeneous data of two different types and combines stratification with similarity-aware splitting (see Table 1 and section “Detailed description of related work” in the Supplementary Materials for comparison with existing tools). Moreover, DataSAIL is more versatile than existing approaches because it can be used out-of-the-box for various types of molecular data. We validate DataSAIL by showing how it can reduce leakage between training and test data for various ML models trained on both one- and two-dimensional biomolecular datasets.

Table 1 Overview of different splitting tools and frameworks and their capabilities towards biochemical data

Full size table

Results

Data splits for supervised ML

In supervised ML, we are given a dataset ${{\mathcal{M}}}=\{({x}_{1},{y}_{1}),\ldots,({x}_{n},{y}_{n})\}$ of n samples with feature vectors x_i ∈ X and labels y_i ∈ Y, where X is a feature space and Y represents the space of labels. The goal is to learn a function f_θ: X → Y that minimizes a loss function ${{\mathcal{L}}}({f}_{\theta }({{{\bf{x}}}}_{{{\bf{i}}}}),{y}_{i})$. This is achieved by selecting a hypothesis space ${{\mathcal{H}}}$ of candidate functions and fitting ${f}_{\theta }\in {{\mathcal{H}}}$ within the hypothesis space¹⁵. To develop a supervised ML model f_θ, one needs to split ${{\mathcal{M}}}$ into three pairwise disjoint datasets: A training set ${{{\mathcal{M}}}}_{train}$ to learn the parameters θ, i.e., to select f_θ from a fixed hypothesis space ${{\mathcal{H}}}$. A validation set ${{{\mathcal{M}}}}_{val}$ to optimize the hyper-parameters that determine the shape of ${{\mathcal{H}}}$ (e.g., number of hidden layers) or control the employed optimization strategy (e.g., learning rate or optimizer). And a test set ${{{\mathcal{M}}}}_{test}$ to assess the performance of the trained model on so far unseen data.

Our proposed method, DataSAIL, works for one-dimensional and two-dimensional datasets. In a one-dimensional dataset, one feature vector-output value pair (x_i, y_i) corresponds to one elementary data point; for example, a certain molecular property such as toxicity can be predicted for a single chemical compound (Fig. 1, 1D data). If the feature vector x_i consists of two elementary data points (such as in drug-target interaction prediction, where x_i represents a pair of a molecule and a protein target for which an interaction affinity y_i should be predicted), we call this a two-dimensional dataset (Fig. 1, 2D data).

**Fig. 1: Visualization of exemplary one-dimensional and two-dimensional datasets.**

**Fig. 2: Schematic workflow of DataSAIL.**

Importantly, in a two-dimensional dataset, the similarity between molecules can be defined over each dimension, e.g., along the drug and target dimensions. We define different splitting tasks with abbreviations based on whether they account for similarity-induced information leakage and the dimensions of the dataset (1 or 2). In identity-based splittings, the similarity between molecules is not considered, whereas in similarity-based splittings, it is accounted for. Those tasks are visualized in Fig. 1 and include identity-based one-dimensional splitting (I1), identity-based two-dimensional splitting (I2), similarity-based one-dimensional splitting (S1), similarity-based two-dimensional splitting (S2), and random interaction-based splitting (R). In two-dimensional data splitting, interactions may exist that cannot be assigned to any split without leaking information if the two interacting molecules are assigned to different folds. Therefore, interactions can get lost in two-dimensional data splitting (white tiles in panels I2 and S2 in Fig. 1).

The (k, R, C)-DataSAIL problem

In this section, we introduce (k, R, C)-DataSAIL, which formalizes the problem of splitting an R-dimensional dataset into k folds such that data leakage is minimized and C classes that are present in the data (e.g., confounders such as sex or the output labels y_i if the space of labels Y is discrete) are distributed equally among the k folds such that each fold preserves the overall class distribution. Intuitively, we define (k, R, C)-DataSAIL as the problem to minimize inter-class similarity while keeping similar class ratios across the splits. Although designed with biomedical applications in mind, our problem definition is generic and can be used for any dataset where a similarity or a distance measure is available for the contained data points. We present a generalized version of the problem for R-dimensional datasets, although our Python implementation only supports one- or two-dimensional input, i.e., R ≤ 2.

More formally, let ${{\mathcal{D}}}$ be the set of data points represented in ${{\mathcal{M}}}$ or a set of clusters defined over these data points (allowing clusters as elements of ${{\mathcal{D}}}$ will be important for our heuristic solver, as explained below). The data points/clusters $x\in {{\mathcal{D}}}$ can have $R\in {\mathbb{N}}$ different entity types t(x) ∈ [R]: = {1, …, R}. For instance, in a drug-target interaction scenario, we have R = 2, with the two entity types r ∈ {1, 2} corresponding to drugs and protein targets or clusters thereof. If ${{\mathcal{M}}}$ is a one-dimensional dataset, all data points/clusters have the same element type. Moreover, the data points or clusters $x\in {{\mathcal{D}}}$ have cardinalities $\kappa (x)\in {{\mathbb{N}}}_{\ge 1}$. For elementary data points, we always have κ(x) = 1; if ${{\mathcal{D}}}$ contains clusters, we may have κ(x) > 1. For each element type r ∈ [R], we write ${{{\mathcal{D}}}}_{t=r}:=\{x\in {{\mathcal{D}}}| t(x)=r\}$ and ${n}_{r}:={\sum}_{x\in {{{\mathcal{D}}}}_{t=r}}\kappa (x)$ to denote the set of all data elements of type r and their overall cardinality, respectively. Additionally, we assume that a similarity measure $\,{\rm{sim}}\,:{{\mathcal{D}}}\times {{\mathcal{D}}}\to {\mathbb{R}}$ or a distance measure $\,{\rm{dist}}\,:{{\mathcal{D}}}\times {{\mathcal{D}}}\to {\mathbb{R}}$ is available for ${{\mathcal{D}}}$ (see “Methods” for details on how to define sim and dist).

Information leakage has been defined qualitatively by Kaufman et al.³ and quantitatively by Elangovan et al.¹⁶ as follows:

$$| {{{\mathcal{D}}}}_{test}{| }^{-1}\cdot \sum\limits_{x\in {{{\mathcal{D}}}}_{test}}\mathop{\max }\limits_{{x}^{{\prime} }\in {{{\mathcal{D}}}}_{train}}\,{\rm{sim}}\,(x,{x}^{{\prime} }).$$

(1)

This definition is incomplete as only the biggest leak per test sample is considered, and the validation set is ignored. In view of this, we define the leakage induced by a mapping $\pi :{{\mathcal{D}}}\to \,[k]$ that splits ${{\mathcal{D}}}$ into k folds ${{{\mathcal{D}}}}_{i}^{\pi }:=\{x\in {{\mathcal{D}}}| \pi (x)=i\}$, i ∈ [k]: = {1, …, k}, as the total similarity

$$L(\pi ):=\sum\limits_{x{x}^{{\prime} }\in \left({{\mathcal{D}}}\atop 2\right)}\left[\pi (x)\,\ne \, \pi ({x}^{{\prime} })\right]\cdot \,{\rm{sim}}\,(x,{x}^{{\prime} })\cdot \kappa (x)\cdot \kappa ({x}^{{\prime} })$$

(2)

between data elements assigned to different folds. Here, $\left[\cdot \right]:\{\perp,\top \}\to \{0,1\}$ is the Iverson bracket. The cardinalities are added as factors to our leakage function L to put a higher weight on similarities between larger clusters. Typically, we have k = 3 for data splitting in ML (${{{\mathcal{D}}}}_{1}^{\pi }={{{\mathcal{D}}}}_{train}$, ${{{\mathcal{D}}}}_{2}^{\pi }={{{\mathcal{D}}}}_{val}$, ${{{\mathcal{D}}}}_{3}^{\pi }={{{\mathcal{D}}}}_{test}$). We will define (k, R, C)-DataSAIL as the problem to minimize L(π), given two sets of constraints we introduce below.

Let s_i ∈ (0, 1) with ${\sum }_{i=1}^{k}{s}_{i}=1$ be user-provided desired split fractions for the k folds ${{{\mathcal{D}}}}_{i}^{\pi }$ induced by π (e.g., s₁ = 0.8, s₂ = 0.1, s₃ = 0.1 for splitting the data into 80% training, 10% validation, and 10% test data). As a first set of constraints on π, we require that, for all pairs (i, r) ∈ [k] × [R] of entity types and folds, π respects the split fractions s_i up to a relative error ϵ ∈ [0, 1):

$$\sum\limits_{x\in {{{\mathcal{D}}}}_{i}^{\pi }\cap {{{\mathcal{D}}}}_{t=r}}\kappa (x)\ge (1-\epsilon )\cdot {s}_{i}\cdot {n}_{r}$$

(3)

For elementary data points where κ(x) = 1, this means that the fraction of data points of type r (that is, the data points in ${{{\mathcal{D}}}}_{t=r}$) that are assigned to split i (the data points in ${{{\mathcal{D}}}}_{i}^{\pi }$) matches the desired split fractions up to a relative error ϵ.

In many ML applications, the data elements $x\in {{\mathcal{D}}}$ may belong to one or multiple of C classes σ(x) ⊆ [C], and we would like to compute stratified splits where the desired split fractions s_i are respected for each class c ∈ C. To model this requirement, we add a constraint

$$\sum\limits_{x\in {{{\mathcal{D}}}}_{i}^{\pi }\cap {{{\mathcal{D}}}}_{t=r}^{\sigma=c}}\kappa (x)\ge (1-\delta )\cdot {s}_{i}\cdot {n}_{r}^{c}$$

(4)

for each triple (i, r, c) ∈ [k] × [R] × [C] of folds, entity types, and classes, where ${{{\mathcal{D}}}}_{t=r}^{\sigma=c}:=\{x\in {{{\mathcal{D}}}}_{t=r}| c\in \sigma (x)\}$ is the set of data elements of type r that belong to class c, ${n}_{r}^{c}:={\sum}_{x\in {{{\mathcal{D}}}}_{t=r}^{\sigma=c}}\kappa (x)$ is the overall cardinality of such data elements, and δ ∈ [0, 1] is an acceptable relative error. Two observations are important at this point:

For all ϵ ≥ δ, the set of constraints specified in Eq. (4) implies the constraints from Eq. (3), which can thus be discarded if δ ≤ ϵ.
When no class information is available (i.e., all data elements have the same “dummy class” σ(x) = 1), Eqs. (4) and (3) are equivalent up to choices of ϵ and δ and Eq. (4) can thus be discarded.

We can now define the (k, R, C)-DataSAIL problem:

$$\mathop{\min }\limits_{\pi }L(\pi )$$

(5)

$$\,{\rm{s. t. (3),(4)}}\,$$

(6)

Theorem 1

The (k, R, C)-DataSAIL problem is NP-hard for all $k\in {{\mathbb{N}}}_{\ge 2}$, $R\in {{\mathbb{R}}}_{\ge 1}$, and $C\in {{\mathbb{N}}}_{\ge 1}$.

The proof for Theorem 1 is contained in Section “Proof of Theorem 1”. To compute leakage-reduced data splits despite this hardness result, we developed a heuristic workflow that first assigns individual data points to a fixed number of clusters and then solves a constant-size instance of (k, R, C)-DataSAIL where the clusters are treated as data elements (Fig. 2, see “Methods” for details). The (k, R, C)-DataSAIL problem can be formulated as an ILP with ${{\mathcal{O}}}(| {{\mathcal{D}}}| \cdot k+| {{\mathcal{D}}}{| }^{2})$ variables and ${{\mathcal{O}}}(k\cdot R\cdot C+| {{\mathcal{D}}}{| }^{2}\cdot k)$ constraints. The formulation of the ILP problem is given in Section “An ILP formulation of the (r, R, C)-DataSAIL problem”. Note that when $| {{\mathcal{D}}}|$ is a constant-sized set of pre-computed clusters over the original dataset, our ILP has a constant number of variables and constraints and can hence be solved efficiently. To solve the constant-size instances, we use the standard ILP solvers.

Once a mapping π has been computed for ${{\mathcal{D}}}$, we can use it to compute a mapping for ${{\mathcal{M}}}$: First, we unpack π and assign each data element z contained in the cluster $x\in {{\mathcal{D}}}$ to π(x), i.e., we define π(z): = π(x) for all z ∈ x (Fig. 2d). Then, we assign the feature vector-label pair $({x}_{j},{y}_{j})\in {{\mathcal{M}}}$ to the split i if and only if all π(z) = i holds for all data points z represented by x_j (e.g., a drug and a protein in the case of drug-target interaction prediction). Feature vector-label pairs (x_j, y_j) with conflicting assignments for different data points represented by x_j are discarded. For instance, in a drug-target prediction scenario, it may happen that the drug and the protein jointly represented by x_j are assigned to different splits by π (e.g., the drug is assigned to the training fold and the protein is assigned to the test fold). In this case, (x_j, y_j) is discarded.

Splitting biomolecular datasets

First, we consider one-dimensional data. We trained and tested four baseline ML models (random forests¹⁷ (RF), support vector machines¹⁸ (SVM), gradient boosting¹⁹ (XGB), and multilayer perceptrons²⁰ (MLP)) and the deep learning model D-MPNN²¹ for molecular property predictions on random and similarity-based data splits computed with DataSAIL (S1) and two competitors. We used two widely used datasets from the MoleculeNet collection²² (QM8: regression problem, upper panels in Fig. 3; Tox21: classification problem, lower panels; Supplementary Fig. 1 shows further results for additional competitors and further one-dimensional datasets from MoleculeNet). As expected, splitting with DataSAIL leads to a better separation of training and test samples. In particular, on both datasets, DataSAIL’s data splits exhibit the lowest leakage L(π) among all compared data splits (see rightmost bar-plots in Fig. 3c, f). The other tools that aim to reduce information leakage—LoHi and DeepChem’s fingerprint-based splitting—only partly achieve this goal, thus leading to larger values of L(π). Overall, we observe that smaller values of L(π) are associated with larger drops in test performance in comparison to random splits (see Fig. 3c, f and Supplementary Fig. 2). This indicates that minimizing L(π) as implemented in DataSAIL indeed leads to harder splits and also shows that the ML models benchmarked here struggle to generalize to molecules with low Tanimoto similarity^23,24,25 (the similarity measure we used to compute the values of L(π) reported in Fig. 3) with respect to the training data. For the deep learning model D-MPNN, these results are in line with findings reported in the original publication²¹, where the authors had shown that D-MPNN performs substantially worse on scaffold-based splits (which, like DataSAIL’s S1 splits, rely on molecular similarity) than on random splits (see Supplementary Table 1). In all figures, we depict the scaled L(π) as defined in Eq. (20).

**Fig. 3: One-dimensional datasets QM8 and Tox21.**

Then, we consider two-dimensional data by splitting the LP-PDBBind dataset that contains binding affinities between 15,477 drugs and 12,718 protein targets (Fig. 4). We compared DataSAIL’s I2 and S2 splitting to I1 and S1 splitting for both drugs and targets, as well as to DeepChem’s fingerprint-based splitting²⁶, LoHi²⁷, and GraphPart²⁸ (comparisons to additional splitting algorithms are shown in Supplementary Figs. 3 and 4). As for the one-dimensional data, splits computed by DataSAIL exhibit consistently low L(π) values (Fig. 4c, f, i), with the S2 splits performing particularly well. Moreover, splits with low L(π) values again lead to substantial drops in performance in comparison to I1 baselines which split the data randomly across the drug (Fig. 4c) or protein (Fig. 4f) axis. Another interesting observation is that, for all ML models, test performances are substantially worse for DataSAIL’s S2 splits than for all other tested data splits (Fig. 4i), showing that the tested binding affinity prediction models do not generalize well to scenarios where neither the drugs nor the proteins seen at inference time are similar to drugs and proteins contained in the training data. To our knowledge, DataSAIL is the only tool with out-of-the-box support for a splitting strategy that allows for the estimation of generalization capability in such scenarios. Strikingly, a comparison with the dataset-specific splits in the protein-ligand dataset PLINDER (Supplementary Table 2) shows that, in terms of L(π), DataSAIL’s automatically computed splits are competitive with splits curated for specific datasets.

**Fig. 4: Two-dimensional dataset LP-PDBBind.**

Another improvement of DataSAIL over existing methods is the combination of stratified splitting with information leakage minimization. To show the effect of DataSAIL in this setting, we use the SR-ARE subchallenge from Tox21, for which 6889 active and 942 inactive small molecules exist in the dataset. Here, DataSAIL is not compared to a fully random split but to the classical, similarity-unaware stratified split. We investigate the effect of additionally introducing similarity-awareness and observe that the corresponding DataSAIL splits reduce information leakage considerably (Fig. 5). Again, the comparison of L(π) on the right of Fig. 5c shows that DataSAIL computes splits with reduced information leakage between the folds in comparison to the classical methods, and again we observe that these splits pose harder generalization tasks, leading consistent drops in performance across all tested ML models.

**Fig. 5: One-dimensional data splitting with stratification.**

Effect of solvers and hyper-parameters, scalability

An important parameter of DataSAIL is the number of clusters K used to construct the constant-size (k, R, C)-DataSAIL instance to be fed into the ILP solver. The first row of Fig. 6 shows how K affects the quality of the splits (panel a) and the runtime of DataSAIL (panel b). We clustered the Tox21 dataset into various numbers of clusters using Tanimoto similarities of ECFPs. We then fed the resulting (k, R, C)-DataSAIL instances into the ILP solvers GUROBI, MOSEK, and SCIP and set a time limit of 2 h per solver. Interestingly, the quality of the splits does not improve for K > 150 and is already good for K ≈ 50, showing that a rather small number of clusters is sufficient for obtaining leakage-reduced data splits with DataSAIL. In terms of quality, the tested ILP solvers perform very similarly. GUROBI is the fastest solver.

**Fig. 6: Ablation studies and scalability benchmark.**

Figure 6c shows how the quality of the splits depends on the acceptable relative errors ϵ and δ, tested for the SR-ARE subchallenge from Tox21 with quality quantified following Eq. (2). The classes of this dataset were the binary labels of the SR-ARE subchallenge. Therefore, we balanced positive and negative samples in both splits. We observe that the quality mainly depends on ϵ, which controls how close the obtained split fractions have to be to the user-requested split fractions s_i. Contrary to expected, we did not identify a dependency on δ. However, this is only a small example, and general trends may differ as datasets can vary greatly.

Because MoleculeNet offers a variety of datasets with different sizes, structures, and similarities, we use it for benchmarking the runtime of the various splitting techniques from DataSAIL, LoHi²⁷, and DeepChem²⁶ (Fig. 6d). As expected, the bigger the dataset, the slower the algorithms compute their splits. While DataSAIL is the slowest algorithm, it shows a benign scaling behavior and terminated for all datasets within a reasonable amount of time. In contrast, LoHi did not produce results for the MUV dataset within 12 h.

Discussion

Similarity is an often overlooked source of information leakage that is especially relevant when ML models are developed to be used on data with a distribution shift during inference. In this work, we present DataSAIL, a computational workflow and a tool to minimize similarity-induced information leakage when splitting data for ML model training and testing. DataSAIL provides better OOD data splits than state-of-the-art tools. We provide a formal definition of the underlying optimization problem, show that the problem is NP-hard, present a scalable heuristic, and empirically show that our heuristic can compute high-quality leakage-reduced data splits in a reasonable time, making DataSAIL a Swiss army knife for data splitting. DataSAIL can split one-dimensional and two-dimensional data and biochemical data of various types (small molecules, protein sequences, DNA and RNA sequences, genomes, and longer contigs). Our framework can also easily accommodate other data types, provided the user can provide similarities or distances between the data points.

A limitation of our implementation of DataSAIL is that it only supports R≤2 entity types, although our theoretical framework applies to arbitrary R-dimensional data. In future work, we plan to extend the DataSAIL implementation to work on arbitrary dimensional data. Another current limitation is the clustering step (Fig. 2b), where our implementation relies on spectral or agglomerative clustering and does not support custom clustering algorithms that may be more appropriate for specific data types. DataSAIL’s implementation also cannot handle similarities between entities of different types, although the theoretical framework allows for that. Moreover, splitting two-dimensional data with the current version of DataSAIL can lead to the loss of some feature vector-label tuples when the two elementary data points represented by the feature vector are assigned to different splits. This problem could be mitigated by adding a data loss penalization term to the objective function minimized by DataSAIL. Expanding the implementation to the theoretical limits would improve the versatility of the Python package but also increase the number of variables to deal with. Furthermore, DataSAIL uses off-the-shelf ILP solvers that naturally have a high overhead because they are applicable to multiple settings. Tailoring a solver to DataSAIL’s specific needs may thus improve performance and runtime considerably.

Finally, it is important to stress that testing models on challenging OOD splits as computed by DataSAIL is not appropriate in every ML development setting: The leakage function L(π) minimized by DataSAIL becomes small when $\,{\rm{sim}}\,(x,{x}^{{\prime} })$ is small for data elements x and ${x}^{{\prime} }$ that π assigns to different splits. Since DataSAIL allows the user to select from various pre-implemented similarity functions sim and is open to custom similarity functions, the user has full control over the behavior of L(π). Which choice of sim is most appropriate depends on the intended deployment scenario for the evaluated ML model. In particular, if the inference-time data is expected to be similar to the training data with respect to the similarity function sim selected by the user, evaluating an ML model on the data splits computed by DataSAIL will lead to overly pessimistic results. When using DataSAIL to compute splits for evaluating an ML model that is intended to generalize to OOD data, model evaluators hence have to ensure that the selected similarity function sim indeed captures the intended generalization task. Given an appropriate choice of sim, a positive correlation between L(π) and performance then indicates that the tested ML models struggle to generalize the OOD scenarios modeled by sim.

Related to this, it may happen that the selected similarity function sim is correlated with the response variable to be predicted by the ML model (e.g., in a binary classification problem, it could happen that $\,{\rm{sim}}\,(x,{x}^{{\prime} })$ is substantially larger for data points x and ${x}^{{\prime} }$ that fall into the same class than for data points that fall into different classes). In such scenarios, it is crucial that the user runs DataSAIL with the stratification constraint (4), where the classes C are defined according to the response variable. Without such a constraint, DataSAIL would compute splits that are highly imbalanced with respect to the response variable, which may again lead to overly pessimistic performance estimates.

Methods

Proof of Theorem 1

We show that the (k, R, C)-DataSAIL problem is NP-hard for all fixed constants $k\in {{\mathbb{N}}}_{\ge 2}$ (number of folds), $R\in {{\mathbb{N}}}_{\ge 1}$ (number of entity types), and $C\in {{\mathbb{N}}}_{\ge 1}$ (number of classes). We proceed in three steps:

Step 1: We show that there is a polynomial-time reduction from (k, R, 1)-DataSAIL to (k, R, C)-DataSAIL for arbitrary fixed C ≥ 2.
Step 2: We show that there is a polynomial-time reduction from (k, 1, 1)-DataSAIL to (k, R, 1)-DataSAIL for arbitrary fixed R ≥ 2.
Step 3: We show that (k, 1, 1)-DataSAIL is NP-hard via a polynomial-time reduction from the minimum k-section problem, which is known to be NP-hard.

Step 1 is straightforward: Given an instance ${I}_{k,R,1}=({{\mathcal{D}}},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )$ of (k, R, 1)-DataSAIL (we can ignore the δ if C = 1), we construct an instance I_k,R,C of (k, R, C)-DataSAIL by arbitrarily assigning the data elements $x\in {{\mathcal{D}}}$ to C classes and setting δ: = 1. Then, Eq. (4) is vacuous, implying that each $\pi :{{\mathcal{D}}}\to [k]$ is a solution to I_k,R,1 if and only if it is a solution to I_k,R,C.

For Step 2, let ${I}_{k,1,1}=({{{\mathcal{D}}}}_{1},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )$ of (k, 1, 1)-DataSAIL. We now construct and instance ${I}_{k,R,1}=({{{\mathcal{D}}}}^{{\prime} },{{\rm{sim}}}^{{\prime} },{\kappa }^{{\prime} },{\{{s}_{i}\}}_{i=1}^{k},\epsilon )$ of (k, R, 1)-DataSAIL as follows: ${{{\mathcal{D}}}}^{{\prime} }$ contains ${{{\mathcal{D}}}}_{1}$ and R − 1 additional copies ${{{\mathcal{D}}}}_{r}$, r = 2, …, R. Let x_r denote the copy in ${{{\mathcal{D}}}}_{r}$ of the data element ${x}_{1}\in {{{\mathcal{D}}}}_{1}$. We define ${\kappa }^{{\prime} }({x}_{r}):=\kappa ({x}_{1})$ for all copies. For all pairs of data elements $({x}_{r},{x}_{{r}^{{\prime} }}^{{\prime} })\in {{{\mathcal{D}}}}^{{\prime} }\times {{{\mathcal{D}}}}^{{\prime} }$, we define

$${{\rm{sim}}}^{{\prime} }({x}_{r},{x}_{{r}^{{\prime} }}^{{\prime} }):=\left\{\begin{array}{ll}M\quad &\,{{\rm{if}}}\,r \, \ne \, {r}^{{\prime} }\,{{\rm{and}}}\,{x}_{1}={x}_{1}^{{\prime} }\\ \,{{\rm{sim}}}\,({x}_{1},{x}_{1}^{{\prime} })\quad &\,{\rm{otherwise}}\end{array}\right.{{,}}\,$$

(7)

where M is some large enough constant ($M:={R}^{2}\cdot {\sum}_{{x}_{1}{x}_{1}^{{\prime} }\in \left({{{\mathcal{D}}}}_{1}\atop 2\right)}\,{{\rm{sim}}}\,({x}_{1},{x}_{1}^{{\prime} })$ suffices). That is, similarities between different copies of the same data element are set to a very high value M, and all other similarities are inherited from the (k, 1, 1)-DataSAIL instance I_k,1,1.

Given an optimal solution π₁ for I_k,1,1, we can always define an induced solution π_R for I_k,R,1 as π_R(x_r): = π₁(x₁). Eq. (3) continues to hold because we have ${\kappa }^{{\prime} }({x}_{r})=\kappa ({x}_{1})$ for all r ∈ [R] and ${x}_{1}\in {{{\mathcal{D}}}}_{1}$. For each edge ${x}_{1}{x}_{1}^{{\prime} }$ contained in the cut induced by π₁, there are $\left(\begin{array}{c}R\\ 2\end{array}\right)+R$ copies contained in the cut induced by π_R, all of which have weight $\,{{\rm{sim}}}\,({x}_{1},{x}_{1}^{{\prime} })\cdot \kappa ({x}_{1})\cdot \kappa ({x}_{1}^{{\prime} })$: R copies of the form ${x}_{r}{x}_{r}^{{\prime} }$ and $\left(\begin{array}{c}R\\ 2\end{array}\right)$ copies of the form ${x}_{r}{x}_{{r}^{{\prime} }}^{{\prime} }$ with $r\ne {r}^{{\prime} }$. Moreover, the cut contains no other edges since, by definition of π_R, all copies of the same node end up in the same split. Hence, we have

$${OPT}_{R}\le L({\pi }_{R})=\left(\left(\begin{array}{c}R\\ 2\end{array}\right)+R\right)\cdot L({\pi }_{1})=\left(\left(\begin{array}{c}R\\ 2\end{array}\right)+R\right)\cdot {OPT}_{1} \, < \, M\,{{,}}\,$$

(8)

where OPT₁ and OPT_R denote the optima of I_k,1,1 and I_k,R,1, respectively.

Conversely, let ${\pi }_{R}^{{\prime} }$ be an optimal solution for I_k,R,1. Then ${\pi }_{R}^{{\prime} }$ puts all copies of the same data elements into the same folds, since otherwise, we would have $L({\pi }_{R}^{{\prime} })\ge M > L({\pi }_{R})$, contradicting the optimality of ${\pi }_{R}^{{\prime} }$. For all ${x}_{1}\in {{{\mathcal{D}}}}_{1}$, we now define ${\pi }_{1}^{{\prime} }({x}_{1}):={\pi }_{R}^{{\prime} }({x}_{1})$. By counting edge copies as above, we obtain:

$${OPT}_{R}=L({\pi }_{R}^{{\prime} })=\left(\left(\begin{array}{c}R\\ 2\end{array}\right)+R\right)\cdot L({\pi }_{1}^{{\prime} })\ge \left(\left(\begin{array}{c}R\\ 2\end{array}\right)+R\right)\cdot {OPT}_{1}.$$

(9)

By combining the chains of inequalities in Eqs. (8) and (9), we obtain that ${\pi }_{1}^{{\prime} }$ is optimal for I_k,1,1. This concludes Step 2 of our proof.

For Step 3, we have to show that (k, 1, 1)-DataSAIL is NP-hard for all constants $k\in {{\mathbb{N}}}_{\ge 2}$. This can be done via a reduction from the minimum k-section problem. Given a graph on G = (V, E) and a constant $k\in {{\mathbb{N}}}_{\ge 2}$, the minimum k-section problem asks to find a partition π: V → [k] that splits V into k folds such that ${\sum}_{uv\in E}[\pi (u)\ne \pi (v)]$ is minimized and

$$\left\lfloor \frac{| V| }{k}\right\rfloor \le | {V}_{i}^{\pi }| \le \left\lceil \frac{| V| }{k}\right\rceil$$

(10)

holds for all i ∈ [k] (all folds have the same size). This problem is NP-hard, even when restricting to balanced instances with ∣V∣ = k ⋅ C for some $C\in {{\mathbb{N}}}_{\ge 1}$²⁹, where the constraint in Eq. (10) simplifies to

$$| {V}_{i}^{\pi }|={k}^{-1}\cdot | V| .$$

(11)

Given a balanced instance (V, E, k) of the minimum k-section problem, we now define an instance ${I}_{k,1,1}=({{\mathcal{D}}},\,{\rm{sim}}\,,\kappa,{\{{s}_{i}\}}_{i=1}^{k},\epsilon )$ of (k, 1, 1)-DataSAIL by setting, ${{\mathcal{D}}}:=V$, κ(x): = 1 for all $x\in {{\mathcal{D}}}$, $\,{\rm{sim}}\,(x,{x}^{{\prime} }):=[x{x}^{{\prime} }\in E]$ for all $(x,{x}^{{\prime} })\in {{\mathcal{D}}}\times {{\mathcal{D}}}$, s_i: = k⁻¹ for all i ∈ [k], and ϵ: = 0. Clearly, any solution π to (V, E, k) also solves I_k,1,1 and vice versa. Moreover, we have $L(\pi )={\sum}_{uv\in E}[\pi (u)\ne \pi (v)]$ by construction of sim. Consequently, solving (V, E, k) is equivalent to solving I_k,1,1, which completes the proof.

An integer linear programming formulation of the (r, R, C)-DataSAIL problem

Our formulation contains binary variables ξ_x,i for all $(x,i)\in {{\mathcal{D}}}\times [k]$ that encode whether the data element x is assigned to fold i. Moreover, it contains binary variables ${\zeta }_{x{x}^{{\prime} }}$ for all unordered pairs of data elements $x{x}^{{\prime} }\in \left(\begin{array}{c}{{\mathcal{D}}}\\ 2\end{array}\right)$, which are defined such that ${\zeta }_{x,{x}^{{\prime} }}=1$ if and only if x and ${x}^{{\prime} }$ are assigned to different folds.

$$\mathop{\min }\limits_{{{\boldsymbol{\xi }}},{{\boldsymbol{\zeta }}}}\sum\limits_{x{x}^{{\prime} }\in \left(\begin{array}{c}{{\mathcal{D}}}\\ 2\end{array}\right)}\,{\rm{sim}}\,(x,{x}^{{\prime} })\cdot {\zeta }_{x{x}^{{\prime} }}$$

(12)

$$\sum\limits_{i=1}^{k}{\xi }_{x,i}=1\quad \forall x\in {{\mathcal{D}}}$$

(13)

$$\sum\limits_{x\in {{{\mathcal{D}}}}_{t=r}}\kappa (x)\cdot {\xi }_{x,i}\ge (1-\epsilon )\cdot {s}_{i}\cdot {n}_{r}\quad \forall (i,r)\in [k]\times [R]$$

(14)

$$\sum\limits_{x\in {{{\mathcal{D}}}}_{t=r}^{\sigma=c}}\kappa (x)\cdot {\xi }_{x,i}\ge (1-\delta )\cdot {s}_{i}\cdot {n}_{r}^{c}\quad \forall (i,r,c)\in [k]\times [R]\times [C]$$

(15)

$${\zeta }_{x{x}^{{\prime} }}\ge {\xi }_{x,i}-{\xi }_{{x}^{{\prime} },i}\quad \forall (x{x}^{{\prime} },i)\in \left(\begin{array}{c}{{\mathcal{D}}}\\ 2\end{array}\right)\times [k]$$

(16)

$${\xi }_{x,i}\in \{0,1\}\quad \forall (x,i)\in {{\mathcal{D}}}\times [k]$$

(17)

$${\zeta }_{x{x}^{{\prime} }}\in \{0,1\}\quad \forall x{x}^{{\prime} }\in \left(\begin{array}{c}{{\mathcal{D}}}\\ 2\end{array}\right)$$

(18)

Constraint (13) ensures that ξ encodes a partition π_ξ. Constraints (14) and (14) ensure that π_ξ respects the constraints from Eqs. (3) and (4), respectively. Constraint (16) ensures that

$${\zeta }_{x{x}^{{\prime} }}=[{\pi }_{{{\boldsymbol{\xi }}}}(x)\, \ne \, {\pi }_{{{\boldsymbol{\xi }}}}({x}^{{\prime} })]\,{{,}}\,$$

(19)

which implies that the objective minimized in (12) equals L(π_ξ). This, in turn, implies that the ILP given in the equations (12) to (18) is equivalent to (k, R, C)-DataSAIL. To see why (16) implies (19), note that the right-hand side of (16) is 0 for all i ∈ [k] if ${\pi }_{{{\boldsymbol{\xi }}}}(x)={\pi }_{{{\boldsymbol{\xi }}}}({x}^{{\prime} })$. Otherwise, the right-hand side of (16) is 1 for the unique fold i that contains x but not ${x}^{{\prime} }$. Since we minimize over ζ with non-negative coefficients in the objective, these considerations imply (19).

Implementation details

We here provide details on the workflow of the heuristic implemented in the DataSAIL Python package and visualized in Fig. 2. In the first step (Fig. 2a), the users can choose between several algorithms to compute similarities or distances for different data types or provide a custom matrix (following Table 2). All distances or similarities are scaled to [0, 1].

Table 2 Algorithms to compute similarities and distances

Full size table

Subsequently (Fig. 2b), the input dataset ${{\mathcal{D}}}$ is clustered into K clusters, where K is a constant that the user can adjust. This is done separately for the two data types in two-dimensional datasets, leading to 2K clusters in total. To cluster similarities, DataSAIL uses spectral clustering³⁰; for distances, agglomerative clustering is used³¹ (as implemented in scikit-learn³²).

Using the resulting set of clusters ${{\mathcal{C}}}$, DataSAIL then constructs a problem instance of size K (or 2K for two-dimensional datasets) as follows (Fig. 2c): The clusters $A\in {{\mathcal{C}}}$ act as data elements, inter-cluster similarities or distances ${{\rm{sim}}}_{{{\mathcal{C}}}},{d}_{{{\mathcal{C}}}}:{{\mathcal{C}}}\times {{\mathcal{C}}}\to {\mathbb{R}}$ are computed using average-, single-, or complete-linkage depending on the choice of the user (if distances are provided, the cluster distances are transformed to cluster similarities as ${{\rm{sim}}}_{{{\mathcal{C}}}}:=1-{{\rm{dist}}}_{{{\mathcal{C}}}}$). Similarities between entities of different types are currently not supported, i.e., DataSAIL assumes ${{\rm{sim}}}_{{{\mathcal{C}}}}(A,{A}^{{\prime} })=0$ if A and ${A}^{{\prime} }$ are clusters of data elements of different types. Cardinalities are defined as ${\kappa }_{{{\mathcal{C}}}}(A):={\sum}_{x\in A}\kappa (x)$. The remaining parameters (number of folds k, type and class assignments t and σ, desired relative fold sizes s_i, error margins ϵ and δ) are inherited from the input provided by the user. This constant-size instance is then solved by feeding its ILP formulation (Section “An integer linear programming formulation of the (r, R, C)-DataSAIL problem”) into CVXPY^33,34,35—a Python package for convex optimization that provides a unified interface for multiple solvers such as GUROBI³⁶, MOSEK³⁷, or SCIP³⁸.

The ILP solvers return a partition ${\pi }_{{{\mathcal{C}}}}:{{\mathcal{C}}}\to [k]$ of the set of clusters ${{\mathcal{C}}}$. In the last step (Fig. 2d), this cluster partition is unpacked into a partition $\pi :{{\mathcal{D}}}\to [k]$ of the original data points by setting $\pi (x):={\pi }_{{{\mathcal{C}}}}(A)$ for each $A\in {{\mathcal{C}}}$ and each x ∈ A. Note that, by definition of ${\kappa }_{{{\mathcal{C}}}}$, the fact that ${\pi }_{{{\mathcal{C}}}}$ respects the constraints specified in Eqs. (3) and (4) implies that the same constraints are also respected by π.

Datasets and machine learning models

For splitting one-dimensional data following (k, 1, 1)-DataSAIL, we use the MoleculeNet collection of benchmark datasets with different measured biochemical properties (e.g., toxicity or water solubility), which should be predicted in regression or classification²². Since this benchmark contains multiple datasets from different sources, the performance metrics differ between datasets.

An application case for the (k, 2, 1)-DataSAIL problem is the LP-PDBBind dataset, comprising experimentally measured binding affinities of 19,443 protein-ligand complexes³⁹. To demonstrate the effect of information leakage in stratified splitting, we use the stress response-antioxidant response element (SR-ARE) subchallenge from Tox21⁴⁰ as an instance of the (k, 1, 2)-DataSAIL problem, where the two classes are active and inactive small molecules in this pathway.

Four classical ML models were trained for all datasets: RF¹⁷, SVM¹⁸, XGB¹⁹, and vanilla multi-layer perceptrons (MLP)²⁰. For the one-dimensional datasets, we additionally trained a directed message-passing graph neural network (D-MPNN)²¹. For the two-dimensional dataset LP-PDBBind, we additionally trained DeepDTA, a deep learning model comprising CNN-encoders for proteins and ligands and an MLP predictor based on the encoder outputs⁴¹. Both deep learning models were selected because they showed top performance in their respective tasks^42,43, do not rely on pre-training (which introduces a new aspect into OOD performance evaluation), and are reasonably easy to use. All models’ exact training setups and parameterizations are described in Section “Training of supervised machine learning models”.

Validation protocol

We empirically investigated how DataSAIL improves estimating the performance of ML models on unseen data in two ML tasks: molecular property prediction and drug-target interaction prediction. Here, we compare DataSAIL’s similarity-based splitting to random splitting (identity-based splits fulfilling Eq. (3)), fingerprint-based splitting, and LoHi. An extensive comparison against other methods for leakage-reduced data splitting mentioned above is provided in Supplementary Section S1. Details on the hyper-parameters of DataSAIL used for the experiments are given in Section “Hyper-parameter choices”.

Using the compared data splitting approaches, we split the benchmark datasets into 80% training and 20% test data (i.e., we set k = 2, s₁ = 0.8, and s₂ = 0.2 for our experiments). We then trained five ML models on the training sets and evaluated their performances on the test sets. We did not need a validation set because we did not tune hyper-parameters. All test performances were averaged over five splittings of the datasets, with shuffling of the dataset between splittings to increase variability. Whenever random data splitting yields consistently better test performances than splitting with DataSAIL, there is evidence for similarity-induced information leakage that can be avoided by similarity-based splitting as implemented in DataSAIL. To better visualize the reduction of information leakage by DataSAIL, we show the average L(π) of each splitting algorithm on the right of the performances. For better interpretability and comparability, we define scaled L(π) by scaling L(π) as defined in Eq. (2) to the interval [0, 1] as follows:

$$\,{{\rm{scaled}}}\,L(\pi ):=\frac{L(\pi )}{{\sum }_{x,{x}^{{\prime} }\in {{\mathcal{D}}}}sim(x,{x}^{{\prime} })}.$$

(20)

Training of supervised machine learning models

We trained six different models, four of them (RFs, SVMs, XGB, and MLPs) based on the implementation in the scikit-learn v1.3.2 package. The SVMs and XGB were wrapped in the MultiOutput framework to make them applicable to multi-target learning in a one-versus-all fashion. Following Deng et al.¹, we train the random forest as an ensemble of 500 trees, the SVMs with linear kernels, and XGB with default parameters. For the MLPs, we use 3 hidden layers with sizes 512, 256, and 64 and train them for 200 epochs. Otherwise, all models are trained with default parameters and for reproducibility with random_state = 42. The training for these four models was conducted on standard CPUs. The input to all four models for molecular property prediction is a Morgan (ECFP4) fingerprint with a radius 2 hashed to a bit vector size of 1024. When training them on LP-PDBBind splits, we concatenate the Morgan fingerprint of the drug, with radius 2, hashed to a bit vector size of 480, with the ESM-2 embedding of the target. We use the 12-layer ESM-2 model, producing a 480-dimensional protein embedding received from the fair-esm v2.0.0 Python package.

The fifth model, D-MPNN, was taken from ChemProp v1.6.1⁴⁴. As this is a graph neural network, the input is the SMILES string of a molecule. This model was trained with default parameters for 50 epochs on an NVIDIA RTX 3090 with 24 GB GPU RAM. The sixth model is DeepDTA; we used the implementation from the LP-PDBBind GitHub repository (https://github.com/THGLab/LP-PDBBind/). DeepDTA is a state-of-the-art model for drug-target interaction prediction based on two CNN encoders for SMILES and amino acid sequence input. It was trained for 50 epochs with kernel size 8 in both encoders on the same NVIDIA RTX 3090 GPU.

We used three solvers, GUROBI v11.0.0, MOSEK v10.1.21, and SCIP v7.0.3, retrieved through conda. For GUROBI and MOSEK, we issued academic licenses from their respective platforms.

Hyper-parameter choices

Table 3 summarizes the hyper-parameters and configurations of DataSAIL used to obtain the results reported in this paper. Except for the results reported in Fig. 6a, b, where varying ILP solvers were used and a time limit of 2 h was imposed, all splits were computed with GUROBI and a time limit of 1000 s.

Table 3 Hyper-parameter values

Full size table

Data availability

The data from MoleculeNet²² was fetched through the Python package deepchem v2.7.1. Links to download the individual datasets are available at https://moleculenet.org/datasets-1. The data from LP-PDBBind³⁹ was taken from their GitHub repository (https://github.com/THGLab/LP-PDBBind/). The data for PLINDER¹⁴ was downloaded from the Google Cloud Storage (https://console.cloud.google.com/storage/browser/plinder). The data was taken from v2, and the files determined the splits from v0. It is important to mention that despite the variation in the versions, the benchmark is backward compatible, i.e., data is extended but not altered. Therefore, the v0 splits can be extracted from the v2 data.

Code availability

All code for DataSAIL and the experiments is available on GitHub at https://github.com/kalininalab/DataSAIL. The code for the experiments is provided in the experiments subfolder. Furthermore is the code deposited at Zenodo⁴⁵.

References

Deng, J. et al. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 14, 6395 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Chatterjee, A. et al. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 14, 1989 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012).
Article Google Scholar
Bernett, J. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat. Methods 21, 1444–1453 (2024).
Article CAS PubMed Google Scholar
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
Article CAS PubMed Google Scholar
Tossou, P., Wognum, C., Craig, M., Mary, H. & Noutahi, E. Real-world molecular out-of-distribution: specification and investigation. J. Chem. Inf. Model. 64, 697–711 (2014).
Article Google Scholar
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in ML-based science. Patterns 4, 100804 (2023).
Park, Y. & Marcotte, E. M. Flaws in evaluation schemes for pair-input computational predictions. Nat. Methods 9, 1134–1136 (2012).
Article CAS PubMed PubMed Central Google Scholar
Hamp, T. & Rost, B. More challenges for machine-learning protein interactions. Bioinformatics 31, 1521–1525 (2015).
Article CAS PubMed Google Scholar
Bernett, J., Blumenthal, D. B. & List, M. Cracking the black box of deep sequence-based protein-protein interaction prediction. Brief. Bioinforma. 25, bbae076 (2023).
Article Google Scholar
Grimm, D. G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 36, 513–523 (2015).
Article PubMed Google Scholar
Notin, P. et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems 36, 64331–64379 (2023).
Kovtun, D. et al. PINDER: The protein interaction dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603980 (2024).
Durairaj, J. et al. PLINDER: The protein-ligand interactions dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603955 (2024).
Cucker, F. & Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 39, 1–49 (2002).
Article MathSciNet Google Scholar
Elangovan, A., He, J. & Verspoor, K. Memorization vs. generalization: quantifying data leakage in NLP performance evaluation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 16, 1325–1335 (2021).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Google Scholar
Vapnik, V. N. The Nature Of Statistical Learning Theory (Springer Science & Business Media, 1999).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Article MathSciNet Google Scholar
Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University (1974).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article CAS PubMed Google Scholar
Tanimoto, T. T. An elementary mathematical theory of classification and prediction. Automatic Information Organization and Retrieval (McGraw-Hill, 1968).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Article CAS PubMed Google Scholar
Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019)
Steshin, S. Lo-Hi: Practical ML Drug Discovery Benchmark. Preprint at https://arXiv.org/abs/2310.06399 (2023).
Teufel, F. et al. GraphPart: homology partitioning for biological sequence analysis. NAR Genom. Bioinform. 5, lqad088 (2023).
Article PubMed PubMed Central Google Scholar
Schmidt, T. J. On the Minimum Bisection Problem in Tree-Like and Planar Graphs. PhD thesis, Technical University of Munich (2017). Available from: https://mediatum.ub.tum.de/doc/1338548/404979.pdf.
Shi, J. & Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000).
Article Google Scholar
Jain A. K. & Dubes R. C. Algorithms For Clustering Data (Prentice-Hall, Inc., 1988).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Diamond, S. & Boyd, S. CVXPY: a Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17, 1–5 (2016).
MathSciNet Google Scholar
Agrawal, A., Verschueren, R., Diamond, S. & Boyd, S. A rewriting system for convex optimization problems. J. Control Decis. 5, 42–60 (2018).
Article MathSciNet Google Scholar
Agrawal, A. & Boyd, S. Disciplined quasiconvex programming. Optim. Lett. 14, 1643–1657 (2020).
Article MathSciNet Google Scholar
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. Available from https://www.gurobi.com.
MOSEK ApS. MOSEK Optimizer API for Python. Available from: https://docs.mosek.com/latest/pythonapi/index.html (2023).
Bestuzheva, K. et al. Enabling Research through the SCIP Optimization Suite 8.0. ACM Trans. Math. Softw. 49, 1–21 (2023).
Li, J. et al. Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction. Preprint at https://arXiv.org/abs/2308.09639 (2023).
National Center for Advancing Translational Sciences. The Tox21 data challenge 2014. Available from: https://tripod.nih.gov/tox21/challenge/data.jsp (2014).
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).
Article PubMed PubMed Central Google Scholar
PWC. PapersWithCode.com. Accessed 1 February 2024. Available from: https://paperswithcode.com/paper/deepdta-deep-drug-target-binding-affinity.
PWC. PapersWithCode.com. Accessed: 1 February 2024. Available from: https://paperswithcode.com/paper/are-learned-molecular-representations-ready.
Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17 (2023).
Article PubMed PubMed Central Google Scholar
Joeres, R., Blumenthal, D. B. & Kalinina, O.V. DataSAIL. Zenodo (2024). Available at https://doi.org/10.5281/zenodo.13938602.
Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nat. Chem. Biol. 18, 1033–1036 (2022).
Article CAS PubMed PubMed Central Google Scholar
Burns, J. W., Spiekermann, K. A., Bhattacharjee, H., Vlachos, D. G. & Green, W. H. Machine learning validation via rational dataset sampling with astartes. J. Open Source Softw. 8, 5996 (2023).
Article ADS Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–14 (2016).
Article Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article ADS MathSciNet CAS PubMed Google Scholar

Download references

Acknowledgements

R.J. and O.V.K. thank Ilya Senatorov, Alexander Gress, and Anne Tolkmitt for fruitful discussions and the members of the Kalinina lab for testing the package. R.J. thanks Daniel Bojar for the opportunity to continue working on DataSAIL during his stay at the BojarLab, University of Gothenburg. R.J. was supported by the HelmholtzAI project XAI-Graph, the Knut and Alice Wallenberg Foundation, and the University of Gothenburg. D.B.B. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, grant no. 516188180), by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0309A and 01KD2419A), and by the Klaus Tschira Foundation (grant no. 00.003.2024). O.V.K. acknowledges financial support from the Klaus Faber Foundation.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors jointly supervised this work: David B. Blumenthal, Olga V. Kalinina.

Authors and Affiliations

Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
Roman Joeres & Olga V. Kalinina
Center for Bioinformatics, Saarland University, Saarbrücken, Germany
Roman Joeres & Olga V. Kalinina
Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden
Roman Joeres
Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden
Roman Joeres
Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
David B. Blumenthal
Medical Faculty, Saarland University, Homburg, Germany
Olga V. Kalinina

Authors

Roman Joeres
View author publications
Search author on:PubMed Google Scholar
David B. Blumenthal
View author publications
Search author on:PubMed Google Scholar
Olga V. Kalinina
View author publications
Search author on:PubMed Google Scholar

Contributions

O.V.K. conceived the project. R.J. implemented the Python framework and carried out all experiments. D.B.B. conceived the theory and proved the NP-hardness. D.B.B. and O.V.K. jointly supervised the work. All authors contributed equally to writing and reviewing the manuscript.

Corresponding author

Correspondence to Roman Joeres.

Ethics declarations

Competing interests

D.B.B. consults for BioVariance. The other authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Matthew Rosenblatt, Simon Steshin and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Joeres, R., Blumenthal, D.B. & Kalinina, O.V. Data splitting to avoid information leakage with DataSAIL. Nat Commun 16, 3337 (2025). https://doi.org/10.1038/s41467-025-58606-8

Download citation

Received: 26 August 2024
Accepted: 28 March 2025
Published: 08 April 2025
DOI: https://doi.org/10.1038/s41467-025-58606-8

This article is cited by

Don’t push the button! Exploring data leakage risks in machine learning and transfer learning
- Andrea Apicella
- Francesco Isgrò
- Roberto Prevete
Artificial Intelligence Review (2025)

Subjects

Abstract

Similar content being viewed by others

Guiding questions to avoid data leakage in biological machine learning applications

An introduction to machine learning and analysis of its use in rheumatic diseases

Overlay databank unlocks data-driven analyses of biomolecules for all

Introduction

Results

Data splits for supervised ML

The (k, R, C)-DataSAIL problem

Theorem 1

Splitting biomolecular datasets

Effect of solvers and hyper-parameters, scalability

Discussion

Methods

Proof of Theorem 1

An integer linear programming formulation of the (r, R, C)-DataSAIL problem

Implementation details

Datasets and machine learning models

Validation protocol

Training of supervised machine learning models

Hyper-parameter choices

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Don’t push the button! Exploring data leakage risks in machine learning and transfer learning

Search

Quick links