Data leakage undermines the reliability of machine learning model evaluations, particularly in biological data. Here, they present a data splitting approach that minimizes information leakage and enables more accurate assessment of model performance on out-of-distribution data.
- Roman Joeres
- David B. Blumenthal
- Olga V. Kalinina