Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Data-driven federated learning in drug discovery with knowledge distillation

Abstract

A main challenge for artificial intelligence in scientific research is ensuring access to sufficient, high-quality data for the development of impactful models. Despite the abundance of public data, the most valuable knowledge often remains embedded within confidential corporate data silos. Although industries are increasingly open to sharing non-competitive insights, such collaboration is often constrained by the confidentiality of the underlying data. Federated learning makes it possible to share knowledge without compromising data privacy, but it has notable limitations. Here, we introduce FLuID (federated learning using information distillation), a data-centric application of federated distillation tailored to drug discovery aiming to preserve data privacy. We validate FLuID in two experiments, first involving public data simulating a virtual consortium and second in a real-world research collaboration between eight pharmaceutical companies. Although the alignment of the models with the partner specific domain remains challenging, the data-driven nature of FLuID offers several avenues to mitigate domain shift. FLuID fosters knowledge sharing among pharmaceutical organizations, paving the way for a new generation of models with enhanced performance and an expanded applicability domain in biological activity predictions.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: FL schemes.
Fig. 2: FLuID methodology.
Fig. 3: Concept validation and teacher–student comparison for a simulated consortium.
Fig. 4: Teacher and hybrid model comparison.
Fig. 5: Concept validation in an industrial setup.
Fig. 6: Public proof-of-concept datasets.

Similar content being viewed by others

Data availability

All data for the public simulation experiment are held within the GitHub repository (https://github.com/LhasaLimited/FLuID_POC) in the data folder. A release of the repository is also available from Zenodo68 (https://zenodo.org/records/14531198).

Code availability

All code necessary to run the public portion of the experiment can be found in the github repository (https://github.com/LhasaLimited/FLuID_POC) or it is also available from Zenodo68 (https://zenodo.org/records/14531198). Simply open and run the Jupyter notebook and it will run the entire experiment using the included data. The code is licenced using the GPLv3 licence.

References

  1. Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).

    Article  Google Scholar 

  2. Zhou, W. et al. Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images. Nat. Commun. 12, 1259 (2021).

    Article  Google Scholar 

  3. Topaloglu, M. Y., Morrell, E. M., Rajendran, S. & Topaloglu, U. In the pursuit of privacy: the promises and predicaments of federated learning in healthcare. Front. Artif. Intell. 4, 746497 (2021).

    Article  Google Scholar 

  4. Brauneck, A. et al. Federated machine learning in data-protection-compliant research. Nat. Mach. Intell. 5, 2–4 (2023).

    Article  Google Scholar 

  5. Bak, M. et al. Federated learning is not a cure-all for data ethics. Nat. Mach. Intell. 6, 370–372 (2024).

    Article  Google Scholar 

  6. Zhu, H., Xu, J., Liu, S. & Jin, Y. Federated learning on non-IID data: a survey. Neurocomputing 465, 371–390 (2021).

    Article  Google Scholar 

  7. McMahan, B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. y. Communication-efficient learning of deep networks from decentralized data. In Proc. 20th International Conference on Artificial Intelligence and Statistics PMLR 54, 1273–1282 (2017).

  8. Zhou, J. et al. A survey on federated learning and its applications for accelerating industrial internet of things. Preprint at https://doi.org/10.48550/arXiv.2104.10501 (2021).

  9. Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Comput. Ind. Eng. 149, 106854 (2020).

    Article  Google Scholar 

  10. Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 50–60 (2020).

    Google Scholar 

  11. Yin, X., Zhu, Y. & Hu, J. A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput. Surv. 54, 131:1–131:36 (2021).

    Google Scholar 

  12. Kairouz, P. et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14, 1–210, (2021).

  13. Liu, J. et al. From distributed machine learning to federated learning: a survey. Knowl. Inf. Syst. 64, 885–917 (2022).

    Article  Google Scholar 

  14. Konečný, J., McMahan, H. B., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. Preprint at https://doi.org/10.48550/arXiv.1610.02527 (2016).

  15. Abadi, M. et al. Deep learning with differential privacy. In Proc. 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318 (ACM, 2016).

  16. Dwork, C. Differential privacy: a survey of results. In Proc. International Conference on Theory and Applications of Models of Computation (eds Agrawal, M. et al.) 1–19 (Springer, 2008).

  17. Long, G., Tan, Y., Jiang, J. & Zhang, C. in Federated Learning: Privacy and Incentive (eds Yang, Q. et al.) 240–254 (Springer, 2020).

  18. Rieke, N. et al. The future of digital health with federated learning. Npj Digit. Med. 3, 119 (2020).

    Article  Google Scholar 

  19. Choudhury, O. et al. Predicting adverse drug reactions on distributed health data using federated learning. AMIA. Annu. Symp. Proc. 2019, 313–322 (2020).

    Google Scholar 

  20. Nguyen, D. C. et al. Federated learning for smart healthcare: a survey. ACM Computing Surveys (Csur) 55, 1–37 (2022).

  21. Xiong, Z. et al. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches. Sci. China Life Sci. 65, 529–539 (2022).

    Article  Google Scholar 

  22. Manu, D. et al. FL-DISCO: federated generative adversarial network for graph-based molecule drug discovery: special session paper. In Proc. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) 1–7 (IEEE, 2021).

  23. Naz, S., Phan, K. T. & Chen, Y.-P. P. A comprehensive review of federated learning for COVID-19 detection. Int. J. Intell. Syst. 37, 2371–2392 (2022).

    Article  Google Scholar 

  24. Goldsmith, M. R. et al. in Crop Protection Products for Sustainable Agriculture (eds Rauzan, B. M. & Lorsbach, B. A.) Vol. 1390, 181–200 (American Chemical Society, 2021).

  25. Heyndrickx, W. et al. MELLODDY: cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information. J. Chem. Inf. Model. 64, 2331–2344 (2024).

    Article  Google Scholar 

  26. Hanser, T. Federated learning for molecular discovery. Curr. Opin. Struct. Biol. 79, 102545 (2023).

    Article  Google Scholar 

  27. Konečný, J. et al. Federated learning: strategies for improving communication efficiency. Preprint at https://doi.org/10.48550/arXiv.1610.05492 (2017).

  28. Wu, C., Wu, F., Lyu, L., Huang, Y. & Xie, X. Communication-efficient federated learning via knowledge distillation. Nat. Commun. 13, 2032 (2022).

    Article  Google Scholar 

  29. Zhu, X. Semi-Supervised Learning Literature Survey (Univ. Wisconsin, 2005); https://minds.wisconsin.edu/handle/1793/60444

  30. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at http://arxiv.org/abs/1503.02531 (2015).

  31. Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I. & Talwar, K. Semi-supervised knowledge transfer for deep learning from private training data. Preprint at http://arxiv.org/abs/1610.05755 (2016).

  32. Papernot, N. et al. Scalable private learning with PATE. Preprint at http://arxiv.org/abs/1802.08908 (2018).

  33. Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

    Article  Google Scholar 

  34. Dietterich, T. G. Ensemble methods in machine learning. In Proc. International Workshop on Multiple Classifier Systems (eds Kittler, J. & Roli, F.) 1–15 (Springer, 2000).

  35. Li, L., Gou, J., Yu, B., Du, L. & Tao, Z. Y. D. Federated distillation: a survey. Preprint at https://doi.org/10.48550/arXiv.2404.08564 (2024).

  36. Eldar, Y. C. et al. in Machine Learning and Wireless Communications (eds Goldsmith, A. et al.) 457–485 (Cambridge Univ. Press, 2022).

  37. Li, D. & Wang, J. FedMD: heterogenous federated learning via model distillation. Preprint at https://doi.org/10.48550/arXiv.1910.03581 (2019).

  38. Itahara, S., Nishio, T., Koda, Y., Morikura, M. & Yamamoto, K. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-IID private data. IEEE Trans. Mob. Comput. 22, 191–205 (2023).

    Article  Google Scholar 

  39. Sattler, F., Marban, A., Rischke, R. & Samek, W. Communication-efficient federated distillation. Preprint at http://arxiv.org/abs/2012.00632 (2020).

  40. Sui, D. et al. FedED: federated learning via ensemble distillation for medical relation extraction. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 2118–2128 (Association for Computational Linguistics, 2020).

  41. Han, S. et al. FedX: unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision. (eds Avidan, S. et al.) 691–707 (Springer Nature Switzerland, 2022).

  42. Jeong, E. et al. Communication-efficient on-device machine learning: federated distillation and augmentation under non-IID private data. Preprint at http://arxiv.org/abs/1811.11479 (2023).

  43. Choquette-Choo, C. A. et al. CaPC learning: confidential and private collaborative learning. Preprint at https://doi.org/10.48550/arXiv.2102.05188 (2021).

  44. PyGrid: a peer-to-peer platform for private data science and federated learning OpenMined Blog https://blog.openmined.org/what-is-pygrid-demo/ (2020).

  45. FLuID POC platform. GitHub https://github.com/LhasaLimited/FLuID_POC (2023).

  46. Hancox, J. C., McPate, M. J., El Harchi, A. & Zhang, Y. H. The hERG potassium channel and hERG screening for drug-induced torsades de pointes. Pharmacol. Ther. 119, 118–132 (2008).

    Article  Google Scholar 

  47. Wolford, B. What is GDPR, the EU’s new data protection law? GDPR.eu https://gdpr.eu/what-is-gdpr/ (2018).

  48. Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning model. In IEEE Symposium on Security and Privacy (SP) 3–18 (IEEE, 2017).

  49. Raipuria, G., Bonthu, S. & Singhal, N. Noise robust training of segmentation model using knowledge distillation. In Proc. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021 (eds Del Bimbo, A. et al.) 97–104 (Springer, 2021).

  50. Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).

    Article  Google Scholar 

  51. Bassani, D., Brigo, A. & Andrews-Morger, A. Federated learning in computational toxicology: an industrial perspective on the Effiris Hackathon. Chem. Res. Toxicol. 36, 1503–1517 (2023).

    Article  Google Scholar 

  52. Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2021).

    Article  Google Scholar 

  53. Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminformatics 7, 20 (2015).

    Article  Google Scholar 

  54. Glen, R. et al. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs Investig. Drugs J. 9, 199–204 (2006).

    Google Scholar 

  55. Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J. & Osman, J. Parameter based methods for compound selection from chemical databases. Quant. Struct. Act. Relatsh. 15, 285–289 (1996).

    Article  Google Scholar 

  56. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  57. Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry. J. Med. Chem. 57, 3186–3204 (2014).

    Article  Google Scholar 

  58. Maggiora, G. M. Concepts and Applications of Molecular Similarity (eds Johnson, M. A. & Maggiora, G. M.) (John Wiley & Sons, 1990).

  59. Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Article  Google Scholar 

  60. Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).

    Article  Google Scholar 

  61. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. Math. Stat. Probab. (eds Marie Le Cam, L. & Neyman, J.) Vol. 1, 281–298 (1967).

  62. Siramshetty, V. B., Chen, Q., Devarakonda, P. & Preissner, R. The catch-22 of predicting hERG blockade using publicly accessible bioactivity data. J. Chem. Inf. Model. 58, 1224–1233 (2018).

    Article  Google Scholar 

  63. Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition Vol. 1, 278–282 (1995).

  64. Hanser, T., Barber, C., Marchaland, J. F. & Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. https://doi.org/10.1080/1062936X.2016.1250229 (2016) .

  65. Hanser, T. et al. Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge. J. Cheminformatics 6, 21 (2014).

    Article  Google Scholar 

  66. Carhart, R., Smith, D. H. & Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. https://doi.org/10.1021/ci00046a002 (1985).

  67. Hanser, T., Steinmetz, F. P., Plante, J., Rippmann, F. & Krier, M. Avoiding hERG-liability in drug design via synergetic combinations of different (Q)SAR methodologies and data sources: a case study in an industrial setting. J. Cheminformatics 11, 9 (2019).

    Article  Google Scholar 

  68. Hanser, T., Werner, S. & Plante, J. FLuID POC a simulation platform for federated distillation. Zenodo https://doi.org/10.5281/zenodo.14531198 (2024).

  69. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

Download references

Acknowledgements

The Lhasa Limited authors acknowledge the invaluable collaborative effort, scientific discussions and knowledge sharing made possible by all the industrial partners in this project.

Author information

Authors and Affiliations

Authors

Contributions

E.A., A.A., L.T.A., R.J.B., A.B., A.D., S.G., N.G., D.K., L.K., W.M., F.R., Y.S., F.S., A.W., J.W. and T.Y.: data access and preparation, domain expertise and continuous input and knowledge sharing. T.H., J.-F.M., J.P., R.v.D. and S.W.: methodology design and implementation, experiment orchestration, virtual simulation platform development, data analytics and paper preparation. C.B. and L.J.: managing the partner–Lhasa relationships.

Corresponding author

Correspondence to Thierry Hanser.

Ethics declarations

Competing interests

The authors declare no competing interests, except for R.J.B. who is an employee and shareholder of Sanofi, a pharmaceutical R&D company that may benefit from the outcome of this research.

Peer review

Peer review information

Nature Machine Intelligence thanks Alissa Brauneck, Gabriele Buchholtz, Stuart McLennan and Umit Topaloglu for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hanser, T., Ahlberg, E., Amberg, A. et al. Data-driven federated learning in drug discovery with knowledge distillation. Nat Mach Intell 7, 423–436 (2025). https://doi.org/10.1038/s42256-025-00991-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-00991-2

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research