Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Multiscale topology-enabled structure-to-sequence transformer for protein–ligand interaction predictions

Abstract

Despite the success of pretrained natural language processing (NLP) models in various fields, their application in computational biology has been hindered by their reliance on biological sequences, which ignores vital three-dimensional (3D) structural information incompatible with the sequential architecture of NLP models. Here we present a topological transformer (TopoFormer), which is built by integrating NLP models and a multiscale topology technique, the persistent topological hyperdigraph Laplacian (PTHL), which systematically converts intricate 3D protein–ligand complexes at various spatial scales into an NLP-admissible sequence of topological invariants and homotopic shapes. PTHL systematically transforms intricate 3D protein–ligand complexes into NLP-compatible sequences of topological invariants and shapes, capturing essential interactions across spatial scales. TopoFormer gives rise to exemplary scoring accuracy and excellent performance in ranking, docking and screening tasks in several benchmark datasets. This approach can be utilized to convert general high-dimensional structured data into NLP-compatible sequences, paving the way for broader NLP based research.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic illustration of the overall TopoFormer model.
Fig. 2: Performance of TopoFormer on scoring and ranking tasks.
Fig. 3: Performance of TopoFormer on docking and screening tasks.
Fig. 4: Illustration of the concepts related to topological sequence embedding.

Similar content being viewed by others

Data availability

The training dataset employed in this study comprises a comprehensive collection of protein–ligand complexes sourced from various PDBbind databases, specifically CASF-2007, CASF-2013, CASF-2016 and PDBbind v.2020. To ensure the dataset’s reliability and eliminate redundancies, a meticulous curation process was undertaken, resulting in a total of 19,513 non-overlapping complexes. All data used in this study can be downloaded from the official PDBbind website: http://www.pdbbind.org.cn/index.php. We also provide a comprehensive set of resources at https://github.com/WeilabMSU/TopoFormer. This includes topological embedded features used in both TopoFormer and TopoFormers, sequence-based features derived from the Transformer-CPZ28 and ESM33 models and all additional generated poses with their associated scores, which were crucial for the docking and screening tasks. Instructions for accessing the poses are also available via Zenodo at https://doi.org/10.5281/zenodo.10892799 (ref. 66).

Code availability

All source code and models are publicly available via Zenodo at https://doi.org/10.5281/zenodo.10892799 (ref. 66).

References

  1. Fleming, N. How artificial intelligence is changing drug discovery. Nature 557, S55–S57 (2018).

    Article  Google Scholar 

  2. Lyu, J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229 (2019).

    Article  Google Scholar 

  3. Kitchen, D. B., Decornez, H., Furr, J. R. & Bajorath, J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Discov. 3, 935–949 (2004).

    Article  Google Scholar 

  4. Pinzi, L. & Rastelli, G. Molecular docking: shifting paradigms in drug discovery. Int. J. Mol. Sci. 20, 4331 (2019).

    Article  Google Scholar 

  5. Pagadala, N. S., Syed, K. & Tuszynski, J. Software for molecular docking: a review. Biophys. Rev. 9, 91–102 (2017).

    Article  Google Scholar 

  6. Wang, L. et al. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 137, 2695–2703 (2015).

    Article  Google Scholar 

  7. Sliwoski, G., Kothiwale, S., Meiler, J. & Lowe, E. W. Computational methods in drug discovery. Pharmacol. Rev. 66, 334–395 (2014).

    Article  Google Scholar 

  8. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  9. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

    Article  Google Scholar 

  10. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  MathSciNet  Google Scholar 

  11. Song, Y. & Wang, L. Multiobjective tree-based reinforcement learning for estimating tolerant dynamic treatment regimes. Biometrics 80, ujad017 (2024).

    Article  Google Scholar 

  12. Luo, J., Wei, W., Waldispühl, J. & Moitessier, N. Challenges and current status of computational methods for docking small molecules to nucleic acids. Eur. J. Med. Chem. 168, 414–425 (2019).

    Article  Google Scholar 

  13. Lo, Yu-Chen, Rensi, S. E., Torng, W. & Altman, R. B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 23, 1538–1546 (2018).

    Article  Google Scholar 

  14. The Atomwise AIMS Program. AI is a viable alternative to high throughput screening: a 318-target study. Sci. Rep. 14, 7526 (2024).

  15. Gómez-Sacristán, P., Simeon, S., Tran-Nguyen, V.-K., Patil, S. & Ballester, P. J. Inactive-enriched machine-learning models exploiting patent data improve structure-based virtual screening for PDL1 dimerizers. J. Adv. Res. (in the press); https://doi.org/10.1016/j.jare.2024.01.024

  16. Hu, X. et al. Discovery of novel non-steroidal selective glucocorticoid receptor modulators by structure-and IGN-based virtual screening, structural optimization, and biological evaluation. Eur. J. Med. Chem. 237, 114382 (2022).

    Article  Google Scholar 

  17. Vaswani, A. et al. Attention is all you need. In NIPS'17: Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 6000–6010 (Curran Associates, 2017).

  18. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. B. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (Association for Computational Linguistics, 2019).

  19. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).

    Google Scholar 

  20. Singh, R., Sledzieski, S., Bryson, B., Cowen, L. & Berger, B. Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proc. Natl Acad. Sci. USA 120, e2220778120 (2023).

    Article  Google Scholar 

  21. Saar, K. L. et al. Turning high-throughput structural biology into predictive inhibitor design. Proc. Natl Acad. Sci. USA 120, e2214168120 (2023).

    Article  Google Scholar 

  22. Cang, Z., Mu, L. & Wei, G.-W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 14, e1005929 (2018).

    Article  Google Scholar 

  23. Nguyen, D. D., Cang, Z. & Wei, G.-W. A review of mathematical representations of biomolecular data. Phys. Chem. Chem. Phys. 22, 4343–4367 (2020).

    Article  Google Scholar 

  24. Wang, R., Nguyen, D. D. & Wei, G.-W. Persistent spectral graph. Int. J. Numer. Methods Biomed. Eng. 36, e3376 (2020).

    Article  MathSciNet  Google Scholar 

  25. Meng, Z. & Xia, K. Persistent spectral–based machine learning (PerSpect ML) for protein-ligand binding affinity prediction. Sci. Adv. 7, eabc5329 (2021).

    Article  Google Scholar 

  26. Chen, D., Liu, J., Wu, J. & Wei, G.-W. Persistent hyperdigraph homology and persistent hyperdigraph Laplacians. Found. Data Sci. 5, 558–588 (2023).

    Article  MathSciNet  Google Scholar 

  27. Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).

    Article  MathSciNet  Google Scholar 

  28. Chen, D., Zheng, J., Wei, G.-W. & Pan, F. Extracting predictive representations from hundreds of millions of molecules. J. Phys. Chem. Lett. 12, 10793–10801 (2021).

    Article  Google Scholar 

  29. Ruff, K. M. & Pappu, R. V. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 433, 167208 (2021).

    Article  Google Scholar 

  30. Li, Y., Han, L., Liu, Z. & Wang, R. Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results. J. Chem. Inf. Model. 54, 1717–1736 (2014).

    Article  Google Scholar 

  31. Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model. 49, 1079–1093 (2009).

    Article  Google Scholar 

  32. Su, M. et al. Comparative assessment of scoring functions: the CASF-2016 update. J. Chem. Inf. Model. 59, 895–913 (2018).

    Article  Google Scholar 

  33. Trull, T. J. & Ebner-Priemer, U. W. Using experience sampling methods/ecological momentary assessment (ESM/EMA) in clinical assessment and clinical research: introduction to the special section. Psychol. Assess. 21, 457–462 (2009).

  34. Karlov, D. S., Sosnin, S., Fedorov, M. V. & Popov, P. graphDelta: MPNN scoring function for the affinity prediction of protein–ligand complexes. ACS Omega 5, 5150–5159 (2020).

    Article  Google Scholar 

  35. Sánchez-Cruz, N., Medina-Franco, J., Mestres, J. & Barril, X. Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinformatics 37, 1376–1382 (2021).

    Article  Google Scholar 

  36. Wang, Z. et al. Onionnet-2: a convolutional neural network model for predicting protein-ligand binding affinity based on residue-atom contacting shells. Front. Chem. 9, 753002 (2021).

    Article  Google Scholar 

  37. Rezaei, M. A., Li, Y., Wu, D., Li, X. & Li, C. Deep learning in drug design: protein-ligand binding affinity prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 407–417 (2020).

    Article  Google Scholar 

  38. Wang, S. et al. Se-onionnet: a convolution neural network for protein–ligand binding affinity prediction. Front. Genet. 11, 607824 (2021).

    Article  Google Scholar 

  39. Jones, D. et al. Improved protein–ligand binding affinity prediction with structure-based deep fusion inference. J. Chem. Inf. Model. 61, 1583–1592 (2021).

    Article  Google Scholar 

  40. Boyles, F., Deane, C. M. & Morris, G. M. Learning from the ligand: using ligand-based features to improve binding affinity prediction. Bioinformatics 36, 758–764 (2020).

    Article  Google Scholar 

  41. Liu, Z. et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31, 405–412 (2015).

    Article  Google Scholar 

  42. Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest. J. Comput. Chem. 38, 169–177 (2017).

    Article  Google Scholar 

  43. Gentile, F. et al. Automated discovery of noncovalent inhibitors of SARS-Cov-2 main protease by consensus deep docking of 40 billion small molecules. Chem. Sci. 12, 15960–15974 (2021).

    Article  Google Scholar 

  44. Méndez-Lucio, O., Ahmad, M., del Rio-Chanona, E. A. & Wegner, J. K. A geometric deep learning approach to predict binding conformations of bioactive molecules. Nat. Mach. Intell. 3, 1033–1039 (2021).

    Article  Google Scholar 

  45. Zheng, L. et al. Improving protein–ligand docking and screening accuracies by incorporating a scoring function correction term. Brief. Bioinform. 23, bbac051 (2022).

    Article  Google Scholar 

  46. Bao, J., He, X. & Zhang, J. Z. H. DeepBSP—a machine learning method for accurate prediction of protein–ligand docking structures. J. Chem. Inf. Model. 61, 2231–2240 (2021).

    Article  Google Scholar 

  47. Shen, C. et al. Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer. J. Med. Chem. 65, 10691–10706 (2022).

    Article  Google Scholar 

  48. Nguyen, D. D. & Wei, G.-W. AGL-Score: algebraic graph learning score for protein–ligand binding scoring, ranking, docking, and screening. J. Chem. Inf. Model. 59, 3291–3304 (2019).

    Article  Google Scholar 

  49. Liu, X., Feng, H., Wu, J. & Xia, K. Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction. PLoS Comput. Biol. 18, e1009943 (2022).

    Article  Google Scholar 

  50. Tran-Nguyen, V.-K., Junaid, M., Simeon, S. & Ballester, P. J. A practical guide to machine-learning scoring for structure-based virtual screening. Nat. Protoc. 18, 3460–3511 (2023).

    Article  Google Scholar 

  51. Moon, S., Zhung, W., Yang, S., Lim, J. & Kim, W. Y. PIGNet: a physics-informed deep learning model toward generalized drug–target interaction predictions. Chem. Sci. 13, 3661–3673 (2022).

    Article  Google Scholar 

  52. Tran-Nguyen, V.-K., Bret, G. & Rognan, D. True accuracy of fast scoring functions to predict high-throughput screening data from docking poses: the simpler the better. J. Chem. Inf. Model. 61, 2788–2797 (2021).

    Article  Google Scholar 

  53. Tran-Nguyen, V.-K. & Ballester, P. J. Beware of simple methods for structure-based virtual screening: the critical importance of broader comparisons. J. Chem. Inf. Model. 63, 1401–1405 (2023).

    Article  Google Scholar 

  54. Tran-Nguyen, V.-K., Simeon, S., Junaid, M. & Ballester, P. J. Structure-based virtual screening for PDL1 dimerizers: evaluating generic scoring functions. Curr. Res. Struct. Biol. 4, 206–210 (2022).

    Article  Google Scholar 

  55. Shen, C. et al. A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers. Chem. Sci. 14, 8129–8146 (2023).

    Article  Google Scholar 

  56. Jones, G., Willett, P., Glen, R. C., Leach, A. R. & Taylor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 267, 727–748 (1997).

    Article  Google Scholar 

  57. Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).

    Article  Google Scholar 

  58. Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).

    Article  Google Scholar 

  59. Horak, D. & Jost, J. Spectra of combinatorial Laplace operators on simplicial complexes. Adv. Math. 244, 303–336 (2013).

    Article  MathSciNet  Google Scholar 

  60. Eckmann, B. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv. 17, 240–255 (1944).

    Article  MathSciNet  Google Scholar 

  61. Chen, J., Zhao, R., Tong, Y. & Wei, G.-W. Evolutionary de Rham-Hodge method. Discrete Continuous Dyn. Syst. Ser. B. 26, 3785–3821 (2021).

    Article  MathSciNet  Google Scholar 

  62. Mémoli, F., Wan, Z. & Wang, Y. Persistent Laplacians: properties, algorithms and implications. SIAM J. Math. Data Sci. 4, 858–884 (2022).

    Article  MathSciNet  Google Scholar 

  63. Edelsbrunner, H., Letscher, D. & Zomorodian, A. Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002).

    Article  MathSciNet  Google Scholar 

  64. Liu, J., Li, J. & Wu, J. The algebraic stability for persistent Laplacians. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.03902 (2023).

  65. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15979–15988 (IEEE, 2022).

  66. Chen, D. WeilabMSU/TopoFormer: TopoFormer. Zenodo https://doi.org/10.5281/zenodo.10892799 (2024).

  67. Sunseri, J. & Koes, D. R. Virtual screening with Gnina 1.0. Molecules 26, 7369 (2021).

    Article  Google Scholar 

  68. Yang, C. & Zhang, Y. Delta machine learning to improve scoring-ranking-screening performances of protein–ligand scoring functions. J. Chem. Inf. Model. 62, 2696–2712 (2022).

    Article  Google Scholar 

  69. Wójcikowski, M., Kukiełka, M., Stepniewska-Dziubinska, M. M. & Siedlecki, P. Development of a protein–ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions. Bioinformatics 35, 1334–1341 (2019).

    Article  Google Scholar 

  70. Stepniewska-Dziubinska, M. M., Zielenkiewicz, P. & Siedlecki, P. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics 34, 3666–3674 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by NIH grant nos. R01GM126189, R01AI164266 and R35GM148196, National Science Foundation grant nos. DMS2052983 and IIS-1900473, Michigan State University Research Foundation, and Bristol-Myers Squibb grant no. 65109. The work of J.L. was performed while visiting Michigan State University.

Author information

Authors and Affiliations

Authors

Contributions

D.C. designed the project, modified the method, wrote the code, performed computational studies, wrote the first draft and revised the manuscript. J.L. wrote the methods section and revised the manuscript. G.-W.W. conceptualized and supervised the project, acquired funding and revised the manuscript.

Corresponding authors

Correspondence to Jian Liu or Guo-Wei Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Emil Alexov, Pedro Ballester and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Tables 1–7, Evaluation metrics, Hyperparameter selection and optimization, Topological objects, Vietoris–Rips hyperdigraph and alpha hyperdigraph.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Liu, J. & Wei, GW. Multiscale topology-enabled structure-to-sequence transformer for protein–ligand interaction predictions. Nat Mach Intell 6, 799–810 (2024). https://doi.org/10.1038/s42256-024-00855-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-024-00855-1

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research