Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

Gentile, Francesco; Yaacoub, Jean Charle; Gleave, James; Fernandez, Michael; Ton, Anh-Tien; Ban, Fuqiang; Stern, Abraham; Cherkasov, Artem

doi:10.1038/s41596-021-00659-2

Protocol
Published: 04 February 2022

Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

Francesco Gentile ORCID: orcid.org/0000-0001-8299-1976¹,
Jean Charle Yaacoub¹^na1,
James Gleave¹^na1,
Michael Fernandez¹,
Anh-Tien Ton ORCID: orcid.org/0000-0001-7418-6563¹,
Fuqiang Ban¹,
Abraham Stern² &
…
Artem Cherkasov¹

Nature Protocols volume 17, pages 672–697 (2022)Cite this article

59k Accesses
343 Citations
31 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

With the recent explosion of chemical libraries beyond a billion molecules, more efficient virtual screening approaches are needed. The Deep Docking (DD) platform enables up to 100-fold acceleration of structure-based virtual screening by docking only a subset of a chemical library, iteratively synchronized with a ligand-based prediction of the remaining docking scores. This method results in hundreds- to thousands-fold virtual hit enrichment (without significant loss of potential drug candidates) and hence enables the screening of billion molecule–sized chemical libraries without using extraordinary computational resources. Herein, we present and discuss the generalized DD protocol that has been proven successful in various computer-aided drug discovery (CADD) campaigns and can be applied in conjunction with any conventional docking program. The protocol encompasses eight consecutive stages: molecular library preparation, receptor preparation, random sampling of a library, ligand preparation, molecular docking, model training, model inference and the residual docking. The standard DD workflow enables iterative application of stages 3–7 with continuous augmentation of the training set, and the number of such iterations can be adjusted by the user. A predefined recall value allows for control of the percentage of top-scoring molecules that are retained by DD and can be adjusted to control the library size reduction. The procedure takes 1–2 weeks (depending on the available resources) and can be completely automated on computing clusters managed by job schedulers. This open-source protocol, at https://github.com/jamesgleave/DD_protocol, can be readily deployed by CADD researchers and can significantly accelerate the effective exploration of ultra-large portions of a chemical space.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: AI-accelerated DD approach versus regular docking.**

Fig. 2: Effect of varying training size and number of iterations on the number of remaining molecules (molecules that are classified as virtual hits, hence not discarded) for screening ZINC20 against the dimerization site of androgen receptor (PDB ID: 1R4I³⁹).

**Fig. 3: Chemical library preparation for DD.**

**Fig. 5: Iterative model improvement during DD iterations (virtual screening of ZINC20 library against the active site of SARS-CoV-2 papain-like protease (PDB ID: 7LBR⁵⁶) using Glide SP).**

Rapid traversal of vast chemical space using machine learning-guided docking screens

Article Open access 13 March 2025

The Pan-Canadian Chemical Library: A Mechanism to Open Academic Chemistry to High-Throughput Virtual Screening

Article Open access 06 June 2024

Traversing chemical space with active deep learning for low-data drug discovery

Article 27 September 2024

Data availability

The prepared version of ZINC20 can be freely obtained from https://files.docking.org/zinc20-ML/. The example iteration is freely available from the Federated Research Data Repository (https://doi.org/10.20383/102.0489). Source data for Figs. 2 and 5 are freely available from the Federated Research Data Repository (https://doi.org/10.20383/102.0489).

Code availability

The DD code is freely available at https://github.com/jamesgleave/DD_protocol.

Change history

27 March 2024
In Step 34, the text “Run the procedure from Steps 16–31” originally read “14–31”. This has now been amended in the HTML and PDF versions of the article.

References

Lyu, J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gorgulla, C. et al. An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663–668 (2020).
Article CAS PubMed PubMed Central Google Scholar
Stein, R. M. et al. Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature 579, 609–614 (2020).
Article CAS PubMed PubMed Central Google Scholar
Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23, 101681 (2020).
Article CAS PubMed PubMed Central Google Scholar
Acharya, A. et al. Supercomputer-based ensemble docking drug discovery pipeline with application to Covid-19. J. Chem. Inf. Model. 60, 5832–5852 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shoichet, B. K. Virtual screening of chemical libraries. Nature 432, 862–865 (2004).
Article CAS PubMed PubMed Central Google Scholar
Cherkasov, A., Ban, F., Li, Y., Fallahi, M. & Hammond, G. L. Progressive docking: a hybrid QSAR/docking approach for accelerating in silico high throughput screening. J. Med. Chem. 49, 7466–7478 (2006).
Article CAS PubMed Google Scholar
Svensson, F., Norinder, U. & Bender, A. Improving screening efficiency through iterative screening using docking and conformal prediction. J. Chem. Inf. Model. 57, 439–444 (2017).
Article CAS PubMed Google Scholar
Ahmed, L. et al. Efficient iterative virtual screening with Apache Spark and conformal prediction. J. Cheminform. 10, 8 (2018).
Article PubMed PubMed Central Google Scholar
Gentile, F. et al. Deep Docking: a deep learning platform for augmentation of structure based drug discovery. ACS Cent. Sci. 6, 939–949 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sterling, T. & Irwin, J. J. ZINC 15—ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337 (2015).
Article CAS PubMed PubMed Central Google Scholar
McGann, M. FRED pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 51, 578–596 (2011).
Article CAS PubMed Google Scholar
Friesner, R. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47, 1739–1749 (2004).
Article CAS PubMed Google Scholar
Ton, A.-T., Gentile, F., Hsing, M., Ban, F. & Cherkasov, A. Rapid identification of potential inhibitors of SARS-CoV-2 main protease by deep docking of 1.3 billion compounds. Mol. Inf. 39, e2000028 (2020).
Article Google Scholar
Muratov, E. N. et al. A critical overview of computational approaches employed for COVID-19 drug discovery. Chem. Soc. Rev. 50, 9121–9151 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gentile, F., Ton, A.-T., Mslati, H., Ban, F. & Cherkasov, A. Discovery of SARS-CoV-2 main protease inhibitors through Deep Docking of 1.36 billion compounds. in 26th Congress of the European Society of Biomechanics (European Society of Biomechanics, 2021).
Rossetti, G. G. et al. Identification of low micromolar SARS-CoV-2 Mpro inhibitors from hits identified by in silico screens. Preprint at bioRxiv https://doi.org/10.1101/2020.12.03.409441(2020).
Jastrzębski, S. et al. Emulating docking results using a deep neural network: a new perspective for virtual screening. J. Chem. Inf. Model. 60, 4246–4262 (2020).
Article PubMed Google Scholar
Al Saadi, A. et al. IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads. in ACM International Conference Proceeding Series (Association for Computing Machinery, 2021); https://doi.org/10.1145/3472456.3473524
Berenger, F., Kumar, A., Zhang, K. Y. J. & Yamanishi, Y. Lean-docking: exploiting ligands’ predicted docking scores to accelerate molecular docking. J. Chem. Inf. Model. 61, 2341––2352 (2021).
Article CAS PubMed Google Scholar
Graff, D. E., Shakhnovich, E. I. & Coley, C. W. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem. Sci. 12, 7866–7881 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Efficient exploration of chemical space with docking and deep-learning. Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/60c755bf842e65adc6db4393 (2021).
Sessions, Z. et al. Recent progress on cheminformatics approaches to epigenetic drug discovery. Drug Discov. Today 25, 2268–2276 (2020).
Article CAS PubMed PubMed Central Google Scholar
Coley, C. W. Defining and exploring chemical spaces. Trends Chem. 3, 133–145 (2021).
Article CAS Google Scholar
Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).
Article CAS PubMed PubMed Central Google Scholar
Enamine. REAL Database https://enamine.net/library-synthesis/real-compounds/real-database# (2021).
Enamine. REAL Space https://enamine.net/compound-collections/real-compounds/real-space-navigator (2021).
Hawkins, P. C. D., Skillman, A. G., Warren, G. L., Ellingson, B. A. & Stahl, M. T. Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 50, 572–584 (2010).
Article CAS PubMed PubMed Central Google Scholar
The RDKit Documentation—The RDKit 2020.03.1 Documentation. https://www.rdkit.org/docs/ (2020).
QUACPAC 2.0.2.2. (OpenEye Scientific Software, 2019).
O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011).
Article PubMed PubMed Central Google Scholar
Kochev, N. T., Paskaleva, V. H. & Jeliazkova, N. Ambit-Tautomer: an open source tool for tautomer generation. Mol. Inf. 32, 481–504 (2013).
Article CAS Google Scholar
Morgan, H. L. The generation of a unique machine description for chemical structures—a technique developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965).
Article CAS Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Extended Connectivity Fingerprint ECFP https://docs.chemaxon.com/display/docs/extended-connectivity-fingerprint-ecfp.md (ChemAxon, 2021).
Maestro v9.3. (Schrödinger, 2019).
Molecular Operating Environment 2019 (Chemical Computing Group, 2019).
Moustakas, D. T. et al. Development and validation of a modular, extensible docking program: DOCK 5. J. Comput. Aided Mol. Des. 20, 601–619 (2006).
Article CAS PubMed Google Scholar
Shaffer, P. L., Jivan, A., Dollins, D. E., Claessens, F. & Gewirth, D. T. Structural basis of androgen receptor binding to selective androgen response elements. Proc. Natl Acad. Sci. USA. 101, 4758–4763 (2004).
Article CAS PubMed PubMed Central Google Scholar
Santos-Martins, D. et al. Accelerating AutoDock4 with GPUs and gradient-based local search. J. Chem. Theory Comput. 17, 1060–1073 (2021).
Article CAS PubMed PubMed Central Google Scholar
Alhossary, A., Handoko, S. D., Mu, Y. & Kwoh, C.-K. Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics 31, 2214–2216 (2015).
Article CAS PubMed Google Scholar
Abagyan, R., Totrov, M. & Kuznetsov, D. ICM—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. J. Comput. Chem. 15, 488–506 (1994).
Article CAS Google Scholar
Neves, M. A. C., Totrov, M. & Abagyan, R. Docking and scoring with ICM: the benchmarking results and strategies for improvement. J. Comput. Aided Mol. Des. 26, 675–686 (2012).
Article CAS PubMed PubMed Central Google Scholar
Giga Docking Guide—Orion Programming Guide. 1.0 documentation https://docs.eyesopen.com/orion-developer/2020-2-1/modules/large-scale-floes/docs/source/giga_docking_guide.html (OpenEye Software, 2020).
LeGrand, S. et al. GPU-accelerated drug discovery with docking on the Summit supercomputer: porting, optimization, and application to COVID-19 research. Preprint at https://arxiv.org/abs/2007.03678 (2020).
Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jorgensen, W. L. The many roles of computation in drug discovery. Science 303, 1813–1818 (2004).
Article CAS PubMed Google Scholar
OEDOCKING v3.3.0.3 (OpenEye Scientific Software, 2021).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016 265–283 (The USENIX Association, 2016).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2012).
Google Scholar
Berman, H. M. The Protein Data Bank. Nucleic Acids Res 28, 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Morris, G. M. et al. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 19, 1639–1662 (1998).
Article CAS Google Scholar
Melo, F. Area under the ROC curve. in Encyclopedia of Systems Biology (eds. Dubitzky, W. et al.) 38–39 (Springer, 2013).
Hur, E. et al. Recognition and accommodation at the androgen receptor coactivator binding interface. PLoS Biol. 2, E274 (2004).
Article PubMed PubMed Central Google Scholar
Melo, F. Receiver operating characteristic (ROC) curve. in Encyclopedia of Systems Biology (eds. Dubitzky, W. et al.) 1818–1823 (Springer, 2013).
Shen, Z. et al. Design of SARS-CoV-2 PLpro inhibitors for COVID-19 antiviral therapy leveraging binding cooperativity. J. Med. Chem. https://doi.org/10.1021/acs.jmedchem.1c01307 (2021).

Download references

Acknowledgements

F.G. is supported by fellowships from the Canadian Institutes for Health Research (MFE-171324), the Michael Smith Foundation for Health Research/VCHRI & VGH UBC Hospital Foundation (RT-2020-0408) and the Ermenegildo Zegna Foundation. F.B. is supported by a UBC Data Science Institute fellowship. We thank J. Irwin for his support in sharing the DD-prepared version of the ZINC20 library.

Author information

These authors contributed equally: Jean Charle Yaacoub, James Gleave.

Authors and Affiliations

Vancouver Prostate Centre, Department of Urologic Sciences, The University of British Columbia, Vancouver, BC, Canada
Francesco Gentile, Jean Charle Yaacoub, James Gleave, Michael Fernandez, Anh-Tien Ton, Fuqiang Ban & Artem Cherkasov
NVIDIA Corporation, Santa Clara, CA, USA
Abraham Stern

Authors

Francesco Gentile
View author publications
Search author on:PubMed Google Scholar
Jean Charle Yaacoub
View author publications
Search author on:PubMed Google Scholar
James Gleave
View author publications
Search author on:PubMed Google Scholar
Michael Fernandez
View author publications
Search author on:PubMed Google Scholar
Anh-Tien Ton
View author publications
Search author on:PubMed Google Scholar
Fuqiang Ban
View author publications
Search author on:PubMed Google Scholar
Abraham Stern
View author publications
Search author on:PubMed Google Scholar
Artem Cherkasov
View author publications
Search author on:PubMed Google Scholar

Contributions

F.G. and A.C. conceived the work. F.G. wrote the manuscript with the help of M.F., J.C.Y., J.G., A.-T.T. and F.B. F.G. developed the protocol, with the help of J.C.Y., J.G. and A.S. J.C.Y., J.G. and F.G. wrote the current version of the code. A.-T.T. and M.F. provided support with critical evaluation and tested user-friendliness of the protocol. A.S. contributed to discussing and revising the protocol. A.C. supervised experiments and edited the manuscript.

Corresponding author

Correspondence to Artem Cherkasov.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Protocols thanks John Karanicolas and Ying Yang for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1 and 2, Supplementary Figs. 1 and 2 and Supplementary References.

Reporting Summary

Supplementary Table 3

evaluation.csv file obtained from evaluating different training sizes in one DD iteration, screening the ZINC20 library against the AR dimerization site (PDB ID: 1R4I ; ref. 40) using Glide SP for docking and a recall of 0.90. Validation and test sets comprised 700,000 molecules each.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gentile, F., Yaacoub, J.C., Gleave, J. et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc 17, 672–697 (2022). https://doi.org/10.1038/s41596-021-00659-2

Download citation

Received: 11 June 2021
Accepted: 08 November 2021
Published: 04 February 2022
Version of record: 04 February 2022
Issue date: March 2022
DOI: https://doi.org/10.1038/s41596-021-00659-2

This article is cited by

Developments and challenges in hit progression within fragment-based drug discovery
- Harold Grosjean
- Philip C. Biggin
Nature Communications (2026)
Active learning enables generation of molecules that advance the known Pareto front
- Evan R. Antoniuk
- Peggy Li
- Anna M. Hiszpanski
npj Computational Materials (2026)
Quantum-machine-assisted drug discovery
- Yidong Zhou
- Jintai Chen
- Zhiding Liang
npj Drug Discovery (2026)
SLICE (SMARTS and Logic In ChEmistry): fast generation of molecules using advanced chemical synthesis logic and modern coding style
- Stefi Nouleho Ilemo
- Victorien Delannée
- Nadya I. Tarasova
Journal of Cheminformatics (2025)
UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*
- Qianrong Guo
- Saiveth Hernandez-Hernandez
- Pedro J. Ballester
Journal of Cheminformatics (2025)