A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research

Zhang, Yuan; Sui, Xin; Pan, Feng; Yu, Kaixian; Li, Keqiao; Tian, Shubo; Erdengasileng, Arslan; Han, Qing; Wang, Wanjing; Wang, Jianan; Wang, Jian; Sun, Donghu; Chung, Henry; Zhou, Jun; Zhou, Eric; Lee, Ben; Zhang, Peili; Qiu, Xing; Zhao, Tingting; Zhang, Jinfeng

doi:10.1038/s42256-025-01014-w

Article
Published: 17 March 2025

A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research

Yuan Zhang ORCID: orcid.org/0000-0003-1693-0889^1,2^na1,
Xin Sui¹^na1,
Feng Pan²^na1,
Kaixian Yu²^na1,
Keqiao Li¹,
Shubo Tian¹,
Arslan Erdengasileng¹,
Qing Han¹,
Wanjing Wang¹,
Jianan Wang²,
Jian Wang³,
Donghu Sun²,
Henry Chung²,
Jun Zhou²,
Eric Zhou ORCID: orcid.org/0009-0003-1056-7342²,
Ben Lee²,
Peili Zhang⁴,
Xing Qiu⁵,
Tingting Zhao^2,6 &
…
Jinfeng Zhang ORCID: orcid.org/0000-0002-7429-7615^1,2

Nature Machine Intelligence volume 7, pages 602–614 (2025)Cite this article

7344 Accesses
14 Citations
2 Altmetric
Metrics details

Subjects

Abstract

To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery. However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. Here we used an information extraction pipeline that won first place in the LitCoin Natural Language Processing Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG’s comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of automated knowledge discovery, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified around 1,200 candidate drugs in the first 4 months, with one-third of those discovered in the first 2 months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The coverage of iKraph and some basic properties.**

**Fig. 2: Drug repurposing for COVID-19.**

**Fig. 3: Drug repurposing for cystic fibrosis.**

**Fig. 4: F₁ scores for drug repurposing prediction for ten diseases and ten common drugs.**

**Fig. 5: The overview of our drug repurposing strategy and validation approach.**

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

An ontology-based knowledge graph for representing interactions involving RNA molecules

Article Open access 22 August 2024

Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque

Article Open access 09 September 2022

Data availability

The datasets used in this study are available on the GitHub repository at https://github.com/myinsilicom/iKraph (ref. ⁹⁷). Due to size limitations, additional large datasets can be accessed via Zenodo at https://doi.org/10.5281/ZENODO.14846820 (ref. ⁹⁸). We used the BioRED dataset to train our NER and relation extraction models, and the BioRED dataset can be accessed through https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/. The complete KG is hosted on the cloud-based platform: https://www.biokde.com. The downloadable version of the complete iKraph can be accessed via Zenodo at https://doi.org/10.5281/ZENODO.14846820 (ref. ⁹⁸). Source data are provided with this paper.

Code availability

The code and datasets generated during this study can be found via the GitHub repository at https://github.com/myinsilicom/iKraph (ref. ⁹⁷).

References

Kitano, H. Nobel Turing Challenge: creating the engine for scientific discovery. npj Syst. Biol. Appl. 7, 29 (2021).
Article Google Scholar
Li, L. et al. Real-world data medical knowledge graph: construction and applications. Artif. Intell. Med. 103, 101817 (2020).
Article Google Scholar
Yu, S. et al. BIOS: An algorithmically generated biomedical knowledge graph. Preprint at https://arxiv.org/abs/2203.09975 (2022)
Nicholson, D. N. & Greene, C. S. Constructing knowledge graphs and their biomedical applications. Comput. Struct. Biotechnol. J. 18, 1414–1428 (2020).
Article Google Scholar
Gao, Z., Ding, P. & Xu, R. KG-Predict: a knowledge graph computational framework for drug repurposing. J. Biomed. Inform. 132, 104133 (2022).
Article Google Scholar
Li, N. et al. KGHC: a knowledge graph for hepatocellular carcinoma. BMC Med. Inf. Decis. Making 20, 135 (2020).
Article Google Scholar
Ernst, P., Siu, A. & Weikum, G. KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinf. 16, 157 (2015).
Article Google Scholar
Zheng, S. et al. PharmKG: a dedicated knowledge graph benchmark for biomedical data mining. Briefings Bioinform. 22, bbaa344 (2021).
Article Google Scholar
Petasis, G. et al. Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proc. 39th Annual Meeting on Association for Computational Linguistics: ACL ’01 426–433 (Association for Computational Linguistics, 2001).
Kim, J.-H. & Woodland, P.C. A rule-based named entity recognition system for speech input. In Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000) (eds Yuan, B. et al.) 528–531 (International Speech Communication Association, 2000); https://doi.org/10.21437/ICSLP.2000-131
Miyao, Y., Sagae, K., Sætre, R., Matsuzaki, T. & Tsujii, J. Evaluating contributions of natural language parsers to protein–protein interaction extraction. Bioinformatics 25, 394–400 (2009).
Article Google Scholar
Lee, J., Kim, S., Lee, S., Lee, K. & Kang, J. On the efficacy of per-relation basis performance evaluation for PPI extraction and a high-precision rule-based approach. BMC Med. Inf. Decis. Making 13, S7 (2013).
Article Google Scholar
Raja, K., Subramani, S. & Natarajan, J. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature. Database 2013, bas052 (2013).
Article Google Scholar
Kim, J.-H., Kang, I.-H. & Choi, K.-S. Unsupervised named entity classification models and their ensembles. In Proc. 19th International Conference on Computational Linguistics (COLING 2002) (eds Tseng, S.-C. et al.) 1–7 (Association for Computational Linguistics, 2002); https://doi.org/10.3115/1072228.1072316
Li, L., Zhou, R. & Huang, D. Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334–338 (2009).
Article Google Scholar
Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. & Leser, U. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. 6, e1000837 (2010).
Article MathSciNet Google Scholar
Bui, Q.-C., Katrenko, S. & Sloot, P. M. A. A hybrid approach to extract protein–protein interactions. Bioinformatics 27, 259–265 (2011).
Article Google Scholar
Patra, R. & Saha, S. K. A kernel-based approach for biomedical named entity recognition. Sci. World J. 2013, 950796 (2013).
Article Google Scholar
Hong, L. et al. A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories. Nat. Mach. Intell. 2, 347–355 (2020).
Article Google Scholar
Zhang, H.-T., Huang, M.-L. & Zhu, X.-Y. A unified active learning framework for biomedical relation extraction. J. Comput. Sci. Technol. 27, 1302–1313 (2012).
Article Google Scholar
Yu, K. et al. Automatic extraction of protein-protein interactions using grammatical relationship graph. BMC Med. Inf. Decis. Making 18, 42 (2018).
Article Google Scholar
Chowdhary, R., Zhang, J. & Liu, J. S. Bayesian inference of protein–protein interactions from biological literature. Bioinformatics 25, 1536–1542 (2009).
Article Google Scholar
Corbett, P. & Copestake, A. Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinf. 9, S4 (2008).
Article Google Scholar
Lung, P.-Y., He, Z., Zhao, T., Yu, D. & Zhang, J. Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering. Database 2019, bay138 (2019).
Article Google Scholar
Bell, L., Chowdhary, R., Liu, J. S., Niu, X. & Zhang, J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS ONE 6, e21474 (2011).
Article Google Scholar
Kim, S., Yoon, J. & Yang, J. Kernel approaches for genic interaction extraction. Bioinformatics 24, 118–126 (2008).
Article Google Scholar
Bell, L., Zhang, J., & Niu, X. Mixture of logistic models and an ensemble approach for protein-protein interaction extraction. In Proc. 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine (eds Grossman, R. et al.) 371–375 (Association for Computing Machinery, 2011); https://doi.org/10.1145/2147805.2147853
Florian, R., Ittycheriah, A., Jing, H. & Zhang, T. Named entity recognition through classifier combination. In Proc. 7th Conf. Natural Language Learning at HLT-NAACL 2003 (CoNLL ’03) (eds Daelemans, W. et al.) 168–171 (Association for Computational Linguistics, 2003).
Leaman, R., Wei, C.-H. & Lu, Z. tmChem: a high performance approach for chemical named entity recognition and normalization. J. Cheminform. 7, S3 (2015).
Article Google Scholar
Qu, J. et al. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 21, 773 (2020).
Article Google Scholar
Nguyen, T. H. & Grishman, R. Relation extraction: perspective from convolutional neural networks. In Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing (eds Blunsom, P. et al.) 39–48 (Association for Computational Linguistics, 2015).
He, D., Zhang, H., Hao, W., Zhang, R. & Cheng, K. A customized attention-based long short-term memory network for distant supervised relation extraction. Neural Comput. 29, 1964–1985 (2017).
Article MathSciNet Google Scholar
Li, F., Zhang, M., Fu, G. & Ji, D. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinf. 18, 198 (2017).
Article Google Scholar
Crichton, G., Pyysalo, S., Chiu, B. & Korhonen, A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinf. 18, 368 (2017).
Article Google Scholar
Luo, L. et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 1381–1388 (2018).
Article Google Scholar
Guo, Z., Zhang, Y. & Lu, W. Attention guided graph convolutional networks for relation extraction. In Proc. 57th Annual Meeting of the Association for Computational Linguistics (eds Korhonen, A. et al.) 241–251 (Association for Computational Linguistics, 2019).
Gridach, M. Character-level neural network for biomedical named entity recognition. J. Biomed. Inform. 70, 85–91 (2017).
Article Google Scholar
Lim, S. & Kang, J. Chemical–gene relation extraction using recursive neural network. Database 2018, bay060 (2018).
Article Google Scholar
Gu, J., Sun, F., Qian, L. & Zhou, G. Chemical-induced disease relation extraction via convolutional neural network. Database 2017, bax024 (2017).
Article Google Scholar
Habibi, M., Weber, L., Neves, M., Wiegandt, D. L. & Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, i37–i48 (2017).
Article Google Scholar
Liu, S. et al. Extracting chemical–protein relations using attention-based neural networks. Database 2018, bay102 (2018).
Article Google Scholar
Wu, H. & Huang, J. Joint entity and relation extraction network with enhanced explicit and implicit semantic information. Appl. Sci. 12, 6231 (2022).
Article Google Scholar
Akbik, A., Bergmann, T. & Vollgraf, R. Pooled contextualized embeddings for named entity recognition. In Proc. 2019 Conference of the North (eds Burstein, J. et al.) 724–728 (Association for Computational Linguistics, 2019).
Eberts, M. & Ulges, A. Span-based Joint Entity and Relation Extraction with Transformer Pre-Training (IOS, 2019).
Zhuang, L., Lin, W., Ya, S. & Zhao, J. A robustly optimized BERT pre-training approach with post-training. In Proc. 20th Chinese Natl. Conf. Computational Linguistics (eds Li, S. et al.) 1218–1227 (Chinese Information Processing Society of China, 2021); https://aclanthology.org/2021.ccl-1.108/
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT 2019 4171-4186 (Association for Computational Linguistics, 2019).
Nguyen, D. Q., Vu, T. & Nguyen, A. T. BERTweet: a pre-trained language model for English Tweets. In Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 9–14 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.emnlp-demos.2
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019).
Article Google Scholar
Liang, C. et al. BOND: BERT-assisted open-domain named entity recognition with distant supervision. In Proc. 26th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining (KDD ’20) (eds Gupta, R. et al.) 1054–1064 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403149
Wadden, D., Wennberg, U., Luan, Y. & Hajishirzi, H. Entity, relation, and event extraction with contextualized span representations. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 5784–5789 (Association for Computational Linguistics, 2019).
Zhang, Z. et al. ERNIE: enhanced language representation with informative entities. In Proc. 57th Annual Meeting of the Association for Computational Linguistics (eds Korhonen, A. et al.) 1441–1451 (Association for Computational Linguistics, 2019).
Chang, H., Xu, H., van Genabith, J., Xiong, D. & Zan, H. JoinER-BART: joint entity and relation extraction with constrained decoding, representation reuse and fusion. IEEE/ACM Trans. Audio Speech Lang. Proc. https://doi.org/10.1109/TASLP.2023.3310879 (2023).
Yamada, I., Asai, A., Shindo, H., Takeda, H. & Matsumoto, Y. LUKE: deep contextualized entity representations with entity-aware self-attention. In Proc. the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 6442–6454 (Association for Computational Linguistics, 2020).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K. et al.) 3613–3618 (Association for Computational Linguistics, 2019).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. OpenAI https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).
Brown, T. B. et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) Vol. 33, 1877–1901 (Curran Associates Inc., 2020).
Wei, X. et al. Zero-shot information extraction via chatting with ChatGPT. Preprint at https://arxiv.org/abs/2302.10205 (2023).
Pan, J. Z. et al. Large language models and knowledge graphs: opportunities and challenges. Trans. Graph Data Knowl. 1, 2:1–2:38 (2023).
Zhu, Y. et al. LLMs for knowledge graph construction and reasoning: recent capabilities and future opportunities. World Wide Web 27, 58 (2023).
Article Google Scholar
Kandpal, N., Deng, H., Roberts, A., Wallace, E. & Raffel, C. Large language models struggle to learn long-tail knowledge. In Proc. 40th Int. Conf. Machine Learning (ICML 2023) (eds Krause, A. et al.) Vol. 202, 15708–15719 (PMLR, 2023); https://proceedings.mlr.press/v202/kandpal23a.html
Li, T., Hosseini, M. J., Weber, S. & Steedman, M. Language models are poor learners of directional inference. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 903–921 (Association for Computational Linguistics, 2022).
Elazar, Y. et al. Measuring and improving consistency in pretrained language models. Trans. Assoc. Comput. Ling. 9, 1012–1031 (2021).
Google Scholar
Heinzerling, B. & Inui, K. Language models as knowledge bases: on entity representations, storage capacity, and paraphrased queries. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (eds Merlo, P. et al.) 1772–1791 (Association for Computational Linguistics, 2021).
Zheng, Q., Guo, K. & Xu, L. A large-scale Chinese patent dataset for information extraction. Syst. Sci. Control Eng. 12, 2365328 (2024).
Article Google Scholar
Stoica, G., Platanios, E. A. & Poczos, B. Re-TACRED: addressing shortcomings of the TACRED dataset. In Proc. AAAI Conf. Artif. Intell. Vol. 35, 13843–13850 (2021); https://doi.org/10.1609/aaai.v35i15.17631
Luan, Y., He, L., Ostendorf, M. & Hajishirzi, H. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E. et al.) 3219–3232 (Association for Computational Linguistics, 2018).
Wouters, O. J., McKee, M. & Luyten, J. Estimated research and development investment needed to bring a new medicine to market, 2009-2018. JAMA 323, 844 (2020).
Article Google Scholar
Lovering, F., Bikker, J. & Humblet, C. Escape from flatland: increasing saturation as an approach to improving clinical success. J. Med. Chem. 52, 6752–6756 (2009).
Article Google Scholar
Cui, L. et al. DETERRENT: knowledge guided graph attention network for detecting healthcare misinformation. In Proc. 26th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining (KDD ’20) (eds Gupta, R. et al.) 492–502 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403092
Mohamed, S. K., Nounu, A. & Nováček, V. Biological applications of knowledge graph embedding models. Briefings Bioinform. 22, 1679–1693 (2021).
Article Google Scholar
Wang, C., Yu, H. & Wan, F. Information retrieval technology based on knowledge graph. In Proc. 3rd Int. Conf. Advances in Materials, Mechatronics and Civil Engineering (ICAMMCE 2018) 291–296 (Atlantis Press, 2018); https://doi.org/10.2991/icammce-18.2018.65
Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife 6, e26726 (2017).
Article Google Scholar
Azuaje, F. Drug interaction networks: an introduction to translational and clinical applications. Cardiovascular Res. 97, 631–641 (2013).
Article Google Scholar
Ye, H., Liu, Q. & Wei, J. Construction of drug network based on side effects and its application for drug repositioning. PLoS ONE 9, e87864 (2014).
Article Google Scholar
Chen, H., Zhang, H., Zhang, Z., Cao, Y. & Tang, W. Network-based inference methods for drug repositioning. Comput. Math. Methods Med. 2015, 130620 (2015).
Article Google Scholar
Luo, Y. et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8, 573 (2017).
Article Google Scholar
Islamaj, R., Lai, P.-T., Wei, C.-H., Luo, L. & Lu, Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Zenodo https://doi.org/10.5281/ZENODO.10351131 (2023).
Luo, L., Lai, P.-T., Wei, C.-H., Arighi, C. N. & Lu, Z. BioRED: a rich biomedical relation extraction dataset. Briefings Bioinform. 23, bbac282 (2022).
Article Google Scholar
Ahmed, F. et al. SperoPredictor: an integrated machine learning and molecular docking-based drug repurposing framework with use case of COVID-19. Front. Public Health 10, 902123 (2022).
Article Google Scholar
Ahmed, F. et al. A comprehensive review of artificial intelligence and network based approaches to drug repurposing in Covid-19. Biomed. Pharmacother. 153, 113350 (2022).
Article Google Scholar
Zhou, Y. et al. Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2. Cell Disc. 6, 14 (2020).
Article Google Scholar
Aghdam, R., Habibi, M. & Taheri, G. Using informative features in machine learning based method for COVID-19 drug repurposing. J. Cheminformatics 13, 70 (2021).
Article Google Scholar
Belikov, A. V., Rzhetsky, A. & Evans, J. Prediction of robust scientific facts from literature. Nat. Mach. Intell. 4, 445–454 (2022).
Article Google Scholar
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 3, 1–23 (2022).
Article Google Scholar
Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP) 3982–3992 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/D19-1410
Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at http://arxiv.org/abs/1907.11692 (2019).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
MathSciNet Google Scholar
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. 18th BioNLP Workshop and Shared Task (eds Demner-Fushman, D. et al.) 58–65 (Association for Computational Linguistics, 2019).
Alsentzer, E. et al. Publicly available clinical BERT embeddings. In Proc. 2nd Clinical Natural Language Processing Workshop (eds Rumshisky, A. et al.) 72–78 (Association for Computational Linguistics, 2019).
Sohn, S., Comeau, D. C., Kim, W. & Wilbur, W. J. Abbreviation definition identification based on automatic precision estimates. BMC Bioinf. 9, 402 (2008).
Article Google Scholar
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 10, 67 (2023).
Article Google Scholar
Zhou, Y. et al. TTD: Therapeutic Target Database describing target druggability information. Nucleic Acids Res. 52, D1465–D1477 (2023).
Article Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Article Google Scholar
Gene Ontology Consortium et al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Article Google Scholar
Wilks, C. et al. recount3: summaries and queries ffor large-scale RNA-seq expression and splicing. Genome Biol. 22, 323 (2021).
Article Google Scholar
Zhang, Y. et al. myinsilicom/iKraph: 1.0.0. Zenodo https://doi.org/10.5281/ZENODO.14577964 (2024).
Zhang, Y. et al. iKraph: a comprehensive, large-scale biomedical knowledge graph for AI-powered, data-driven biomedical research. Zenodo https://doi.org/10.5281/ZENODO.14846820 (2025).

Download references

Acknowledgements

We thank the LitCoin NLP Challenge and BioCreative Challenge VIII BioRED track organizers for generating the valuable challenge dataset, which made the work possible. This research was partially supported by the NIH under grant no. R21LM014277 (J. Zhang), contract 75N91024C00007 (J. Zhang) and contract 75N93024C00034 (J. Zhang); by the National Science Foundation under grant nos. 2335357 (J. Zhang) and 2403911 (J. Zhang) and by the National Cancer Institute, NIH, under Prime Contract No. 75N91019D00024, Task Order No. 75N91024F00030 (J. Zhang). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the US Government. The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the paper.

Author information

These authors contributed equally: Yuan Zhang, Xin Sui, Feng Pan, Kaixian Yu.

Authors and Affiliations

Department of Statistics, Florida State University, Tallahassee, FL, USA
Yuan Zhang, Xin Sui, Keqiao Li, Shubo Tian, Arslan Erdengasileng, Qing Han, Wanjing Wang & Jinfeng Zhang
Insilicom LLC, Tallahassee, FL, USA
Yuan Zhang, Feng Pan, Kaixian Yu, Jianan Wang, Donghu Sun, Henry Chung, Jun Zhou, Eric Zhou, Ben Lee, Tingting Zhao & Jinfeng Zhang
Sunnyvale, CA, USA
Jian Wang
Forward Informatics, Winchester, MA, USA
Peili Zhang
Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA
Xing Qiu
Department of Geography, Florida State University, Tallahassee, FL, USA
Tingting Zhao

Authors

Yuan Zhang
View author publications
Search author on:PubMed Google Scholar
Xin Sui
View author publications
Search author on:PubMed Google Scholar
Feng Pan
View author publications
Search author on:PubMed Google Scholar
Kaixian Yu
View author publications
Search author on:PubMed Google Scholar
Keqiao Li
View author publications
Search author on:PubMed Google Scholar
Shubo Tian
View author publications
Search author on:PubMed Google Scholar
Arslan Erdengasileng
View author publications
Search author on:PubMed Google Scholar
Qing Han
View author publications
Search author on:PubMed Google Scholar
Wanjing Wang
View author publications
Search author on:PubMed Google Scholar
Jianan Wang
View author publications
Search author on:PubMed Google Scholar
Jian Wang
View author publications
Search author on:PubMed Google Scholar
Donghu Sun
View author publications
Search author on:PubMed Google Scholar
Henry Chung
View author publications
Search author on:PubMed Google Scholar
Jun Zhou
View author publications
Search author on:PubMed Google Scholar
Eric Zhou
View author publications
Search author on:PubMed Google Scholar
Ben Lee
View author publications
Search author on:PubMed Google Scholar
Peili Zhang
View author publications
Search author on:PubMed Google Scholar
Xing Qiu
View author publications
Search author on:PubMed Google Scholar
Tingting Zhao
View author publications
Search author on:PubMed Google Scholar
Jinfeng Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Z., X.S., F.P., K.L., S.T., A.E., Q.H., W.W., Jianan Wang and Jian Wang collected data and developed models and pipelines. Y.Z., F.P. and J. Zhang analysed the data and developed methods. D.S., H.C., J. Zhou, E.Z., B.L., T.Z. and J. Zhang. developed the iExplore platform interface. K.Y. and J. Zhang conceptualized and designed the study. Y.Z., F.P., K.Y. and J. Zhang wrote the paper. X.Q., T.Z. and P.Z. provided consultation and paper revision. J. Zhang supervised the study and is the corresponding author.

Corresponding author

Correspondence to Jinfeng Zhang.

Ethics declarations

Competing interests

J. Zhang and T.Z. are owners of Insilicom LLC. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks James Evans and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Methods sections 1–5, Discussion section 6, Figs. 1–3, Tables 1–4 and Box 1.

Reporting Summary

Source data

Source Data Fig. 1

Data used to generate the Venn plot, pie plot and line plot.

Source Data Fig. 2

Data used to generate the plot on repurposed drugs for COVID-19.

Source Data Fig. 3

Data used to generate the plot on repurposed drugs for cystic fibrosis.

Source Data Fig. 4

Data used to generate the plot on drug repurposing study on top ten common drugs and ten diseases.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Sui, X., Pan, F. et al. A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research. Nat Mach Intell 7, 602–614 (2025). https://doi.org/10.1038/s42256-025-01014-w

Download citation

Received: 23 September 2024
Accepted: 13 February 2025
Published: 17 March 2025
Version of record: 17 March 2025
Issue date: April 2025
DOI: https://doi.org/10.1038/s42256-025-01014-w

This article is cited by

Advancing active compound discovery for novel drug targets: insights from AI-driven approaches
- Xing-you Wang
- Yang Chen
- Xu-tong Li
Acta Pharmacologica Sinica (2025)
Vaner2: towards more general biomedical named entity recognition using multi-task large language model encoders
- Yuxuan Liu
- Junyi Bian
- Shanfeng Zhu
Health Information Science and Systems (2025)