Drug repositioning model based on knowledge graph embedding

He, Shufang; Zhao, Xiaoyu

doi:10.1038/s41598-025-95372-5

Download PDF

Article
Open access
Published: 25 March 2025

Drug repositioning model based on knowledge graph embedding

Shufang He¹ &
Xiaoyu Zhao¹

Scientific Reports volume 15, Article number: 10298 (2025) Cite this article

3340 Accesses
Metrics details

Subjects

Abstract

Drug repositioning utilizes existing drugs for new therapeutic applications, driven by the rapid increase in disease and drug-related data. However, organizing knowledge in this field and integrating the complex and scattered data from multiple systems into a cohesive knowledge network have become urgent problems to address. In this paper, we propose a drug repositioning model based on knowledge graph embedding. The model employs multivariate relational data to embed entities and relationships in a low-dimensional vector space. It also innovatively introduces the attention mechanism into translation and bilinear models, forming new models such as Attranse, Attdismult, and Attrescal. This model’s feature extraction does not rely on a single approach, instead, it integrates multiple models and combines their screening results to enhance drug screening quality. The model’s effectiveness was validated using COVID-19 data, yielding results consistent with 7 clinically approved drugs for COVID-19 treatment, indicating high accuracy in identifying new drug indications. The successful application of this model to COVID-19 suggests its potential for broader use in emerging infectious diseases and complex conditions, providing valuable insights for future drug development.

A method for the rational selection of drug repurposing candidates from multimodal knowledge harmonization

Article Open access 26 May 2021

Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding

Article Open access 04 March 2023

Drug repositioning by merging active subnetworks validated in cancer and COVID-19

Article Open access 06 October 2021

Introduction

Background

Research on complex diseases, such as cancer, diabetes, and cardiovascular diseases, has always been a significant and challenging issue due to the interplay of genetic and environmental factors¹. In the past few decades, although the development of genomics and life sciences has been rapid, new drug research remains time-consuming, costly, and characterized by a low success rate. Consequently, investigations into new drugs have begun to stagnate. According to conservative estimates, developing a new drug typically takes 10–15 years, with research and development costs nearly doubling to $2 billion, while the return on investment in research and development decreased from 10% in 2010 to 2% in 2019^2,3. Phase I to Phase III clinical trials are essential in drug development, typically lasting 3–7 years. Phase I trials usually last about 1 year, Phase II trials typically last 2 years, and Phase III trials may last 4 to 5 years. Due to the unpredictable side effects of drugs with novel structures, approximately 90% of experimental drugs fail Phase I clinical trials⁴, and 50% fail to reach the market in Phase III due to poor efficacy^5,6. Some drugs that pass Phase III may still be withdrawn during Phase IV market surveillance. Therefore, the development and discovery of new drugs face significant challenges. In this context, drug repositioning becomes particularly important and urgent. Compared to innovator drugs, drug repositioning offers significant advantages in research time, funding, and success rates, as illustrated in Fig. 1. As a drug development strategy, drug repositioning presents clear cost-benefit advantages, significantly reducing the costs and risks associated with new drug development while shortening the time frame between discovery and clinical availability⁷.

In the face of various intractable diseases, the process of developing new drugs is complex and lengthy, thus, the application of Artificial Intelligence (AI) technology has become increasingly important^{8,9,10,11,12,13,14}. The application of AI in drug discovery relies on computer-aided drug design, which combines extensive chemical and biological data to establish high-quality machine learning models. This approach guides the discovery and optimization of compounds during target screening, molecular structure/ chemical spatial analysis, ligand-receptor interaction simulation, and three-dimensional quantitative structure-activity relationship analysis of drugs^15,16. In drug repositioning research, graph neural network-based methods effectively handle complex graph-structured data, such as drug-target interaction networks and biomolecule networks, capturing potential relationships between drugs and targets while demonstrating strong generalization ability. Knowledge graph-based methods using multi-source data fusion can integrate biological information from various sources, enhance data comprehensiveness, and improve the accuracy of drug repositioning. Machine learning-based methods can quickly process large volumes of data and perform research and analysis on large-scale datasets. However, several problems arise during the research process:

Insufficient semantic features exist in entity and relationship representation: The se-mantic features of entity and relationship representation in datasets are insufficient, and traditional word embedding technology is designed for general natural language, making it less practical for knowledge graph embedding. However, traditional translation models only embed knowledge graph structures and exhibit weak semantic features. Integrating text semantic information with deep learning methods, such as convolutional neural networks, also faces challenges related to difficult data acquisition and high complexity.
Currently, the efficiency of knowledge fusion technology is low: Knowledge graphs involve massive amounts of data and complex technologies. As a key component, knowledge fusion is still not mature enough in both theoretical research and practical application. Tasks such as knowledge embedding and entity alignment do not meet the efficiency and accuracy requirements for large-scale data processing in today’s industry. Given the sparse nature of knowledge graphs and the semantic complexity of natural language, improving the efficiency of knowledge fusion remains a priority in big data intelligence.
Quality verification of screened drug candidates: Even superior models must verify and evaluate results, which serves as a standard for measuring model quality. Determining effective methods to verify this quality has become a challenging issue.
Large drug sample sets: The dataset of old drugs is extensive, with many similarities among different drugs. Drug screening results may recommend too many drugs, limiting the range of older drugs available for repositioning.

This study presents several innovations and research initiatives to address the difficulties mentioned above:

Introduction of attention mechanisms in translation and bilinear models: The translation model TransE is concise and effective; however, it has limitations in handling complex relations. The model is limited to the knowledge graph structure, and its semantic features for entity and relation representation are weak. The bilinear model computes the potential semantic reliability of entities and relations in vector space. It employs a global attention mechanism to assign attribute node features to entity representations and a self-attention mechanism to process label word order and relational representations, integrating text features. This approach addresses the shortcomings of insufficient semantic features in entity and relation representations and effectively enhances the quality of knowledge embeddings.
Integrating multiple models: In this research, various models with attention mechanisms are integrated, and their screening results are combined for better results to enhance drug screening outcomes and quality.
Quality verification of screened candidate drugs: Predict and rank scores for all possible combinations of (drug, treatment, virus) triplets. Next, compare the top-ranked drugs with those currently undergoing clinical trials to identify several high-scoring candidate drugs.
The model’s feature extraction does not depend on a single model. This approach enhances accuracy by not relying solely on one model for feature extraction. After further screening of the integrated results, it can effectively identify better outcomes from various versions.

In this paper, a method based on knowledge graph embedding is proposed to rapidly identify the potential efficacy of known drugs. It uses COVID-19 as an example to verify the effectiveness of the method and minimize the conversion gap between preclinical test results and clinical results. Since the knowledge graph has been successfully established, it can be easily expanded in the future to cover various categories, symptoms, disease genetic characteristics, and more. This will enable broader drug repositioning screening for both infectious and non-infectious diseases¹⁷.

Related work

Attention mechanism

The application of the attention mechanism is very extensive. In medicine, the attention mechanism is frequently used for analyzing and processing medical images. Feng Y. et al. proposed a novel medical image segmentation network, PAMSNet, which enhances image details using efficient pyramid attention and channel-space attention modules¹⁸. Di J. et al. proposed an image fusion method utilizing an improved attention mechanism and a decomposition network, achieving better texture preservation and sharper edge contours¹⁹. The attention mechanism also plays a significant role in drug repositioning. Zu J. et al. proposed a drug repositioning method that combines word vector representation with the attention mechanism, enhancing the classification accuracy of drug-target protein interaction predictions²⁰. Tang X. F. et al. proposed a drug repositioning method based on a bilinear attention network, introducing a layer attention mechanism to combine embeddings from different graph convolutional layers, resulting in more expressive representations of drugs and diseases²¹.

Knowledge graph

In 2012, Google proposed the Google Knowledge Graph, from which the term “knowledge graph” is derived, and improved search engine performance through this technology. A knowledge graph is a graph-based data structure composed of points and edges. Each point represents an “entity,” and each edge represents a “relationship” between entities. In essence, a knowledge graph is a relational network that connects various types of information, facilitating better queries of complex associated information. Understanding user intent at a semantic level allows for problem analysis from a relational perspective. Knowledge Graph Embedding is a technique that converts entities and relationships in high-dimensional, sparse knowledge graph data into low-dimensional, dense vector representations, enabling the computation and inference of semantic relationships in a vector space. Graph embedding has shown significant potential in elucidating molecular mechanisms and predicting the biological activities of repurposed drugs for various diseases. Graph embedding can simulate and analyze interactions between drug molecules and biological targets, identify key drug-target interaction patterns, and enhance understanding of how these interactions affect disease occurrence, development, and treatment. It can accurately predict the biological activities of existing drugs for other diseases. These predictions not only accelerate drug discovery but also provide new insights and directions for clinical treatment, especially in the face of urgent public health challenges or diseases that are under-researched.

With the development of AI technology, knowledge graphs have significant applications in the medical fields of diagnosis and treatment, drug research and development, and knowledge management. Guo Z. Q. et al. developed an intelligent question-and-answer platform for proprietary Chinese medicines using technologies such as knowledge graphs, natural language processing for multi-label text classification, named entity recognition, and speech recognition. This platform quickly and accurately queries relevant information based on users’ questions and presents a relevant knowledge graph to assist users in understanding proprietary Chinese medicines²². Wu D. et al. developed an automatic question-and-answer system for cardiovascular diseases based on a cardiovascular disease knowledge graph to effectively answer users’ questions regarding the diagnosis of symptoms and drug recommendations²³. Remy C. et al. built a knowledge graph based on patient symptoms to help doctors quickly identify potential rare diseases²⁴. Weng H. et al. designed a framework for constructing and applying a Traditional Chinese Medicine (TCM) knowledge base based on representation learning, which is effective in knowledge discovery and aiding decision-making in diagnosis and treatment²⁵. Fu Z. X. et al. proposed a multi-link predictive reasoning algorithm based on rules and a Markov Logic Network (MLN) with a research background in TCM Visceral Syndrome Differentiation, which provides auxiliary diagnosis for clinical practice in TCM²⁶. Li J. et al. used knowledge graph technology to connect scattered information related to the plague, creating a comprehensive knowledge graph focused on this disease. This graph facilitates in-depth exploration of the complex pathogenesis and potential treatment methods for the plague. They also identified the value of Coptis, Rhubarb, and other varieties of traditional Chinese medicine, as well as moxibustion therapy, for the treatment and prevention of plague²⁷. Lu Y. W. et al. extracted biomedical knowledge, constructed a gastric cancer knowledge graph, and used knowledge embedding vectors to predict that nine drugs could treat gastric cancer, thereby verifying the medical application value of the knowledge graph²⁸. Lyu Y. H. et al. used the tools SemRep and Metamap, based on the Unified Medical Language System (UMLS), to obtain autism drug entity triplets and construct an autism drug entity knowledge graph. Based on the knowledge graph, 27 potential drugs for autism were screened using three semantic paths, providing a theoretical and methodological basis for drug repositioning²⁹. Fan M. et al. proposed a storage structure and visualization technology based on knowledge graphs to visualize the relevant attributes of Tibetan medicine. This approach aims to enhance the integration of Tibetan medicine knowledge and improve data correlation, ultimately better exploring the potential value of Tibetan medicine data³⁰. Ouzounis S. et al. proposed a data-driven approach to facilitate the reuse of diabetes drugs by integrating heterogeneous biomedical data into a unified knowledge graph³¹. Xi C. C. combined knowledge graph with qualitative and quantitative research methods, and compared the research of Sun Yikui, Zhao Xianke, and Zhang Jiebin through grounded theory and data mining, greatly improving the efficiency of analysis³². Wang Q. et al. constructed 643 medical records related to the treatment of coronary heart disease, featuring 144 renowned TCM doctors, to provide a methodological reference for the inheritance of their experiences³³. Xiong W. P. et al. extracted information on Chinese Patent Medicine (CPM), diseases, symptoms, and other relevant data from electronic medical records, constructed a knowledge graph, and based on this graph, developed a rule base for CPM monitoring, ensuring comprehensive application of CPM to guarantee drug safety³⁴.

Research content

Since the safety of marketed drugs has been clinically verified, the pharmacokinetic characteristics have been clearly defined, and the production process, quality standards, and dosage forms are complete, the amount of preclinical research needed is significantly reduced compared to developing drugs from scratch. Only the main pharmacodynamics of new indications need clarification. In the clinical research stage, if the drug dosage form and administration mode are the same as the original indication, and if the dosage and administration time are less than or equal to those of the original indication, the clinical study for the new drug indication will directly enter phase II b, generally not requiring phase I and phase II a clinical trials. Compared to traditional new drug development, drug repositioning can effectively shorten the drug research and development cycle, reduce costs, and avoid risks, making it a very promising drug development strategy. The comparison of the development process of the innovator drug and drug repositioning is shown in Fig. 2.

This research primarily encompasses the construction of a knowledge graph, its embedding model, and the evaluation of the verification model’s professionalism.

Data and algorithms offer new opportunities for constructing knowledge graphs. Knowledge graphs serve as the foundational support for AI, making them essential. The Drug Repurposing Knowledge Graph (DRKG) is a large-scale drug repositioning knowledge graph created collaboratively by the Amazon Shanghai AI Laboratory, Amazon AI North America, the University of Minnesota, Ohio State University, and Hunan University. It holds significant reference and application value. This article constructs a knowledge graph using DRKG, which includes DrugBank and 24 million publicly available publications. A new knowledge graph is created by assembling six coronaviruses as comprehensive nodes for COVID-19. We randomly divide the dataset into training, validation, and testing sets in a ratio of 9:0.5:0.5.

With the development of Internet technology, data has shown explosive growth. However, due to the multi-source heterogeneous content on the Internet and the loose organizational structure, using this information efficiently has become difficult. The emergence of knowledge graphs aims to transform massive amounts of unstructured or semi-structured data into standardized, unified, reliable, and effective structured knowledge. This transformation creates a highly interconnected semantic web to support data mining and intelligent services^35,36,37. A knowledge graph describes various entities, concepts, and their relationships in the real world. It can essentially be seen as a directed graph-structured network. In the graph, nodes represent entities or concepts, while edges represent the relationships between entities and other entities (or between entities and concepts). Based on this highly structured knowledge, diverse data mining applications and intelligent services can be developed. Knowledge Embedding (KE), also known as Knowledge Representation Learning (KRL), aims to learn the quantification of entity relations³⁸, transforming the symbolic form of knowledge into a computable real-valued vector. It is the core technology in the entire knowledge graph construction process and often provides vector input as part of knowledge fusion. In this study, the translation model TransE and the bilinear models Dismult and Rescal are selected. These models were combined with attention mechanisms to create Attranse, Attdismult, and Attrescal for training the knowledge graph obtained from our research³⁹.

In the final validation stage, the model’s effectiveness can be demonstrated through prediction scores, cross-comparison of multiple model scores, comparative analysis with COVID-19 clinical drugs, drug-gene characteristics, and HCoV-induced enrichment analysis.

The model flowchart of this research is shown in Fig. 3:

Methods

Knowledge graph embedding model

The embedding of the knowledge graph utilizes machine learning to represent the semantic information of research objects as dense low-dimensional vectors. This effectively addresses data sparsity and enhances knowledge fusion and reasoning performance. These models consider the collaboration and computational costs among entities. They represent entities with vectors and perform matrix transformations on these vectors or their relationships. Additionally, they propose evaluation functions to measure the correlation between entities. Graph embedding represents complex biochemical mechanisms as relationships in a low-dimensional vector space. Using graph embedding, the model predicts new drug action mechanisms based on known interactions of drug- treatment- virus. Embedding vectors help discover potential drug roles in unexplored biochemical mechanisms, thus facilitating drug repositioning.

Attention mechanism model

The attention mechanism enables the model to learn how to allocate its attention by weighting input signals. The primary purpose of the attention mechanism is to score the various dimensions of input and weight features based on these scores, highlighting the impact of important features on downstream models.

In this study, a global attention mechanism is applied to head and tail entities, while a self-attention mechanism is utilized for relationships. In the self-attention mechanism, each element of a sequence makes attention calculations with all other elements, capturing the influence relationships between them without any additional information. Its effectiveness has been demonstrated in machine reading, text summarization, and image annotation.

Translation model

The translation model is a key method of knowledge embedding, with TransE being the most popular and representative model. This model is simple and effective, achieving good results with high performance on large-scale knowledge graphs, which are widely studied in translation models. TransE is a distance-based knowledge graph embedding method that assumes relationships between entities can be represented as translations in vector space. This model learns entity and relationship representations through translation operations and distance metrics. The distance function is typically used to measure the difference between the predicted and true vectors. It interprets the relationship in a knowledge graph triplet as the translation operation from the head entity to the tail entity in the embedded space. For a triplet (h, r, t), the head entity vector plus the relationship vector in the embedded space should be as close as possible to the tail entity vector; that is, h + r ≈ t, as shown in Fig. 4.

The triplet evaluation function is defined as formula (1):

$$f_{{transe}} (h,r,t) = \left\| {h + r - t} \right\|_{2}^{2}$$

(1)

where h represents the head entity vector, r represents the relationship vector, and t represents the tail entity vector.

Bilinear model

1.
Rescal

Rescal is a matrix factorization-based model that uses tensor decomposition techniques to learn low-dimensional vector representations of entities and relationships in knowledge graphs. Rescal is a bilinear model that represents the knowledge graph as a third-order vector, T. The first order represents the head entity, the second order represents the relationship, and the third order represents the tail entity. If the triplet (h, r, t) exists in the Knowledge graph, and the serial numbers of h, r, and t in each order are i, j, k, then T_{i, j, k}=1, otherwise, T_{i, j, k}=0. Rescal obtains the representations of entities and relationships in vector space through tensor decomposition. Rescal represents each entity as a vector to represent its implicit semantics. Each relationship is represented as a matrix that illustrates the relationships between the dimensions of the head and tail entities. Figure 5 illustrates the relationship diagram.

The evaluation function of Rescal for triplet (h, r, t) is defined as follows:

$$\:{\text{f}}_{\text{R}\text{escal}}\text{(}\text{h,r,t}\text{)=}{\text{h}}^{\text{T}}{\text{M}}_{\text{r}}\text{t}$$

(2)

Here, h,t ϵ R_k represent the vectors of the head and tail entities, while M_r ϵ R_k × k is the matrix corresponding to the relationship r. The evaluation function calculates the semantic correlation of h and t under the relationship r using this bilinear method.

The RESCAL model learns embedded representations of entities and relationships by optimizing algorithms during training to minimize the error between predictions and actual triplets. After training, the learned embedding vectors can be used for tasks such as relationship inference, entity classification, and link prediction in knowledge graphs.

2.
Dismult.

The core idea of the Distmult is to model the interaction between entities and relationships using the inner product, rather than simply adding or connecting them. This approach makes the model more flexible and better able to capture multiple correlations between entities and relationships. Distmult improves the model performance by restricting the relationship matrix M_r to a diagonal matrix based on Rescal. For the triplet (h, r, t), Distmult represents the head and tail entities h and t as vectors h, t ϵ R_k, respectively. The relationship diagram is illustrated in Fig. 6.

The evaluation function is defined as:

$$\:{\text{f}}_{\text{D}\text{ismult}}\text{(}\text{h,r,t}\text{)=}{\text{h}}^{\text{T}}\text{d}\text{iag}\text{(}{\text{M}}_{\text{r}}\text{)t}$$

(3)

Since the matrix Mr is a diagonal, the evaluation function can obtain correlations between the same dimensions h and t, significantly reducing the number of parameters compared to Rescal.

The combination of attention mechanism and knowledge graph embedding

For the representation of entities and relationships in the Knowledge graph, the model considers two features: one is their own semantic features. For entity representation, the model uses a global attention mechanism to incorporate attribute features. For relationship representation, it employs a self-attention mechanism to extract semantic features from relationship label words. Another is the structural features on triplet (h, r, t).

The entity attributes and their respective characteristics have been determined by the DRKG knowledge graph, which includes 13 entity types, totaling 97238 entities, and 5,874,261 triplets belonging to 107 relationship types. Before training, collect a list of coronavirus (CoV) diseases in DRKG, with all coronavirus diseases as targets. We assemble six types of coronaviruses (including SARS-CoV, MERS-CoV, HCoV-229E, and HCoV-NL63) as comprehensive nodes of coronaviruses and reconnect the links between genes and drugs. The generated knowledge graph contains four types of entities: drugs, genes, diseases, and drug-related information, along with 39 relationships, 145,179 nodes, and 15,018,067 edges.

In the embedding space, the head and tail entities obtain their respective attribute features through global attention and self-attention mechanisms, and then perform translation operations. We refer to this network as Attranse. As shown in Fig. 7.

The bilinear models, Rescal and Dismult, derive their potential semantics from a vector representation of each entity. Each relationship is represented as a matrix that models the paired interactions between potential factors. Similarly, the head entity and tail entity acquire new attribute features through global attention, while relational tag words derive relational semantics representation through a self-attention mechanism. Each relationship is represented by a matrix, as illustrated in Fig. 8. Finally, the matrix multiplication operation is performed. The two networks are referred to as Attrescal and Attdismult, respectively.

Verification

Input the COVID-19 entity and database into the Knowledge Graph embedding model mentioned above, and obtain the score of each drug using the evaluation functions of Attranse, Attrescal, and Attdismult. Cross-validation was performed on the top 100 drugs predicted by each model to yield the final results. The clinical data of COVID-19 trial is obtained from https://covid19-trials.com/. By comparing the top 100 drugs predicted by the model with those used in COVID-19 clinical treatments, we can select current clinical drugs from this list, demonstrating the effectiveness of drug repositioning methods.

Additionally, this paper further employs enrichment analysis of drug gene characteristics in human cell lines, along with the transcriptomic and proteomic data induced by SARS-CoV, to more effectively validate the best candidate drugs.

First, three datasets of differential gene expression in human cell lines infected with HCoV were collected from the Gene Expression Omnibus database on https://www.ncbi.nlm.nih.gov/geo/. Specifically, two transcriptome datasets were used: one from the peripheral blood (GSE1739) and another from Calu-3 cells (GSE33267) of SARS-CoV-infected patients. A transcriptome dataset from Calu-3 cells (GSE122876) infected with MERS-CoV was also selected. Additionally, a proteome dataset specific to SARS-CoV-2 was collected on https://biochem2.com/index.php/22ibcii/pqc/130-frontpage-pqc#coronavirus. P-values less than 0.01 are defined as differentially expressed genes and proteins. Differential gene expression in cells treated with various drugs was retrieved from the Connectivity Map (CMap) database and used as a gene profile for drug analysis. The Enrichment Score (ES) calculated for each CoV dataset is as follows:

$$ES = \left\{ {\begin{array}{*{20}l} {ES_{{up}} - ES_{{down}} ,} \hfill & {\text{sgn} \left( {ES_{{up}} } \right) \ne \text{sgn} \left( {ES_{{down}} } \right)} \hfill \\ {0,} \hfill & {else} \hfill \\ \end{array} } \right.$$

(4)

ES_up and ES_down are up-regulated and down-regulated genes calculated from the CoV gene signature dataset, respectively. The calculations for a_up/down and b_up/down are as follows:

$${\text{a = }}\mathop {{\text{max}}}\limits_{{{\text{1}} \le {\text{j}} \le {\text{s}}}} \left( {\frac{{\text{j}}}{{\text{s}}} - \frac{{{\text{v(j)}}}}{{\text{r}}}} \right)$$

(5)

$$b = \mathop {\max }\limits_{{1 \le j \le s}} \left( {\frac{{v\left( j \right)}}{r} - \frac{{j - 1}}{s}} \right)$$

(6)

Where j = 1,2, …, s are the genes in the HCoV dataset, arranged in ascending order in the gene profile of the calculated drug. The level of gene j is represented by V(j), and 1 < = V(j) < = r, where r is the number of genes from the CMap database (12,849). If a_up/down>b_up/down, then ES_up/down=a_up/down, If a_up/down<b_up/down, then ES_up/down=-b_up/down. To quantify the importance of ES scores, a randomly generated gene list was repeated 100 times, with the same number of up-regulated and down-regulated genes as the CoV dataset. If ES > 0 and P < 0.05, it is considered that the drug has a significant enrichment effect.

Experimental results

Experimental environment

Experimental results were found to be influenced by the experimental environment. Consequently, the parameters of the experimental environment are provided below. In this work, the Pytorch framework in deep learning was utilized to train the model. The specific details of the experimental environment are presented in Table 1.

Table 1 Training environment parameters.

Full size table

Dataset

DRKG is a comprehensive knowledge graph in biomedicine, encompassing six main data aspects: human genes, compounds, biological processes, drug side effects, diseases, and symptoms. DRKG extracts data from six large-scale open medical databases, including DrugBank, Hetionet, GNBR, String, IntAct, and DGIdb, as well as recent medical literature related to COVID-19, and standardizes this information. The DRKG knowledge graph contains 97,238 entities across 13 entity types and 5,874,261 triplet data across 107 relationship types. These 107 relationship types illustrate the interaction types between 17 entity type pairs, with multiple interaction types possible for the same entity pair, as shown in Fig. 9.

The medical knowledge graph is the cornerstone of smart medicine⁴⁰. However, existing knowledge graph construction technology in the medical field generally faces issues such as low efficiency, numerous restrictions, and poor scalability. Considering the characteristics of medical data, such as cross-language, strong professionalism, and complex structure, the ontology representation depicts knowledge as a network, where associated nodes (entities) are represented by a triple (entity 1, relationship, entity 2)⁴¹. The number of nodes in the knowledge graph affects the structural complexity of the network and the efficiency and difficulty of reasoning.

In this research, the construction of the knowledge graph is completed using the DRKG. DrugBank combines the structural and pharmacological data of drug molecules, including biotech drugs, with the protein sequences, structures, and modes of action of their targets. It also integrates information on the chemical structure, pharmacological effects, protein targets, physiological pathways, and drug interactions. Additionally, it links to the PDB and KEGG databases to analyze detailed drug information. Drugs with a molecular weight greater than 230 daltons that also exist in the GNBR are selected from DrugBank. For these drugs, the knowledge graph includes relationships related to drug interactions, side effects, ATC codes, mechanisms of action, pharmacodynamics, and toxicity. It also incorporates relationships between coronaviruses and genes discovered in knowledge graph experiments. Like ‘Disease::SARS-CoV2 E’, ‘Disease::SARS-CoV2 Spike’, ‘Disease::SARS-CoV2 nsp1’, ‘Disease::SARS-CoV2 orf10’, etc. By assembling six types of coronaviruses (including SARS-CoV, MERS-CoV, HCoV-229E, and HCoV-NL63) as comprehensive nodes of coronaviruses (CoV), and reconnecting the links between genes and drugs. The generated knowledge graph contains four types of entities: drugs, genes, diseases, and drug-related information; along with 39 relationships, 145,179 nodes, and 15,018,067 edges.

Experimental results

This model is deployed in a web environment and presents the final results as a table, selecting drugs ranked in the top 100 by their scores. To meet validation requirements, the results of the 100 drugs obtained from the final model were compared with current clinical drugs for treating COVID-19, and the intersection was selected and output, marking those consistent with clinical drugs as 1 and those inconsistent as 0. All predicted drug scores fall within the range of 0 to 1, with a lower score indicating a better drug prediction effect. To improve clarity, we transformed the original scores. The transformation method involves subtracting 10 times the original score from 100, adjusting the score range to 90–100. In this range, a higher score indicates a better predicted drug effect. It was found that 7 of the drugs predicted by the model were identical to the clinical drugs for treating COVID-19, as shown in Table 2.

Table 2 Drug prediction results.

Full size table

Both the Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (Trial sixth edition) and the Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (Trial seventh edition) clearly indicate that ribavirin can be used for antiviral treatment. Trial results from the RECOVERY study in the United Kingdom show that dexamethasone can reduce mortality in patients on invasive ventilators. Patients who received dexamethasone, hydrocortisone, or methylprednisolone experienced an estimated 20% reduction in their risk of death, according to a team at Imperial College London and ICNARC. Thai doctors have used oseltamivir, originally developed to treat MERS, along with lopinavir and ritonavir, which were initially used to treat AIDS, in clinical treatments with good results. A trial of colchicine conducted by a medical team at Attikon Hospital in Athens, Greece, proved effective in treating COVID-19 patients, particularly those with severe symptoms. Professor Xia Jinglin from Zhongshan Hospital affiliated with Fudan University innovatively used thalidomide to treat severe COVID-19, and results indicated its effectiveness. Given that the clinical manifestations of severe COVID-19 pneumonia resemble iron overload, deferoxamine may serve as a promising supportive treatment for COVID-19 pneumonia complications, and studies on its anti-COVID-19 use are ongoing⁴². In summary, the results obtained from drug predictions are highly reliable, aiding in the identification of drugs with potential therapeutic effects, providing effective support for drug development, and reducing both the time and costs involved.

Discussion

Research in drug repositioning can be classified into two categories based on methods. One category is drug knowledge discovery based on experimental data, while the other is drug knowledge mining based on scientific data. The former relies on clinical trials, focusing on the interaction between drug molecules and cell receptors to uncover potential drug effects by establishing clinical models^43,44. The latter is based on computer technology, primarily targeting the correlations among scientific data as the research object and conducting drug knowledge discovery by constructing data models through computing^45,46. When AI methods are applied in drug repositioning, as opposed to clinical experimental methods, they can leverage vast amounts of data and strong computing capabilities to analyze the relationship between drugs and diseases from multiple perspectives. This enhances the efficiency and success rate of drug screening, improves efficacy and safety predictions, shortens the research and development cycle, and reduces costs.

In this study, we introduced the global attention mechanisms for three models: Attranse, Attrescal, and Attdismult, to incorporate attribute features into entities, and the self-attention mechanism was used to provide semantic features for relational tags, to construct a drug repositioning model based on knowledge graph embedding. For example, the predicted drugs related to COVID-19 were analyzed and validated. A total of 8,104 drugs are predicted, 32 of which are COVID-19 clinical drugs. These drugs have a molecular weight greater than 230 and were selected from FDA-approved drugs in DrugBank. Among the three models in traditional methods, the transE model achieved the best results, with three drugs ranked in the top 10. The first is ribavirin, the fifth is dexamethasone, and the ninth is colchicine. In this study, seven drugs were ranked in the top 10, as shown in Table 2. The experimental results demonstrate a 133% improvement in prediction accuracy of the model’s predicted drugs compared to traditional methods, confirming the effectiveness of this study. Based on the results and validation analysis, this research method provides a theoretical basis for drug repositioning, offers new ideas for traditional drug discovery, and supports decision-making for future clinical experiments and research.

The research has significant potential for expansion. Currently, experimental verification has been conducted only on the relevant data for COVID-19. Future research can also explore potential drugs for other diseases, such as Alzheimer’s disease and cancer, using the methods proposed in this paper or improved methods. This may require collaboration with professionals to obtain safer and more reliable drugs.

Conclusion

The traditional new drug development process often requires significant financial support, long research and development periods, and continuous advancements in research technology. However, most candidate drugs ultimately fail to reach the market due to safety issues or unsatisfactory efficacy. In contrast, the Drug Repositioning strategy explores potential new indications for existing drugs, offering a more cost-effective, time-efficient, and low-risk pathway for drug development. This study tackles the challenge of drug repositioning by incorporating attention mechanisms into translation and bilinear models, enhancing the quality of knowledge embedding and enabling more accurate drug screening. This study combines multiple models with attention mechanisms, overcoming the limitations of a single model and enhancing the robustness and accuracy of predictions. By incorporating attention mechanisms into models such as TransE, Rescal, and Dismult, it significantly improves the representation of entities and relationships in knowledge graphs. The experimental results, particularly the drug predictions related to COVID-19, validate the proposed method. Among the top 10 predicted drugs, seven align with those already in clinical trials. Compared to traditional methods, this model improves drug prediction accuracy by 133%, demonstrating its effectiveness and practicality. This result not only confirms the potential of our method in drug repositioning but also provides a solid foundation for future clinical trials.

The main contributions of this study are as follows:

(1)
Introduced attention mechanisms and bilinear models in translation to enhance semantic representation.
(2)
Integrated multiple models to improve prediction quality and the drug screening process.
(3)
Validated candidate drugs against current clinical trial drugs, strengthening the effectiveness of the proposed method.
(4)
Developed a drug repositioning framework based on knowledge graph embedding, applicable to various diseases.

Although the current research focuses on COVID-19, there are several promising ways to expand it:

(1)
Application to other diseases: The method proposed in this study can be extended to explore potential drugs for treating diseases such as Alzheimer’s disease, cancer, and viral infections. Future research should adapt these models to address the unique challenges posed by different diseases.
(2)
Model enhancement: Future work could explore improvements to attention mechanisms, such as more complex variants of self-attention or combining them with other neural network architectures. These improvements could lead to more accurate predictions and better handling of complex drug-disease relationships.
(3)
Clinical validation collaboration: Future research should prioritize collaboration with clinical professionals to assess the safety and efficacy of drugs. These partnerships are essential for translating model predictions into real-world applications and clinical trials.
(4)
Expanding data sources: To further validate and improve the model’s reliability, future work could incorporate a wider range of data sources, including patient-specific data and other drug-related information. This will enhance the robustness and scalability of the method.

In summary, this study presents a promising approach for drug repositioning, emphasizing the use of attention mechanisms in knowledge graph embedding. This research has significant potential to contribute to personalized medicine and drug discovery by extending the method to other diseases and collaborating with clinical experts.

Data availability

All data generated or analysed during this study are included in this published article.

References

Schork, N. J. Genetics of complex disease: Approaches, problems, and solutions. Am. J. Respir. Crit Care Med. 156 (4), S103–S109 (1997).
CAS PubMed MATH Google Scholar
Steedman, M. & Taylor, K. Measuring the return from pharmaceutical innovation. Deloitte Center for Healthcare Solutions, Deloitte. (2019). https://www2.deloitte.com/us/en/pages/life-sciences-and-health-care/articles/measuring-return-from-pharmaceutical-innovation.html
Di, M., Joseph, A., Henry, G., Grabowski & Ronald, W. H. Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics. North-Holland, February 12, ‌ (2016).
Tai, Y. F., Wang, S., Zhang, F. L., Yu, Q. & He, P. F. The main topics and development context of drug repositioning research in China. Chin. J. New. Drugs. 29 (22), 2541–2551 (2020).
MATH Google Scholar
Joachim, E. V., Mathias, S. M. M. & Soeren, D. Robert, P. PROMISCUOUS: A database for network-based drug-repositioning. Nucleic Acids Res. 39 (Database issue), D1060–D1066 (2011).
MATH Google Scholar
Novac, N. Challenges and opportunities of drug repositioning. Trends Pharmacol. Sci. 34 (5), 267–272 (2013).
CAS PubMed MATH Google Scholar
F, S. A. Old drugs, new tricks. BMJ 342 (7793), d741 (2011). Clinical research ed.
Google Scholar
Shan, S. Z., Wen, B., Qiao, T. C. & Shan, G. C. Application and analysis of problems of artificial intelligence in drug repurposing for Corona virus disease 2019 (COVID-19). Chin. J. Pharmacol. Toxicol. 38 (04), 294–303 (2024).
MATH Google Scholar
Yang, S. M., Yu, J., Hou, W. B., Zhao, Q. & Li, Y. L. Research progress on artificial intelligence algorithms for drug discovery. Drugs Clin. 38 (12), 3150–3160 (2023).
MATH Google Scholar
Yu, Z. H. et al. Artificial intelligence-based drug development: Current progress and future challenges. J. China Pharm. Univ. 54 (3), 282–293 (2023).
MATH Google Scholar
Li, S. X. et al. Research progress in application of artificial intelligence during process of drug discovery. Drug Evaluat. Res. 46 (9), 2030–2036 (2023).
MATH Google Scholar
Vishakha, S., Kumar, S. S. & Ritesh, S. A novel framework based on explainable AI and genetic algorithms for designing neurological medicines. Sci. Rep. 14 (1), 12807–12807 (2024).
MATH Google Scholar
Barberis, A., Aerts, H. J. W. L. & Buffa, F. M. Robustness and reproducibility for AI learning in biomedical sciences: RENOIR..Sci. Rep. 14(1), 1933–1933 (2024).
Chayna, S. et al. Artificial intelligence and machine learning technology driven modern drug discovery and Development. Int. J. Mol. Sci. 24 (3), 2026–2026 (2023).
MATH Google Scholar
E, G. D. et al. A SARS-CoV-2-Human protein-protein interaction map reveals drug targets and potential drug-repurposing. bioRxiv: the preprint server for biology (2020).
Ashique, S. et al. Application of artificial intelligence (AI) to control COVID-19 pandemic: Current status and future prospects. Heliyon 10 (4), e25754 (2024).
PubMed PubMed Central MATH Google Scholar
Luo, H. M. et al. Drug-drug interactions prediction based on deep learning and knowledge graph: A review. iScience 27 (3), 109148 (2024).
ADS CAS PubMed PubMed Central MATH Google Scholar
Feng, Y. et al. A medical image segmentation network based on Spatial pyramid and attention mechanism. Biomed. Signal Process. Control, 94106285 (2024).
Di, J. et al. A multimodal medical image fusion method based on an attention mechanism and MobileNetV3. Biomed. Signal Process. Control. 96 (PB), 106561 (2024).
MATH Google Scholar
Zu, J., Qian, J. T., Wang, Y. & Gu, Y. X. Drug repositioning method and system based on word vector representation and attention mechanism. Shanxi Province: CN202210582908.2, 2024-04-19.
Tang, X. F. et al. A drug repositioning method and system based on bilinear attention network. Hubei Province: CN202311259470.5, 2023–2011.
Guo, Z. Q., Tan, Z. F., Wang, J. J. & Ye, Q. Intelligent Q&A of proprietary Chinese medicine based on knowledge graph. Comput. Inform. Technol. 31 (4), 52–57 (2023).
MATH Google Scholar
Wu, D. & Zhou, Z. J. Intelligent question answering system for cardiovascular diseases based on knowledge Graph. Softw. Guide. 21 (3), 160–164 (2022).
MATH Google Scholar
Remy, C. et al. LORD: a phenotype-genotype semantically integrated biomedical data tool to support rare disease diagnosis coding in health information systems.. AMIA Annual Symposium proceedings. AMIA Symposium. 2015434-440. (2015).
Weng, H., Chen, J. L., Ou, A. & Lao, Y. R. Leveraging representation learning for the construction and application of a knowledge graph for traditional Chinese medicine: Framework development study. JMIR Med. Inf. 10 (9), e38414–e38414 (2022).
MATH Google Scholar
Fu, Z. X. et al. Research on viscera syndrome differentiation in TCM based on multiple link prediction Reasoning. Chin. J. Inform. Tradit. Chin. Med. 30 (4), 18–24 (2023).
MATH Google Scholar
Li, J., Gao, J., Feng, B. Y. & Jing, Y. PlagueKD: A knowledge graph-based plague knowledge database.. Database J. Biol. Databases Curat. 2022. (2022).
Lu, Y. W. Construction of gastric cancer knowledge graph and drug discovery application. Soochow Univ. (2023).
Lyu, Y. H., Zhao, H. X., Li, Q., Liang, A. X. & Yu, Q. Drug knowledge discovery for autism spectrum disorders based on SPO predications. Chin. Nurs. Res. 38 (5), 796–804 (2024).
Google Scholar
Fan, M. & Gao, Y. Visualization construction of knowledge graph based on Tibetan medicine data. Mod. Comput. 29 (24), 64–68 (2023).
MATH Google Scholar
Ouzounis, S. et al. Data-driven drug repurposing in diabetes mellitus through an enhanced knowledge graph. Eng. Proc. 50(1), 9 (2023).
Xi, C. C. The Kidney Life Gate Theory of the Warming and Supplementing School in Ming Dynasty and its Inheritance and Development To Huangdi Neijing[D] (Beijing University of Chinese Medicine, 2021).
Wang, Q., Dai, G. H., Guan, H. & Gao, W. L. Construction and application of clinical experience knowledge graph for renowned TCM Doctors in treating coronary heart disease. Chin. J. Inform. Tradit. Chin. Med. 31 (3), 64–70 (2024).
MATH Google Scholar
Xiong, W. P. et al. Design and evaluation of a prescription drug monitoring program for Chinese patent medicine based on knowledge Graph. Evid.-Based Complement. Altern. Med., 20219970063–20219970063. (2021).
Wang, Q., Mao, Z. D., Wang, B. & Guo, L. Knowledge graph embedding: A survey of approaches and Applications. IEEE Trans. Knowl. Data Eng. 29 (12), 2724–2743 (2017).
MATH Google Scholar
Dai, J. J., Shi, L. J., Huang, X. Q. & Chen, M. Y. Construction and application of diseases and drugs correlation knowledge map based on Spark. Chin. J. Health Inf. Manag. 19 (6), 931–938 (2022).
MATH Google Scholar
Yuan, K. Q., Deng, Y., Chen, D. Y., Zhang, B. & Lei, K. Construction techniques and research development of medical knowledge graph. Appl. Res. Comput. 035 (007), 1929–1936 (2018).
MATH Google Scholar
Yang, B. S., Yih, W. T., He, X. D., Gao, J. F. & Deng, L. Embedding entities and relations for learning and inference in knowledge Bases. CoRR, (2014).
Zheng, D. et al. DGL-KE: training knowledge graph embeddings at Scale. 739–748. (2020).
Xiu, X. L., Wu, S. Z., Cui, J. W., Wu, J. M. & Qian, Q. Advances in studies on construction of medical knowledge graphs. Chin. J. Med. Libr. Inform. Sci. 27 (10), 33–39 (2018).
MATH Google Scholar
Liu, Z. Y., Sun, M. S., Lin, Y. K. & Xie, R. B. Knowledge representation learning: A review. J. Comput. Res. Dev. 53 (2), 247–261 (2016).
MATH Google Scholar
Ghasemiyeh, P. & Samani, S. M. Iron chelating agents: Promising supportive therapies in severe cases of COVID-19?. Trends Pharm. Sci. 6 (2), 65–66 (2020).
CAS Google Scholar
Qi, X. et al. Exploration of therapeutic drugs for gastric cancer using drug repositioning strategy. J. Univ. Electron. Sci. Technol. China. 52 (5), 659–666 (2023).
MATH Google Scholar
Lu, W. Q. et al. Drug repurposing of histone deacetylase inhibitors that alleviate neutrophilic inflammation in acute lung injury and idiopathic pulmonary fibrosis via inhibiting leukotriene A4 hydrolase and blocking LTB4 Biosynthesis. J. Med. Chem. 60 (5), 1817–1828 (2017).
CAS PubMed MATH Google Scholar
Peng, C., Hu, Y. X., Chen, L. F., Ye, Z. X. & Tian, G. A. Review on in-silico repositioning algorithms of drugs and chemical compounds. Progr. Pharm. Sci. 44 (1), 4–9 (2020).
MATH Google Scholar
Chen, F., Yang, C. R., Zhang, Z., Chen, F. & Liu, X. Establishment of a high-throughput screening platform based on drug repurposing targeting alpha-1-acid glycoprotein and discovery of potential weight loss drugs. J. Pharm. Pract. Serv. 42 (3), 114–120 (2024).
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Intelligence Technology, Geely University of China, No. 123, Section 2 of Chengjian Avenue, Eastern New Area, Chengdu, 641423, Sichuan, China
Shufang He & Xiaoyu Zhao

Authors

Shufang He
View author publications
Search author on:PubMed Google Scholar
Xiaoyu Zhao
View author publications
Search author on:PubMed Google Scholar

Contributions

S.H. conceptualized the article, wrote the main manuscript text, conducted a literature review and analysis, collected and organized data, and prepared the figures and tables. X. Z. offered valuable advice throughout the writing process.

Corresponding author

Correspondence to Shufang He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

He, S., Zhao, X. Drug repositioning model based on knowledge graph embedding. Sci Rep 15, 10298 (2025). https://doi.org/10.1038/s41598-025-95372-5

Download citation

Received: 25 December 2024
Accepted: 20 March 2025
Published: 25 March 2025
DOI: https://doi.org/10.1038/s41598-025-95372-5

Subjects

Abstract

Similar content being viewed by others

A method for the rational selection of drug repurposing candidates from multimodal knowledge harmonization

Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding

Drug repositioning by merging active subnetworks validated in cancer and COVID-19

Introduction

Background

Related work

Attention mechanism

Knowledge graph

Research content

Methods

Knowledge graph embedding model

Attention mechanism model

Translation model

Bilinear model

The combination of attention mechanism and knowledge graph embedding

Verification

Experimental results

Experimental environment

Dataset

Experimental results

Discussion

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links