Introduction

Knowledge Graphs (KGs) store extensive real-world knowledge in the form of facts. These facts are captured as triples of entities and their various relations, and are widely used in scenarios such as search engines1, recommender systems2, and intelligent question-answering3. However, real-world factual information is often complex and dynamically changing with a time-sensitive nature, which seriously degrades the performance of traditional KGs. Therefore, temporal knowledge graphs (TKGs) were proposed as a solution.

TKGs inherit the advantages of traditional KGs and also introduce a temporal dimension that allows them to represent dynamically changing facts. TKGs are made up of KGs that include timestamped facts, each represented as a quaternion (Subject Entity, Relation, Object Entity, Timestamp). For example, Fig. 1 shows an example of TKGs, where each subgraph is a static KGs, and the quadruple (Donald Trump, Win, US election, 2024) indicates that Donald Trump won the US election in 2024. Due to their ability to capture dynamic properties that evolve over time, TKGs have shown promising applications in areas such as future event prediction4, medical assistance systems5 and personalized service formulation6.

Fig. 1
figure 1

An example of a TKG, including two subgraphs of the KG with timestamps 2018 and 2024. Portrait images of political leaders, flags, and icons were obtained from Wikimedia Commons under CC BY licenses or from public domain resources.

Temporal knowledge graph reasoning (TKGR) encompasses two types of methods: interpolation and extrapolation. Temporal knowledge graph reasoning (TKGR) can be divided into two methods: interpolation and extrapolation. Interpolation reasoning focuses on missing facts within the time range [\(t_0\),\(t_T\)], focusing on filling the gaps in historical time periods; while extrapolation reasoning focuses on t > \(t_T\), predicting future facts through past and current knowledge. Due to the increased difficulty of extrapolation, our model utilizes the extrapolation reasoning method.

The field of TKGR has seen considerable advancements in recent years. For example, EvoExplore7 effectively captures the evolutionary characteristics of short-term events through the time point process of the hierarchical attention mechanism and the soft motif module; DPCL-Diff8 generates new events through graph node diffusion for sparse historical data and uses dual-domain periodic contrast learning to distinguish periodic and non-periodic events; LogCL9 introduces an entity-aware attention mechanism to better integrate local and global historical information through comparative learning; CDRGN-SDE10 uses stochastic differential equations to capture the nonlinear dynamics of time Change and adapt to heterogeneous characteristics. However, these methods still face the following three challenges, which reduce their performance:

How to solve the problem of difficulty in capturing long-distance dependencies in an environment with sparse data Existing methods usually assume that data is continuous and dense, ignoring the impact of sparsity on long-distance dependency modeling. However, in practice, TKGs data is often unevenly distributed, and the number of triplets in certain time periods is sparse or even missing, making it difficult for the model to effectively learn potential dependencies across time spans11,12. At the same time, long-distance dependencies are time-lag and hidden. For example, future events may depend on key events at earlier points in time, but these events are easily weakened in sparse data. Therefore, designing a model that can capture long-distance dependencies in a sparse environment is crucial for enhancing TKGR performance.

How to effectively handle the problem of noise interference Existing methods usually assume that the input data is noise-free, ignoring the common errors, incomplete or conflicting information in actual TKGs. These noises will cause the model to learn unreliable embedding representations, reducing the accuracy and robustness of reasoning12. Therefore, it is crucial to design methods that can effectively deal with noise interference and improve the model robustness.

How to resolve the challenge of difficulty in modeling temporal relationships Although existing methods have taken into account the dynamic characteristics of time, they are still insufficient in capturing the potential connections of temporal relationships13,14. For example, to predict (US, ?, North Korea, 2025), it is possible to learn from the 2018 KG subgraph in Fig. 1 that Donald Trump improved US-North Korea relations when he took office as president, and from the 2024 subgraph that Trump was re-elected president, and thus infer that US-North Korea relations may ease in 2025. Therefore, how to model the potential temporal connections in the TKG to improve prediction accuracy is of great research value.

To address the above three challenges, this paper propose a framework based on dual-gate and noise-aware contrastive learning (DNCL) to improve the accuracy and robustness of TKGR. Specifically, the DNCL model consists of three modules: The multi-dimensional gated update module dynamically extracts key information from horizontal and vertical dimensions by introducing a dual-gate mechanism of selection gate and update gate, suppresses redundant features, and captures long-distance dependencies information between entities and relations through row and column slicing operations, ensuring that key information is retained in sparse data. The noise-aware adversarial modeling module generates diverse noises and enhances the ability of the model to identify noise by introducing an adversarial training mechanism of noise generator and noise discriminator, thereby improving robustness. The multi-layer embedding contrastive learning module combines intra-layer and inter-layer contrastive learning strategies to capture short-term dynamics and global temporal relationships, respectively, and enhance the representation capabilities of model. This dual strategy helps the model better capture the latent relationships in TKG and improves the model’s ability to model temporal relationships.

Our paper has the following contributions:

  • Innovatively integrates multi-dimensional gated mechanisms to improve long-distance dependencies modeling: A multi-dimensional gated update module is proposed, which for the first time optimizes the long-distance dependencies capture capability of the TKG through a dual-gate selection strategy, significantly alleviating the problem of information sparsity.

  • Introducing noise-aware adversarial modeling to improve model robustness: A noise-aware adversarial modeling module is designed to significantly enhance noise resistance by generating and discriminating noise for adversarial training so that the model maintains efficient inference performance despite noise interference.

  • Multi-layer embedding contrastive learning for deep temporal relationship modeling: A multi-layer embedded comparative learning module is proposed to comprehensively mine the latent connections of temporal relationships by combining intra-layer and inter-layer comparative learning strategies, which significantly improves the inference performance of the model.

  • Excellent performance verification: Experimental results on four public benchmark datasets show that DNCL is significantly ahead of existing models in multiple core indicators. In particular, the Hit@1 on the ICEWS14, ICEWS15-05, and ICEWS18 datasets increased by 6.91%, 4.31%, and 5.30%, respectively, demonstrating strong reasoning capabilities.

The structure of this paper is as follows: Sec. Related work reviews related work and emphasizes the novelty of our model compared to existing methods; Sec. Preliminaries outlines fundamental concepts and notations; Sec. The proposed model presents the DNCL model; Sec. Experiments discusses the experimental results; and Sec. Conclusion concludes the study.

Related work

This section details the knowledge graph reasoning method, categorized into static and temporal based on the inclusion of time dimension information.

Static knowledge graph reasoning

Static KG reasoning enhances structural information in static scenarios by exploring latent associations between entities and relations, and can be grouped into four methods15: methods based on translation, logical rules, multi-source information, and neural networks. Translation-based methods, such as TransE16, model entities and relations in a low-dimensional vector space, and use a simple and efficient objective function and negative sample sampling. Subsequent models such as TransH17 and TransR18 further optimize the interaction modeling of entities and relations. Logical rule-based methods, such as RLvLR19, combine embedding technology and improved sampling strategies to efficiently learn first-order rules. Neural-LP20 achieves end-to-end logical rule reasoning by combining neural networks with logical operations. Multi-source information-based methods, such as IterE21, improve the representation of sparse entities through iterative learning. MKRL22 uses convolutional networks and attention mechanisms to enhance knowledge representation learning. Neural network-based methods, such as ConvE23, use multi-layer convolutional networks to model complex relationships, and Conv-TransE24 use convolution and structure perception to enhance knowledge graph embedding. CompGCN25 process multi-relation graphs through graph convolutional networks. Although these methods have made important progress in static KG reasoning, they still have certain limitations because they do not introduce the time dimension.

Temporal knowledge graph reasoning

TKGR methods can be divided into interpolation and extrapolation according to the inference time. Interpolation reasoning aims to complete the missing facts in the past26. For example, TTransE27 models knowledge evolution through the time dimension based on TransE16. TA-DistMult26 uses the traditional DistMult28 as a benchmark and combines recurrent neural networks and potential decomposition to achieve embedding learning. TNTComplEx29 extends the ComplEx30 and optimizes link prediction under time constraints through regularization and 4th-order tensor decomposition. HyTE31 maps timestamps to hyperplanes and explicitly integrates them into the entity-relationship space. DE-SimlE32 uses multi-layer convolutional networks to efficiently model entity relationships. However, these methods cannot infer future facts and are difficult to achieve prediction tasks.

In the field of visual question answering, R-VQA33 uses balanced datasets to reduce language prior bias, and ESC-Net34 combines spatial and channel attention to enhance visual features, but they mainly focus on static images and lack the ability to model temporal information. ENVQA35 combines dual and triple attention mechanisms to focus on both local and global features, while QSFVQA36 reduces computational overhead by isolating question types. These methods provide new inspiration for KG reasoning, but still have difficulty in capturing temporal correlations and cannot adapt to the characteristics of knowledge evolution over time, which is the core requirement of the TKGR task.

This study focuses on extrapolative reasoning, that is, predicting future facts. Know-Evolve37 models temporal evolution patterns through deep evolving knowledge networks, but has difficulty capturing long-distance dependencies. RE-NET38 uses an autoregressive architecture and a recurrent event encoder for multi-step reasoning. xERTE39 achieves explainable link prediction using subgraph reasoning and temporal attention mechanism. RE-GCN40 combines recursive graph convolutions and static graph constraints to learn entity relationship evolution. TiRGN41 captures dynamic characteristics through local and global patterns. HiSMatch42 combines dual-structure encoders to match historical trends. RETIA43 jointly models dynamic relationships with twin hyper-relation subgraphs and interaction modules. BH-TDEN44 uses a Bayesian hyper-network to model temporal uncertainty. Path reasoning methods such as TiPNN45 reason through query paths, but perform poorly in sparse data environments. In addition, in terms of modeling the continuity of time, TANGO13 uses neural ordinary differential equations to capture continuous dynamics. TARGAT14 jointly models time and relationships through time-aware matrices, but they still have shortcomings in modeling latent temporal relationships. ERSP46 uses explicit similarity metrics to extract features between entity relationships and static attributes, but ignores implicit semantic associations. \(\text {L}^2\text {TKG}\)47 combines explicit features and latent relationship learning, but lacks dynamic global modeling of temporal dependencies. In recent years, the application of contrastive learning in TKGs has received widespread attention, providing a reference for this article. For example, CENET48 combines historical and non-historical dependency relationships, and LogCL9 fuses local and global information through contrastive learning to cope with noise interference, but they all lack a dedicated noise processing mechanism, which limits performance improvement.

In order to overcome the problems of the above-mentioned existing models, this paper proposed the DNCL model, which combines the noise-aware mechanism with the contrastive learning method and adopts a multi-dimensional gated update module. It effectively addresses challenges faced by existing models, such as the difficulty in capturing long-distance dependencies in information sparse environments, noise interference, and the difficulty in modeling potential connections between temporal relationships.

Preliminaries

In this section, the background knowledge of the TKG, the TKGR task, and contrastive learning is explained. The main mathematical symbols used in the DNCL model and their meanings are detailed in Table 1.

Table 1 Primary notations and descriptions.

Background of temporal knowledge graph

Definition 1

(Temporal knowledge graph). A TKG, denoted as \(\mathscr {G}\), is a form of knowledge representation that integrates temporal information directly into the entity-relationship framework. It comprises a sequence of KG snapshots, expressed as \(\mathscr {G} = \{\mathscr {G}_{1}, \mathscr {G}_{2}, \ldots , \mathscr {G}_{T}, \ldots \}\). Each snapshot \(\mathscr {G}_{t}\) at a given timestamp t is defined as \(\mathscr {G}_{t} = (\mathscr {N}, \mathscr {R}, \mathscr {M}_{t})\), where \(\mathscr {N}\) represents the set of entities, \(\mathscr {R}\) denotes the set of relations, and \(\mathscr {M}_{t}\) is the set of facts corresponding to timestamp t. Each fact in \(\mathscr {M}_{t}\) is a quadruple \((n_s, r, n_o, t)\), where \(n_s \in \mathscr {N}\) is the subject entity, \(n_o \in \mathscr {N}\) is the object entity, \(r \in \mathscr {R}\) specifies the relationship between entities at time t, and \(t \in \mathscr {T}\) indicates that t belongs to the set of timestamps \(\mathscr {T}\).

Formulation of the temporal knowledge graph entity reasoning task

Definition 2

(Temporal knowledge graph reasoning.) TKGR aims to use historical knowledge graph sequences \(\mathscr {G}=\{\mathscr {G}_{1},\mathscr {G}_{2},...,\mathscr {G}_{t}\}\) to predict missing object entities \((n_s,r,?,t+1)\) or subject entities \((?,r,n_o,t+1)\) in future queries. Because for each quadruple \((n_s,r,n_o,t)\), this paper will add an anti-relation quadruple \((n_o,r^{-1},n_s,t)\) to the dataset. So when predicting the subject entity \((?, r, n_o, t+1)\), the task can be reformulated as an object entity prediction query, \((n_o, r^{-1}, ?, t+1)\). Consequently, this study primarily concentrates on predicting the object entity.

Contrastive learning

Contrastive learning is a self-supervised approach that enhances the representation ability of the model by constructing positive and negative sample pairs49. Due to its excellent performance in improving representation discrimination and model robustness, it has been widely used in the field of KGs in recent years. Its core idea is to optimize the contrastive loss function so that different views of the same instance are closer in the embedding space, while pulling the representations of different instances farther apart, thereby improving the ability of the model to distinguish49. In self-supervised contrastive learning, data augmentation is performed on a mini-batch of N randomly selected instances to generate a positive sample pair for each instance, with all other instances treated as negative samples. For a given positive sample pair (pq), the contrastive loss function is defined as follow:

$$\begin{aligned}&\mathscr {L}_{p,q}=-\log \frac{\exp (\textbf{y}_p\cdot \textbf{y}_q/\tau )}{\sum _{s=1,s\ne p}^{2N}\exp (\textbf{y}_p\cdot \textbf{y}_s/\tau )}, \end{aligned}$$
(1)

Where 2N represents the total number of samples in the mini-batch, comprising the original N samples and an additional N samples generated through data augmentation, resulting in 2N samples in total; \(\textbf{y}_p\), \(\textbf{y}_q\), \(\textbf{y}_s\) represents the embedding vector of the sample; \(\tau\) represents the temperature parameter; and \(\cdot\) represents the dot product operation.

The proposed model

This section first gives a comprehensive overview of the proposed DNCL model, then elaborates on the three core modules of the model and their specific design, and introduces the model training and inference process in detail.

Model overview

Figure 2 shows the overall framework of the DNCL model, which mainly consists of three parts: a multi-dimensional gated update module, a noise-aware adversarial modeling module, and a multi-layer embedding contrastive learning module. Specifically, the model first dynamically fuses the entity and relation historical embedding matrices input by TKG through the dual-gate mechanism of the multi-dimensional gated update module, flexibly extracts key information from historical time series data, and suppresses redundant or irrelevant information, thereby effectively alleviating the information sparsity problem and enhancing the modeling ability of long-distance dependencies. Subsequently, the noise-aware adversarial modeling module uses the adversarial training mechanism constructed by the noise generator and the discriminator to enhance the robustness of the model in a noisy environment and effectively reduce the negative impact of noise on the reasoning results. Moreover, the multi-layer embedding contrastive learning module further improves the semantic expression ability of the embedding representation by combining intra-layer and inter-layer contrastive learning strategies, and better captures the potential complex temporal relationships in the time series knowledge graph. The following sections will discuss the design and implementation details of each module in detail.

Fig. 2
figure 2

An overall illustrative diagram of the proposed DNCL model. The model consists of three main modules: the multi-dimensional gated update module, the noise-aware adversarial modeling module, and the multi-layer embedding contrastive learning module.

Multi-dimensional gated update module

Capturing long-distance dependencies in a sparse data environment is one of the core challenges of TKGR. Since events in KGs are sparse and unevenly distributed, long-distance historical information may be missing or not updated. Traditional recurrent neural networks50 are affected by gradient vanishing and are difficult to effectively transmit long-distance information. In addition, they lack a filtering mechanism and are prone to introducing irrelevant information, which reduces the reasoning effect. Although gate mechanisms51 such as GRU and LSTM can alleviate gradient vanishing, their fixed state update method causes the model to tend to focus on recent information and ignore long-distance dependencies. Especially in an environment with a large time span and uneven data, it is difficult for the model to stably capture key historical information, thus affecting the prediction performance. For example, when predicting the diplomatic relations between the United States and Iran in 2025 in the ICEWS14 dataset, the model should not only consider recent events, such as the economic sanctions in 2023, but more importantly, distant historical events, such as the nuclear agreement in 2014. These distant events may have a more profound impact on future relations. However, traditional gate models tend to focus on recent information, making it difficult to effectively identify and retain key long-distance dependencies, reducing the accuracy of reasoning.

To effectively address the above challenges, we proposed a multi-dimensional gated update module, which introduces a dual-gate mechanism consisting of a selection gate and an update gate in the horizontal and vertical dimensions to enhance the modeling capability of long-distance dependencies in sparse environments. The selection gate models the long-distance dependencies of entities and the evolutionary patterns of relationships through row and column slices, dynamically calculates the relevance of historical events to the current prediction task, accurately screens key events, and suppresses short-term noise interference. For example, when predicting US-Iran relations in 2025, the selection gate can dynamically give a higher weight to the 2014 nuclear agreement while reducing the impact of short-term economic sanctions in 2023. On this basis, the update gate further integrates the filtered long-distance historical information with the current embedding state to ensure that key long-distance information will not be covered by short-distance events, thereby improving the stability and accuracy of long-distance reasoning. In addition, this module also combines the attention mechanism to dynamically calculate the weighted embedding matrix to enhance the model’s robust modeling capability for long-distance historical events in complex sparse data environments.

The module selects and updates the historical entity and relation embedding matrices in the horizontal and vertical dimensions, respectively. In order to accumulate historical information from zero in a sparse environment, the module first initializes the input historical entity and relation embedding matrices as a zero tensor \(\textbf{H}_0,\textbf{R}_0=\textbf{0}\). Then, the time weights of the current time step are dynamically generated to capture the periodic or dynamic changes of the time series and spliced with the input information of the current time to complete the dynamic modeling of time. The specific process is as follows:

$$\begin{aligned}&\textbf{h}_t=cos(\textbf{W}_t\cdot t+\tilde{\textbf{b}_t}), \end{aligned}$$
(2)
$$\begin{aligned}&\quad \textbf{X}_t^{\prime }=concat(\textbf{X}_t,\textbf{h}_t), \end{aligned}$$
(3)

where \(\textbf{h}_t\) represents the dynamic time weight at timestamp t; \(\textbf{X}_t\) represents the input information embedding of the current time step; and \(\textbf{X}_t^{\prime }\) is the concatenated input embedding, which is used for the fusion of subsequent gated mechanisms; \(\textbf{W}_t\in \mathbb {R}^{d_h\times 1}\) represents the time step weight matrix, \(d_h\) represents the embedding dimension; and \(\tilde{\textbf{b}}_t\in \mathbb {R}^{d_h}\) is the learnable parameter time bias vector.

Subsequently, the historical information is fused with the current information through the selection gate in the dual-gate mechanism and updated. The process first concatenates the multi-dimensional input information and then generates the gated input through linear transformation. The specific formula as follows:

$$\begin{aligned}&\textbf{Z}_t=[\textbf{H}_{t-1};\textbf{R}_{t-1};\textbf{X}_t^{\prime }], \end{aligned}$$
(4)
$$\begin{aligned}&\quad \textbf{G}_t=\textbf{W}\cdot \textbf{Z}_t+\tilde{\textbf{b}}, \end{aligned}$$
(5)

where \(\textbf{H}_{t-1}\) and \(\textbf{R}_{t-1}\) represent the historical entity and relation embedding matrices at timestamp \(t-1\) respectively; \(\begin{bmatrix};\end{bmatrix}\) represents the tensor splicing operation; \(\textbf{Z}_t\) represents the multi-dimensional input of the current time step, which includes the combination of historical entity embedding, historical relation embedding and current time input embedding; \(\textbf{G}_t\) represents the gated input matrix; and \(\textbf{W}\) and \(\tilde{\textbf{b}}\) are learnable parameters, which are the linear transformation weight matrix and bias vector, respectively. After that, the gated inputs are sliced by selecting the split operation in the gate and the gated inputs are activated using the Sigmoid52 and Tanh53 functions. The specific formula is as follows:

$$\begin{aligned}&\textbf{G}_t^\sigma =\sigma (split(\textbf{G}_\textbf{t},4\cdot d_h)), \end{aligned}$$
(6)
$$\begin{aligned}&\quad \textbf{G}_t^{tanh}=tanh(split(\textbf{G}_\textbf{t},2\cdot d_h)), \end{aligned}$$
(7)

where \(split(\cdot )\) represents the slicing of the matrix; \(\sigma (\cdot )\) represents the Sigmoid activation function; \(tanh(\cdot )\) represents the Tanh activation function; \(\textbf{G}_t^\sigma\) represents the part that generates the update gate and output gate; and \(\textbf{G}_{t}^{\tanh }\) represents the part that generates the input gate. Next, the gated information is divided by \(\textbf{G}_t^\sigma\) and \(\textbf{G}_{t}^{\tanh }\), the specific formula is as:

$$\begin{aligned}&\textbf{U}_r,\textbf{O}_r,\textbf{U}_c,\textbf{O}_c=chunk(\textbf{G}_t^\sigma ,4), \end{aligned}$$
(8)
$$\begin{aligned}&\quad \textbf{I}_r,\textbf{I}_c=chunk(\textbf{G}_t^{\tanh },2), \end{aligned}$$
(9)

where \(chunk(\cdot )\) represents the matrix segmentation; \(\textbf{U}_r\in \mathbb {R}^{\left| \mathscr {N}\right| \times d_h}\) and \(\textbf{U}_c\in \mathbb {R}^{\left| \mathscr {R}\right| \times d_h}\) represents the update gate of entity and relation respectively, which is used to control the fusion ratio of historical information and current information, \(\left| \mathscr {N}\right|\) and \(\left| \mathscr {R}\right|\) are the number of entities and relations respectively; \(\textbf{O}_r\in \mathbb {R}^{\left| \mathscr {N}\right| \times d_h}\) and \(\textbf{O}_c\in \mathbb {R}^{|\mathscr {R}|\times d_h}\) represents the output gate of entity and relation, which is used to adjust the output amplitude; Moreover \(\textbf{I}_r\in \mathbb {R}^{\left| \mathscr {N}\right| \times d_h}\) and \(\textbf{I}_{c}\in \mathbb {R}^{|\mathscr {N}|\times d_{h}}\) represents the input gate of entity and relation, which is used to perform nonlinear activation on the current input. Therefore, the update gate obtained by segmentation updates the historical embedding matrix of entities and relations, effectively suppressing irrelevant or redundant information and obtaining key information. The update process is as follows:

$$\begin{aligned}&\textbf{H}_t=tanh((1-\textbf{U}_r)\odot \textbf{H}_{t-1}+\textbf{U}_r\odot \textbf{I}_r)\odot \textbf{0}_r, \end{aligned}$$
(10)
$$\begin{aligned}&\quad \textbf{R}_t=tanh((1-\textbf{U}_c)\odot \textbf{R}_{t-1}+\textbf{U}_c\odot \textbf{I}_c)\odot \textbf{O}_c, \end{aligned}$$
(11)

where \(\textbf{H}_t\) and \(\textbf{R}_t\) represent the updated entity and relation embedding matrices at timestamp t respectively, and \(\odot\) represents the bit-by-bit multiplication of elements. And in this process, as the time step t increases, the cumulative product of \(1-\textbf{U}_r)\) and \(1-\textbf{U}_c)\) can avoid the exponential decay of the gradient and thus preserve the long-distance dependency information.

In order to capture the global long-distance dependencies information of entities and relations in the time series, the embeddings from all time steps are accumulated to generate the global entity and relation embedding representation. The accumulation process is as follows:

$$\begin{aligned}&\textbf{H}^{mdgu}=\sum _{t=1}^{m}\textbf{H}_t, \end{aligned}$$
(12)
$$\begin{aligned}&\quad \textbf{R}^{mdgu}=\sum _{t=1}^{m}\textbf{R}_t, \end{aligned}$$
(13)

where m is the total length of historical time steps, and \(\textbf{H}^{mdgu}\), \(\textbf{R}^{mdgu}\) represent the entity and relation embedding matrices accumulated at all time steps, respectively.

Moreover, in order to strengthen the long-distance dependencies modeling, the attention-weighted embedding matrix is calculated by the attention mechanism in the horizontal dimension direction based on the updated entity embedding matrix. The calculation formula is as follows:

$$\begin{aligned}&\textbf{A}_t=Softmax\left( \frac{\textbf{Q}_t\cdot \textbf{K}_t^\top }{\sqrt{d_k}}\right) =Softmax\left( \frac{(\textbf{W}_q\cdot \textbf{H}_{t})\cdot (\textbf{W}_k\cdot \textbf{H}_{t})^\top }{\sqrt{d_k}}\right) , \end{aligned}$$
(14)
$$\begin{aligned}&\quad \textbf{H}_t^{att}=\textbf{A}_t\cdot \textbf{V}_t=\textbf{A}_t\cdot (\textbf{W}_v\cdot \textbf{H}_{t}), \end{aligned}$$
(15)

where \(Softmax(\cdot )\) represents the normalized attention weight; \(\textbf{Q}_t\) , \(\textbf{K}_t\) and \(\textbf{V}_t\) represent the matrices of query, key, and value, respectively; \(\textbf{Q}_t\cdot \textbf{K}_t^\top\) is the similarity scores between the query and the key, indicating the attention distribution among different entities; \(\textbf{A}_t\) represents the attention weight matrix; \(\textbf{H}_t^{att}\) represents the embedding matrix after attention weighting; \(\textbf{W}_q,\textbf{W}_k,\textbf{W}_\nu \in \mathbb {R}^{d_h\times d_k}\) are trainable parameters, representing the weight matrices of query, key, and value, respectively; and \(\sqrt{d_k}\) represents the scaling factor to prevent the inner product value from being too large and affecting the effect of the Softmax normalization function. In additional, the final embedding matrix \(\textbf{H}_t^{final}\) is the weighted sum of the attention-weighted embedding matrix and the entity embedding matrix for the current time step. The attention-weighted embedding captures the global information, while the entity embedding at the current time step preserves the local dynamic features. The specific calculation formula is as follows:

$$\begin{aligned} \textbf{H}_t^{final}=LayerNorm\left( \beta \cdot \textbf{H}_t^{att}+(1-\beta )\cdot \textbf{H}_t\right) , \end{aligned}$$
(16)

where \(LayerNorm(\cdot )\) represents the normalization of the fused embedding to stabilize the training process, and \(\beta\) is the fusion weight coefficient, which is used to control the balance between the attention embedding and the current embedding.

Noise-aware adversarial modeling module

In the task of TKGR, data usually contains erroneous, missing or conflicting information, which causes the model to learn unreliable representations and reduce the accuracy and robustness of reasoning. Therefore, a mechanism that can perceive and resist noise interference is needed to improve the generalization ability and stability of the model in complex environments. To this end, we propose a noise-aware adversarial modeling module. This module consists of a noise generator and a noise discriminator, which aims to improve the robustness and generalization ability of the model in complex scenarios. The specific structures of the noise generator and the noise discriminator are shown in Fig. 3. The noise generator introduces random noise to simulate the interference in real scenarios and increase the uncertainty of the noise, thereby expanding the distribution range of the data and helping the model adapt to diverse input conditions. The noise discriminator takes guidance as the core, identifies and retains effective noise that is beneficial to model learning, and filters out meaningless interference to ensure the quality and pertinence of the generated noise. Through the adversarial modeling mechanism, the embedding can more accurately capture the noise characteristics, making the model more robust in dealing with complex noise environments, while greatly improving its reasoning ability for unseen data50. The specific implementation process of the noise generator, the noise discriminator and the noise-aware adversarial training is introduced in detail below.

Fig. 3
figure 3

The architecture of the noise generator and the noise discriminator. \(d_i\), \(d_h\), \(d_o\), and 1 represent the number of neurons in each layer respectively; ReLU, Sigmoid, and Normalization respectively indicate that the layer uses the ReLU activation function, Sigmoid activation function, and normalization operation.

Noise generator

The noise generator adopts a two-layer fully connected neural network structure. It first extracts random noise from uniform distribution \(\textbf{p}_\textrm{rand}\sim Uniform(0,1)\) as input, the input dimension is \(d_i\). After that, the first fully connected layer maps the input dimension \(d_i\) to the hidden dimension \(d_h\) through full connection and applies ReLU as the activation function54. The second fully connected layer maps the hidden layer dimension \(d_h\) to the output dimension \(d_o\), and finally generates the adversarial noise through the L2 normalization operation. The specific generation process is as follows:

$$\begin{aligned} \parallel \delta \parallel _p=Normalize(\textbf{W}_4\cdot ReLU(\textbf{W}_3\cdot \textbf{p}_\textrm{rand}+\tilde{\textbf{b}_1})+\tilde{\textbf{b}_2}), \end{aligned}$$
(17)

where \(\parallel \delta \parallel _p\) represents the adversarial noise generated by the generator; \(ReLU(\cdot )\) denotes the activation function, which is used to introduce the nonlinearity; \(Normalize(\cdot )\) denotes the normalization function; \(\textbf{W}_1\in \mathbb {R}^{d_\textrm{h}\times d_\textrm{i}}\) and \(\textbf{W}_2\in \mathbb {R}^{d_0\times d_\textrm{h}}\) are the fully connected layer weight matrices of the generator, \(d_i\) and \(d_o\) represent the input and output dimensions of the noise generator respectively; and \(\tilde{\textbf{b}}_1\in \mathbb {R}^{d_\textrm{h}}\), \(\tilde{\textbf{b}}_2\in \mathbb {R}^{d_0}\) represent the bias vector of the generator.

Noise discriminator

The structure of the noise discriminator also includes two layers of fully connected neural networks. Its input is a noise vector \(\parallel \delta \parallel _p\) of dimension \(d_i\). In the first fully connected layer, the input dimension \(d_i\) is transformed into \(d_h\) through full connection, and ReLU is applied as the activation function. Then, the hidden dimension \(d_h\) is further mapped to a scalar output through the second fully connected layer, and the effectiveness score of the noise is calculated through the Sigmoid activation function to determine whether the input adversarial noise is effective. The calculation formula is as follows:

$$\begin{aligned} v=sigmoid(\textbf{W}_d\cdot {ReLU}(\textbf{W}_c\cdot \parallel \delta \parallel _p+\tilde{\textbf{b}_c})+\tilde{\textbf{b}_d)}, \end{aligned}$$
(18)

where \(v\in [0,1]\) represents the validity score of the discriminator output; \(\textbf{W}_3\in \mathbb {R}^{d_\textrm{h}\times d_\textrm{i}}\) and \(\textbf{W}_4\in \mathbb {R}^{d_\textrm{o}\times d_\textrm{h}}\) represent the weight matrix of the fully connected layer of the discriminator, \(d_i\) and \(d_o\) represent the input and output dimensions of the noise discriminator respectively; and \(\tilde{\textbf{b}}_3\in \mathbb {R}^{d_\textrm{h}}\), \(\tilde{\textbf{b}}_4\in \mathbb {R}^{d_0}\) represent the bias vector of the discriminator.

Noise-aware adversarial learning

This paper adopts an alternating optimization strategy similar to the classic generative adversarial network55 to train the noise generator and the discriminator. In each training iteration, the parameters of the discriminator are first fixed, and the generator is optimized so that the noise it generates deceives the discriminator as much as possible, that is, the following loss function is minimized:

$$\begin{aligned} L_G=-\mathbb {E}_{\parallel \delta \parallel _p\sim p_g(\parallel \delta \parallel _p)}[\log (D(\parallel \delta \parallel _p))], \end{aligned}$$
(19)

where \(L_{G}\) represents the loss function of the generator; \(\mathbb {E}_{\parallel \delta \parallel _p\sim p_g(\parallel \delta \parallel _p)}\) represents the expected value of the noise sample \(\parallel \delta \parallel _p\) generated by the generator; \(p_g(\parallel \delta \parallel _p)\) represents the probability distribution of generated noise; \(D(\parallel \delta \parallel _p)\) represents the probability of the decision of discriminator on noise \(\parallel \delta \parallel _p\). Then the parameters of the generator are fixed and the discriminator is optimized so that it can more accurately distinguish between real noise and generated noise, that is, minimize the following loss:

$$\begin{aligned} L_D=-\mathbb {E}_{\parallel \delta \parallel _p\sim p_{r}}[\log (D(\parallel \delta \parallel _p))]-\mathbb {E}_{\parallel \delta \parallel _p\sim p_{g}}[\log (1-D(\parallel \delta \parallel _p))], \end{aligned}$$
(20)

where \(L_D\) represents the loss function of the discriminator; \(\mathbb {E}_{\parallel \delta \parallel _p\sim p_{r}}\) represents the expected value of the real noise sample; \(p_{r}(\parallel \delta \parallel _p)\) represents the probability distribution value of the real noise. Moreover, the generator and discriminator are optimized alternately once in each iteration until the loss function converges.

Based on the adversarial network strategy, the random noise is weightedly fused with the adversarial noise output by the discriminator. The fused noise is used to update the entity and relation embeddings, and the entity and relationship embedding matrices with adversarial noise are generated respectively. The specific update process is as follows:

$$\begin{aligned}&\textbf{H}_t^{noise}=\textbf{H}_t+\epsilon \varvec{\cdot }{(v\cdot \parallel \delta \parallel _p+(1-v)\cdot \textbf{p}_{\textrm{rand}})}, \end{aligned}$$
(21)
$$\begin{aligned}&\quad \textbf{R}_t^{noise}=\textbf{R}_t+\epsilon \varvec{\cdot }{(v\cdot \parallel \delta \parallel _p+(1-v)\cdot \textbf{p}_{\textrm{rand}})}, \end{aligned}$$
(22)

where \(v\cdot \Vert \delta \Vert _p+(1-v)\cdot \textbf{p}_{\textrm{rand}}\) represents the adversarial noise after the adversarial strategy fusion; \(\epsilon\) represents the noise intensity control parameter; moreover \(\textbf{H}_{t}^{noise}\) and \(\textbf{R}_{t}^{noise}\) represent the entity and relation embedding matrices after adding adversarial noise, respectively.

Multi-layer embedding contrastive learning module

In the multi-layer embedding contrastive learning module, considering the sparsity and time sensitivity of TKGs data, we did not use traditional explicit data augmentation methods such as random perturbations to avoid destroying the integrity of temporal information. Instead, we designed an implicit enhancement strategy based on temporal features to improve data diversity through adversarial noise and perturbation embedding. In terms of sampling strategy, combined with the dynamic characteristics of TKGs, positive and negative samples are constructed based on embedding perturbations in the time dimension and embedding comparisons across layers and time steps to better capture the potential dynamic relationships in time series data.

The multi-layer embedding contrastive learning module captures the dynamic changes of short-term and global time through two contrastive learning strategies: intra-layer and inter-layer, respectively, to enhance the representation ability of the model, and better mine the potential relationships in TKG, thus improving the model’s ability to model temporal relationships. The calculation formula of the global entity and relation embedding matrix with adversarial noise is as follows:

$$\begin{aligned}&\textbf{H}^\textrm{global}=\frac{1}{\left| \mathscr {T}\right| }\sum _{t=1}^{\left| \mathscr {T}\right| }\textbf{H}_t^\textrm{noise}, \end{aligned}$$
(23)
$$\begin{aligned}&\quad \textbf{R}^\textrm{global}=\frac{1}{\left| \mathscr {T}\right| }\sum _{t=1}^{\left| \mathscr {T}\right| }\textbf{R}_t^\textrm{noise}, \end{aligned}$$
(24)

where \(\left| \mathscr {T}\right|\) represents the total number of timestamps in TKG. In our work, all embedding matrices involved in contrastive learning are first normalized to ensure fair similarity comparisons between embeddings at different layers. The following provides a detailed introduction to the specific implementation of the intra-layer and inter-layer contrastive learning methods.

Intra-layer contrastive learning

Intra-layer contrastive learning aims to enhance the ability of the model to model temporal latent relationships by optimizing the entity and relation embeddings within the same layer. Specifically, within the same layer, entity and relation embeddings with adversarial noise are used to calculate the cosine similarity between them and the embeddings from the previous time step, which also contain adversarial noise, for contrastive learning. By leveraging the distribution relationship between positive and negative samples, the embedding representations are further refined, enabling the model to more accurately capture the dynamic changes in time series data. Contrastive learning analysis is initially conducted on entity embeddings with adversarial noise. The process of calculating the positive sample score is as follows:

$$\begin{aligned} S_{intra-pos}^H=ReLU\left( \frac{\textbf{H}_{t,l}^{noise}\cdot (\textbf{H}_{t-1,l}^{noise})^\top }{\tau }\right) , \end{aligned}$$
(25)

where l represents the current layer number; \(\textbf{H}_{t,l}^{noise}\) and \(\textbf{H}_{t-1,l}^{noise}\) represent the entity embedding matrix with adversarial noise in the current time step and the previous time step in the \(l^{th}\) layer respectively; and \(S_{intra-pos}^H\) represents the entity positive sample score compared within the layer. The process of calculating the negative sample score is as follows:

$$\begin{aligned} S_{intra-neg}^H=ReLU\left( \frac{\textbf{H}_{t,l}^{noise}\cdot (\textbf{H}_l^{global})^\top }{\tau }\right) , \end{aligned}$$
(26)

where \(\textbf{H}_l^{global}\) represents the global entity embedding matrix with adversarial noise at the \(l^{th}\) layer, and \(S_{intra-neg}^H\) represents the negative sample score of the intra-layer contrast. Therefore, the intra-layer contrastive learning loss of the entity embedding with adversarial noise can be calculated using the positive and negative sample scores. The specific calculation process is as follows:

$$\begin{aligned} \mathscr {L}_{intra-H}=-\sum _{i=1}^{|\mathscr {N}_t|}\log \frac{S_{intra-pos,i}^H}{\sum _{k=1}^{|\mathscr {N}_t|}S_{intra-neg,k}^H}, \end{aligned}$$
(27)

where i and k represent a positive and negative sample index selected in the set, respectively, and \(|\mathscr {N}_t|\) represents the total number of entities at timestamp t. Similarly, we can use the above method to perform contrastive learning on relation embedding with adversarial noise, thereby obtaining the intra-layer contrastive learning loss of relation embedding with adversarial noise \(\mathscr {L}_{intra-R}\).

Inter-layer contrastive learning

Inter-layer contrastive learning aims to capture multi-scale semantic information in time series by comparing entity and relation embeddings with adversarial noise at different levels (such as bottom and top layers), and enhance the model’s sensitivity to changes in potential temporal information, thereby solving the problem of capturing latent temporal relationships in TKG. Specifically, the strategy first calculates the similarity between the current layer embedding and the last layer embedding through cosine similarity to obtain the positive sample score; then calculates the similarity between the current layer embedding and the global embedding to obtain the negative sample score. Next, the performance of entity embeddings with adversarial noise in contrastive learning is analyzed, and describe the calculation process of its positive sample score as follows:

$$\begin{aligned} S_{inter-pos}^H=ReLU\left( \frac{\textbf{H}_{t,l}^{noise} \cdot (\textbf{H}_{t-1,L}^{noise})^\top }{\tau }\right) , \end{aligned}$$
(28)

where L represents the total number of layers, and also represents the last layer; \(\textbf{H}_{t-1,L}^{noise}\) represents the entity embedding matrix with adversarial noise in the last time step in the last layer L; and \(S_{inter-pos}^H\) represents the entity positive sample score of the inter-layer comparison. The process of calculating the negative sample score is as follows:

$$\begin{aligned} S_{inter-neg}^H=ReLU\left( \frac{\textbf{H}_{t,l}^{noise}\cdot (\textbf{H}_L^{global})^\top }{\tau }\right) , \end{aligned}$$
(29)

where \(\textbf{H}_L^{global}\) represents the global entity embedding matrix with adversarial noise in the last layer L, and \(S_{inter-neg}^H\) represents the negative sample score of the inter-layer contrast. Then, the inter-layer contrastive learning loss with adversarial noise entity embedding is calculated based on the distribution relationship between positive and negative samples:

$$\begin{aligned} \mathscr {L}_{inter-H}=-\sum _{i=1}^{|\mathscr {N}_t|}\log \frac{S_{inter-pos,i}^H}{\sum _{k=1}^{|\mathscr {N}_t|}S_{inter-neg,k}^H}. \end{aligned}$$
(30)

Similarly, we also use the above method to perform comparative calculations on the inter-layer contrastive learning loss with adversarial noise relation embedding \(\mathscr {L}_{inter-R}\). The contrastive loss is then defined as:

$$\begin{aligned} \mathscr {L}_{cl}=\lambda (\mathscr {L}_{intra-H}+\mathscr {L}_{intra-R} +\mathscr {L}_\text {inter-H}+\mathscr {L}_\text {inter-R}), \end{aligned}$$
(31)

where \(\lambda\) represents the weight factor, which is a hyperparameter used to adjust the contribution of contrastive learning loss in the overall loss.

Model training and inference

Previous studies25 have shown that the combination of convolutional scoring functions and graph convolutional networks has shown excellent performance in KG reasoning. At the same time, ConvTransE24 has become a widely used scoring function due to its excellent performance in TKGR tasks. Therefore, our study choose ConvTransE as the decoder to achieve entity prediction. The calculation formula of its entity prediction score \(\textbf{P}_{score}^N\) is as follows:

$$\begin{aligned} \textbf{P}_{score}^\mathscr {N}=\sigma (\textbf{H}^{mdgu}\cdot {ConvTransE}(\textbf{n}_{s,t},\textbf{r}_t+\textbf{H}_t^{final}\cdot {ConvTransE}(\textbf{n}_{s,t},\textbf{r}_t)), \end{aligned}$$
(32)

where \(\textbf{n}_{s,t}\) and \(\textbf{r}_t\) represent the embedding of the subject entity \(n_s\) relation r in \(\textbf{H}^{mdgu}\) and \(\textbf{H}_t^{final}\). The purpose of our model is to perform a multi-label learning task, where each label represents a possible entity or relation. Next, the entity prediction loss function \(\mathscr {L}_{N}\) is defined as:

$$\begin{aligned} \mathscr {L}_{\mathscr {N}}=\sum _{t=1}^{\left| \mathscr {T}\right| }\sum _{(n_s,r,n_o,t)\in \mathscr {M}_t}\sum _{e\in \mathscr {E}}z_t^n\log \textbf{P}_{score}^\mathscr {N}(n_s,r,n_o,t), \end{aligned}$$
(33)

where \(\textbf{P}_{score}^\mathscr {N}(n_s,r,n_o,t)\) is the predicted probability score of the entity; and \(\textbf{z}_t^n\in \mathbb {R}^{|{\mathscr {N}}|}\) is the label of the element which is 1 if the fact occurs, otherwise 0. Therefore, the final loss function calculation is defined as:

$$\begin{aligned} \mathscr {L}=\mathscr {L}_{\mathscr {N}}+\mathscr {L}_{cl}. \end{aligned}$$
(34)

Table 2 summarizes the key equations in the DNCL model and their purpose. In addition, the specific DNCL model reasoning process is shown in Algorithm 1.

Table 2 Summary of key equations and their purpose.
Algorithm 1
figure a

Reasoning process of DNCL

Complexity analysis

This section will analyze the complexity of our DNCL model from the three proposed modules. For the multi-dimensional gated update module, the time complexity is \(O(m\left| \mathscr {N}\right| d_h)\), where m is the time series length, \(\left| \mathscr {N}\right|\) is the number of entities, and ddd is the embedding dimension. The row and column slicing strategy is used to flexibly select key information, reduce global operation overhead, and improve computational efficiency. For the noise-aware adversarial modeling module, the time complexity of adversarial training is \(O(n\left| \mathscr {N}\right| d_h)\), where n is the number of iterations of adversarial training. Noise generation and discrimination operations effectively reduce noise interference and significantly improve model robustness at low cost. For the multi-layer embedding contrastive learning module, the sum of the time complexity of the intra-layer and inter-layer contrastive loss is \(O(L\left| \mathscr {N}\right| ^2d_h+L^2\left| \mathscr {N}\right| ^2d_h)\), where L is the number of model layers. Multi-level semantics are captured by intra-layer and inter-layer contrast, combined with noise embedding optimization, to improve the modeling ability of complex dynamic relationships. Therefore, the overall time complexity of the DNCL model \(O(m\left| \mathscr {N}\right| d_h+n\left| \mathscr {N}\right| d_h+L\left| \mathscr {N}\right| ^2d_h+L^2\left| \mathscr {N}\right| ^2d_h)\) is the sum of the complexity of the three modules.

Experiments

This section evaluates the performance of the DNCL model through extensive experiments conducted on four datasets. A detailed analysis of the experimental results is provided below.

Datasets

This paper performs comprehensive experiments on four representative TKG datasets to evaluate the performance of our model. These datasets include: ICEWS1426, ICEWS1839, ICEWS05-1526, and GDELT56. The ICEWS14 dataset, sourced from the Integrated Crisis Early Warning System (ICEWS), primarily documents political events that occurred in 2014. ICEWS18, also from ICEWS, covers event data from January 1 to October 31, 2018, providing event records closer to the present. ICEWS05-15 is a long-term version of the ICEWS dataset, covering event information from 2005 to 2015, reflecting dynamic changes over a longer time span. GDELT is a global event dataset that records a variety of global events and covers a wide range. The detailed statistics of the dataset are listed in Table 3.

For the preprocessing of the dataset, we first removed duplicate facts and samples with missing timestamps in the dataset to ensure the accuracy and consistency of the training data; then, in order to unify the time scale, we normalized all timestamps and mapped them to the interval [0,1] to improve the comparability of data in different time ranges and the temporal generalization ability of the model. In addition, in order to maintain the temporal causal relationship within the dataset, we strictly divide all datasets into training, validation, and test sets in chronological order, with proportions of 80%, 10%, and 10%, respectively, and ensured that the static knowledge graph information in all datasets was fully integrated to provide more stable entity and relationship representations. At the same time, to ensure the consistency of the model input, we uniquely mapped all entity and relationship IDs to ensure that the same entity or relationship maintains the same numerical representation in the entire dataset. Finally, in terms of the negative sampling strategy, we randomly sample a certain number of negative samples \((n_s^\prime ,r,n_o,t)\) or \((n_s,r,n_o^\prime ,t)\) for each positive sample \((n_s^\prime ,r,n_o,t)\) to enhance the generalization and robustness of the model, where \(n_s^\prime\) and \(n_o^\prime\) represent the wrong head entity and tail entity, respectively.

Table 3 Statistics of the TKG datasets (\(\mathscr {D}_{train}\), \(\mathscr {D}_{valid}\), and \(\mathscr {D}_{test}\) are the numbers of facts in training, validation, and test sets; \(\Delta \textrm{t}\) represents time interval).

Evaluation metrics

To thoroughly assess the performance of the DNCL model in the TKGR task, we use mean reciprocal rank (MRR) and Hits@k (k=1,3,10) as evaluation metrics. MRR measures the ability of the model to rank the correct answer at the top, while Hits@k evaluates the presence of the correct answer within the top k results. MRR is used to measure the model’s ability to rank the correct answer in the top, and Hits@k evaluates whether the correct answer appears in the top k. A higher value indicates that the model has stronger accuracy and robustness. In addition, our study uses a time-aware filtered setting for evaluation. Compared with traditional methods39,40, this setting only removes valid facts that occur at the same time as the query facts, which is more in line with the actual needs of temporal reasoning and improves the accuracy of the evaluation.

Baseline models

This paper presents comparative experiments of the DNCL model with 20 latest or classic KG reasoning methods, thoroughly validating its effectiveness in the TKGR task. The baseline models we considered include static KG reasoning models, interpolation TKG reasoning models, and extrapolation TKG reasoning models. Among them, the static KG reasoning models mainly include: DisMult28, ComplEx30, ConvE23, Conv-TransE24 and RotatE57. The interpolation TKG reasoning models mainly include: TTransE27, TA-DistMult26, DE-SimlE32 and TNTComplEx29. The extrapolated TKG inference models mainly include: RE-NET38, xERTE39, TANGO13, RE-GCN40, TiRGN41, HisMatch42, RETIA43, CENET48, \(\text {L}^2\text {TKG}\)47, LogCL9 and BH-TDEN44.

Implementation details

In the training, we set the batch size to four times the number of samples per timestamp. We use the adam optimizer58 with an initial learning rate of 0.001, leveraging adam’s adaptive learning rate mechanism to improve training efficiency and convergence stability. All trainable parameters are initialized uniformly with Xavier, while entity and relation embeddings are initialized to zero tensors at the beginning of training to ensure consistency of the initial state. To mitigate overfitting, we apply L2 regularization with a weight decay coefficient of \(1e^{-5}\) and incorporate a dropout rate of 0.2 in each network layer. We also implement an early stopping strategy that terminates training if the validation MRR does not improve for 10 consecutive epochs. For hyperparameter tuning, we perform a grid search. Specifically, the embedding dimension \(d_h\) is selected from \(\{100,200,300,400,500\}\), where 200 is the best in all datasets. The history KG length m is selected from \(\{1,2,3,...,10\}\); the optimal sequence lengths for ICEWS14, ICEWS05-15, ICEWS18 and GDELT are 7, 9, 4 and 2 respectively. The temperature coefficient \(\tau\) is optimized in the range of \(\{0.01,0.03,...,0.09,0.1,0.3,...,0.9\}\), with the optimal value of 0.03 for ICEWS14 and ICEWS18 and 0.07 for ICEWS05-15 and GDELT. The noise intensity control parameter \(\epsilon\) is uniformly set to 0.15 to ensure robustness under noisy conditions. In the decoder, we use 50 convolution kernels with a kernel size of 2\(\times\)3 and set the dropout rate to 0.2. In addition, the static KG information is fully integrated into each dataset and the prediction weight is set to 0.9. All experiments were performed on a system equipped with 3 NVIDIA Tesla V100 GPUs to facilitate faster processing and large-scale experiments.

Results of prediction

The performance results of our proposed DNCL model, along with comparison models, for entity reasoning prediction are presented in Table 4. As shown, the DNCL model outperforms current state-of-the-art methods on most metrics across the four datasets, demonstrating its effectiveness in tackling TKGR tasks. Specifically, compared with the second best result in Table 4, the DNCL model improves MRR by 4.73%, 2.77% and 4.93% on the ICEWS14, ICEW05-15 and ICEWS18 datasets respectively. And for the Hist@1/10 indicators, DNCL has improved on all four data sets. The most significant improvement of Hist@1 is 6.91% on the ICEWS14 dataset, and the most significant improvement of Hist@10 is 3.93% on the ICEWS18 dataset. The GDELT dataset has significant noise and sparsity characteristics, which brings complex challenges to most models and limits their performance. However, the DNCL model overcomes these challenges by effectively coping with the impact of noise through noise-aware modeling and screening out key information through a multi-dimensional gate mechanism, thereby improving its performance efficiency to a certain extent.

Table 4 Comparison results of the overall performance (percentage) of different methods on ICESW14, ICEWS05-15, ICEWS18 and GDELT.

DNCL outperforms all static models (shown in the first part of Table 4) as well as temporal models in the interpolation (shown in the second part of Table 4). This can be attributed to the fact that static models overlook temporal information, failing to capture the dynamic changes in entities and relations, which limits their effectiveness in time series tasks. Moreover, although interpolation models focus on completing historical missing facts, they fail to model the evolution of entities and relations over time, and thus lack the ability to predict future unseen facts. In contrast, extrapolation models are better able to cope with the reasoning challenges in time series knowledge graphs by modeling the dynamic changes of the time dimension.

Compared with state-of-the-art extrapolation models, although both CENET and LogCL models use contrastive learning to enhance model performance, their performance is not as good as DNCL due to the lack of separate modeling of noise and the failure to consider the potential relation in the time dimension. Although \(\text {L}^2\text {TKG}\) utilizes structure encoders and latent relation learning modules to mine potential relations inside and outside time, it mainly relies on the static graph structure of historical data, does not fully incorporate global time dependencies, and lacks modeling of dynamic evolution trends of entities and relationships, making it Its performance is limited. However, DNCL filters out key information through a multi-dimensional gated mechanism to effectively address the problem of long-distance dependencies in information sparse environments. At the same time, it improves the robustness of the model in complex noise environments through noise-aware adversarial modeling, and combines multi-layer embedding contrastive learning more comprehensively and dynamically captures the temporal potential information in the TKG, improved model performance in multiple tasks.

Ablation study

To investigate the contribution of each component in the DNCL model to its overall performance, we performed ablation experiments on the ICEWS14, ICEWS05-15, ICEWS18, and GDELT datasets, using MRR and Hits@1/3/10 as evaluation metrics. The results are summarized in Table 5.

Table 5 Ablation experiment results (percentage) of the DNCL model. ’w/o DG’ represents the DNCL model without the dual-gate mechanism, ‘w/o AN’ represents the DNCL model without the adversarial noise modeling mechanism, ’w/o intra’ represents the DNCL model without use the intra-layer contrastive learning, ’w/o inter’ represents the DNCL model without use inter-layer contrastive learning and ‘w/o CL’ represents the DNCL model without use the multi-layer embedding contrastive learning strategy.

Impact of multi-dimensional gated update module

The multi-dimensional gated update module extracts key information and suppresses redundant noise through a dual-gate mechanism, solving the problem of long-distance dependencies on difficult to capture and information sparse. Taking ICEWS14 as an example, removing the dual-gate mechanism (w/o DG) reduces the MRR from 51.18 to 50.05%. Although this is only a 1.13% decrease, considering that the ICEWS14 dataset is relatively rich and the structure is relatively stable, this decrease shows the effectiveness of the dual-gate mechanism in information selection. In the more challenging and noisy GDELT dataset, removing the dual-gate mechanism reduces the MRR from 23.59 to 21.69%, a relatively larger decrease. This means that in scenarios where information is more sparse and noise is more obvious, the contribution of the dual-gate mechanism to information screening and long-distance dependencies capture becomes more prominent.

Impact of noise-aware adversarial modeling module

The noise-aware adversarial modeling module enhances the robustness of the model in dealing with noise interference through adversarial training. Compared to the contribution of the multi-dimensional gated update module, which is mainly in information refinement, the noise-aware adversarial modeling module is more directly oriented to the noise distribution. From the experimental results, in relatively clean or low-noise datasets such as ICEWS14, the MRR drops from 51.18 to 50.27% with the removal of the adversarial noise modeling mechanism (w/o AN), a decrease of 0.91%; whereas, in the GDELT dataset, where the noise is more significant, the MRR plummets from 23.59 to 21.11%, a decrease of as much as 2.27%, which is significantly higher than the ICEWS14’s drop. This shows the sensitivity and adaptability of the noise-aware adversarial modeling module to noisy environments.

In summary, it can be seen from the significant decrease in the overall performance of the DNCL model after removing each module that these modules each play an indispensable role in improving model performance. Each module is designed for key challenges in the TKGR task, and plays a core role in capturing long-distance dependencies under information sparsity, handling noise interference, and modeling potential temporal relationships. This further verifies the rationality and necessity of the module design, and also shows that the synergy between modules is crucial to the overall performance of the model.

Impact of multi-layer embedding contrastive learning module

The core feature of the multi-layer embedding contrastive learning module is to combine the two contrastive learning strategies of intra-layer and inter-layer to mine richer feature representations and improve the temporal relationship modeling capabilities of model. Ablation experiment results show that removing the entire multi-layer embedding contrastive learning (w/o CL) leads to the most significant performance drop, for example, in the ICEWS14 dataset, the MRR drops from 51.18 to 47.50%, a drop of 3.68%, which fully demonstrates that multi-layer embedding contrastive learning plays a key role in optimizing and enriching the embedding space. In addition, this trend remains consistent across multiple datasets, indicating the wide adaptability of the module under different temporal distributions.

Further analysis shows that although the performance degradation of removing intra-layer contrastive learning (w/o intra) and removing inter-layer contrastive learning (w/o inter) is relatively small, it still has a significant impact on the overall performance of the model. Specifically, removing intra-layer contrastive learning weakens the model’s ability to represent short-term dependency information and reduces the entity distinction in the same-layer embedding space; while removing inter-layer contrastive learning affects the ability to capture global temporal relationships, resulting in unstable information transfer between different layers.

In summary, the multi-layer embedding contrastive learning module not only improves the model’s ability to explicitly model latent relationships in the temporal dimension, but also provides a more robust underlying representation space for the multi-dimensional gated update and noise-aware adversarial modeling modules. It also verifies that the synergy of intra-layer and inter-layer contrastive learning is a key factor in improving model performance.

Comparison on prediction time

Since the RE-NET38 and xERTE39 models represent classic approaches for complex subgraph reasoning and sequential dependency modeling, respectively, they provide ideal comparative baselines for validating the efficient computational efficiency achieved by the DNCL model through its modular design. Therefore, to comprehensively evaluate the computational efficiency of the DNCL model, this paper compared it with the RE-NET and xERTE models in entity reasoning tasks under the same conditions. Specifically, this paper measured the running time of each model on the test sets of four datasets and strictly ensured that all experiments were conducted under the same environment and parameter settings to ensure the fairness of the results. The comparison results are shown in Fig. 4.

Fig. 4
figure 4

Comparative analysis of runtime(seconds).

Experimental results show that the running time of the DNCL model on different datasets is significantly better than RE-NET and xERTE. Moreover, even on a highly dynamic and large-scale event record dataset such as GDELT, DNCL can still maintain excellent inference efficiency and significantly reduce computational overhead. This efficiency is due to the modular design of the DNCL model, especially the synergy between the dual-gate mechanism and the multi-layer contrastive learning module, which effectively reduces redundant calculations. At the same time, noise-aware adversarial modeling further optimizes computational efficiency by modeling noise, significantly shortening inference time while still maintaining excellent inference accuracy and robustness. In contrast, xERTE’s complex subgraph expansion and pruning operations, and RE-NET’s stepwise autoregressive inference framework, both bring additional computational overhead. Therefore, DNCL shows significant advantages in computational efficiency through its concise and efficient design and comparative learning strategy.

Comparison analysis with non-contrastive-based TKGR models

Since \(\text {L}^2\text {TKG}\)47 combines explicit and implicit features to model temporal relationships, but lacks global temporal information integration, it is difficult to capture global information; HiSMatch42 uses a dual-structure encoder to match historical patterns, but mainly relies on static historical information and has weak adaptability to dynamic evolution; RE-GCN40 models temporal relationships based on recursive graph convolutional networks, but is susceptible to noise interference in complex temporal evolution environments and has difficulty modeling fine-grained temporal dynamics. In addition, none of these models use contrastive learning to optimize temporal relationship modeling, so it can more effectively verify the modeling advantage of DNCL’s multi-layer embedding contrastive learning module in modeling potential temporal relationships in both short-term and global time ranges. Therefore, we selected \(\text {L}^2\text {TKG}\), HiSMatch, and RE-GCN as comparison objects for non-contrastive learning models.

As shown in Fig. 5, MRR of DNCL on four common datasets is significantly higher than that of non-contrastive learning models such as \(\text {L}^2\text {TKG}\), HiSMatch, and RE-GCN, highlighting the value of the contrastive learning module. The multi-layer embedding contrastive learning module captures short-term dynamic features through intra-layer contrastive learning and aggregates global temporal relationships through inter-layer contrastive learning, thereby focusing on fine-grained changes at the same time and integrating richer historical information across time, comprehensively improving the model’s ability to model potential temporal relationships. On ICEWS14 and ICEWS05-15, where the data is relatively stable, intra-layer contrastive learning can effectively capture dynamic changes in short-term time; while inter-layer contrastive learning aggregates more context in global time, thereby enhancing reasoning coherence. For ICEWS18 and GDELT, which have larger noise, the dual contrast strategy can not only suppress local noise interference and ensure the precision within the local time window, but also mine the deep evolution laws of entities and relationships and the potential relationships in TKGs, improving the robustness to random distortion and information loss. Overall, the complementary integration of these two strategies enables DNCL to achieve better reasoning results than traditional non-contrastive models under different data complexities.

Fig. 5
figure 5

Performance comparison of the DNCL and the non-contrastive learning model on the MRR (percentage) metric.

Analysis of embedded dimensions \(d_h\)

Since the ICEWS14 and ICEWS18 datasets have a balanced structure and diverse event types, their moderate data size, dynamics, and noise level are helpful to observe the typical trend of changes in embedding dimension \(d_h\) on model performance. However, due to the complex data distribution and event characteristics of ICEWS05-15 and GDELT, interference factors may be introduced, which is not conducive to clearly judging the impact of \(d_h\). Therefore, in order to further study the impact of different \(d_h\) on model performance, this study keep other optimal hyperparameters unchanged and conduct experiments on ICEWS14 and ICEWS18 datasets using different embedding dimensions \(d_h\in \{100,200,300,400,500\}\).

As shown in Fig. 6, increasing the \(d_h\) can improve the performance indicators of the model on the two datasets within a certain range, but this improvement is not unlimited. The performance is low at a lower dimension, such as 100, which will lead to insufficient model performance, mainly because the low dimension cannot fully capture the complex characteristics of the data. As the dimension increases, the overall indicators improve. However, when the \(d_h\) is further increased to 400 or 500, the growth trend of performance tends to stagnate or even decrease. This shows that the appropriate \(d_h\) is crucial to model performance, but too high an \(d_h\) may bring additional computational complexity and increase the risk of model overfitting. Therefore, a reasonable choice of \(d_h\) can not only improve model performance, but also achieve a balance between performance and efficiency.

Fig. 6
figure 6

Performance (percentage) of different embedding dimensions \(d_h\) on ICEWS14 and ICEWS18 datasets.

Analysis of temperature coefficient \(\tau\)

As described in the previous section, this study conducted experiments on different temperature coefficients \(\tau\) on the ICEWS14 and ICEWS18 datasets while keeping the optimal configuration of other hyperparameters unchanged. As shown in Fig. 7, on both datasets, with the change of the \(\tau\), the overall performance of the model on the MRR and Hits@3 indicators is relatively stable, which shows that the model has certain robustness and adaptability to the setting of the \(\tau\). However, the figure also shows that the model performance fluctuates with the change of the \(\tau\), and the performance may be slightly improved or decreased. This fluctuation reflects that the influence of the \(\tau\) on the model is not completely linear, and different coefficient values may adjust the distribution of embeddings to different degrees, thereby affecting the effect of contrastive learning. In addition, different datasets have different sensitivities to the \(\tau\). For example, on the ICEWS18 dataset, the model may be more sensitive to changes in the \(\tau\), while on the ICEWS14 dataset, it shows stronger stability. This shows that the choice of \(\tau\) not only affects the performance of the model, but is also closely related to the characteristics of the dataset. Therefore, it is particularly important to reasonably adjust the \(\tau\) to balance the robustness and adaptability of the model and improve the performance of TKGR.

Fig. 7
figure 7

Performance (percentage) of different temperature coefficient \(\tau\) on ICEWS14 and ICEWS18 datasets.

Analysis of long-distance dependencies problem

Because ICEWS14 and GDELT form a sharp contrast in data complexity: ICEWS14 has a clear structure and is easy to extract long-distance dependency features; GDELT is complex and has significant data disturbances, which can test the robustness of the model under difficult conditions. Therefore, in order to further explore the performance of the DNCL model in capturing long-distance dependencies, this paper conducted experiments on the ICEWS14 and GDELT datasets, using a history length m range of 1 to 10, while keeping other optimal hyperparameters unchanged. This allows DNCL to evaluate its long-distance modeling capabilities in simple to complex time series environments, thus providing strong support for the application of the model in real and diverse scenarios.

As shown in Fig. 8 , on the ICEWS14 dataset, as the m increases, the MRR and H@3 indicators of the model always remain relatively stable and high. This shows that regardless of whether the history length is short or long, the DNCL model can effectively capture and utilize long-distance temporal information, thereby maintaining excellent reasoning performance under long historical dependencies. On the GDELT dataset, due to more complex data, more event types, and more obvious noise interference, the modeling of long-distance temporal dependencies requires higher requirements. Although the performance fluctuates slightly with the change of history length in this dataset, the DNCL model can still maintain relatively good performance over a longer historical range. This result shows that DNCL not only has a robust ability to capture long-distance dependencies on the clearer and more orderly ICEWS14 dataset, but also can cope with complex and changeable long-distance relations in the more challenging GDELT dataset, and achieve effective modeling of temporal information.

Fig. 8
figure 8

Performance (percentage) of different history length m on ICEWS14 and GDELT datasets.

To further evaluate the ability of the DNCL model in capturing long-distance dependencies, We continue to analyze the performance of different history lengths m. As shown in Table 6, on the ICEWS14 dataset (24-hour granularity), as the history length m increases from 1 to 7, the model performance improves steadily, with Hit@1 increasing from 39.23 to 40.37% and MRR rising from 50.27% to 51.18%. Subsequently, although the performance slightly decreases, with MRR dropping from 51.18 to 50.25%, the drop is only about 0.93%, and the overall performance remains at a high level. On the GDELT dataset (15-minute granularity), when the history length m increases from 1 to 2, the model performs best (Hit@1 is 14.78%, MRR is 23.59%), but then the performance fluctuates greatly. This shows that in a dataset with shorter granularity such as GDELT, an overly long history length may introduce redundant information. However, the model can still show strong stability and robustness in a short time span. The overall trend shows that DNCL can effectively utilize and filter key information across multiple historical time steps, demonstrating its ability to model long-distance dependencies. In particular, in the ICEWS14 dataset, since many events have a long dependency span (from days to months or even years), the performance of the model is improved more significantly, which further verifies the effectiveness of the proposed multi-dimensional gated update module in capturing long-distance dependencies.

Table 6 MRR and Hit@1 scores (percentage) with different history lengths m on ICEWS14 and GDELT datasets. The best result for different history lengths is highlighted in black font.

Analysis of noise interference problem

Analysis of different noise intensities

Since the data distribution characteristics or complexity hierarchy of the ICEWS05-15 and GDELT datasets are significantly different from those of the ICEWS14 and ICEWS18 datasets, they introduce more confounding factors in the analysis of the effects of noise disturbances, which are not conducive to a direct and clear assessment of the performance of the DNCL on the noise intensity dimension. Therefore, In order to further verify the effectiveness of the DNCL model in reducing noise interference, this paper introduced a random noise on the ICEWS14 and ICEWS18 datasets, and set different noise intensity control parameters \(\epsilon\) for experimental analysis.

As shown in Fig. 9 , whether for the more ordered ICEWS14 dataset or the more complex ICEWS18 dataset, when \(\epsilon =0\), that is, in a completely clean dataset, DNCL still maintains high performance in both MRR and Hits@3, proving that it has good predictive ability under noise-free conditions. As \(\epsilon\) gradually increases from 0.05 to 0.5, the changes in the two indicators of the model remain stable, and there is no obvious performance collapse or sharp decline. This phenomenon fully demonstrates that although the potential noise in the data increases with the increase of \(\epsilon\), DNCL can still effectively suppress noise interference through the noise-aware adversarial modeling module to ensure the stability of the inference results. In addition, these results also show that DNCL can extract key features and filter meaningless information, both in a clean environment and under gradually increasing noise conditions, showing high adaptability and robustness to different complexities and noise distributions. This advantage makes it more widely applicable in dynamic and diverse time series environments.

Fig. 9
figure 9

Performance (percentage) of different noise intensity control parameter \(\epsilon\) on ICEWS14 and ICEWS18 datasets.

Although the noise in this study is randomly generated, the noise-aware adversarial modeling module proposed by DNCL aims to improve the robustness of the model to real-world structured noise such as conflicting facts and incomplete information. This module effectively distinguishes between reliable and unreliable information by introducing adversarial perturbations during training, ensuring that the embedded representation remains stable in a noisy environment. Previous studies have shown59 that adversarial training significantly improves the generalization and robustness of the model. In DNCL, adversarial noise not only enhances the model’s adaptability to random noise, but also improves its resistance to structured noise. This is because adversarial training continuously simulates inconsistent perturbations in reality, prompting the model to learn more robust entity and relationship embeddings to accurately filter out contradictory facts or time-misaligned information. Therefore, even if the experiment does not explicitly introduce structured or contradictory noise, DNCL still has good generalization performance and can effectively adapt to the complex noise environment of the real world.

Comparative analysis with alternative adversarial learning methods

The Wasserstein adversarial training framework60 has become a representative method because it generates high-quality negative samples in knowledge graph embedding tasks and effectively alleviates the gradient vanishing problem of traditional generative adversarial networks. However, in practical application scenarios, this method usually ignores the complex and diverse noises existing in natural data and mainly relies on artificially generated negative samples to improve the discriminability of representations. In contrast, the noise-aware adversarial modeling module we proposed directly targets the real noise environment. By introducing discrimination and fusion strategies in the noise generation process, it adaptively identifies and suppresses interference information, thereby significantly improving the robustness of the model. To further verify the effectiveness of this module, we compared the performance of the DNCL model with the model based on the Wasserstein adversarial training method on three datasets: ICEWS14, ICEWS05-15, and GDELT.

As shown in Fig. 10, on the three datasets, the performance of the DNCL is generally better than that of the F-DE_SimplE and F-BoxTE models60 based on the Wasserstein adversarial learning framework. Specifically, in the ICEWS14 and ICEWS05-15 datasets with clearer structures and relatively sparse data, the DNCL model shows stronger temporal relationship capture and long-distance dependency modeling capabilities; while in the GDELT dataset with more significant noise, the DNCL model still maintains its advantage, verifying the robustness of its noise-aware mechanism in real complex environments. Therefore, although Wasserstein adversarial learning can optimize the quality of negative sample generation, it is still insufficient when facing complex, diverse, and continuously evolving noise interference in real scenarios. The noise-aware adversarial modeling module proposed in this paper addresses these challenges and significantly improves the temporal reasoning performance and stability of the model at different scales and different noise levels through the adversarial training mechanism constructed by the noise generator and the noise discriminator.

Fig. 10
figure 10

Performance (percentage) comparison of DNCL with models trained using the Wasserstein adversarial learning framework on the ICEWS14, ICEWS05-15, and GDELT datasets. The results of F-DE_SimplE and F-BoxTE are from60.

Conclusion

In this paper, a novel dual-gate and noise-aware contrastive learning framework (DNCL) is proposed to address key issues in TKGR. Specifically, the multi-dimensional gated update module effectively models long-distance dependencies through a dual-gate mechanism; the noise-aware adversarial modeling module improves noise resistance through adversarial training; and the multi-layer embedding contrastive learning module captures the representation ability of latent temporal relationships through intra-layer and inter-layer contrastive learning. Experimental results show that DNCL improves the Hit@1 indicator by 6.91%, 4.31%, and 5.30% on the ICEWS14, ICEWS18, and ICEWS05-15 datasets, respectively. Ablation experiments and parameter analysis further verify the effectiveness and robustness of each module. DNCL still shows excellent adaptability and robustness in complex scenarios, providing an efficient solution for TKGR tasks and a new direction for more complex temporal reasoning research.

With the expansion of TKGs applications, future research could explore the following areas. First, explore how to combine multimodal information such as text, images, and sensor data in the DNCL model to assist time series reasoning with more comprehensive input. Second, explore how to optimize the DNCL model structure and algorithm to cope with ultra-large-scale, ultra-sparse, and strongly noisy data. In addition, it is also possible to study how to introduce cutting-edge methods such as large language models, meta-learning, and incremental learning to enhance the ability to adapt to new events and new fields.