GTAT: empowering graph neural networks with cross attention

Shen, Jiahao; Ain, Qura Tul; Liu, Yaohua; Liang, Banqing; Qiang, Xiaoli; Kou, Zheng

doi:10.1038/s41598-025-88993-3

Download PDF

Article
Open access
Published: 08 February 2025

GTAT: empowering graph neural networks with cross attention

Jiahao Shen¹,
Qura Tul Ain¹,
Yaohua Liu¹,
Banqing Liang¹,
Xiaoli Qiang² &
…
Zheng Kou¹

Scientific Reports volume 15, Article number: 4760 (2025) Cite this article

8479 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Graph Neural Networks (GNNs) serve as a powerful framework for representation learning on graph-structured data, capturing the information of nodes by recursively aggregating and transforming the neighboring nodes’ representations. Topology in graph plays an important role in learning graph representations and impacts the performance of GNNs. However, current methods fail to adequately integrate topological information into graph representation learning. To better leverage topological information and enhance representation capabilities, we propose the Graph Topology Attention Networks (GTAT). Specifically, GTAT first extracts topology features from the graph’s structure and encodes them into topology representations. Then, the representations of node and topology are fed into cross attention GNN layers for interaction. This integration allows the model to dynamically adjust the influence of node features and topological information, thus improving the expressiveness of nodes. Experimental results on various graph benchmark datasets demonstrate GTAT outperforms recent state-of-the-art methods. Further analysis reveals GTAT’s capability to mitigate the over-smoothing issue, and its increased robustness against noisy data.

Unifying topological structure and self-attention mechanism for node classification in directed networks

Article Open access 04 January 2025

Fusing multiplex heterogeneous networks using graph attention-aware fusion networks

Article Open access 24 November 2024

Discovering latent node Information by graph attention network

Article Open access 26 March 2021

Introduction

Graph-structured data maps out intricate relations between various entities around the world, from the vast expanses of social networks¹ to the dense construction of knowledge graphs², and the intricate patterns of molecular structures³ even to 3D topologies of manifolds⁴. This data structure plays an essential part in complex relationship modeling. Graph Neural Networks (GNNs) and their variants are efficient tools for exploring graph-structured data, utilizing node features and graph structure to address challenges in network analysis. This capability makes GNNs widely applicable across various domains, including deciphering molecular structures⁵, navigating social networks⁶, formulating product suggestions⁷, or dissecting software programs⁸.

Convolution techniques in computer vision^9,10 have been applied to graph-structured data, promoting advancements in GNNs. Based on different convolution definitions, GNNs are divided into two categories: spectral-domain¹¹ and spatial-domain^12,13,14. Spectral-domain GNNs define graph convolution through the lens of graph signal processing, based on the principle that convolving two signals in the space domain is equivalent to multiplying their Fourier transforms in the frequency domain. This concept originates from Bruna’s work¹¹, with subsequent advancements and refinements made by notable works on ChebNet¹⁵, CayleyNet¹⁶, and GCN¹⁷. Spatial-domain GNNs perform convolution on the representations of each node and its neighbors directly to update states, and exhibit a wide variety of variants according to different neighboring information aggregation and integration strategies. Particularly, Graph Attention Network (GAT)¹⁸stands out owing to its attention-based neighborhood aggregation. This architecture enables nodes to weigh the significance of neighboring information during their feature update process. Building upon this, GAT2¹⁹ introduces dynamic attention, demonstrating more robust and expressive capabilities.

While these methods make use of basic topological information, such as node degrees or edges, during message passing, they do not explicitly incorporate richer topological features. This limitation prevents GNNs from fully leveraging the inherent properties of the graph structure, which are crucial for understanding graph-structured data. For instance, in social networks²⁰, the topological structure can reveal community patterns, influential entities, and the dynamics of information flow. In chemical informatics²¹, the molecular topology directly influences the chemical properties and reactivity of molecules. In biological networks²², analyzing topological differences helps understand cellular functions and disease mechanism. To address this limitation, some GNNs^23,24,25 leverage the topological infomation, by adjusting factors like message passing weights or choosing specific nodes for information propagation. You’s and Tian’s work^26,27 attempts to enhance node expressiveness by concatenating the extracted topological information with node representation. However, node representations and topology representations are essentially two different modalities. Wang’s and Baltrušaitis’s work^28,29 indicates that simply concatenating data from different modalities, while ignoring the interactions between these modalities, may hinder the network from effectively learning useful information from each modality.

Motivated by the above issues, we propose Graph Topology Attention Networks (GTAT) to address the inadequate utilization of topological information and the limitation of unimodal configuration. In specific, GTAT starts by extracting topology features from the graph’s structure, and then encodes them into topology representations. We take the infuluence of the node local topology into account by encoding the topology information as another input into the model. Then, we compute two types of attention scores and use cross attention mechanism to process both the node representations and the extracted topology features. This integration enables topology features to be incorporated into node representations and ensures the relationships in graph effectively captured, achieving a more robust and expressive graph model.

The contributions of this paper are summarized as follows:

We propose a novel graph neural network framework, GTAT, which enhances the utilization of topological information for processing graph-structured data. In this framework, we treat node feature representations and extracted topology representations as two separate modalities, which are then inputted into the GNN layers.
We explore the feasibility of applying the cross attention mechanism in GNNs. Our approach calculates attention scores for both node feature representations and node topology representations, then employ a cross attention mechanism to integrate these two sets of representations. This integration allows the model to dynamically adjust the influence of node features and topological information, enhancing the representation capability.
Experimental results on nine diverse datasets demonstrate our model has a better performance than state-of-the-art models on classification tasks. Further analysis involving variations in model depth and noise levels reveals GTAT’s capability to mitigate the over-smoothing issue, and its increased robustness against noisy data. These results highlight that GTAT can be used as a general architecture and applied to different scenarios.

Related work

Graph neural networks

Different GNNs employ various aggregation schemes for a node to aggregate messages from its neighbors. GCN¹⁷ utilizes a layer-wise propagation technique, employing a localized first-order approximation of spectral graph convolutions to encode representations. SAGE³⁰ learns a function to generate embeddings from a node’s local neighborhood, enabling predictions on previously unseen data. SGC³¹ simplifies the training process by reducing the number of non-linear layers and merges multiple layers of graph convolution into a single linear transformation. FAGCN³² optimizes neighborhood information aggregation by analyzing the spectral properties of graphs, employing different strategies for handling high-frequency and low-frequency signals. Attention mechanism³³ empowers GATs to selectively focus on significant neighborhood information while updating node representations, thus pioneering a new approach in graph representation learning. GAT¹⁸ employs a self-attention mechanism, which calculates attention coefficients for each neighbor of a node and utilizes them to weight corresponding neighbor features during aggregation, permitting the GAT to assign more considerable weights to more relevant neighbors. GAT2¹⁹ employs a dynamic attention mechanism to enhance the model’s expressive abilities, accommodating scenarios where different keys possess varying degrees of relevance to different queries.

GNNs with topology

Leveraging graph topology has become more and more popular in graph representation learning. mGCMN³⁴ incorporates motif-induced adjacency matrices into its message passing framework, adjusting weights to capture complex neighborhood structures. TAGCN³⁵ slides a set of fixed-size learnable filters over the graph, where each filter adapts to the local topology. P-GNNs²⁶ samples multiple sets of anchor nodes and applies a distance-weighted aggregation scheme to differentiate nodes’ positions information. SubGNN³⁶ learns disentangled representations of subgraphs by using routing mechanism to handle subgraph internal topology, position, and connectivity, enhancing performance on subgraph prediction tasks. To learn deep embeddings on the high-order graph-structured data, Hyper-Conv³⁷. extends traditional graphs, permitting edges to connect to any number of vertices, thus altering the aggregation methods among nodes. Given the importance of topological information, we extract and encode it to enhance model’s representation ability.

Cross attention mechanism

The concept of the cross attention mechanism was first proposed in the Transformer model³⁸. Cross attention mechanism bridges two distinct sequences from diverse modalities such as text, sound, or images. Cross attention provides a flexible framework that allows for interactions between different modalities^39,40, enhancing mutual understanding. Exploiting this concept, the Perceiver model⁴¹ processes input byte arrays by alternating between cross attention and latent self-attention blocks. Meta’s Segment Anything Model⁴² leverages cross attention to connect the prompts and image information, fostering enhanced interactions and richer embeddings. MMCA⁴³ uses cross attention module to generate cross attention maps for each pair of class feature and query sample feature, making the extracted feature more discriminative. Recently, some works^44,45 have also adopted cross-attention mechanisms in graph-related tasks. However, most of these studies focus on using cross-attention to facilitate interactions between graph modules and non-graph modules. In this study, we employ the cross-attention mechanism to enable modality interaction within the graph module itself, without requiring the assistance of non-GNN modules. This distinction allows for more efficient and intrinsic interactions within the graph structure itself.

Method

Framework

As illustrated in Figure 1, our framework begins with the topology feature extraction (TFE) for each node. After getting the set of topology representations, we apply Graph Cross Attention (GCA) layers to update node feature representations and topology representations. Lastly, the model utilizes the node feature representations from the final layer to predict node classifications. Our methodology presents an innovative fusion of original feature representations and the topology representations, utilizing a unique cross attention mechanism on graph to enhance the expressive capabilities of each node. The following sections comprehensively elaborate on our approach.

Topology feature extraction

To extract the information inherent in graph, we obtain the topology representations based on the graphlet degree vector (GDV)^22,46 for each node. GDV is a count vector that represents the distribution of nodes in specific orbits of graphlets. Graphlets, defined as small connected non-isomorphic induced subgraphs within a graph, succinctly capture the neighboring structure of each node in the network. And an orbit can be thought of as a unique position or role a node can have within a graphlet. For instance, each node in a triangle (a three node graphlet) has the same role, so they belong to the same orbit. GDV is a vector to count the participation times of different orbits across the local distinct graphlets. The GDV delivers a measure of the node’s local network topology feature, enhancing model’s understanding of the graph structure.

Figure 2 shows all four different orbits with up to three nodes and the GDV calculating of node $\nu$. In fact, there are 15 distinct orbit types for graphlets with up to four nodes, and 73 types for graphlets with up to five nodes. We utilize the Orbit Counting Algorithm (OCRA)⁴⁷ to compute the GDV of nodes within a network. OCRA offers a combinatorial method for the enumeration of graphlets and orbit signatures of network nodes, reducing the computational complexity encountered in the counting of graphlets. The time complexities for computing the GDVs of these two dimensions are respectively $O\left( n \cdot d^3\right)$ and $O\left( n \cdot d^4\right)$, where n is the number of nodes and d is the maximum degree of the nodes.

Building on the aforementioned approach, this study employs the GDV as the extracted node topology feature. The dimensionality of each node’s GDV corresponds to the number of orbits, representing its topological characteristics. These GDVs, after being normalized and processed through a multilayer perceptron (MLP)⁴⁸, serve as the topology representations inputted into the network. To balance the computational efficiency and prediction accuracy, we employ the 73-dimensional GDV. The comparative experiments are showed in Section 4.

Graph cross attention layer

After obtaining the topology representation, our approach introduces the computation of two types of attention: the feature attention and a novel topology attention, thereby implementing a cross attention mechanism on graph. The structure of GCA layer is depicted in Figure 3.

Our GCA layer receives a set of node feature representations, ${H}_l=\left\{ {h}_1, {h}_2, \ldots , {h}_N\right\}$, and a set of topology representations, ${T}_l=\left\{ {t}_1, {t}_2, \ldots , {t}_N\right\}$, where N is the number of nodes at layer l. Following the methodology in GAT, we calculate the feature attention score between feature representations of nodes and their corresponding neighbors:

$$\begin{aligned} e\left( {h}_i, {h}_j\right) =\operatorname {LeakyReLU}\left( {a}^{\top } \cdot \left[ {W} {h}_{{i}} \Vert {W} {h}_j\right] \right) \end{aligned}$$

(1)

where ${h}_i$ and ${h}_j$ are the feature representations of nodes $i$ and $j$, while W and a represent a weight matrix and a shared parameter vector, respectively. This calculation embodies the inherent attributes of the nodes and assigns more considerable weights to more relevant neighbors.

Furthermore, we introduce a new form of attention score, topology attention score. This score is calculated between topology representations of nodes and their corresponding neighbors:

$$\begin{aligned} {e}_t\left( {t}_i, {t}_j\right) = \operatorname {LeakyReLU}\left( {a}_t^{\top } \cdot \left[ {t}_i \Vert {t}_j\right] \right) \end{aligned}$$

(2)

where ${t}_i$ and ${t}_j$ are the topology representations of node $i$ and node $j$, with ${a}_t$ being a shared parameter vector. Then the feature attention scores and topology attention scores are normalized as :

$$\begin{aligned} \alpha _{i j}=\operatorname {softmax}_j\left( e\left( {h}_i, {h}_j\right) \right) =\frac{\exp \left( e\left( {h}_i, {h}_j\right) \right) }{\sum _{j^{\prime } \in \mathcal {N}_i} \exp \left( e\left( {h}_i, {h}_{j^{\prime }}\right) \right) } \end{aligned}$$

(3)

and

$$\begin{aligned} \beta _{i j}=\operatorname {softmax}_j\left( e_{t}\left( {t}_i, {t}_j\right) \right) =\frac{\exp \left( e_{t}\left( {t}_i, {t}_j\right) \right) }{\sum _{j^{\prime } \in \mathcal {N}_i} \exp \left( e_{t}\left( {t}_i, {t}_{j^{\prime }} \right) \right) } \end{aligned}$$

(4)

where $\alpha _{i j}$ is the feature attention coefficient between node $i$ and node $j$, and $\beta _{i j}$ serves as the topology attention coefficient, enabling the model to capture the local substructure of each node in the network. Additionally, $\mathcal {N}_{i}$ represents the set of neighbors of node i, and it can be defined as follows:

$$\begin{aligned} \mathcal {N}_i=\{j \in \mathcal {V} \mid (j, i) \in \mathcal {E}\} \end{aligned}$$

(5)

where $\mathcal {V}$ represents the set of nodes in the graph, and $\mathcal {E}$ represents the set of edges.

Following the two attention computations, we implement the cross attention mechanism, which intertwines the node feature representations and the topology representations. The node feature representation is updated with the computed topology attention coefficients as:

$$\begin{aligned} {h}_i^{\prime }=\sigma \left( \sum _{j \in \mathcal {N}_i} \beta _{i j} {W} {h}_j\right) \end{aligned}$$

(6)

where $\sigma$ is a nonlinearity and ${W} \in \mathbb {R}^{F_3\times F_1}$ represent a weight matrix. Simultaneously, the topology representation is updated with the calculated feature attention coefficients :

$$\begin{aligned} {t}_i^{\prime }=\sigma \left( \sum _{j \in \mathcal {N}_i} \alpha _{i j} {t}_j\right) \end{aligned}$$

(7)

Finally, the layer outputs a new set of node feature representations, ${H}_{l+1}=\left\{ {h}_1^{\prime }, {h}_2^{\prime }, \ldots , {h}_N^{\prime }\right\}$, and a set of topology representations, ${T}_{l+1}=\left\{ {t}_1^{\prime }, {t}_2^{\prime }, \ldots , {t}_N^{\prime }\right\}$.

It’s worth mentioning that dynamic attention mechanism, which is introduced in GAT2, also performs well across various tasks. The dynamic attention in GAT2 diverges from GAT’s static counterpart by adjusting its weights based on the query, thus accommodating scenarios where different keys possess varying degrees of relevance to different queries. The dynamic attention calculation in GAT2 is formulated as follows:

$$\begin{aligned} e\left( {h}_i, {h}_j\right) ={a}^{\top } \operatorname {LeakyReLU}\left( \left[ {W}{h}_i \Vert {W}{h}_j\right] \right) \end{aligned}$$

(8)

To equip our model with dynamic attention, we further propose another version: GTAT2. In GTAT2, we employ the dynamic attention mechanism as utilized in GAT2 for the computation of two attention scores, as Equation 8 and Equation 9:

$$\begin{aligned} {e}_t\left( {t}_i, {t}_j\right) ={a}_t^{\top } \operatorname {LeakyReLU}\left( \left[ {t}_i \Vert {t}_j\right] \right) \end{aligned}$$

(9)

Both the node feature and topology representations in GTAT2 are updated similarly to those in GTAT. Experiments and analysis on GTAT and GTAT2 are conducted subsequently.

The cross action of the node and topology representations allows for the capture of both node intrinsic attributes and topological relations, thereby significantly augmenting the prediction accuracy of our model.

Experiments

Datasets

In our experiments, we use nine commonly used benchmark datasets, namely three citation networks datasets (i.e., Cora, Citeseer, and PubMed)⁴⁹, two Amazon sale datasets (i.e., Computers and Photo)⁵⁰, two coauthorship datasets (i.e., Physics and CS), one Wikipedia-based dataset (i.e., WikiCS)⁵¹, and one arxiv papers dataset (i.e., Arxiv)⁵². Statistics for all datasets can be found in Table 1. The resources we used are all from the PyTorch Geometric Library⁵³.

Table 1 The statistics of datasets.

Full size table

Experimental setup

All experiments are implemented in PyTorch and conducted on a server with two NVIDIA GeForce 4090 (24 GB memory each). We conduct 20 runs, reporting the mean values alongside the standard deviation. The search space for hyper-parameters encompasses: hidden size options of ${\left\{ 8, 16, 32, 64 \right\} }$, learning rate choices of ${\left\{ 0.01, 0.005 \right\} }$, dropout values of ${\left\{ 0.4, 0.6 \right\} }$, weight decay options of ${\left\{ 1E-3, 5E-4 \right\} }$, and selection of attention heads from ${\left\{ 1, 2, 4, 8 \right\} }$for models using attention mechanism. We hold the number of layers constant at 2. All methods utilize an early stopping strategy⁵⁴ based on validation loss, with patience of 100, and all are trained using a full-batch approach. In all cases, we randomly select 20 and 30 nodes per class for the training and validation, respectively, and the remaining nodes are used for testing. We use NLL Loss as the loss function for the model:

$$\begin{aligned} \mathcal {L}=-\frac{1}{N} \sum _{i=1}^N \sum _{c=1}^C y_{i, c} \log \left( \hat{y}_{i, c}\right) \end{aligned}$$

(10)

where C is the number of classes in the classification task, $\hat{y}_{i, c}$ is the predicted probability of sample $i$ being classified into class $c$, and $y_{i, c}$is the ground truth label. We utilize the Adam optimizer⁵⁵ to minimize the loss function and optimize the parameters of these models.

Node classification results

The comparative methods in our study involve nine different algorithms: GCN¹⁷, GraphSAGE (SAGE)³⁰, SGC³¹, FAGCN³², GAT¹⁸, GAT2¹⁹, Hyper-Conv³⁷, mGCMN³⁴ and Dir-GNN⁵⁶.

Table 2 shows the average accuracy and standard deviation of different models. Except for two datasets, GTAT or GTAT2 achieves the best results in all other datasets. Compared to GATs, GTATs show better performance across all datasets due to the extracted topology features and the cross attention mechanism. Specifically, GTAT achieves an average accuracy improvement of 0.53% across nine datasets compared to GAT, and GTAT2 outperforms GAT2 with an accuracy improvement of 0.48%. Compared to Hyper-Conv. and mGCMN, which utilize topological information, our model also demonstrates better accuracy. While Hyper-Conv. and mGCMN merely adjust the message-passing pathways or weights based on the extracted topological structure, our method receives the extracted topology features as an additional modality. This mechanism enables GTATs to fit the impact of the topological structure on node representation, contributing to more accurate and reliable predictions. Compared to the earlier proposed SGC, GCN, and SAGE models, the GTATs exhibit superior performance.

FAGCN’s effectiveness in the Physics and Cs datasets, where the node features have high dimensions, can be attributed to its adaptive integration of low-frequency and high-frequency signals from the raw features. However, GTATs outperform FAGCN across all other seven datasets. Particularly for the Arxiv dataset, which has low node feature dimensions, GTAT outperforms FAGCN by 4.25%, highlighting GTATs’ capability to achieve higher accuracy with limited node features.

In summary, our GTAT models demonstrate outstanding performance across all nine datasets spanning four distinct data types, showcasing their broad applicability in handling diverse graph-structured data.

Table 2 Accuracy(%) comparison with different models on nine datasets.

Full size table

Effectiveness of cross attention

To further explore the impact of the cross attention mechanism embedded in our model, we conduct series of experiments based on GATs with two different configurations. (1) GATs+A, which updates both the node feature representations H and the topology representations T using the topology attention coefficients $\beta$. (2) GATs+B updates only the node feature representations H based on the topology attention coefficients $\beta$, while the topology representations T remain constant. As shown in Table 3, our method presents the best performance across the most of datasets except the Computers. These results support the importance of utilizing the potential of both node feature and topology representations through our cross attention mechanism to attain optimal performance.

Table 3 Accuracy(%) comparison with/without cross attention.

Full size table

Over-smoothing analysis

A critical challenge in GNNs is the over-smoothing issue⁵⁷, which limits the number of layers that can be effectively stacked. As the number of layers increases, the nodes become less and less distinguishable, making the performance of the model drop sharply.

To verify whether topology representations and cross attention could alleviate the over-smoothing issue, we select four different types of datasets and compare the performance of GTATs and GATs at varying depths. There are few significant differences between the models in initial layers, as shown in Figure 4. However, as the depth increases, the GTATs demonstrate more stable performance, avoiding the drastic decline observed in GATs.

Figure 5displays the t-SNE⁵⁸ plots of the node representations with 20 layers of GAT and GTAT on the Physics dataset. The t-SNE plot provides a visual description of high-dimensional data by projecting them into 2D space, aiding in the identification of relevant patterns. From this visualization, it is evident that GTAT achieve clearer node clustering than GAT. Besides, Figure 6 shows the node classification accuracy curves and loss curves of GATs and our proposed GTATs. It can be seen that GTATs can converge more quickly and stably while achieving better accuracy.

Over-smoothing occurs when node representations become increasingly similar, rendering the model incapable of effectively distinguishing between different nodes. To quantify the similarity between node representations, we selected Dirichlet energy ($E_D$)⁵⁹ as our metric:

$$E_D = \frac{1}{n_e} \sum _{i,j} A_{ij} \Vert {h}_i - {h}_j\Vert ^2$$

where $n_e$ denotes the total number of edges, ${h}_i$ represents the representation of node i, and $A_{ij}$ is the corresponding element in the adjacency matrix. A higher $E_D$ indicates greater dissimilarity between node representations. Figure 7 shows that the Dirichlet energy at each layer of the GTATs is exponentially higher than that of the GATs, indicating that GTATs better preserve the distinctiveness of node embeddings even as the depth increases.

GTATs’ better performance at deep layers can be attributed to the topology attention in our model architecture, which establishes the relationships between nodes from the perspective of the topology they inhabit. Topology attention enhances the distinctiveness of node feature representations, thereby improving the expressiveness of the model.

Robustness analysis

Better robustness indicate stronger stability of the model when facing noisy data. To evaluate the robustness of the GTATs, we conduct experiments on four different types of datasets and compare the performance of GTATs and GATs under random feature attack (RFA). RFA¹⁹ intentionally corrupts node features in the graph to evaluate each model’s ability when facing the perturbations caused by feature attacks. In particular, the attack is implemented by randomly modifying the nodes features according to a noise ratio $0 \le p \le 1$. For node $i$, its representations is modified as follows:

$${h}_i' ={h}_i + p \cdot noise, \quad noise \sim \mathcal {N}(0, 1)$$

where noise is a vector sampled from a Gaussian distribution, $\mathcal {N}$, with mean zero and variance one.

Figure 8 shows the node classification accuracy on four datasets as a function of the noise ratio p. As p increases, the accuracy of all models decrease asour expection. However, GTATs show a milder degradation in accuracy compared to GATs, which show a steeper descent. The experimental results show that GATs, relying solely on node representations, face difficulty adapting to increased noise levels and suffer more obvious performance declines. GTATs’ resilience to noise can be attributed to the extracted topology presentations and the cross attention mechanism. Both allow GTATs to maintain better differentiation and stability of node features under RFA. These results clearly demonstrate the robustness of GTATs over GATs in noisy settings.

Efficiency analysis

Similar to other deep learning models, GTAT may need to be deployed on small devices. To compare the scale of the GNN models, we carry out an analysis of the model parameter counts and their performance across three datasets of varying sizes. For a fair comparison, all models in this study adhere to the same hyperparameters: 2 attention heads, an hidden layer of 64 dimensions, a dropout rate of 0.6, a learning rate of 0.01, and a weight decay set to 0.001. As shown in Table 4, it’s clear that GTATs have only a slight increase in parameter counts compared to GATs, yet its performance is notably better. In contrast to GATs, GTATs additionally employ a MLP to convert the GDV into topology representation and ${a}_t$ to calculate topology attention.

Actually, the more orbits there are, the more local topological information a node can obtain. GTATs may benefit from sufficient topology information, but face a heavier computational burden. To understand the influence of orbits with different quantities on model predictions, we conduct experiments across three distinct dataset scales and statistically analyze the time required by the OCRA to compute GDVs of them. In this study, GTATs_4 represent the models that utilize orbits with up to four nodes, and GTATs_5 denote the versions that utilize orbits with up to five nodes. The results in Table 5 show that orbits with up to five nodes, while taking more time to compute than those with four nodes, enhance the accuracy of the predictions. Due to the lack of more efficient algorithm, employing orbits with up to six nodes, while potentially increasing accuracy, would significantly increase the computational time, especially for larger and dense networks. In order to balance computational efficiency with accuracy gains, this paper counts the 73 different orbits with up to five nodes as the nodes’ topology features.

Table 4 Accuracy(%) and parameter counts.

Full size table

Table 5 Accuracy(%) and orbit counts.

Full size table

Conclusion

In this paper, we introduce the GTAT, an innovative framework designed to harness the topological potential of graph-structured data. GTAT distinctively merges node and topology features through a cross attention mechanism, enhancing node representations and capturing graph structure information. Experimental results indicate our approach has a better performance than state-of-the-art existing models on classification tasks. Besides, the performance of GTAT with variations in depth and noise suggests that its topology representation combined with cross attention mechanism not only alleviate over-smoothing issue but also enhances the model’s robustness. Future works will focus on refining the GTAT and exploring its potential applications in diverse contexts.

Data availability

Codes are available at https://github.com/kouzheng/GTAT.

References

Majeed, A. & Rauf, I. Graph theory: A comprehensive survey about graph theory applications in computer science and social networks. Inventions 5, 10 (2020).
Article MATH Google Scholar
Ji, S., Pan, S., Cambria, E., Marttinen, P. & Philip, S. Y. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE transactions on neural networks and learning systems 33, 494–514 (2021).
Article MathSciNet Google Scholar
Qian, Y. et al. Molscribe: robust molecular structure recognition with image-to-graph generation. Journal of Chemical Information and Modeling 63, 1925–1934 (2023).
Article CAS PubMed MATH Google Scholar
Chen, S. et al. Deep unsupervised learning of 3d point clouds via graph topology inference and filtering. IEEE transactions on image processing 29, 3183–3198 (2019).
Article ADS MathSciNet MATH Google Scholar
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems 28 (2015).
Fan, W. et al. Graph neural networks for social recommendation. In The world wide web conference, 417–426 (2019).
Ying, R. et al. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 974–983 (2018).
Allamanis, M., Brockschmidt, M. & Khademi, M. Learning to represent programs with graphs. arXiv preprint[SPACE]arXiv:1711.00740 (2017).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Article MATH Google Scholar
Su, Y. et al. Nano scale instance-based learning using non-specific hybridization of dna sequences. Communications Engineering 2, 87 (2023).
Article ADS CAS PubMed Central MATH Google Scholar
Bruna, J., Zaremba, W., Szlam, A. & LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv preprint[SPACE]arXiv:1312.6203 (2013).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations (2019).
Monti, F. et al. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5115–5124 (2017).
Ghorvei, M., Kavianpour, M., Beheshti, M. T. & Ramezani, A. Spatial graph convolutional neural network via structured subdomain adaptation and domain adversarial learning for bearing fault diagnosis. Neurocomputing 517, 44–61 (2023).
Article Google Scholar
Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29 (2016).
Levie, R., Monti, F., Bresson, X. & Bronstein, M. M. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67, 97–109 (2018).
Article ADS MathSciNet MATH Google Scholar
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR) (2017).
Veličković, P. et al. Graph Attention Networks. International Conference on Learning Representations (ICLR) (2018). Accepted as poster.
Brody, S., Alon, U. & Yahav, E. How attentive are graph attention networks? In International Conference on Learning Representations (ICLR) (2022).
Momennejad, I. Collective minds: social network topology shapes collective cognition. Philosophical Transactions of the Royal Society B 377, 20200315 (2022).
Article Google Scholar
Smith, A. D., Dłotko, P. & Zavala, V. M. Topological data analysis: concepts, computation, and applications in chemical engineering. Computers & Chemical Engineering 146, 107202 (2021).
Article CAS MATH Google Scholar
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, e177–e183 (2007).
Article PubMed Google Scholar
Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence 33, 3558–3565 (2019).
Article MATH Google Scholar
Sankar, A., Zhang, X. & Chang, K. C.-C. Motif-based convolutional neural network on graphs. arXiv preprint[SPACE]arXiv:1711.05697 (2017).
Zhao, Q., Ye, Z., Chen, C. & Wang, Y. Persistence enhanced graph neural network. In International Conference on Artificial Intelligence and Statistics, 2896–2906 (PMLR, 2020).
You, J., Ying, R. & Leskovec, J. Position-aware graph neural networks. In International conference on machine learning, 7134–7143 (PMLR, 2019).
Tian, Y., Zhang, C., Guo, Z., Zhang, X. & Chawla, N. Learning mlps on graphs: A unified view of effectiveness, robustness, and efficiency. In The Eleventh International Conference on Learning Representations (2022).
Wang, X., Wang, X., Jiang, B., Tang, J. & Luo, B. Mutualformer: Multi-modal representation learning via cross-diffusion attention. International Journal of Computer Vision 1–22 (2024).
Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 423–443 (2018).
Article PubMed MATH Google Scholar
Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
Wu, F. et al. Simplifying graph convolutional networks. In International conference on machine learning, 6861–6871 (PMLR, 2019).
Bo, D., Wang, X., Shi, C. & Shen, H. Beyond low-frequency information in graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence 35, 3950–3957 (2021).
Article MATH Google Scholar
Lin, W. et al. Limit and screen sequences with high degree of secondary structures in dna storage by deep learning method. Computers in Biology and Medicine 166, 107548 (2023).
Article CAS PubMed MATH Google Scholar
Li, X., Wei, W., Feng, X., Liu, X. & Zheng, Z. Representation learning of graphs using graph convolutional multilayer networks based on motifs. Neurocomputing 464, 218–226 (2021).
Article MATH Google Scholar
Du, J., Zhang, S., Wu, G., Moura, J. M. & Kar, S. Topology adaptive graph convolutional networks. arXiv preprint[SPACE]arXiv:1710.10370 (2017).
Alsentzer, E., Finlayson, S., Li, M. & Zitnik, M. Subgraph neural networks. Advances in Neural Information Processing Systems 33, 8017–8029 (2020).
MATH Google Scholar
Bai, S., Zhang, F. & Torr, P. H. Hypergraph convolution and hypergraph attention. Pattern Recognition 110, 107637 (2021).
Article MATH Google Scholar
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
Huang, Z. et al. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 603–612 (2019).
Chen, Z. et al. Alien: Attention-guided cross-resolution collaborative network for 3d gastric cancer segmentation in ct images. Biomedical Signal Processing and Control 96, 106500 (2024).
Article MATH Google Scholar
Jaegle, A. et al. Perceiver: General perception with iterative attention. In International conference on machine learning, 4651–4664 (PMLR, 2021).
Kirillov, A. et al. Segment anything. arXiv preprint[SPACE]arXiv:2304.02643 (2023).
Wei, X., Zhang, T., Li, Y., Zhang, Y. & Wu, F. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10941–10950 (2020).
Huang, W., Wu, J., Song, W. & Wang, Z. Cross attention fusion for knowledge graph optimized recommendation. Applied Intelligence 1–10 (2022).
Cai, W. & Wei, Z. Remote sensing image classification based on a cross-attention mechanism and graph convolution. IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2020).
MATH Google Scholar
Milenković, T., Ng, W. L., Hayes, W. & Pržulj, N. Optimal network alignment with graphlet degree vectors. Cancer informatics 9, CIN–S4744 (2010).
Hočevar, T. & Demšar, J. A combinatorial approach to graphlet counting. Bioinformatics 30, 559–565 (2014).
Article PubMed MATH Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. nature 323, 533–536 (1986).
Google Scholar
Yang, Z., Cohen, W. & Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, 40–48 (PMLR, 2016).
Shchur, O., Mumme, M., Bojchevski, A. & Günnemann, S. Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
Mernyei, P. & Cangea, C. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv preprint[SPACE]arXiv:2007.02901 (2020).
Hu, W. et al. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33, 22118–22133 (2020).
Google Scholar
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (2019).
Prechelt, L. Early stopping-but when? In Neural Networks: Tricks of the trade, 55–69 (Springer, 2002).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint[SPACE]arXiv:1412.6980 (2014).
Rossi, E. et al. Edge directionality improves learning on heterophilic graphs. In Learning on Graphs Conference, 25–1 (PMLR, 2024).
Rusch, T. K., Bronstein, M. M. & Mishra, S. A survey on oversmoothing in graph neural networks. arXiv preprint[SPACE]arXiv:2303.10993 (2023).
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9 (2008).
Cai, C. & Wang, Y. A note on over-smoothing for graph neural networks. arXiv preprint[SPACE]arXiv:2006.13318 (2020).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (62172114, 62473104), the Fundings by Science and Technology Projects in Guangzhou (2023A03J0113). Our heartfelt thanks go out to Yanqing Su, Zhihong Chen and Minjia Huangfu for their unique companionship and invaluable discussions during this project.

Author information

Authors and Affiliations

Institute of Computing Science and Technology, Guangzhou University, Guangzhou, 510006, China
Jiahao Shen, Qura Tul Ain, Yaohua Liu, Banqing Liang & Zheng Kou
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, 510006, China
Xiaoli Qiang

Authors

Jiahao Shen
View author publications
Search author on:PubMed Google Scholar
Qura Tul Ain
View author publications
Search author on:PubMed Google Scholar
Yaohua Liu
View author publications
Search author on:PubMed Google Scholar
Banqing Liang
View author publications
Search author on:PubMed Google Scholar
Xiaoli Qiang
View author publications
Search author on:PubMed Google Scholar
Zheng Kou
View author publications
Search author on:PubMed Google Scholar

Contributions

J: Conceptualization, Methodology, Programing, Visualization and Original Draft Preparation. Q: Visualization, Investigation, Writing-Review and Editing. Y: Visualization and Formal Analysis. B: Investigation and Validation. X: Resources, Supervision. Z (Corresponding Author): Project Administration, Conceptualization, Writing-Review and Editing.

Corresponding authors

Correspondence to Xiaoli Qiang or Zheng Kou.

Ethics declarations

Ethical approval

This study did not involve human or animal subjects, and thus, no ethical approval was required.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, J., Ain, Q.T., Liu, Y. et al. GTAT: empowering graph neural networks with cross attention. Sci Rep 15, 4760 (2025). https://doi.org/10.1038/s41598-025-88993-3

Download citation

Received: 20 June 2024
Accepted: 03 February 2025
Published: 08 February 2025
Version of record: 08 February 2025
DOI: https://doi.org/10.1038/s41598-025-88993-3

Subjects

Abstract

Similar content being viewed by others

Unifying topological structure and self-attention mechanism for node classification in directed networks

Fusing multiplex heterogeneous networks using graph attention-aware fusion networks

Discovering latent node Information by graph attention network

Introduction

Related work

Graph neural networks

GNNs with topology

Cross attention mechanism

Method

Framework

Topology feature extraction

Graph cross attention layer

Experiments

Datasets

Experimental setup

Node classification results

Effectiveness of cross attention

Over-smoothing analysis

Robustness analysis

Efficiency analysis

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethical approval

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links