Introduction

In recent years, machine learning and deep learning have been extensively applied to research in the biological field1,2,3,4,5,6, demonstrating remarkable performance especially in the areas of cancer-related analysis and prognosis. In the field of medical image segmentation7,8,9, the classic methods are a series of U-shaped CNN networks10,11 represented by U-Net12,13. The feature size of its input and output layers is equal to the width and height of the image. By leveraging the inductive bias14 of CNNs, it extracts the local features of the image layer by layer to output segmentation results with the same width and height as the image. However, in recent years, the Transformer15 has achieved remarkable success in the field of computer vision. Some works have harnessed the powerful global modeling capabilities of the Transformer, achieving impressive results in various fields such as healthcare, microbiology, and industrial defect detection16,17,18,19,20,21,22. Some works introduce the Transformer into the field of medical image segmentation, such as TransUnet23 and Swin Unet24. These works successfully lead medical image segmentation to a new level by taking advantage of the transformers global dense modeling capability. Nevertheless, both the classic CNNs and the Transformer treat the image as a grid or a regularly arranged sequence, which lacks flexibility and difficultly capture the human tumor structure regions with complex topological structures25 in medical images.

We critically examine the suitability of Transformers and CNNs for medical image segmentation, where the upper limit of the regular modeling of the Transformer and CNN is not high. Even after a large pre-training26,27,28, it’s still difficult for transformers to learn high-level complex features. We focus on a GNN network29,30,31,32 architecture that naturally have a complex topological structure.For example, the research33 demonstrates that Graph Neural Networks (GNNs) possess the capability of unraveling and reallocating heterogeneous information within both the topological space and the attribute space. Some successful works have introduced the Graph Neural Network (GNN) architecture into the field of medical images. For instance, SG-Fusion34 has constructed a hybrid model of Transformer and GNN, which significantly enhances the prognostic indicators of gliomas. Another work that combines Transformer and GNN35 has achieved state-of-the-art qualitative and quantitative results in the domain of short-axis PET image quality enhancement. We find that GNNs can also serve as backbone architectures for medical image segmentation, emerging as strong competitors to CNNs and Transformers, and innovatively propose U-GNN, aiming to utilize the powerful topological modeling ability of graph neural networks (GNNs) to fit complex human structures to be segmented, such as tumors and organs. As far as we know, U-GNN is the first U-shaped architecture fully based on GNNs, which pioneerly introduces graph neural networks into the field of medical image segmentation.

We respect the backbone structure of U-Net, it has a natural advantage for medical image segmentation. The U-GNN we proposed is also U-shaped macroscopically. The feature size of the input and output layers of the network is equal to the size of the image. Particularly, the U-shaped network of U-GNN is stacked by a series of original Vision GNN blocks. In each Vision GNN block: the image features are firstly divided into multiple image patch nodes. Subsequently, these nodes undergo graph structure learning36,37,38 to determine how node pairs should be directly connected, thereby establishing the graph topology. Different from previous naive works, we innovatively propose the concept of the multi-order similarity receptive field39,40, which reveals that the first-order similarity between two nodes themselves is not sufficient to represent complex feature relationships, and when calculating the similarity, it is also necessary to consider whether the surrounding nodes are similar, which is particularly important in the scenario where a large area of medical images has a black-and-white structure. We call this novel graph construction method multi-order similarity graph construction, which accurately describes the relationship between image nodes. After the graph is constructed, we propose a global-local combined nodes information aggregation41,42 method called multi-order information aggregation, which can fully enable the information interaction and update between nodes and their neighbor nodes feature.

We conduct standard experiments on multi-organ43 and cardiac segmentation datasets44. The experimental results fully demonstrate that the proposed U-GNN method exhibits excellent segmentation accuracy and robust generalization ability. Overall, the contributions of our proposed U-GNN are mainly reflected in the following:

  • We construct a U-shaped neural network based on graph neural networks for the first time and apply it to medical image segmentation. This innovative attempt is unprecedented in this field.

  • Aiming at the characteristics of medical image segmentation, we propose a multi-order similarity graph construction method, which can describe complex scenarios that cannot be described by the Transformer and accurately construct a Graph at the image feature level.

  • We propose a multi-order information aggregation method, which fully enables information interaction between nodes and iterates the Graph at the image feature level.

  • A large number of experiments prove that the U-GNN we proposed far surpasses all previous methods based on CNNs or Transformers.

Methods

U-GNN overall structure

U-GNN presents a U-shaped structure as a whole, as illustrated in Fig. 1. The size of the input image is consistent with that of the output segmented image. Breaking it down, U-GNN can be regarded as a left-right symmetric encoder45-latent space46-decoder47 structure, and all these three structures are stacked by the most fundamental Vision GNN blocks.

In the encoder stage, the image feature input data with a dimension of C and a resolution of \(H\times W\) is fed into a layer structure stacked by three consecutive Vision GNN modules for multi-scale representation learning. In each layer of the network, the Vision GNN blocks are of equal size, and when the image updates its features within them, both the feature dimension and the resolution remain unchanged. Between layers, there is a downsampling operation, which downsamples the width and height of the image by a factor of two. To compensate for the information loss caused by the downsampling, the dimension of the image features will be doubled simultaneously. The above process is repeated three times in the encoder, and ultimately, both the width and height of the medical image are compressed to one-eighth of their original size.

In the latent space layer, the image nodes pass through \(n_4\) Vision GNN blocks. The characteristic of this stage is that the number of nodes is relatively small (1/64 of that in the input layer), but the feature dimension of the nodes is large, which is suitable for learning some high-level and abstract image features.

The decoder is similar to the encoder but symmetric, and it is also constructed based on Vision GNN blocks at first. The difference is that between different layers of the decoder, the deep features of the image will be upsampled. This layer reshapes the feature maps of adjacent dimensions into feature maps with a higher resolution, achieving the effect of 2-times upsampling. However, at the same time, the dimension of the image features is correspondingly reduced to half of the original dimension. The above process is repeated three times in the decoder, and finally, the features of the medical image are restored to the same size as the input, that is, the final segmentation result map is obtained.

In addition, it should be noted that, similar to U-Net, U-GNN introduces the skip connection mechanism with the aim of organically combining the multi-scale features extracted by the encoder with the upsampled features. Specifically, the features of the shallow layers (Layers 1, 2, and 3) are concatenated with the features of the deep layers (Layers 5, 6, and 7) through skip connections and participate in the feature learning of the deep layers together. This can effectively alleviate the problem of spatial information loss caused by the downsampling process, especially in terms of supplementing high-frequency details. This is because empirical evidence shows that the model tends to learn high-frequency detail information in the shallow layers and low-frequency abstract overall information in the deep layers.

Fig. 1
figure 1

The overall architecture of U-GNN encompasses components such as an encoder, a bottleneck layer, a decoder, and skip connections. Among them, the construction of the encoder, the bottleneck layer, and the decoder is carried out based on the Vision GNN block.

Multi-order similarity graph construction

The U-GNN is constructed by stacking the basic components of the Vision GNN block. Figure 2 shows the specific structure of a single Vision GNN block. Each Vision GNN block consists of a LayerNorm (LN) layer, a parallel structure of Local Graph and Global Graph, Node Aggregation, a residual connection, and a non-linear MLP. All of the process can be mathematically expressed as follows:

$$\begin{aligned} \begin{aligned} \hat{x}^l&=\text {NodeEmbedding}(x^l)\\ \hat{x}^{l + 1}&=F(G(\hat{x}^l))+\hat{x}^l\\ x^{l + 1}&=\text {MLP}(\text {LN}(\hat{x}^{l + 1}))+\hat{x}^{l + 1} \end{aligned} \end{aligned}$$
(1)

where \(x^l\) is the input image feature of the \(l\)-th layer, which is initially embedded into the node sequence \(\hat{x}^l.\) \(G\) is the graph structure learning function that constructs the image into a graph; \(F\) is the function for node information aggregation and update after graph construction; FFN is the feed-forward network.

Fig. 2
figure 2

Vision GNN block.

Before introducing the graph construction method G, it’s imperative to introduce the concept of the multi-order similarity receptive field. Let \(\textbf{X} = [\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_N]\) be a token sequence, where \(\textbf{x}_i \in \mathbb {R}^d\) and N is the number of graph nodes. For a node \(\textbf{x}_i\), its neighborhood \(\mathcal {N}(i)\) with radius r is \(\mathcal {N}(i)=\{j: |j-i|\le r, j\in \{1,\ldots ,N\}\}\). Define the aggregation of neighborhood \(\mathcal {N}(i)\) as \(\textbf{m}_i=\frac{1}{|\mathcal {N}(i)|}\sum _{k\in \mathcal {N}(i)}\textbf{x}_k\). The multi-order similarity receptive field SRF(ij) between nodes \(\textbf{x}_i\) and \(\textbf{x}_j\) is defined by the cosine similarity of their neighborhood aggregations:

$$\begin{aligned} SRF(i, j)=\frac{\textbf{m}_i^T\textbf{m}_j}{\Vert \textbf{m}_i\Vert \Vert \textbf{m}_j\Vert } \end{aligned}$$
(2)

The multi-order similarity receptive field is of paramount importance in medical image segmentation. As illustrated in Fig. 3, medical images typically feature a monotonous background of black, white, and gray tones. In such cases, the inherent first-order similarity between different image patches often fails to accurately represent the true relationships. For instance, in Fig. 3, two white nodes are presented, one originating from a tumor region and the other from healthy tissue. However, their individual features often exhibit a high degree of similarity, as they are both white pixel patches. In traditional similarity calculations, these two patches would be characterized as highly similar. Nonetheless, if the concept of the multi-order similarity receptive field is introduced, rather than solely considering the similarity between the two image patches, the similarity of the surrounding pixel patches is also taken into account. Consequently, the disparity between them becomes evident, clearly indicating that one is from the tumor region and the other from healthy tissue.

Fig. 3
figure 3

Illustration of first-order and multi-order similarity: First-order similarity captures node connections based on shallow features like color and texture, while multi-order similarity also considers the similarity of neighboring nodes, providing stronger semantic relevance.

As illustrated in Fig. 1, we divide the channel-wise image features into two branches: one branch participates in global graph, while the other is used for local graph. In the global graph, each image patch node computes its first-order similarity with all other nodes, and the top eight most similar nodes are selected as neighbors. This reflects the first-order similarity between the current node and its most relevant peers, as previously described. In contrast, the local graph connects each node only to its immediate or second-order neighbors in spatial proximity, enabling effective aggregation of local context.

Importantly, we introduce multi-order similarity through a channel allocation mechanism without incurring any additional computational overhead. In the shallow layers of the network, more channels are allocated to the local graph and fewer to the global graph; this allocation is reversed in the deeper layers, where more channels are assigned to the global graph. Consequently, nodes that initially participate in the local graph at shallow levels are naturally incorporated into the global graph at deeper layers. Since these nodes have already aggregated local information, their first-order similarity computations in deeper layers essentially equal to higher-order similarity with a larger receptive field. This allows the model to better represent complex and irregular objects.

Notably, this channel allocation strategy introduces multi-order similarity at no extra computational cost, since the remaining channels would otherwise be subject to their own intrinsic processing, regardless of the proposed design.

Multi-order node information aggregation

After the construction of the graph is completed, it’s crucial to design an effective node information update mechanism. Similar to Convolutional Neural Networks (CNNs), when a Graph Neural Network (GNN) updates node information at the k-th layer, the process can be abstractly described in the following way:

$$\begin{aligned} x_i^{k + 1}=\psi \left( x_i^k,g\left( \left\{ x_j^k\mid x_j^k\in M\left( x_i^k\right) \right\} ,x_i^k,U_1\right) ,U_2\right) \end{aligned}$$
(3)

where \(M\left( x_i^k\right)\) represents the set of neighbors of node \(i\) at the k-th layer; \(g\) represents the node feature aggregation function, and \(\psi\) represents the node feature update function; \(U_1\) and \(U_2\) are both learnable matrices. In the context of U-GNN, we have proposed the following node update formula:

$$\begin{aligned} x_i^{l + 1}=W\cdot \text {Concat}\left[ x_i^l,\max \left( x_j^l-x_i^l\right) ,\text {linear}\left( \text {mean}\left( x_j^l\right) \right) \right] \end{aligned}$$
(4)

where \(x_i^l\) is the feature vector of node \(i\) at layer \(l\), \(x_j^l\) represents the feature vectors of the neighbors of node \(i\) at layer \(l\), \(W\) is a learnable weight matrix, \(\text {Concat}\) is the concatenation operation, \(\max\) computes the maximum value element-wise, \(\text {mean}\) computes the average value element-wise, and \(\text {linear}\) is a linear transformation.

The first term in the concatenation is \(x_i^l\), which ensures that the updated node feature \(x_i^{l + 1}\) retains the original information of the node itself. In graph neural networks, the self-information of a node is crucial as it represents the inherent characteristics of the node. By including \(x_i^l\), we prevent the loss of important node-specific information during the update process. The second term \(\max \left( x_j^l-x_i^l\right)\) captures the maximum difference between the node \(i\) and its neighbors. It helps the model to distinguish the node from its neighbors. If a node has a large difference from its neighbors in some feature dimensions, this term will highlight those differences. The third term \(\text {linear}\left( \text {mean}\left( x_j^l\right) \right)\) aggregates the information from the neighbors of node \(i\). The mean operation \(\text {mean}\left( x_j^l\right)\) computes the average feature vector of the neighbors, which gives a summary of the collective characteristics of the neighborhood. The linear transformation \(\text {linear}\) further maps this aggregated information to a suitable feature space. Let \(\text {mean}\left( x_j^l\right) =\frac{1}{|N(i)|}\sum _{j\in N(i)}x_j^l\), and \(\text {linear}(z)=Az + B\) where \(A\) is a matrix and \(B\) is a bias vector. This aggregation helps the node to incorporate the overall information of its neighborhood, which is essential for learning the global structure of the graph.

Results

Experiment setting

The U-GNN is primarily experimentally validated on two mainstream public medical image segmentation datasets, namely Synapse and ACDC. The Synapse multi-organ segmentation dataset encompasses 30 cases, comprising a total of 3779 abdominal clinical axial CT images. Adhering to the sample partitioning method described in the reference literature, 18 samples are allocated to the training set, and the remaining 12 samples are included in the test set. When evaluating the model’s performance, the average Dice Similarity Coefficient (DSC) and the average Hausdorff Distance (HD) are selected as quantitative evaluation metrics. A comprehensive and systematic assessment is conducted on eight abdominal organs, including the aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach.

The Automatic Cardiac Diagnosis Challenge dataset (ACDC) is sourced from data collection of different patients using an MRI scanner. For each patient’s MRI image, the left ventricle (LV), right ventricle (RV), and myocardium (MYO) are annotated. This dataset is further divided into 70 training samples, 10 validation samples, and 20 test samples. Similar to the evaluation method employed in the reference literature, on this dataset, we evaluate the involved methods solely through the average DSC (Dice Similarity Coefficient).

The U-GNN is implemented using Python 3.8 and Pytorch 2.1.0. To enhance data diversity, data augmentation techniques such as flipping and rotation are applied to all training instances. The size of the input images is set to 224, and the patch size is determined to be 4. The model training is carried out on an Nvidia RTX3090 GPU with 24GB of memory. Meanwhile, the model parameters are initialized using the weights pre-trained on the ImageNet dataset.

Results on synapse dataset

Table 1 presents the comparison between the U-GNN we proposed and the previous state-of-the-art methods on the Synapse multi-organ CT dataset. The experimental data indicate that our designed Unet-like pure GNN method exhibits the optimal performance. In terms of segmentation accuracy, the average Dice Similarity Coefficient (DSC) reaches 84.03% and the average Hausdorff Distance (HD) is 17.72%. Compared with TransUnet and the recent Swin Unet method, our algorithm has achieved a significant improvement of approximately 6% in the DSC evaluation metric. Meanwhile, in the HD evaluation metric, our method has achieved a remarkable accuracy improvement of approximately 18%, which strongly demonstrates the excellent performance of this method in edge prediction. Figure 4 shows the segmentation results of different methods on the Synapse multi-organ CT dataset. It’s not difficult to observe from the figure that the Convolutional Neural Network (CNN)-based methods are prone to over-segmentation, which is likely due to the inherent local characteristics of convolutional operations. In this study, we have demonstrated that by parallelizing the local Graph Neural Network (GNN) and the global GNN in the channel dimension and combining them with a U-shaped architecture with skip connections, global and local information can be more effectively integrated, thus obtaining more satisfactory segmentation results. As far as we know, this result represents the current best level in this field.

Fig. 4
figure 4

The segmentation results of different methods on the Synapse.

Table 1 Segmentation accuracy of different methods on the Synapse multi-organ CT dataset. Abbreviations: Ao (Aorta), GB (Gallbladder), LK (Left Kidney), RK (Right Kidney), Li (Liver), Pan (Pancreas), Spl (Spleen), Sto (Stomach).
Table 2 Training speed and GPU memory occupancy.

Results on ACDC dataset

Similar to the application of the Synapse dataset, in this study, the proposed U-GNN is applied to the training process of the ACDC dataset, aiming to achieve accurate medical image segmentation. The relevant experimental results are summarized in Table 3. During the experiment, with the image data in Magnetic Resonance Imaging (MR) mode as the input, U-GNN still demonstrated excellent performance, achieving an accuracy of 91.53%. This result fully demonstrates that the method we proposed has good generalization ability and robustness, and can maintain stable and efficient performance under different datasets and imaging modalities.

Table 3 Segmentation accuracy of different methods on the ACDC dataset.

Ablation study

To substantiate the tangible efficacy of the method proposed in this paper, we conducted ablation experiments on it. Table 4 showcases the outcomes of these ablation experiments. Evidently, on the Synapse dataset, the two proposed sub-methods, namely Multi-order Similarity Graph Construction and Multi-order Node Information Aggregation, both yield improvements to varying extents. When these two sub-methods are employed in conjunction, they form our final U-GNN, which attains state-of-the-art (SOTA) results in the realm of medical image segmentation.

Table 4 Ablation study of the proposed methods.

Furthermore, we conduct ablation experiments on the proposed multi-order similarity graph construction to investigate the optimal similarity receptive field size. As shown in Table 5, the worst performance is observed when only pairwise similarities between graph nodes are considered, without incorporating multi-order similarity, resulting in a DSC score of 79.74. Notably, as the similarity receptive field expands, the segmentation performance of U-GNN improves significantly, achieving the highest DSC score of 84.03 when a 3-order receptive field is employed.

Table 5 Multi order.

Visualization

To gain a deeper understanding of the underlying mechanism of U-GNN, we visualized the interactions between tumor image patches and other image nodes within the model. As shown in Fig. 5, U-GNN accurately identifies tumor patches at different locations as neighboring nodes. Despite some tumor features being distant or obscured, U-GNN effectively establishes their connections with high accuracy, highlighting its robust ability to capture long-range dependencies in tumor representations.

Fig. 5
figure 5

U-GNN is capable of providing precise graph-level information for medical images, ensuring that image patches corresponding to tumor regions are closely connected and interact with each other.

Conclusion

This paper presents U-GNN, a novel U-shaped encoder-decoder architecture specifically designed for medical image segmentation. The proposed U-GNN is entirely built upon Graph Neural Networks (GNNs). To fully exploit the potential of GNNs in feature learning and information interaction, we employ the Vision GNN blocks as the fundamental unit for feature representation and long-range semantic information exchange. To capture the deep features of medical images, we introduce two innovative methods: Multi-order Similarity Graph Construction and Multi-order Node Information Aggregation. Extensive and in-depth experiments are conducted on multiple tasks, including multi-organ segmentation and cardiac segmentation. The experimental results clearly demonstrate that the proposed U-GNN exhibits superior performance and strong generalization capabilities. Among the current models in medical image segmentation, our approach achieves a significant performance advantage.