Introduction

Relation extraction aims to classify relationships between entities in a given text1, enabling effective information mining and supporting downstream tasks such as knowledge graph construction, question answering systems, and search engines. Existing relation extraction methods can be broadly categorized into two approaches: sequence-based classification and dependency-based classification2. Sequence-based extraction methods primarily focus on predicting relationships using vector transformations of word-level information within sentences. Representative models include Convolutional Neural Networks (CNN)3, Long Short-Term Memory Networks (LSTM)4, and pre-trained models such as BERT and GPT5,6. Although these character-level methods can capture relationships between entities and sentences to some extent, they struggle with long-distance dependencies. As sentence length increases, relying solely on entity vector representations makes it difficult to accurately capture dependencies between distant entities. To address this issue, additional features such as radicals or syntactic information are often incorporated to enrich the representation of inter-entity relationships7. Dependency trees serve as a structured visualization of sentence semantics, providing additional features beyond word embeddings. They enable direct interactions between distant entities in a sentence, thereby improving relation extraction performance8,9.

However, directly incorporating dependency syntax graphs as supplementary features introduces challenges. Since these graphs are generated using external parsers, they often contain noisy information that may hinder effective entity interactions10. The sensitivity to noise increases as entity distance grows, negatively impacting classification accuracy. To mitigate this issue, fixed pruning strategies or subgraph sampling techniques are commonly used, such as extracting the shortest path between given entities11 or pruning dependency trees12. However, such static pruning and subgraph sampling approaches fail to dynamically adapt to sentence structures, leading to residual noise that affects long-distance entity interactions. To further improve relation extraction, this paper proposes a novel method that integrates semantic and syntactic features to handle noisy long-distance dependencies dynamically. Specifically, we introduce an adaptive pruning strategy for dependency syntax graphs and apply soft-threshold filtering to entity vector representations. This approach enhances long-distance dependency modeling while reducing noise from syntactic information, ultimately achieving superior relation extraction performance compared to baseline models.

The main contributions of this paper are as follows:

Long-distance dependency capture with SA-GCN: We propose a Self-Attention-based Graph Convolutional Network (SA-GCN) to dynamically update node representations within the dependency syntax graph. By leveraging a self-attention mechanism, our method effectively removes irrelevant nodes, enabling more precise long-distance dependency modeling in noisy environments.

Noise suppression with residual-dependent syntax graphs: We introduce a noise suppression mechanism based on a residual shrinking network, which retains crucial information within the dependency syntax graph while filtering out irrelevant features. This dynamic noise removal process enhances the extraction of meaningful information and improves model robustness against noisy input.

Relation extraction optimization via predictive interpolation: We propose a predictive interpolation-based optimization strategy for integrating semantic and syntactic features. By combining contextual semantic representations from BERT with syntactic features from the dependency syntax graph, our approach achieves a deep synergy between the two information sources. Additionally, during training, a predictive interpolation loss function facilitates effective interaction between BERT and SA-GCN, significantly enhancing the model’s ability to capture long-distance dependencies.

Related work

Relation extraction methods can generally be categorized into two main approaches: sequence-based relation extraction and dependency-based extraction methods. Sequence-based relation extraction methods classify relationships between entities by leveraging word vector representations within sentences. Zeng et al.13 were the first to employ Convolutional Neural Networks (CNN) combined with entity position markers to extract word and token embeddings from sentences, followed by a relation classifier to determine entity relationships. Li et al.14 proposed an entity-aware attention mechanism integrated with a Long Short-Term Memory network (LSTM) to capture word representations, constructing latent entity-type vectors to encode contextual information and classify relations. Nathani et al.15 introduced an attention-based feature embedding approach using CNNs, capturing entity and relation features from neighboring entities and performing link prediction between missing entities. Although these methods achieve reasonable performance in relation classification, CNN- and LSTM-based models struggle to extract deep semantic features from sentences effectively. With the emergence of pre-trained models such as BERT16,17, these models have been widely adopted in various natural language processing (NLP) tasks. Wu et al.18 introduced the R-BERT model, which utilizes pre-trained embeddings to encode sentences while marking entity positions with special tokens, followed by classification for relation prediction. Hou et al.19 combined BERT with CNN for relation extraction in specialized domains, where BERT encodes word embeddings, and CNN extracts multi-scale features via pooling. However, simple pooling operations such as summation or averaging can result in information loss. Xu et al.20 observed that most existing models rely on neural network architectures but overlook the impact of key phrases on relation extraction. To address this, they proposed a BERT-based gated multi-window attention network (BERT-GMAN), which extracts sentence-level semantic features from BERT, builds a key phrase semantic network for multi-granularity phrase information, and applies feature fusion before classification. Shi et al.21 further enhanced BERT-based extraction by integrating lexical and syntactic features, such as part-of-speech tags and dependency trees, achieving superior extraction performance.

However, sequence-based methods heavily rely on word vector representations while neglecting the influence of syntactic dependencies within a sentence. Dependency-based extraction methods incorporate sentence- and syntax-level features to enrich entity representations. Since sentence dependencies can be naturally represented as graphs, these methods are often combined with Graph Neural Networks (GNNs). Hao et al.22 proposed a generative relation extraction approach that constructs dependency graphs between entities and utilizes GNNs to propagate information, enabling not only relation extraction but also multi-hop link prediction. However, the constructed dependency graphs may contain irrelevant information, which can negatively impact information propagation and reduce classification accuracy. To address this issue and improve relation extraction accuracy, Tian et al.23 introduced a dependency-driven method that employs an attention mechanism within GNNs to distinguish the importance of different word dependencies. Guo et al.24 proposed an attention-guided graph convolutional network (AGGCN) that applies soft pruning to automatically focus on relevant structures in relation extraction tasks, effectively filtering out irrelevant information. Xue et al.25 took a different approach by eliminating reliance on external parsers. They used a Gaussian generator to construct multi-view graphs directly from raw text, refining the graph structure through interaction between graph convolutions and DTWPool before final relation classification.

Although these methods refine dependency trees by assigning attention weights to different nodes, a major limitation is that they may still distort the information of entity nodes. In this work, we fully integrate sentence-level semantic information with dependency syntax trees by leveraging a Self-Attention-based Graph Convolutional Network (SA-GCN) to process entire dependency graphs. This approach enhances long-distance dependency modeling while utilizing a residual shrinking mechanism to improve noise robustness. By effectively fusing semantic and syntactic information, our method achieves superior relation extraction performance.

Methods

This paper proposes a noisy long-distance dependency relation extraction method integrating semantic and syntactic features to address the challenges posed by long-distance dependencies and noise interference in entity relations. The proposed method consists of three core components: (1) contextual semantic feature extraction based on BERT, (2) long-distance dependency modeling incorporating syntactic features, and (3) relation extraction by integrating semantic and syntactic representations. The overall framework is illustrated in Fig. 1.

First, the input text is encoded using a pre-trained model (BERT), transforming it into corresponding token embeddings. Simultaneously, entity vectors are extracted based on entity index mappings. Then, a dependency syntax graph is constructed from both the textual input and an external parser, where nodes in the graph are derived from the BERT-encoded token embeddings. A self-attention-based graph convolutional network (SA-GCN) is employed to capture long-distance dependencies between entities effectively. Additionally, a residual shrinking network is introduced to dynamically suppress noise in the syntactic graph, further enhancing the model’s robustness against noisy dependencies. Finally, the integrated semantic and syntactic feature vectors are passed through a dual multi-layer perceptron (MLP) to construct a relation prediction matrix. The final entity relation output is obtained using an activation function combined with a masking mechanism, improving the accuracy and robustness of relation extraction.

Fig. 1
figure 1

Method Flowchart.

Contextual semantic feature extraction based on BERT

This paper uses large-scale pre-trained models to extract the contextual semantic representations of entities in the text. The input data requires not only the text \(X=\{ {{\text{x}}_1},{x_2},{x_3} \ldots,{x_n}\}\), but also the entity position indexes \(index=\{ ({i_1},{j_1}) \cdots ({i_k},{j_k})\}\), where \({i_k},{j_k} \in \{ 1,2 \ldots,n\} ,{i_k} \leqslant {j_k}\) represent the head and tail indices of the\({\text{k-th}}\) entity in the text. The vectorization process involves encoding batch text data using a pre-trained BERT model to obtain output \(N \in {{\text{R}}^{b \times n \times 768}}\). Then, leveraging the entity position index information \(\{ i,j\} \in (0,n)\), an averaging operation is performed to obtain the vectorized representation of the entity \(E \in {{\text{R}}^{b \times k \times 768}}\). Equation (1) represents the expression of an entity vector \({e_k}\) in E.

$${e_k}=\frac{{\sum\limits_{{({i_k},{j_k}) \in index}} {\operatorname{extract} (N;<{i_k},{j_k}>)} }}{{{j_k} - {i_k}}}$$
(1)

Where b represents the batch size. \({e_k} \in {{\text{R}}^{1 \times 768}}\) represents the entity vector indexed by k; i and j represent the head and tail positions of the entity, respectively; and \(\operatorname{extract}\) denotes the matrix slice taken at the specified positions.

Long distance dependency relationship capture considering syntactic features

Compared to traditional vector extraction methods, the dependency syntax graph can significantly reflect the internal structure and syntactic information of a sentence. However, some irrelevant nodes (such as commas and periods) can introduce irrelevant information between entities during the message-passing process of the graph neural network, making it difficult for entity nodes to learn key features. To reduce the error propagation caused by reliance on the parser and enhance long-distance dependencies between entities, this paper proposes the use of a self-attention-based graph convolutional network (SA-SAG) after constructing the syntax graph. By combining the attention mechanism with convolutional operations, this approach sorts node scores and removes the influence of irrelevant nodes, effectively capturing long-distance dependencies between entities.

Construction of text dependency syntax graph based on external parser

The dependency syntax graph27 represents the dependency relationships between words in a natural language sentence, with each token being represented as a node and connected by directed edges (such as subject-predicate relations, verb-object relations, etc.). As a graphical structure, the dependency graph contains rich semantic and syntactic information. In this paper, the dependency syntax graph of the text is constructed using an external parser, with the aim of providing more syntactic features for the relationships between entities within the sentence, and effectively leveraging structural information to offer more entity-level semantic relations.

The parser automatically infers the dependency relationships between words based on syntactic rules and contextual information, thereby generating the corresponding dependency syntax graph28,29. The process of constructing the dependency syntax graph using a parser is illustrated in Fig. 2. First, entity information is input into an external parser, and tokenization rules are modified to prevent inconsistencies between the default tokenization results of the external parser and the entity information. Then, the text is tokenized by the external parser, obtaining the tokenization index \(cws\) and dependency relationships \(dep\). The tokenization index and semantic vector N are then used to generate the vector representations of the dependency syntax graph nodes. Meanwhile, the dependency relationships \(dep\) are utilized to construct the corresponding adjacency matrix.

Specifically, for text\(X=\{ {{\text{x}}_1},{x_2},{x_3} \ldots,{x_n}\}\), first, the parser’s tokenization rules are adjusted based on the entities in the text. Then, the tokenization and dependency parsing results are obtained accordingly., as shown in the following equation:

$$\left\{ \begin{gathered} parser=\operatorname{special} \_case\,(parser,entity) \hfill \\ cws,dep=\operatorname{parser} \,(X) \hfill \\ \end{gathered} \right.$$
(2)

Where, \(entity\)represents the entities in the text; \(\operatorname{special} \_case\)denotes the modified tokenization function; \(cws\)refers to the tokenization results; \(dep\) refers to the dependency relations; \(parser\)represents the external parser. To construct the dependency syntax graph, the tokenization result \(cws\) is combined with the word vector N to obtain the vector representation of the syntax graph nodes. The vector representation of the \({\text{i-th}}\) node in the graph is given by Eq. (3).

$$H^{i} = \left\{ {\begin{array}{*{20}l} {\text{mean} ([N; < cws[i]_{a} ,cws[i]_{b} > ]),} & {{\text{if i}} \ne {\text{0}}} \\ {N_{{[CLS]}} {\text{ ,}}} & {{\text{if i = 0}}} \\ \end{array} } \right.$$
(3)

Where, N represents the vector generated by BERT; \(cws{[i]_a}\) and \(cws{[i]_b}\) denote the start and end indices of the \({\text{i-th}}\) token in the text; \({H^i}\) denotes the embedding vector of the \({\text{i-th}}\) node in the syntax graph; \({N_{[CLS]}}\) represents the CLS vector output by BERT, it is mapped to the “root” node of the external parser’s output; \(\operatorname{mean} ([N;<u,v>])\)represents the summation of the \({\text{u-th}}\) vector to the \({\text{v-th}}\) vector in N, followed by averaging.

The adjacency matrix in the syntactic graph is transformed through the dependencies of the dependency relations, and can be expressed as Eq. (4):

$$A_{{\alpha ,\beta }} = \left\{ {\begin{array}{*{20}l} {1,} & {{\text{if dep[}}\alpha {\text{]}} = \beta } \\ {0,} & {\text{otherwise} } \\ \end{array} } \right.$$
(4)

Where, \({\text{dep[}}\alpha {\text{]}}\) represents the dependent parent node of node \(\alpha\); \({A_{\alpha ,\beta }}\) represents the value of row \(\alpha\) and column \(\beta\) of the adjacency matrix.

Fig. 2
figure 2

Construction of Dependency Syntax Graph.

Long distance dependency relationship capture method based on SA-GCN

GCN, as a graph-structured deep learning model30, it is capable of preserving the topological structure and node characteristics within a graph, integrating both node and structural information. The core idea of GCN is to update a node’s representation using information from its neighboring nodes. The node representation is updated by aggregating the features of its neighboring nodes. The formula for the traditional GCN is shown in Eq. (5) as follows:

$$H_{\phi }^{{l+1}}{\text{ = }}{\tilde {D}^{ - \frac{1}{2}}}\tilde {A}{\tilde {D}^{ - \frac{1}{2}}}H_{{^{\phi }}}^{l}{W^{l+1}}$$
(5)

Where, \(H_{{^{\phi }}}^{l} \in {{\text{R}}^{n \times d}}\) represents the node vector at layer l; \({{\tilde{\text{D}}}} \in {\text{R}}^{{n \times n}}\) is the degree matrix of \({{\tilde {\text{A}}}} \in {\{ 0,1\} ^{n \times n}}\); and \({{\text{W}}^{l+1}}\) represents the weight matrix used to generate the node embeddings for the \(l+1\) layer.

To guide the model in removing the propagation of irrelevant node information, this paper proposes a self-attention graph convolutional network (SA-GCN) that assigns a weight to each node, capturing the long-distance dependency relationships between entities. As shown in Fig. 3, the process begins by averaging each node vector generated by the BERT pre-trained model to obtain the input vector for the module. Then, the node representations of the dependency graph and the adjacency matrix are fed into the graph convolutional network. They are multiplied by a learnable attention parameter matrix \({M_1}\) to obtain a node importance matrix that integrates both node and structural information. This is represented by the following Eq. (6):

$$\left\{ \begin{gathered} H_{\phi }^{1}={{\tilde {D}}^{ - \frac{1}{2}}}\tilde {A}{{\tilde {D}}^{ - \frac{1}{2}}}H_{\phi }^{0}{W_1} \hfill \\ scor{e_1}{\text{ = tanh(}}H_{\phi }^{1} \times {M_1}{\text{)}} \hfill \\ \end{gathered} \right.$$
(6)

Where, A is the adjacency matrix of the original dependency syntactic graph; \(\tilde {A} {\text{= }}A{\text{ + }}I\) is the adjacency matrix with self-loops added; \({W_1}\) represents the learnable weight matrix of the first layer; \({\text{ I}} \in {{\text{\{ 0,1\} }}^{n \times n}}\) is the identity matrix; \({\text{tanh}}\)is the activation function; \({{\text{M}}_1} \in {{\text{R}}^{{\text{d}} \times 1}}\) is the attention matrix; \(scor{e_1} \in {{\text{R}}^{{\text{n}} \times 1}}\) is the node importance score of the first layer.

To discard irrelevant node features and enhance the feature representation of entity nodes, after sorting the sequence\(scor{e_1}\), the rows and columns of the adjacency matrix \({{\text{A}}^0}\) corresponding to nodes with lower scores are set to 0. This reduces the influence of these nodes during global message passing in the next layer. The specific process is shown in Eq. (7) as follows:

$$\left\{ \begin{gathered} ID{\text{ = top\_rank(}}score{\text{,[}}k \cdot {n_1}{\text{])}} \hfill \\ {A^1}{\text{ = change(}}{A^0}{\text{,}} ID{\text{)}} \hfill \\ \end{gathered} \right.$$
(7)

Where, \({\text{top\_rank}}\) represents the sorting function; k denotes the ratio of the current layer’s nodes to those in the previous layer; \({n_1}\) represents the number of nodes in the maximal connected subgraph after the first update of the adjacency matrix; \({\text{change}}\) is the adjacency matrix transformation function; \({A^1}\) represents the updated adjacency matrix after the first round of graph message passing.

Similarly, the node embeddings output after the second round of message passing and the updated adjacency matrix are shown in Eq. (8) as follows:

$$\left\{ \begin{gathered} H_{\phi }^{2}{\text{= }}{{\tilde {D}}^{ - \frac{1}{2}}}{{\tilde {A}}^1}{{\tilde {D}}^{ - \frac{1}{2}}}H_{\phi }^{1}{W_2} \hfill \\ scor{e_2}=\tanh (H_{\phi }^{2} \times {M_2}) \hfill \\ {A^2}={\text{change}}({A^1},\operatorname{top} \_rank(scor{e_2},[k,{n_2}])) \hfill \\ \end{gathered} \right.$$
(8)

Finally, the node output vectors after the third round of message passing are shown in Eq. (9) as follows:

$$H_{\phi }^{3}{\text{ = }}{\tilde {D}^{ - \frac{1}{2}}}{\tilde {A}^2}{\tilde {D}^{ - \frac{1}{2}}}H_{\phi }^{2}{W_3}$$
(9)

Where, \(H_{\phi }^{2}\),\(H_{\phi }^{3}\)represents the output results after the 2nd and 3rd passes through the GCN network, \({M_2}\) represents the attention parameter matrix, \({W_2}\) and \({W_3}\) represents the parameter matrices of the 2nd and 3rd GCN networks, and \({n_2}\) represents the number of maximal connected subgraphs after the second message passing.

Fig. 3
figure 3

Graph Convolution Module Based on Self-Attention Pooling Layer.

Noise suppression method based on residual-dependency syntax graph

After the graph convolution, self-attention, and sorting operations, the filtered graph structure for each layer can be obtained, as shown in Fig. 4. The figure illustrates how the residual shrinkage network is applied in graph message passing. To suppress the influence of irrelevant node noise during graph message passing, this paper introduces a Residual Shrinkage (RS) network and proposes a noise suppression method based on a residual-dependency syntax graph. The RS network is embedded into the graph message passing process, with the output of each convolution stage being input to the network. By using a learnable threshold, the internal node vectors are subjected to soft thresholding, dynamically removing noise interference in the syntax graph, which enhances the extraction of valid information.

For the residual network, the node vectors are first processed by taking their absolute values to make the features positive. After that, a global average pooling operation is performed to obtain the average value of each node vector. Next, two fully connected layers along with corresponding activation functions are used to learn the threshold features, which are then mapped to the corresponding dimensional vector, resulting in the threshold feature matrix x, as represented by formula (10):

$${x_i}=\operatorname{sigmod} ({W_\delta }(\operatorname{relu} ({W_\varepsilon }(\operatorname{Pooling} (|H_{\phi }^{i}|)+{c_\varepsilon })+{c_\theta }))$$
(10)

Where \(H_{\phi }^{i} \in {{\text{R}}^{n \times {\text{d}}}}\) is the node vector after convolution; \(\operatorname{Pooling}\) is the global average pooling function; \(W \in {{\text{R}}^{1 \times 1}}\) and \(c \in {{\text{R}}^{1 \times 1}}\) represent the learnable parameter matrices and bias of the linear layers, respectively; \(\operatorname{sigmod}\) and \(\operatorname{relu}\) are the activation functions.

To prevent the issue of excessively large thresholds, the vector \({x_i}\) is multiplied element-wise with the absolute value of vector \(H_{\phi }^{i}\) to obtain the final set of thresholds \(\delta\). Then, the original node vector undergoes soft threshold, as shown in Eq. (11):

$$h_{{r,\phi }}^{i}=\left\{ \begin{gathered} h_{\phi }^{i} - \delta ,{\text{ }}h_{\phi }^{i}>\delta \hfill \\ 0,{\text{ -}}\delta \leqslant h_{\phi }^{i} \leqslant \delta \hfill \\ h_{\phi }^{i}+\delta ,{\text{ }}h_{\phi }^{i}< - \delta \hfill \\ \end{gathered} \right.$$
(11)

Where, \(h_{\phi }^{i}\) and \(h_{{r,\phi }}^{i}\) represent the value of node vectors before and after the update within \(H_{\phi }^{i}\), respectively.

Fig. 4
figure 4

Residual-Dependency Syntax Graph Noise Suppression Method Flowchart.

The following pseudocode demonstrates an example of the batch-processing residual-dependency syntax graph noise suppression method:

Noise Suppression Method Based on Residual-Dependency Syntax Graph

\(Input\): In batch processing, sentence \(X=\{ {X_1},{X_2} \ldots,{X_b}\}\); each sentence corresponds to word vector \(\{ {E_1},{E_2}, \ldots,{E_b}\}\);

\(Output\): In batch processing, node vector H.

Initialize the dependency syntax graph

for \({X_i}\) in X do

\({H^i},{A^i}\)\((\operatorname{externalParser} ({X_i}),{E_i})\) #Construct the node embeddings and A for each sentence.

\(H_{\phi }^{0}\)\({H^i}\),\({A^0}\)\({A^i}\) # Construct the dependency syntax graph for each batch.

end for

for m in \(conv\_num\) do #\(conv\_num=\left\{ {1,2,3} \right\}\)

\(H_{\phi }^{m},{A^m}=\operatorname{Self} \operatorname{Attention} Graph Pooling(H_{\phi }^{{m - 1}},{A^{m - 1}})\) # Graph Convolution Layer\({\text{if }}m{\text{ }} \ne {\text{ 1 then}}\)

\(H_{{r,\phi }}^{m}=\operatorname{Residual} Shrinkage Network{\text{(}}H_{\phi }^{m}{\text{)}}\) # Residual Shrinking

end for

H=\(H_{\phi }^{1}+H_{{r,\phi }}^{2}+H_{{r,\phi }}^{3}\) # Vector concatenation

Syntax-semantic feature fusion for relation extraction

For each entity pair in the sentence, the entity’s node vector is found in the dependency syntax graph using the id index, and then concatenated with the BERT word embedding vector to obtain the final entity vector representation \({v_{a,b}}=[{e_{a,b}};{h_{a,b}}]\). Where \({{\text{v}}_{a,b}}\) represents the feature vector of the entities with indices a and b, \({h_{a,b}}\)and \({e_{a,b}}\)represent the node vector and word embedding vector, respectively. Then, a two-layer perceptron is used to map the entity vectors \({v_{a,b}}\), and the feature information between entities is obtained by concatenating entity pairs. A linear layer is then applied to map the entity feature information to relation classification. The specific calculation method is as follows:

$${M_{a,b}}=\operatorname{softmax} (W \cdot ({\text{ML}}{{\text{P}}_1}{\text{(}}{v_{a,b}}) * {\text{ML}}{{\text{P}}_2}{\text{(}}{v_{a,b}}))+c)$$
(12)

Where \({\text{MLP}}\) represents the multi-layer perceptron; \(*\) denotes full concatenation between vectors; \({M_{a,b}}\) represents the entity relationship classification value within the decoding matrix; \(W \in {{\text{R}}^{{\text{d}} \times n}}\) and \({\text{c}} \in {{\text{R}}^{{\text{d}} \times n}}\) are the parameters of the linear layers.

Due to the inherent structural differences between the BERT model and GCN, directly combining them may lead to incomplete convergence, thereby affecting the accuracy of model extraction. The output of the BERT model is typically high-dimensional vectors, while the input of GNN is usually low-dimensional node features. This dimensionality mismatch can result in a sharp increase in computational complexity and may introduce redundant information. Additionally, the optimization objectives of BERT and GCN are inconsistent: BERT focuses on capturing global semantic information, whereas GNN emphasizes local structural information. This objective conflict further impacts the convergence of the model. In this paper, we employ an interpolation prediction method to enable the pre-trained model and the graph neural network to update simultaneously. Specifically, the output vectors N from the BERT model are divided into two purposes: one is used as input to the graph neural network, and after passing through the graph neural network, the vectors are combined with classification results to calculate the cross-entropy loss with the true labels. The other is used for independent prediction, where node vectors E are extracted directly from N, classified, and then used to calculate the cross-entropy loss with the true labels. The weights between the two parts are connected by a weighting coefficient, as shown in Eq. (13):

$$loss=w \cdot {\operatorname{CEL} _{BERT-GCN}}({p_1},l)+(1 - w){\operatorname{CEL} _{BERT}}({p_2},l)$$
(13)

In this case, w represents the weight coefficient, when \(w=1\), it indicates that the interpolation prediction method is not used; \(\operatorname{CEL}\) denotes the cross-entropy loss function; \({p_1}\) and \({p_2}\) are the prediction values output by the GCN and BERT, respectively; and l is the ground truth label.

At the same time, the model faces an issue of class imbalance when outputting label matrices, as the relationships between text entities are rare, resulting in a large number of label matrices but very few positive samples. To address this issue, we introduce entity types to restrict the relationship classification between entities. By predefining entity type relationships, we set a masking mechanism that enables the model to compute the loss value for specific entity relationships and specify classification vectors for unrelated entities. This is expressed as in formula (14).

$$\left\{ \begin{gathered} id\_mask = R < entity_{1} ,entity_{2} > \hfill \\ {\text{CEL}}_{{{\text{BERT}}}} \left( {p_{2} ,l} \right) = \text{CEL} (p_{2} \cdot id\_mask,l \cdot id\_mask) \hfill \\ \text{CEL} _{{{\text{BERT}} - {\text{GCN}}}} (p_{1} ,l) = \text{CEL} (p_{1} \cdot id\_mask,l \cdot id\_mask) \hfill \\ \end{gathered} \right.$$
(14)

In this context, \(id\_mask\) represents the mask between entities, R is the function used to judge the predefined entity types, and \(entity\) refers to the entity categories.

Experiments and analysis

Datasets

Articles The experiment uses two datasets for validation:1) Baidu DUIE 2.0 Relation Extraction Dataset32, which includes over 30,000 text samples, 130,000 triplet data, and 31 predefined relation types, with “unknown” representing no relationship between entities. 2) SemEval 2010 Task 8 Dataset33, consisting of 8,000 training samples and 2,717 test samples, containing 10 relation types, with “Other” indicating no relationship between entities.

Experimental setup and metrics

This paper uses the Ubuntu 18 operating system, Python 3.8 environment, and the PyTorch 1.9.0 + cu111 framework. The GPU used is an A100-80G. The model parameters are set as follows: the number of iterations (epochs) are 20 and 30, the batch size is 32, the learning rate is 5e-3, the input text length limit is set to 256, the number of GCN layers is 3, the irrelevant node coefficient for SAG is 0.2, and the coefficient for interpolation prediction is 0.8. The pre-trained model used is bert-base-chinese, with a hidden vector size of 768 and 12 layers of Transformers. The experimental metrics are precision (Precise, P), recall (Recall, R), and F1 score, which are used to evaluate the performance of relation extraction.

Comparison and analysis of different models

To verify the effectiveness of the proposed method, the following baseline methods are selected for comparison:

R-BERT17: This method directly uses the position indices of entities in the sentence, utilizes BERT to extract entity vectors, and aggregates the CLS vector for relation classification.

BERT-LSTM21: Based on BERT, this method uses LSTM’s sequential characteristics to extract deeper features, aggregating vector information for classification.

BERT-GAT34: This approach constructs a graph based on entities and allocates different weights to the edges between entities using an attention mechanism. It then uses graph propagation to update entity vectors and finally decodes the relationship between the two entities.

GP-GNN22: This method uses Glove to extract vectors and combines them with positional information in an LSTM model to obtain entity vectors. It constructs a fully connected graph structure about entities and predicts the relationship between entities.

AGGCN24: This method converts the original dependency tree into a fully connected weighted graph. Based on GCN, it learns node representations using the correlation strength between nodes, ultimately obtaining the relationship between entities.

Based on the above methods, this paper tests the model’s ability for relation extraction on two datasets using F1 score, accuracy, and recall curves, as shown in Table 1.

Table 1 Comparison results.

Experimental results indicate that the proposed model outperforms most existing models in relation extraction tasks, achieving overall performances of 93.28% and 88.41% on the DUIE 2.0 and SemEval2010 Task 8 datasets, respectively, demonstrating the effectiveness of the model. Models such as R-BERT and BERT-LSTM, which all utilize BERT pre-trained models for encoding, apply different operations to the encoded vectors for recognition. However, the use of single word vectors fails to fully capture the interactions between entities, missing out on the analysis of other important features. Models like BERT-GAT, which incorporate graph neural networks, focus on the interaction between entities after encoding but lack guidance from the sentence’s inherent syntax and semantic structures. In contrast, our model is based on BERT encoding and uses attention mechanisms combined with a residual shrinkage network to fully integrate dependency syntax graph information, thus enhancing the performance of relation extraction.

The impact of text length on dependency relations

To visually demonstrate how the proposed model enhances the ability to capture entity dependency relationships as text length increases, Fig. 5 compares BERT without a graph neural network and ours model across different sentence lengths on two datasets. In the upper part, the bar charts with F1, P, and R metrics use solid bars to represent the model without a graph neural network, while the counterparts indicate the proposed model. In the lower part, the line charts depict the variation of F1 scores with text length for each dataset: black lines represent the model without a graph neural network, whereas red lines represent ours model. In the DUIE dataset, it is observed that for sentence lengths within the range of (0, 50), the metrics show little difference. For sentence lengths in the range of (50, 100), the metrics show a slight increase. However, for sentence lengths in the range of (100–256), the F1, Precision (P), and Recall (R) metrics improve by 1.34%, 1.7%, and 0.98%, respectively. In the SemEval 2010 dataset, for sentence lengths within three ranges, the F1 score significantly exceeds that of the previous model, with improvements of 0.19%, 1.11%, and 2.17%, respectively. This indicates that as the sentence length increases, the proposed model shows a more significant improvement in relation extraction performance. This demonstrates that by incorporating graph neural networks based on dependency syntax graphs, the model enhances the dependency relations between entities, leading to improved performance in relation extraction tasks.

Fig. 5
figure 5

The three metrics of the model at different sentence lengths.

Model noise suppression analysis

The method proposed in this paper primarily reduces the propagation of irrelevant node noise based on constructing a dependency syntax graph. To visually demonstrate the impact of adding the SAG and RS modules on the model’s denoising effect, we conduct denoising visualization on the SemEval2010 task8 dataset. First, we train the neural network without reducing noise to ignore the effect of noise on the model. Then, we sequentially add the SAG and RS modules, and use confusion matrices to more intuitively observe the impact of these modules on the overall performance.

As shown in Fig. 6, the diagonal elements of the matrix represent the correctly predicted relationship categories, while the other irrelevant elements show the strength of noise, with lighter colors indicating stronger noise. Compared to the other two confusion matrices, GCN has more noise outside the diagonal, while GCN-SAG-RS has the least noise. This indicates that by using the SAG module to learn the importance score of each node through the attention mechanism and remove useless nodes, and by using the RS module to learn soft thresholds through the residual shrinkage mechanism and update the node vectors, these two methods significantly improve the model’s performance.

Fig. 6
figure 6

Three confusion matrices.

Ablation study

To verify the effectiveness of the proposed method, eight ablation experiments were conducted on two datasets, with the results shown in Table 2. Each experiment added a corresponding module based on the previous one: using the BERT model for relation extraction (BERT), adding Graph Convolutional Networks (GCN), adding the Self-Attention Graph Pooling layer (SAG), and adding the Residual Shrinkage (RS) module. The experimental results show that, on some parts of the DUIE dataset, adding the dependency syntax tree resulted in a 0.39% decrease in the F1 score. This was due to excessive noise, which had a significant impact on the model’s overall performance. After applying two noise reduction methods, the F1 scores improved by 0.12% and 0.23%, respectively. On the SemEval 2010 dataset, after adding the dependency syntax tree, the model’s overall performance improved by 0.16% after GCN graph propagation. Compared to the DUIE Chinese dataset, the noise from building the syntax tree had a smaller impact on the model in the English dataset. After applying the SAG and RS modules, the F1 scores improved by 0.11% and 0.15%, respectively.

Table 2 Ablation study.

To further analyze the impact of the cross-entropy loss function and the interpolation prediction loss function on the model training process and final performance, this paper applies both loss functions (interpolation prediction loss and cross-entropy loss) to our model. To ensure the rationality of the experiment, we omit the second term in the model using interpolation prediction loss and unify the final model output as \({\operatorname{CEL} _{BERT-GCN}}(p,l)\) for comparison. The experiment is conducted on the SemEval2010 test dataset, and the results are shown in Fig. 7. The x-axis represents the training steps, while the left y-axis represents the training loss values. From the trend of the curves, it can be observed that the interpolation prediction loss outperforms the cross-entropy loss in terms of convergence speed and loss value updates. Moreover, the changes in three key evaluation metrics, as presented in Table 3, further validate the effectiveness of the interpolation prediction loss. This advantage is primarily attributed to its computational mechanism, which effectively enhances the information interaction between BERT and GCN, optimizes feature representation learning, and stabilizes parameter updates across different modules. Consequently, the model’s predictive performance is significantly improved.

Fig. 7
figure 7

The impact of two loss functions on model performance.

Table 3 The impact of different loss on model performance in addition, under the same experimental parameters, this paper analyzes the impact of different parameters on model performance. Table 4 shows the performance comparison of different parameters on the DUIE 2.0 dataset.
Table 4 The impact of different parameters on model performance.

From Table 4, it can be seen that, with other parameters being the same, the learning rate has a significant impact on the model’s performance. This is because a smaller learning rate makes it difficult for the model to reach optimal performance within the specified number of iterations. The text length, GCN layer count, and SAG parameter all have varying effects on the model’s classification performance. The model performs best when the GCN layer count is 3 and the SAG coefficient is 0.2.

Conclusions

This paper proposes a relation extraction model based on the BERT pre-trained model combined with an external syntactic parser and graph neural networks. By learning information from the sentence, it achieves the fusion of semantic and syntactic information, improving the accuracy of entity relationship classification. Furthermore, a graph convolutional network with a self-attention pooling layer and a residual shrinking network are employed to reduce the influence of irrelevant information in the text. Additionally, an interpolation prediction method is used to calculate the loss and refine the interaction between BERT and the graph neural network. Experimental results demonstrate significant improvements of the proposed method over the baseline models. This study highlights the potential of combining BERT with graph neural networks and integrating semantic and syntactic information to enhance relation extraction performance, providing valuable insights and directions for future research.