Introduction

With the rapid development of mobile Internet technology, online social networks are gradually integrated in people’s lives, entertainment and work. Rumor information appears accompanied by the explosive growth of information. Due to a large number of users in social media, rumors bring huge harm to society. How to automatically detect online rumors in the early stages, is of great significance for text mining on social media.

Existing studies on automatic rumor detection mainly focused on designing effective feature extraction approaches with various source information, such as semantic features extracted from the contents1,2,3, social media propagation features of user profiles4,5,6 retweet propagation representations7,8,9 and so on. Recently, many studies have attempted to aggregate multi-source features to enhance rumor detection performance. Shu et al.9 developed a combined attention network to learn the interpretability of the comments and its corresponding contents. Ma et al.10 presented a tree-based recursive neural network to capture the semantic information and propagation clues of source tweet propagations for rumor detection.

Most previous rumor detection methods only focus on some part of contextual information, user profiles or patterns of propagation. Actually, both semantic information and propagation patterns are important for rumor detection on social media, as shown the Fig. 1. There are still several limitations of the above methods. First, most of current detection methods11,12 only pay much attention on contextual semantic information, however, impact of user communication behavior on social media is rarely considered. Second, many models13,14,15 attempt to integrate users’ social activities as auxiliary information for rumor detection, such as user comments, retweets, and user personal information, etc., however, the users may tend to simply re-share the source story without leaving any comments16, which is not considered in most of the current studies. And some studies have been proven that rumors tend to be spread by the user who lacks of personal information17,18. Therefore, social activities might actually be very useful for rumor detection. How to better represent the textual features in the presence of social activities is one of the major challenges for rumor detection.

Fig. 1
figure 1

An example of date.

In this paper, we propose a multi-level interactive rumor detection approach based on heterogeneous graph reconstruction. We first design an graph autoencoder framework to represent the semantic graph and the user propagation graph, respectively, then a multi-feature interactive fusion strategy is adopted with adaptive gated fusion for rumor detection. The main contributions of our work can be summarized as follows:

  • A multi-level interactive heterogeneous graph reconstruction approach is presented for rumor detection on social media. Both semantic information and social propagation clues are leveraged to improve rumor detection performance.

  • An graph auto-encoder framework is designed to obtain the semantic representations and the user propagation representations by reconstructing heterogeneous graphs of social media. The parallel graph convolutional encoder based on GCN and GAT is proposed to represent semantic relations and propagation relations, and a variational graph auto-encoder (VGAE) is proposed for multi-graph reconstruction.

  • A multi-feature interactive fusion strategy is adopted with adaptive gated fusion to balance the fused global features and local features for rumor detection.

  • The experimental results show that our proposed approach outperforms all existing approaches, and achieve the state-of-the-art (SOTA) scores on two benchmark datesests. The abalation study and the parameter experiments are given to further show the effectiveness of our proposed model.

Related work

rumor detection

Rumor detection, which is one of the most challenge tasks in text mining field, and has attracted more and more attention over the past decades. Most of the current rumor detection approaches mainly focus on feature fusion strategies based on the textual information and social propagation clues. Early work focused on extracting features manually19. Some studies apply more effective features, such as user comments20, and the emotional attitude of posts21. However, due to the anonymous and noisy characteristics of largescale social media data, over-reliance on manually extracted features is not conducive to rumor detection. Most of the current rumor detection approaches are proposed in the framework of deep neural networks. Ma et al.22 used recurrent neural networks (RNN), gated recurrent unit (GRU) and long short-term memory (LSTM) to learn the text representations for rumor detection. Yu et al. utilized Convolutional Neural Networks (CNN) to capture the high-level interactive features of comments for rumor detection. Social activity is also another important factor for rumor detection, some studies5,9,23 focus on exploring user social activities. Shu et al.24 built an interactive network to realize fake news by exploring the ternary relationship between publishers, news and user information. Liu et al.25 applied RNN and CNN networks to capture the user characteristics change of along the propagation path.

graph convolutional network

Graph Convolutional Network (GCN)26 is another most popular neural network structures that are widely used for rumor detection. Current approaches have been proposed to detect faker news by exploring the topological strcture of social media27,28,29. Bian et al.30 leveraged the Bi-directional graph convolutional network to represent user characteristics by operating on top-down and bottom-up directions for rumor detection. Lu et al.31 developed a graphaware co-attention to recognize faker news by utilizing both source tweets and their corresponding comments. Liu et al.32 designed a VGAE-based32,33 model to capture text, dissemination and structural information to enhance rumor detection performance. In order to fully utilize both global and local semantic information, Yuan et al. presented a novel global–local attention network to capture the local and global relationships among all source tweets, retweets, and users. However, previous studies have not fully considered the synthesis of semantic information and user-disseminated information.

Method

We consider two types of additional information: textual semantic representations and user propagation characteristics in rumor detection task and we build a heterogeneous multi-graph based on tweet-word-user relations. \(\text{G}=\left(\text{V},\text{E}\right)\) as heterogeneous tweet-word-user graph, the node set \(\text{V}=\left(\text{P},\text{W},\text{U}\right)\) constains of source tweets, tweet words and the users the edge set \(\text{E}=\left({\text{E}}_{\text{pw}},{\text{E}}_{\text{ww}},{\text{E}}_{\text{pu}}\right)\) contains of three types of realations: tweet-word edges , word-word edges and tweet-user edges. \({\text{E}}_{\text{pw}}\) describes the relationship that the tweet contains the word, \({\text{E}}_{\text{ww}}\) expresses the semantic relation between words, and \({\text{E}}_{\text{pu}}\) reflects the interaction between users and tweets. Where \(\text{P}=\left\{{\text{p}}_{1},{\text{p}}_{2},\dots ,{\text{p}}_{\text{m}}\right\}\) is source tweets; \(\text{W}=\left\{{\text{w}}_{1},{\text{w}}_{2},\dots ,{\text{w}}_{\text{n}}\right\}\) is tweet words; \(\text{U}=\left\{{\text{u}}_{1},{\text{u}}_{2},\dots ,{\text{u}}_{\text{o}}\right\}\) is the users and \(\text{m}\) is the numbers of source rumors.

Moreover, each source tweets \({\text{p}}_{\text{i}}\) associated with a ground-truth label \({\text{y}}_{\text{i}}\in \left\{\text{N},\text{F},\text{T},\text{U}\right\}\) (Non-rumor, False Rumor, True Rumor, and Unverified Rumor). Rumor detection aims to learn a classifier \(\text{f}:{\text{p}}_{\text{i}}\to {\text{y}}_{\text{i}}\) to predict the label of a tweet based on text contents and user propagation clues.

$${\text{y}}_{\text{i}}=\text{f}\left[{\text{p}}_{\text{i}}|\left(\text{V},\text{E}\right)\right]$$
(1)

where \({p}_{i}\) and \({y}_{i}\) are the sets of source tweets and labels, respectively.

This paper proposes a multi-level interactive rumor detection approach based on heterogeneous graph reconstruction (MLI-HGR), the architecture of our proposed approach is shown in Fig. 2. The MLI-GRA model consists of three parts: (1) the multiple graph convolutional encoder, (2) the multi-graph reconstruction decoder and (3) the multi-feature rumor detector. We will describe each part of our proposed model in detail.

Fig. 2
figure 2

The architecture of our MLI-GRA model.

The multiple graph convolutional encoder

Multi-graph construction

We decompose heterogeneous tweet-word-user graph into tweet-word subgraph and tweet-user subgraph to capture the global semantic features based on the text content and user dissemination information.

  1. (1)

    Semantic graph construction.

The nodes in the tweet-word subgraph are the tweet and word nodes in the heterogeneous graph. The edges between tweets and words are consistent with the edges on the heterogeneous graph. The nodes in tweet-word subgraph denote as \({X}_{pw}=\left\{{x}_{{p}_{1}},{x}_{{p}_{2}},\dots ,{x}_{{p}_{\left|P\right|}},{x}_{{w}_{1}},{x}_{{w}_{2}},\dots ,{x}_{{w}_{\left|W\right|}}\right\},{x}_{{p}_{i}}\in {X}_{P},{x}_{{w}_{i}}\in {X}_{W}\), the relationship between the tweet-word subgraph is represented as an adjacency matrix \({A}_{pw}\).

We build the edges \({E}_{pw}\) with the word occurrence in source tweets. Formally, the weight of edge between node I and j is defined as:

$${A}_{pw\left(ij\right)}=\left\{\begin{array}{c}PMI\left(i,j\right),\\ TF-ID{F}_{ij},\\ 1\\ 0\end{array} \begin{array}{l}i,j are words,PMI\left(i,j\right)>0\\ i is tweet, j is word\\ i=j\\ otherwise\end{array}\right.$$
(2)

The PMI value of a word pair i, j is computed as:

$$\left\{\begin{array}{c}PMI\left(i,j\right)=log\frac{p\left(i,j\right)}{p\left(i\right)p\left(j\right)}\\ p\left(i,j\right)=\frac{\#W\left(i,j\right)}{\#W}\\ p\left(i\right)=\frac{\#W\left(i\right)}{\#W}\end{array}\right.$$
(3)

where \(\#\text{W}\left(i,j\right)\) denotes the number of sliding windows that contain both word i and word j, \(\#\text{W}\) represents the number of sliding windows, and \(\#\text{W}\left(i\right)\) indicates the number of sliding windows that contain word i.

The weight of the edge between a source tweet node and a word node is the term frequency-inverse document frequency (TF-IDF) of the word in the source tweet where term frequency is the number of times the word appears in the source tweet, inverse document frequency is the logarithmically scaled inverse fraction of the number of source tweet that contain the word. To utilize global word co-occurrence information, we use a fixed size-sliding window on all source tweet in the corpus to gather co-occurrence statistics. We employ the point-wise mutual information (PMI), a popular measure for word associations, to calculate weights of the edge \({E}_{ww}\) with a fixed size-sliding window on all source tweets.

  1. (2)

    Propagation graph construction.

The nodes in the tweet-user sub-graph are the tweet and user nodes in the heterogeneous graph, and the edges are composed of the edges between tweets and users on the heterogeneous graph, the nodes in tweet-user subgraph indicate as \({X}_{pu}=\left\{{x}_{{p}_{1}},{x}_{{p}_{2}},\dots ,{x}_{{p}_{\left|P\right|}},{x}_{{u}_{1}},{x}_{{u}_{2}},\dots ,{x}_{{u}_{\left|U\right|}}\right\},{x}_{{p}_{i}}\in {X}_{P}^{,},{x}_{{u}_{i}}\in {X}_{U}^{,}\), \({X}_{P}^{,}\) and \({X}_{U}^{,}\), where and are the node representations transformed by the transformation matrix, the relationship between the tweet-user subgraph is represented as an adjacency matrix \({A}_{pu}\).

We calculate the weight of edge \({E}_{pu}\) by the reciprocal of the time the user retweeted or responded to the tweet related to the source tweet. Formally, the weight of edge between node \(i\) and node \(j\) is defined as:

$${A}_{pu\left(ij\right)}=\left\{\begin{array}{c}1/\left(t+1\right),\\ 1, \\ 0,\end{array} \begin{array}{l}i is tweet,j is useri,j are words,PMI\left(i,j\right)>0\\ i=j\\ otherwise\end{array}\right.$$
(4)

where t represents the elapsed time when a user \(j\) retweeted or replied to tweets related to a source tweet \(i\).

Dual channel convolution

Dual channel convolution GCN can process data with a generalized topological graph structure and deeply explore its characteristics and laws. Since the neighbors of each node in the subgraph have different importance to learn node embeddings for rumor detection, it is inspired by graph attention networks GAT34, encoder module we use GCN and GAT to learn the characteristics of nodes respectively. The GCN is used to extract features to find appropriate embedding vectors for nodes in the graph, and realize the graph reconstruction in the subsequent decoder module, GAT is utilized an attention mechanism to learn the importance of each node’s neighbors and merge the representation of these neighbors with the importance to form each node’s representation.

As for subgraphs, we use GCN to learn a Gaussian Distribution, and then sample \(z\) from this distribution. The Gaussian Distribution can be uniquely determined by the mean \(\mu\) and standard deviation \(\delta\) which can be learned using GCN respectively, finally, a new adjacency matrix is generated by graph reconstruction.

For adjacency matrix \({A}_{pw}\) and adjacency matrix \({A}_{pu}\), We exploit GCN to learned the mean and standard deviation respectively, and used reparameterization28 method to construct and update the gradient. The formula is as follows:

$${H}_{1}=GCN\left(X,{A}_{pw}\right)={A}_{pw}\sigma \left({A}_{pw}X{W}_{0}\right){W}_{1},$$
(5)
$$\mu ={GCN}_{\mu }\left({H}_{1},{A}_{pw}\right)$$
(6)
$$\text{log}\sigma ={GCN}_{\sigma }\left({H}_{1},{A}_{pw}\right)$$
(7)
$${z}_{pw}=\mu +\varepsilon \sigma$$
(8)

where \({H}_{1}\in {\mathbb{R}}^{n\times v}\) represent the hidden features of GCN;\(\text{X}\in {\mathbb{R}}^{n\times d}\) is feature matrix of \({X}_{pw}\), \(\varepsilon\) is sampled from a standard Guassian Distribution,\({W}_{0}\),\({W}_{1}\) are the trainable parameter matrices of GCN, with weight matrices \({GCN}_{\mu }\left({H}_{1},{A}_{pw}\right)\) and \({GCN}_{\sigma }\left({H}_{1},{A}_{pw}\right)\) share first-layer parameters \({W}_{0}\). Similar to Eqs. (3), (4), (5) and (6), we use the same calculation method to learned a Gaussian Distribution of the tweet-user subgraph and sample \({z}_{pu}\).

In order to obtain sufficient expressive ability, we use the GAT to learn the weights between nodes in the subgraph; the graph attention layer is designed as follows:

$${e}_{ij}=LeakyReLU\left({W}_{a}{x}_{i},{W}_{q}{x}_{j}\right), {x}_{i},{x}_{j}\in {X}_{pw\left(pu\right)}$$
(9)
$${a}_{ij}=softmax\left({e}_{ij}\right)=\frac{exp\left({e}_{ij}\right)}{\sum_{k\in {\text{\rm N}}_{i}}exp\left({e}_{ik}\right)}$$
(10)
$${x}_{i}^{,}=\sigma \left(\sum_{j\in {N}_{i}}{a}_{ij}{W}_{k}W{x}_{j}\right)$$
(11)

where \({W}_{a}\),\({W}_{q}\),\({W}_{k}\) are trainable weights and \({a}_{ij}\) is the attention weight between \({x}_{i}\) and \({x}_{j}\).

Finally, we extend employing a self-attention to multi-head attention to learn more stable embedding, the multi-head attention can be denoted as:

$${x}_{i}^{,}={||}_{k=1}^{K}\sigma \left(\sum_{j\in {N}_{i}}{a}_{ij}^{k}{W}^{k}{x}_{j}\right)$$
(12)

where || represents concatenation, \({a}_{ij}^{k}\) are normalized attention coefficients computed by the \(k-th\) attention mechanism (\({a}^{k}\)), and \({W}^{k}\) is the corresponding input linear transformation’s weight matrix.

Overall, Given the representation \({X}_{pw}\) of nodes in tweet-word subgraph and the representation \({X}_{pu}\) nodes in tweet-user subgraph, Input the node representations \({X}_{pw}\) and \({X}_{pu}\) into the subgraph attention neural network to get a new node representation, where the nodes embedding in tweet-word subgraph denote as \({X}_{pw}^{,}=\left\{{x}_{{p}_{1}}^{,},{x}_{{p}_{2}}^{,},\dots ,{x}_{{p}_{\left|P\right|}}^{,},{x}_{{w}_{1}}^{,},{x}_{{w}_{2}}^{,},\dots ,{x}_{{w}_{\left|W\right|}}^{,}\right\}\),and the nodes embedding in tweet-user subgraph indicate as \({X}_{pu}^{,}=\left\{{x}_{{p}_{1}}^{,},{x}_{{p}_{2}}^{,},\dots ,{x}_{{p}_{\left|P\right|}}^{,},{x}_{{u}_{1}}^{,},{x}_{{u}_{2}}^{,},\dots ,{x}_{{u}_{\left|U\right|}}^{,}\right\}\).

Multi-graph reconstruction decoderl encoder

VGAE mainly finds suitable embedding vector for nodes in the graph and realizes graph reconstruction. We take the matrices \({z}_{pw}\) and \({z}_{pu}\) as the input of the multi-graph reconstruction decoding, in order to make the reconstructed adjacency matrix \(\widehat{{A}_{pw\left(pu\right)}}\) similar to the original adjacency matrix \({A}_{pw(pu)}\). We use inner product and a sigmoid function to reconstruct the original graph, and the reconstructed adjacency matrix is obtained through the formula:

$$\widehat{{A}_{pw}}=\sigma \left({Z}_{pw}{Z}_{pw}^{T}\right)$$
(13)
$$\widehat{{A}_{pu}}=\sigma \left({Z}_{pu}{Z}_{pu}^{T}\right)$$
(14)

where \(\sigma\) is sigmoid function.\({Z}_{pw}\in {\mathbb{R}}^{{n}_{1}\times h}\),\({Z}_{pu}\in {\mathbb{R}}^{{n}_{2}\times h}\) stands for the matrix form of \({z}_{pw}\) and \({z}_{pu}\) respectively. Since \({Z}_{pw}\) and \({Z}_{pu}\) are obtained through sampling, noise (standard deviation \(\upsigma\)) will increase the difficulty of reconstruction in the process of reconstructing the adjacency matrix. We apply categorical cross-entropy loss for reconstruction of adjacency matrix; the process can be represented as:

$${\mathcal{L}}_{pw(pu)}=\frac{1}{{A}_{row}{A}_{col}}\sum mlog\widehat{m}+\left(1-m\right)log\left(1-\widehat{m}\right)$$
(15)

where m and \(\widehat{m}\) are the elements of \({A}_{pw(pu)}\) and \(\widehat{{A}_{pw(pu)}}\) respectively.

In order to prevent the noise from being zero and to ensure that the model has the ability to generate, we add the KL divergence to the loss function. Minimizing it means optimizing the probability distribution parameters (\(\upmu\) and \(\upsigma\)) as similar as possible to the target distribution (Gaussian Distribution). The formula is as follows:

$${\mathcal{L}}_{{\mu ,{\sigma }^{2}}_{pw(pu)}}=-\frac{1}{2}\sum_{i=1}^{{n}_{{i}_{\text{1,2}}}}\sum_{j=1}^{{n}_{{d}_{\text{1,2}}}}\left({\mu }_{ij}^{2}+{\sigma }_{ij}^{2}-log{\mu }_{ij}-1\right)$$
(16)

where \({n}_{{d}_{\text{1,2}}}\) are the dimensionality of the implicit variable \({Z}_{pw(pu)}\),\({n}_{{i}_{\text{1,2}}}\) represent the number of all nodes in the subgraph respectively.

Multi-feature rumor detector

The tweet-word subgraph contains the global semantic relation information of text contents, while the tweet-user subgraph contains the information involved in source tweet propagations. However, when the information containing two subgraphs is fused, the large difference between the global semantic features and the user propagation features may cause some useless features to affect the detection performance. Based on this, we proposed a decision-level detector method to fuse subgraph features, included decision-level global feature fusion strategy and adaptive gated fusion strategy.

Global feature fusion strategy

Given the node embeddings \({X}_{pw}^{,}\) and \({X}_{pu}^{,}\), potential representations \({Z}_{pw}\) and \({Z}_{pu}\) after sampled from the Gaussian distribution ,which serve as input to the global feature fusion network, the weights of the tweet-word and tweet-user subgraph are calculated as follows:

$${S}_{pw}^{,}={X}_{pw}^{,}\oplus {Z}_{pw}$$
(17)
$${S}_{pu}^{,}={X}_{pu}^{,}\oplus {Z}_{pu}$$
(18)
$$\left({\beta }_{{\Phi }_{pw(pu)}},{\beta }_{{\Phi }_{pw(pu)}}\right)={att}_{glo}\left({S}_{pw}^{,},{S}_{pu}^{,}\right)$$
(19)

where \({S}_{pw}^{,}\in {\mathbb{R}}^{{n}_{1}\times v}\) is the global semantic features of tweet-word subgraph,\({S}_{pu}^{,}\in {\mathbb{R}}^{{n}_{2}\times v}\) is the global user propagation features of tweet-user subgraph.\({att}_{glo}\) represents the feedforward neural network that performs the global feature fusion strategy.

In order to learn the weights of the tweet-word subgraph and the tweet-user subgraph, we first transform the representation of the node in subgraphs by a nonlinear transformation (e.g. single-layer MLP). Then we measure the importance of the node representations as the similarity of transformed embedding with a global features attention vector \(q\). Furthermore, we average the importance of all nodes in subgraphs as the importance of subgraphs. The importance of tweet-word (tweet-user) subgraph, denoted as \({W}_{pw(pu)}\), is shown as follows:

$${W}_{pw(pu)}=\frac{1}{|{S}_{pw(pu)}^{,}|}\sum_{{x}_{i}\in {S}_{pw\left(pu\right)}^{,}}{q}^{T}.\text{tanh}\left({W}_{gol}{x}_{i}+b\right)$$
(20)

where W is the weight matrix, \(b\) is the bias vector, \(q\) is the global attention vector, Note that all above parameters are shared by the tweet-word subgraph and the tweet-user subgraph, after obtaining the importance of each subgraph, we normalize them via softmax function.\({\beta }_{{\Phi }_{pw(pu)}}\) represents the weight of tweet-word (tweet-user) subgraph, can be obtained by normalizing the above importance of two subgraphs using softmax function:

$${\beta }_{{\Phi }_{pw(pu)}}=\frac{exp\left({w}_{pw(pu)}\right)}{\sum_{\Phi \in \left\{pw,pu\right\}}exp\left({w}_{\Phi }\right)}$$
(21)

Which can be interpreted as the contribution of the \({\Phi }_{pw(pu)}\) for specific task, with the learned weights as coefficients, we can fuse the tweet nodes representation in the subgraph and get the source tweets representation \({P}_{m}\) as follows:

$${P}_{m}=\left\{{p}_{1},{p}_{2},\dots ,{p}_{m}\right\}$$
(22)
$${p}_{i}=\sum_{\Phi \in pw,pu}{\beta }_{\Phi }\cot {p}_{{m}_{i}},{p}_{{m}_{i}}\in {P}_{\Phi }^{,}$$
(23)

where \(m\) is the numbers of source rumors,\({p}_{i}\) denotes the expression of twitter sentence node \(i\) in the \(\Phi\) subgraph,\({P}_{\Phi }^{,}\) represents the sentence node representation with global relation information in the \(\Phi\) subgraph.

Adaptive gated fusion strategy

We connected the latent representations in the two subgraphs as input to the adaptive gated fusion unit. By designing gate unit to promote competition or collaboration between neurons, select features from each subgraph feature that are more conducive to rumor detection, the adaptive gated fusion network can be denoted as:

$$\text{S}=\left[{S}_{pw}^{,};{S}_{pu}^{,}\right]$$
(24)
$$\text{g}=\upsigma \left({W}_{gat}\cdot S+b\right)$$
(25)
$${G}_{gat}=\text{tanh}\left(\text{g}\odot S\right)$$
(26)

where S represents the connection of node features of tweet-word subgraph and tweet-user subgraph, include global semantic features and user communication relationship features. \(\text{g}\) is the state of adaptive gated fusion unit, and \({G}_{gat}\) denotes the feature of shared feature S after adaptive gated fusion unit. \({W}_{gat}\) is the weight matrix, \(b\) is the bias vector,\(\upsigma\) is sigmoid activation function.

As the last layer, the global attention feature \({p}_{i}\) and local gate feature \({G}_{gat}\) are then fed to softmax layer for classification respectively. The formula is as follows:

$$\widehat{{y}_{glo}}=softmax\left({p}_{i}W+b\right)$$
(27)
$$\widehat{{y}_{gat}}=softmax\left({G}_{gat}W+b\right)$$
(28)

We use the cross-entropy loss and a regularization term are used as the model’s objective optimization function to train the model’s parameters.

$${\mathcal{L}}_{gol}=-\sum_{i\in m}{y}_{i}log\widehat{{y}_{glo}}+\lambda ||\theta {||}_{2}^{2}$$
(29)
$${\mathcal{L}}_{gat}=-\sum_{i\in m}{y}_{i}log\widehat{{y}_{gat}}+\lambda ||\theta {||}_{2}^{2}$$
(30)
$$\mathcal{L}=\upeta {\mathcal{L}}_{gat}+\left(1-\eta \right){\mathcal{L}}_{gol}$$
(31)

where \({y}_{i}\) denotes the ground truth one-hot vector of the i-th source tweet, \(\lambda\) represents the trade-off coefficient, \(||\cdot {||}_{2}^{2}\) indicates the L2 regularization term to prevent overfitting and \(\eta\) is the Break-even parameters.

Joint training encoder

We encode the textual semantic information and user propagation information by the encoder module. The graph reconstruction aims to reconstruct the data to learn the structure information while the multi-feature decision-level decoding aims to classify the event. We jointly train these modules by minimizing the loss over all events and the final loss is computed as:

$$\text{Loss}=\upkappa \cdot \left({\mathcal{L}}_{pw\left(pu\right)}+{\mathcal{L}}_{{\mu ,{\sigma }^{2}}_{pw\left(pu\right)}}\right)+\mathcal{L}$$
(32)

where \(\upkappa\) also is the Break-even parameters, since the graph reconstruction loss is far greater than the loss of event classification, we optimize the loss function by designed the Break-even parameters.

Experiment

In this section, we first introduce datasets used in the experiment and then we will evaluate our proposed model on the datasets compared with other baseline models.

Datasets

We evaluate our proposed method on three real world datasets: Twitter15 and Twitter167. They are most famous social sites all over the world. In the datasets, contained 1490 and 818 source tweets of rumors, respectively. Nodes refer to source tweets, the set of words that source tweets contained, and the set of users, edges represent the relationships between the tweet-word, word-word and tweet-user, and features are indicated using TF-IDF values, PMI and the time that the user retweeted or replied to tweets related to the source tweet. Twitter15 and Twitter16 datasets contains four labels: Non-rumor (N), False Rumor (F), True Rumor (T), and Unverified Rumor (U). The label of each source tweet in Twitter15 and Twitter16 is annotated according to the veracity tag of the article in rumor debunking websites (e.g., snopes.com, Emergent.info, etc.). The statistics of the two datasets are shown in Table 1.

Table 1 Statistics of the datasets.

Setting

We implement our models using the same set of hyper parameters in our experiment. We utilize the micro-average accuracy (i.e., Acc.) in all categories and the F1score of the precision and recall in each category to evaluate the performance of models. The batch size is 64. The GCN layers hidden dim is 32. The learning rate is initialized at 5e−4 and gradually decreases during the model training process. The total process is iterated upon 30 epochs. We initialize the word vector with 300 dimensions word embedding. The number of heads K of the GAT is set to 8 and the hidden size is 32. We select the best Break-even parameters,\(\eta\) is 0.4 and \(\kappa\) is 0.1. Training/validation/test set ratios: 70%/15%/15%.

Baselines

We compared our model with a range of current baseline rumor detection models and state-of-the-art models as follows:

  • DTC1: A rumor detection method using decision tree classifiers with manual features to obtain information credibility.

  • SVM-TS35: A Linear SVM Classifier Model Considering the Structure of Time Series.

  • SVM-TK7: A SVM classifier with a propagation Tree Kernel on the basis of the propagation structures of rumors.

  • MVAE36: A multimodal rumor detection model that combines variational autoencoder and classifiers to explore text and picture information.

  • RvNN10: A rumor detection model based on propagation tree structure uses GRU unit to learn rumor representation.

  • PCC6: A detection model for mining user feature sequence combines RNN and CNN neural network.

  • GCAN31: A detection model based on source tweets and user characteristics based on propagation, combining GCN and Dual Co-attention Mechanism.

  • VAE-GCN 32: A rumor detection model that uses GCN as an encoder and variational GAE as a decoder to explore the structure of rumor propagation.

  • BI-GCN30: A GCN-based rumor detection model using semantic bidirectional propagation structure.

  • GLAN39: A rumor detection model that jointly encodes the global information between source tweets, retweets and users.

  • HGATRD4: A heterogeneous graph attention model based on meta-path, used to capture text semantic features and global propagation features.

Tables 2 and 3 show the performance of all compared methods on Twitter15 and Twitter16 dataset which including the baseline model algorithm and the current state-of-the-art rumor detection model. Bold indicates the highest result of this test set; The slash indicates that this method does not have this test set for testing. For fair comparison, the experimental results of the rumor detection model directly quote the best previous performance data.

Table 2 Rumor detection results on Twitter15 datasets.
Table 3 Rumor detection results on Twitter16 datasets.

First, in all baseline model algorithms (DTC, SVM-TS, SVM-TK) that use manual features, their performance is significantly weaker than methods based on deep learning. There is no doubt that deep learning methods can better dig out the effective features of rumors, while methods based on manual features are less accurate and efficient.

Second, in comparison with the current state-of-the-art rumor detection algorithm, our model has a better performance, which proves its effectiveness in rumor detection. It can be found from the GCN-based detection model (GCAN, VAE-GCN, BI-GCN, GLAN,HGATRD) that their performance is relatively better than other deep learning models (RvNN, PPC), which indicates that GCN can learn more comprehensive information and better node representation from social networks. Due to GRU, RNN and CNN cannot process data with the graph structure, important structural features in social information are ignored, resulting in performance degradation. The powerful performance of VAE-GCN and HGATRD illustrates the superiority of VAE-GCN and HGATRD in rumor detection tasks. However, these methods ignore the difference between semantic features and propagation representations, and global features are not well utilized. Our method achieves the best performance because it selectively capture features that more effective.

Finally, compared with some specific models, although our method does not occupy all the best evaluation data, considering the tradeoff among different performance measures, it shows the effectiveness of our method in the task of rumor detection.

Parameters experiments

In deep learning, the parameter selection of the model also has a great influence on the experimental results. By adjusting some important parameters in the model, the performance of the model can be improved significantly. In this section, we investigate the sensitivity of parameters.

Graph reconstruction parameters

At the stage of graph reconstruction, A good \({z}_{pw}\) should make the reconstructed adjacency matrix \(\widehat{{A}_{pw(pu)}}\) similar to the original adjacency matrix \({A}_{pw(pu)}\). However, in the actual training process, the reconstruction loss is far greater than the classification loss of rumor events. We add κ.

to balance the training loss, and the experimental results are shown in Fig. 3. When \(\upkappa =0.1\), the performance is the best. The reason is that graph reconstruction is to better explore global structural information, but event classification is the main focus of the model, and a smaller scale factor should be given.

Fig. 3
figure 3

Comparative experiment of different balance loss parameters.

Decision parameters

The Tweet-word subgraph focuses on capturing the global semantic relationship of the text content, and the tweet-user subgraph focuses on exploring the information involved in the spread of the source Twitter. We design the global feature screening mechanism and the local feature screening unit to select and filter more effective features flexibly. In order to check the influence of the screening mechanism on the detection of rumor events, we explore the performance of the model with different coefficients \(\eta\), and the results are shown in Fig. 3.When \(\eta =0.4\), Twitter15 dataset performs better and \(\eta =0.3\), Twitter16 dataset has a higher accuracy. These may reveal that Twitter15 has a larger amount of data and when multiple features with larger differences are interactively fused, the attention mechanism produces a more distinguishable feature representation, which plays a more important role.

Finally, we select several better sets of parameters for joint training to get the best performance, as shown in Table 4:

Table 4 Optimal parameter selection.

Ablation analysis

To verify the effectiveness of the different modules in the model in this paper, we report the contribution of each component by deleting several components from the entire model.

Importance analysis of subgraphs

We delete tweet-user subgraph and tweet-user subgraph from the model respectively, use GAT and VGAE to model the tweet-word subgraph and learn node representations. Due to that we only conduct experiments on one of the subgraph features, we directly add the output features of GAT and GCN and send them to the multi-feature decision-level decoder module for rumor detection.

The empirical results are summarized in Table 5, the combination of multi-graph features has a better detection effect than the single subgraph. In social networks, rumors are very misleading and difficult to identify from the single characteristic. Specifically, the detection accuracy of tweet-word subgraph is higher than that of tweet-user subgraph. This result shows that text semantic information is more important for rumor detection tasks.

Table 5 The importance analysis of subgraphs.

Importance analysis of VGAE

For a more intuitively comparation, we perform visualization tasks, aiming to lay out heterogeneous graphs in low-dimensional space. In the process of exploring the influence of global structure information on rumor detection, the graph reconstruction module was deleted. We use GAT to model tweet-word subgraph and tweet-user subgraph separately and learn node features, without reconstructing the event structure, and other experimental settings remain unchanged. Here we visualize outputs in a two-dimensional space by applying the \(t\)-SNE algorithm37.

Figure 4 provides the experimental results of these methods on Twitter15 and Twitter16 datasets. Different types of events (NR, FR, TR, UR) in the dataset can be classified well, while VGAE-GAT shows better performance. Specifically, the point distribution in Only-GAT is more scattered and irregular, and there are even more event categories that overlap each other. While points in VAGE-GAT spread around regularly, with smaller intervals between the same kind and larger intervals between different kinds. We can observe that our model uses VGAE to learn the posterior distribution, which not only provides a more flexible graph generation model, but also learns structural information better to obtain a better result representation.

Fig. 4
figure 4

The importance analysis of VGAE.

Importance analysis of multi-feature screening strategy

To further evaluate the capture of valuable features by our model, we separately disassembled the decision-level global feature fusion strategy and the adaptive gated fusion strategy for rumor detection. Note that the Break-even parameters \(\eta\) does not work anymore, and we directly use the output of the filtering mechanism for the task of rumor detection.

From the experimental results in Table 6, we have the following observations: For the feature screening mechanism, the attention-based global feature screening component has a better performance improvement than the gating-based adaptive feature screening component. Specifically, we can see that the accuracy of the decision-level global feature fusion strategy is 3.6% and 3.3% higher than that of the adaptive gated fusion strategy on Twtter15 and Twitter16 datasets. Our model combines two fusion mechanisms, which demonstrates the rationality of the interactive fusion of multiple features.

Table 6 The importance analysis of subgraphs.

Conclusion

In this study, we propose a Multi-feature interaction heterogeneous graph reconstruction method with semantic graph and user propagation graph constraints for rumor detection. This method makes full use of the difference between the textual semantic features in heterogeneous graphs and the structural features of user propagation. Specifically, we decompose the heterogeneous tweet-word-user graph into tweet-word subgraph and tweet-user subgraph, and then use GCN and GAT to learn text semantic information and user communication representation, and apply VGAE learn the overall structure representation. In addition, in order to effectively select and utilize multi-graph global features, we explore a multi-feature screening mechanism to detect rumors. In this paper, we propose a Multi-feature interaction heterogeneous graph reconstruction method with semantic graph and user propagation graph constraints for rumor detection. This method makes full use of text semantic features and user propagation pattern features, and learns more effective feature information through multifeature fusion strategy. Experimental results on two real datasets Twitter15 and Twitter16 demonstrate the effectiveness and superiority of our proposed method, and our proposed model achieves new state-of-the-art results. Ablation studies confirmed the usefulness of different parts of the model.