Introduction

Text classification1 is one of the fundamental tasks in the field of natural language processing (NLP), aiming to automatically categorize text based on its content. It significantly enhances the efficiency of organizing, retrieving, and understanding text, making it easier for people to extract the required information from large volumes of text data. As an important technology, text classification has been successfully applied in various fields. For example, in sentiment analysis2, it helps identify users’ emotional tendencies towards a product or service. In information retrieval3, text classification aids in the effective filtering and sorting of vast amounts of information. In opinion mining4, it enables the identification and analysis of public opinions and attitudes towards a specific topic or event.

Traditional text classification methods include machine learning techniques5,6,7 and deep learning approaches8,9,10. Machine learning methods rely on handcrafted features, representing text by statistically analyzing word frequencies and co-occurrence relationships. Deep learning methods automatically learn complex feature representations from text using neural network models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, these traditional methods struggle to capture the complex relationships and structural information in texts, such as dependencies between words, and perform poorly in handling long-distance dependencies.

To address these issues, Graph Neural Networks (GNNs)11,12 have been widely used in text classification tasks. Yao et al.13 proposed TextGCN, which transformed the text classification problems into node classification issues using graph convolutional networks. Huang et al.14 improved TextGCN by introducing a message-passing mechanisms and reducing memory consumption. Zhang et al. proposed TextING15, which enhanced the model’s generalization ability by constructing relationships between words. Tayal et al.16 extended graph convolutional networks by incorporating label dependencies in the output space. Wang et al.17 proposed two different graph construction methods to capture sparse semantic relationships between short texts. Song et al.18 leveraged corpus graphs and sentence graphs for text classification. Li et al.19 used semantic feature graphs to represent text semantics and proposed a semantic information transfer mechanism to transfer contextual semantic information. Although GNNs can consider both local and global structural features of the original data in text classification, the lack of rich contextual representation for nodes leads to slower convergence and lower accuracy.

In recent years, pre-trained models have achieved notable success in text classification tasks. For example, Word2Vec20 enabled the conversion of large-scale words into word vectors. Transformer21 utilized attention mechanisms to capture intrinsic relationships in data. BERT22 uses a bidirectional pre-trained model to extract more information and achieve more accurate semantic understanding. Pre-trained models learn the features and patterns of data by being pre-trained on large-scale datasets. These models typically use unsupervised learning methods to train on unlabeled data, learning a general form of data representation. This significantly enhances the performance of models in text classification tasks. However, pre-trained models primarily rely on sequential data to establish relationships between words, which often overlooks structural information, making it challenging to capture both local and global structural features.

To combine the advantages of pre-trained models and GNNs, several studies have been proposed. Yang et al.23 proposed the BEGNN model, which constructs text graph based on word co-occurrence relationships and uses GNNs to extract text features. Lv et al.24 proposed the RB-GAT model, which achieves text classification by combining RoBERTa25, BiGRU26 and Graph Attention Network (GAT)27. Lin et al.28 proposed BertGCN model for text classification, which constructs a heterogeneous graph for the corpus and utilizes BERT to initialize document nodes, then performs convolution operations using GCN29. By jointly training the BERT and GCN module, the model can better capture semantic information and word dependencies in the text, making it perform well on many text classification datasets.

However, despite the success of BertGCN in text classification, it still has some limitations. Although BERT can capture bidirectional contextual information, it randomly masks some words during training, resulting in a pretrain-finetune discrepancy. Additionally, BERT processes sequence information into segments, which limits the model’s ability to handle long sequences. Moreover, Li et al.30 and Xu et al.31 pointed out that GCN suffers from an over-smoothing problem, as the number of layers increases, the representations of different nodes become too similar, leading to information loss and performance degradation. To address these issues, we propose a new model, XLG-Net. In this model, XLNet32 uses an autoregressive approach for pre-training. This approach avoids the inconsistencies caused by masked language models and better captures contextual information through a two-stream attention mechanism. GCNII33 alleviates the over-smoothing problem by introducing initial residuals and identity mapping34. Finally, we incorporate the design philosophy of DoubleMix35 into the model, hybridizing the model’s hidden states, effectively enhancing its performance in handling both long and short texts.

Specifically, our contributions are as follows:

  1. 1.

    We propose a novel hybrid architecture, XLG-Net. Our proposed model uses the XLNetMix module and a deep GCNII module to extract text features. It captures contextual information more effectively while considering the structural information of the text, enabling precise text representation and significantly improving classification accuracy.

  2. 2.

    We incorporate the design philosophy of DoubleMix to XLNet by mixing the hidden states in the model. By learning from the hidden states obtained from both the original and augmented data, the model obtains better feature representation for text data, enhancing accuracy and robustness.

  3. 3.

    Our method outperforms other baseline models on four benchmark text classification datasets. Through experiments and analysis, we demonstrate the effectiveness of our approach.

Related work

Pre-training model

In the field of NLP, pre-trained models36 can be divided into two categories. The first type mainly learns shallow word embeddings, such as Word2Vec20 and GloVe37. Although they can generate high-quality word vectors, these pre-trained word vectors cannot dynamically change with context, making it difficult to understand higher-level text concepts. The second type of model primarily learns contextual embeddings, where the semantic information of words changes with the context. For example, the ELMo38 model, which is based on bidirectional LSTM39, addresses the polysemy problem, resulting in word vectors that can vary with the context. The ULMFiT40 model achieves optimal results in text classification tasks by fine-tuning pre-trained models, solving the problem of needing to train models from scratch each time. Vaswani et al.21 proposed the new Transformer architecture, which innovatively advanced the attention mechanism41, enhancing the ability of parallel computing and enabling the model to learn more data features in a short time. Inspired by Transformer, the GPT model42 was proposed, which uses a unidirectional Transformer to replace the LSTM of ELMo for pre-training tasks. This was the first pre-trained model to incorporate the Transformer architecture. Devlin et al.22 introduced the BERT pre-trained model, which employs a bidirectional Transformer for pre-training, enabling better utilization of bidirectional contextual information. BERT broke 11 NLP task records, including raising the GLUE benchmark to 80.4%, achieving an accuracy of 86.7% on MultiNLI(an absolute improvement of 5.6%), and raising the SQuAD v1.1 question-answering test F1 score to 93.2. The RoBERTa25 model is a fine-tuned version of BERT, which meticulously analyzes BERT’s hyperparameters and makes minor modifications to enhance the model’s representation ability. ALBERT43 is one of the classic variants of BERT, which significantly reduces the number of parameters while maintaining performance. XLNet14 integrates some important ideas from Transformer-XL44 into pre-training. For example, by introducing segmented linear positional encoding, it can handle long texts more effectively. Additionally, XLNet increased the scale of data used during pre-training, surpassing BERT on 20 tasks and achieving state-of-the-art performance on 18 tasks.

Graph neural network

In recent years, GNNs have received widespread attention for their ability to extract and integrate features from multi-scale local spatial data, demonstrating strong representational abilities. They have successfully extended deep learning models from Euclidean to non-Euclidean spaces. In graphs, there are not only node features but also structural features. GNNs capture dependencies and relationships between nodes through messages passed along edges. GCN was the first to simply apply the convolution operation in image processing to graph structured data. The main idea is to calculate the weighted average of a node’s neighbors and its own information, thereby obtaining a feature vector that can be fed into a neural network. Veličković et al.27 proposed the GAT, which uses an attention mechanism to compute the attention weights of each node with respect to its neighboring nodes, enhancing the model’s robustness and interpretability. Hamilton et al.45 introduced GraphSAGE, which uses a sampling mechanism to overcome the issues of high memory consumption and slow computation when gradient updating on large-scale graphs.

Despite the significant achievements of GCN and its variants, most of these models are shallow. For example, the optimal models for GCN and GAT are typically two-layer networks. Theoretically, too few layers limits the model’s ability to capture higher-order neighbor information. However, when the number of layers is too high, these models suffer from a degradation in learning ability, such a phenomenon is called over-smoothing. Chen et al.33 proposed the GCNII model, which incorporates residual connections and identity mapping into GCNs to address the over-smoothing issue, thereby ensuring that deeper GCNII model can achieve at least the same performance as their shallow counterparts.

The fusion of pre-trained models and graph neural networks

After the success of pre-trained models and GNNs, some researchers proposed combining these two approaches. Lu et al.46 proposed VGCN-BERT, which integrates the BERT model with a Vocabulary Graph Convolutional Network (VGCN) to capture both local and global information of the data. Graph-Bert47 decomposes the graph into subgraphs to learn the feature information of each node and improves the efficiency of the model through parallel processing. Lin et al.28 proposed BertGCN, a text classification model that combines pre-trained models with transductive learning. It employs techniques such as prediction interpolation, memory storage, and small learning rates during training, achieving significant performance improvements on five text classification datasets. ViCGCN48 leverages the complementary strengths of pretrained models and graph GCN to capture more syntactic and semantic information, addressing the issues of data imbalance and noise in Vietnamese social media texts.

Method

In this part, we describe the structure of XLG-Net in detail.

Fig. 1
figure 1

An overview of XLG-Net architecture.

Architecture overview

In the proposed XLG-Net model, the model structure is shown in Fig. 1, we have designed four modules: (1) graph construction; (2) feature extraction based on XLNetMix; (3) feature extraction based on GCNII; and (4) feature aggregation. The construction module of the graph, namely the Heterogeneous Graph section in Fig. 1, for each dataset, we constructed a heterogeneous graph in accordance with the importance and relevance of its documents and words. The specific methodology is expounded in the section titled “Graph construction”. The XLNetMix module is utilized to process the input text. The features with contextual representations extracted by it are used, on the one hand, to initialize the embedding representations of the document nodes in the heterogeneous graph, and on the other hand, for the subsequent feature fusion module. The heterogeneous graph initialized by the XLNetMix module will be input into the GCNII module. The GCNII applies graph convolution operations to aggregate the information of surrounding words in the sentence. The specific details are explained in the section titled “GCNII based feature extraction”. Finally, the prediction of GCNII and that of the XLNetMix module are fused to obtain the ultimate prediction result.

Graph construction

We construct a heterogeneous graph for each dataset, denoted as \(G=\left( V,E \right)\), where \(V\) is the set of all nodes in the graph and \(E\) is the set of edges between nodes, as shown in Fig. 2.

Fig. 2
figure 2

Schematic of GCNII layer in XLG-Net. Example taken from MR corpus. Nodes begin with “O” are document nodes, others are word nodes. \(R\left( x \right)\) means the embedding of \(x\). Different colors mean different document classes.

Following the rules of TextGCN, we divide the nodes into document nodes and word nodes. The connections between word nodes and document nodes are measured using term frequency-inverse document frequency (TF-IDF). The connections between word nodes are measured using positive pointwise mutual information (PMI). The weight of the edge connecting nodes \(i\) and \(j\) is calculated by Eq. (1):

$$\begin{aligned} A_{i,j}=\left\{ \begin{matrix} PMI\left( i,j \right) ,& i,j\,\,\text {are}\,\,\text {words}\,\,\text {and}\,\,i\ne j\\ TF-IDF\left( i,j \right) ,& i\,\,\text {is}\,\,\text {document},\,\,j\,\,\text {is}\,\,\text {word}\,\,\\ 1,& i=j\\ 0,& \,\,\text {otherwise}\,\,\\ \end{matrix} \right. \end{aligned}$$
(1)

The PMI value of a word pair \(i\), \(j\) is calculated by Eqs. (2)(3)(4):

$$\begin{aligned} \text {PMI}\left( i,j \right)= & \log \frac{p\left( i,j \right) }{p\left( i \right) p\left( j \right) } \end{aligned}$$
(2)
$$\begin{aligned} p\left( i,j \right)= & \frac{\#W\left( i,j \right) }{\#W} \end{aligned}$$
(3)
$$\begin{aligned} p\left( i \right)= & \frac{\#W\left( i \right) }{\#W} \end{aligned}$$
(4)

Where, \(\#W\left( i,j \right)\) is the number of sliding windows that contain word \(i\) and word \(j\), and \(\#W\left( i \right)\) is the number of sliding windows that contain word \(i\). \(\#W\) is the total number of sliding windows.

Algorithm 1
figure a

Construct Graph.

Similarly, we use an identity matrix \(X=I_{n_{\text {doc}}+n_{\text {word}}}\) as the initial features of the nodes, where \(n_{\text {doc}}\) represents the number of document nodes and \(n_{\text {word}}\) represents the number of word nodes. In the XLG-Net model, we use the XLNetMix model to extract feature from the text data, and use them as input representations for document nodes. The embedding of the document nodes is denoted as \(X_{\text {doc}}\in \mathbb {R}^{n_{\text {doc}}\times d}\), where \(d\) is the dimension of the embedding. Consequently, the initial feature matrix of the nodes can be represented as Eq. (5):

$$\begin{aligned} X=\left( \begin{array}{c} X_{\text {doc}}\\ 0\\ \end{array} \right) _{\left( n_{\text {doc}}+n_{\text {word}} \right) \times d} \end{aligned}$$
(5)

Algorithm 1 outlines the steps for constructing a heterogeneous graph.

XLNetMix based feature extraction

In this section, we introduce the structure of XLNetMix. The input of XLNetMix is sample data, and its output is the contextualized word embedding that encompasses entire information of input sequence. The overview of XLNetMix architecture is shown in Fig. 3.

Fig. 3
figure 3

The architecture of the XLNetMix.

In order to improve the robustness of the model, we perform two back-translation operations on each dataset to get two perturbed samples. Based on the idea of DoubleMix, we perform the mixing operation in the hidden space of XLNet. As shown in Fig. 3, we use the XLNet pre-trained model with \(L\) layer. Firstly, we fed both original samples and two perturbed samples to the XLNet layer, then we use the \(i\)-th layer in [0, \(L\)] as the mixing layer to perform the two-step interpolation operation. The first step is to mix up the two perturbed samples by sampling a group of weights from the Dirichlet distribution, and the second step is to mix up the synthesized perturbed sample with the original sample by sampling weights ranging from 0 to 1 derived from the Beta distribution. Since the generated perturbation samples may contain potential injected noise, when mixing the original data and the synthesized perturbation data, we constrain the weight of the original data to a larger value to balance the appropriate trade-off between perturbation and noise, making the final representation closer to the original representation. The feature representations output by the last layer of the model are used as input representations for the document nodes, and then these contextual embedding representations are fed to the GCNII layer.

GCNII based feature extraction

For heterogeneous graph \(G\), we use GCNII for feature extraction and propagation. Each layer of GCNII is defined as Eq. (6):

$$\begin{aligned} {H}^{\left( \ell +1 \right) }=\sigma \left( \left( \left( 1-\alpha _{\ell } \right) {\tilde{P}H}^{\left( \ell \right) }+\alpha _{\ell }{H}^{\left( 0 \right) } \right) \left( \left( 1-\beta _{\ell } \right) {I}_n+\beta _{\ell }{W}^{\left( \ell \right) } \right) \right) \end{aligned}$$
(6)

Where \(H^{\left( 0 \right) }\) represents the feature representation of \(0\)-th layer, \(\alpha _{\ell }\) and \(\beta _{\ell }\) are two hyperparameters, and \(\tilde{P}=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}\) is the graph convolution matrix49. Residual connection have always been one of the most commonly used techniques to alleviate over-smoothing problem. However, GCNII does not obtain information from the previous layer. Instead, it performs a residual connections from the initial layer, linking the smoothed node representation \(\tilde{P}H^{\left( l \right) }\) with the initial layer \(H^{\left( 0 \right) }\) through residual connections and assigning it a weight \(\alpha _{\ell }\), which is the first part of the Eq. (6). The initial layer \(H^{\left( 0 \right) }\) is not the original input feature, but is obtained by applying a linear transformation \(H^{\left( 0 \right) }\;=\;f_{\theta }\left( X \right)\) to the input features. Using residuals alone can only alleviate the problem of over-smoothing, therefore, GCNII incorporates the idea of identity mapping from ResNet, that is, adding an identity matrix \(I_n\) to the weight matrix \(W^{\left( l \right) }\) and setting a weight \(\beta _{\ell }\).

Overall, the concept of initial residuals is to select weights between the current layer representation and the initial layer representation, while identity mapping is to select weights between parameter \(W^{\left( l \right) }\) and the identity matrix \(I_n\). The output of GCNII is the updated feature representation, denote as \(Z_{\text {GCNII}}\), which captures the structural information of the document and obtains the final prediction through the \(softmax\) layer, as shown in Eq. (7):

$$\begin{aligned} Z_{\text {GCNII}}=softmax\left( g\left( X,A \right) \right) \end{aligned}$$
(7)

Where \(g\) denotes the GCNII model, \(X\) is the input feature matrix, \(A\) is the normalized adjacency matrix.

Joint XLNetMix and GCNII predictions

Through preliminary experiments, we find that adding an auxiliary classifier to XLG-Net and directly using the document embeddings generated by XLNetMix as input leads to faster convergence and better performance. Specifically, we construct the auxiliary classifier by feeding the document embeddings into a dense layer with \(softmax\) activation, as shown in Eq. (8):

$$\begin{aligned} Z_{\text {XLNetMix}}=softmax\left( WX \right) \end{aligned}$$
(8)

Where \(X\) represents document embeddings and \(W\) represents the weight matrix.

Specifically, we propose using \(\lambda\) to control the trade-off between XLNetMix predictions and GCNII predictions to optimize the XLG-Net model. We implement this using the following Eq. (9):

$$\begin{aligned} Z=\lambda Z_{\text {GCNII}}+\left( 1-\lambda \right) Z_{\text {XLNetMix}} \end{aligned}$$
(9)

In section of Hyper-Parameter settings, we conduct comprehensive experiments on four datasets to determine the optimal value of \(\lambda\). XLNetMix can capture contextual relationships within documents and GCNII can capture dependencies between semantic relationships. The combination of these two abilities can improved performance on tasks that require understanding the semantic relationships between words.

Experiments and analysis

Datasets

We conducted experiments on four widely used text classification datasets: Movie Review (MR), Ohsumed, R52 and R8. The summary statistics of the datasets is presented in Table 1:

Table 1 Summary statistics of datasets.
  1. (1)

    Movie Review (MR). It is a sentiment classification dataset containing 10,662 movie reviews, each of which can be judged as positive or negative. Specifically, 5331 reviews are positive, and 5331 reviews are negative. In our experiments, 7108 documents of this dataset were employed as the training set and 3554 documents as the test set.

  2. (2)

    Ohsumed. The dataset is sourced from the MEDLINE database, a significant bibliographic database of medical literature maintained by the National Library of Medicine. It contains 7400 documents with medical abstracts covering 23 categories of cardiovascular diseases. In this study, 3357 documents were used as the training set, and 4043 documents were used as the test set.

  3. (3)

    R52 and R8. They are two subsets of the Reuters 21,578 dataset. The R52 subset encompasses 52 categories, with 6532 training documents and 2568 test documents. The R8 subset contains 8 categories, divided into 5485 training documents and 2189 test documents.

Baselines

In order to comprehensively evaluate the proposed method, we compare XLG-Net with several widely recognized text classification models with good performance, including deep models for serialized data processing and models based on GNNs.

BERT22. It is a pre-trained language representation model introduced by Google in 2018. By pre-training on a vast corpus of text data, it acquires a deep understanding of textual semantics. The core of the BERT model is the Transformer architecture, which employs a bidirectional attention mechanism to capture long-distance dependencies within the text. BERT has achieved revolutionary results in multiple tasks of NLP, such as text classification, question-answering systems, and named entity recognition. Its bidirectional training approach enables the model to better comprehend contextual information, thereby excelling in a variety of language understanding tasks.

XLNet32. It is designed to address some of the limitations inherent in traditional language models and the BERT model when processing language. The core contribution of XLNet lies in its adoption of a novel pretraining objective known as Permutation Language Modeling (PLM). By considering all possible permutations of words in a sequence, it captures the interdependencies among different words, thereby more effectively modeling bidirectional context information. Furthermore, XLNet employs a dual-stream self-attention mechanism that achieves target position awareness through the self-attention of content and query streams, thus taking positional information into account during prediction.

SGC50. The SGC (Simple Graph Convolution) model is a simplified version of the graph convolutional network. It simplifies a multi-layer graph convolutional network into a linear model by eliminating the non-linear activation functions between GCN layers. This approach reduces the complexity and computational load of the model while maintaining performance comparable to the original GCN on multiple tasks. The core idea of the SGC model is that the non-linear activation functions in graph convolutional networks are not the key factors in improving performance, the operation of averaging local neighbor features is more critical. Therefore, the SGC model simplifies the model structure by removing these non-linear activation functions and retaining only the final softmax function for classification. In SGC, feature extraction is equivalent to applying a fixed filter on each feature dimension.

TextGCN13. It is a text classification method based on GCN. Its core concept involves leveraging the co-occurrence relationships among words in the text and the relationships between documents and words to construct a graph, where documents and words serve as the nodes of the graph. If a word appears in a document, there will be an edge between these two nodes. The weight of the edges can be represented using TF-IDF to signify the relationship between documents and words. The relationships between word nodes can be constructed through PMI, reflecting the co-occurrence relationships among words. Through the graph convolutional network, TextGCN is capable of learning the embedded representations of documents and words, which can then be utilized for text classification tasks.

TextING15. It is a text classification tool based on GNNs, which enhances feature representation by constructing a semantic graph spectrum of documents, thereby improving the accuracy and generalization ability of classification. The core advantage of TextING lies in its ability to capture the structural information within documents and achieve hierarchical feature extraction through the iterative propagation of graph neural networks. It is suitable for various application scenarios, including news topic classification, sentiment analysis and professional document archiving.

GLTC51. This model is devised with two feature extractors for the purpose of capturing the structural and semantic information within the text. Moreover, the KL divergence is incorporated as a regularization term during the loss calculation process.

RB-GAT52. It is an innovative text classification model that integrates various advanced technologies such as RoBERTa, BiGRU, and GAT. Firstly, it utilizes RoBERTa to extract the initial semantic embeddings of the text. Subsequently, BiGRU further captures the long-distance dependencies and bidirectional information. Then, taking the output of BiGRU as node features, GAT analyzes the semantic structure and key information of the text by employing the multi-head attention mechanism. Finally, the classification results are obtained through the Softmax layer.

BertGCN28. It combines the BERT pre-trained model with GCN to enhance text processing abilities. BertGCN integrates the output of the BERT model as node features in the GCN, enabling the model to handle graph-structured data with text information more effectively. This approach leads to improved performance in text classification tasks.

Experiment setups

Following the approach of TextGCN, we processed all datasets by cleaning and tokenizing the text, then removing some low-frequency words. Finally, we used 10% of the training data for validation to facilitate model training, the model subjected to 50 training iterations. We use XLNetMix and an 8-layer GCNII to implement XLG-Net. Perform two back-translation operations on each dataset to obtain two perturbed samples, and then input both the original samples and the perturbed samples into the XLG-Net model. For the XLNetMix module, we use the output features of the [CLS] token as the document embedding, followed by a feedforward layer to generate the final prediction. At the beginning of each epoch, we use XLNetMix to compute all documents embeddings and update the graph’s node features with these embeddings. The graph is then fed into XLG-Net for training. Therefore, to improve the consistency of the embeddings, we set a smaller learning rate for the XLNetMix module and a larger learning rate for the GCNII module. We compared our research results with previous studies to accurately evaluate our findings. Additionally, to gain a deeper understanding of our proposed model, we conducted an analysis and discussion from various perspectives, including the impact of parameter \(\lambda\) and the number of GCNII layers. Finally, we conducted ablation experiments to validate the effectiveness of our proposed XLG-Net method. All the codes are written using PyTorch and run on a single V100-32GB GPU.

Evaluation metric

This section delineates the performance evaluation criteria adopted in this research. In the domain of classification tasks, particularly with respect to the four datasets accentuated in this study, the commonly utilized metric is Accuracy, which is defined as the ratio of the number of correctly classified texts to the total number of texts. Nonetheless, in light of the pronounced class imbalances inherent in these datasets, the most suitable metric for this research is the Average Macro F1-score, which is computed as the harmonic mean of Precision and Recall.

To determine the Average Macro F1-score, we initially compute Precision and Recall for each class using Eqs. (10) and (11), respectively. Subsequently, Eq. (12) is applied to ascertain the F1-score for each individual class within the dataset. Here, \(TP\) denotes the count of true positives, \(FP\) the count of false positives, \(FN\) the count of false negatives, and \(TN\) the count of true negatives.

$$\begin{aligned} P r e c i s i o n= & {\frac{T P}{T P+F P}} \end{aligned}$$
(10)
$$\begin{aligned} R e c a l l= & {\frac{T P}{T P+F N}} \end{aligned}$$
(11)
$$\begin{aligned} F1\text {-}s c o r e= & 2\times {\frac{P r e c i s i o n\times R e c a l l}{P r e c i s i o n+R e c a l l}} \end{aligned}$$
(12)

Upon obtaining the F1 scores for all classes, we calculate the average macro F1-score (mF1). Eq. (13) depict the macro F1-score in the context of multi-class classification for multiple classes \(C_i\),\(i\in \{1,2,\ldots n\}\)(representing each class within the dataset). In these equations, \(F1\text {-}s c o r e_i\) denote the \(F1\text {-}s c o r e\) of class \(i\) of the dataset.

$$\begin{aligned} m F 1={\frac{\sum _{i=1}^{n}F 1-s c o r e_{i}}{n}} \end{aligned}$$
(13)

Overall results

To verify the performance of the XLG-Net model in text classification, we made a comparison between it and the previous studies, specifically the baselines enumerated in the section titled “Baselines”. The Accuracy and Macro-average F1 of different models on four text classification datasets are presented in Tables 2 and 3:

Table 2 Accuracy results on different datasets (%).
Table 3 Macro-average F1 results on different datasets (%).

As can be observed from Tables 2 and 3, XLG-Net attains the optimal performance across the four datasets. Among all the models under consideration, SGC and TextGCN demonstrate the least favorable performance throughout the four datasets. Furthermore, BERT and XLNet surpass SGC, TextGCN, and TextING, thereby highlighting the advantage of pre-trained models. BertGCN and XLG-Net outstrip other models by a substantial margin, which can be ascribed to the integration of pre-trained models and GCN models, allowing them to complement each other’s strengths. In comparison to BERT and XLNet, XLG-Net exhibits significant enhancements on the Ohsumed dataset. This is attributable to the fact that the average text length of Ohsumed is 79, which is longer than that of other datasets. Since the graph is constructed based on document-word statistics, it implies that the graph derived from longer texts possesses a more complex structure. Such complexity facilitates more effective message propagation, thus enabling the models to achieve superior performance. XLG-Net outperforms the BertGCN model on the four datasets. This can be accounted for by XLNet’s proficiency in capturing long sequential dependencies and GCNII’s capacity to efficiently transfer information within deep networks. Additionally, the incorporation of the design philosophy of DoubleMix in the XLG-Net model empowers it to perform well on short-text datasets as well, which elucidates why XLG-Net shows significant improvements over BertGCN on the MR dataset.

Hyper-parameter settings

The configuration of hyperparameters within the neural network exerts a significant impact on the ultimate experimental outcomes. In order to further enhance the performance of the XLG-Net model, an in-depth exploration is conducted regarding the different hyperparameters.

The effect of \(\lambda\)

In accordance with Eq. (9), the hyperparameter \(\lambda\) governs the balance between the output features of XLG-Net and those of XLNetMix, and its value directly dictates the accuracy of the final outcomes. Consequently, we carried out experiments on four benchmark datasets with the aim of determining the optimal value of \(\lambda\).

Fig. 4
figure 4

Accuracy of XLG-Net when varying \(\lambda\) on the development set (%).

Figure 4 presents the accuracy of XLG-Net with varying \(\lambda\) values on the MR, Ohsumed and R52 datasets. It can be observed that, on these four benchmark datasets, setting the value of \(\lambda\) within the range of 0.6 to 0.8 is the most suitable, and the accuracy attains its optimum when \(\lambda\) is 0.6. Moreover, when \(\lambda\) = 1(utilizing only XLG-Net), the accuracy is consistently higher than when \(\lambda\) (employing only XLNetMix). These results demonstrate that the predictions of XLG-Net are more accurate in text classification tasks when \(\lambda\) is larger. On the other hand, the XLNetMix module is also essential. The appropriate adjustment of the proportion of predictions made by the XLNetMix module can exert a positive influence on the performance of the model.

The effect of \(L\)

In typical graph node classification or graph classification tasks, classic GNN models such as GCN and GAT can achieve good performance with 2–4 layers. Adding more layers often leads to over-smoothing. GCNII addresses this issue, allowing the model’s performance to improve even with a large number of layers. Therefore, exploring the optimal number of GCNII layers(\(L\)) in XLG-Net is also important. We conducted experiments on four benchmark datasets to determin the optimal value of \(L\) and the results are shown in Fig. 5.

Fig. 5
figure 5

Accuracy of XLG-Net when varying \(L\) on the development set (%).

Figure 5 shows the accuracy of XLG-Net with different \(L\) values on the MR, Ohsumed, and R52 datasets, respectively, we can see that when the value of \(L\) is less than 8, the larger the value of \(L\), the higher the model’s accuracy. When \(L\) = 8, the model reaches its best. When L exceeds 8, the model’s predictive accuracy exhibits a gradual decline, indicating that a larger value of L is not necessarily better. Furthermore, the model’s performance on the MR dataset is more susceptible to the influence of L, with the predictive accuracy undergoing significant fluctuations as L varies.

The effect of other parameters

For a more thorough exploration of the performance of the XLG-Net model, further investigations were conducted with respect to the learning rate and the number of hidden layers. The value of \(\lambda\) was configured to be 0.6, and the number of GCNII layers was set to 8, while the remaining parameters remained unchanged. Multiple rounds of experimental evaluations were performed for all parameters, and the average results are presented as follows.

The learning rate represents a vital hyperparameter within the realms of supervised learning and deep learning. It governs the capacity of the objective function to converge to a local minimum and the temporal aspect of such convergence. In cases where the learning rate is excessively high, the loss function might overshoot the global optimum directly. Conversely, an overly low learning rate leads to a sluggish rate of change in the loss function, rendering it susceptible to becoming entrapped in a local minimum. Tables 4 and 5 present the comparative outcomes of two learning rates. It is discernible that as the learning rate diminishes, the F1 value of the model initially ascends and subsequently descends. When the learning rate of XLNetMix is configured at 0.000007 and that of GCNII is set to 0.0007, the model attains the peak F1 value. These particular values are thus elected for experimentation within the XLG-Net model.

Table 4 Macro-average F1 of XLG-Net when the learning rate of XLNetMix changes (%).
Table 5 Macro-average F1 of XLG-Net when the learning rate of GCNII changes (%).

The dimensionality of the hidden layers in GCNII dictates the number and complexity of features that the model can learn. Consequently, selecting an appropriate dimensionality size significantly impacts the model’s performance, complexity, and generalization capabilities. Table 6 illustrates the impact of varying the dimensions of the hidden layer in GCNII on the classification results. When the number of hidden layers was increased to 256, the classification performance declined significantly. This was because the increase in hidden layers introduced more parameters and computations, which impaired the model’s fitting.

Table 6 Macro-average F1 of different hidden layers (%).

Ablation study

Effect of interpolation layers

The hidden layers in pre-trained models are powerful in representation learning. In order to investigate the contribution of interpolation operations in the hidden layer to the model, we conducted ablation experiments. Specifically, we set the number of XLNet layers \(L\) in the model to 12, and conducted experiments by group interpolation. The results are shown in Table 7.

Table 7 Test accuracy of XLG-Net on the text classification datasets.

When all interpolations were excluded, the accuracies on the four datasets were 87.03%, 72.66% , 96.44% and 98.29%. When we perform interpolation at layer 1, the accuracy improves by 1.26%, 0.26% , 0.02% and 0.04%. It shows that the idea of DoubleMix is also applicable in XLNet and contributes to the improvement of model performance. When we interpolate at layer set {1, 2}, the accuracy improvement of the model is small. However, when we interpolate at layer sets {3, 4, 5} and {5, 6}, the performance improvement is significant. According to Jawahar et al.53, the hidden layers {3, 4} of BERT perform best in encoding surface features, layers {6, 7, 9, 12} contain the most syntactic features and semantic features, while the 9-th layer captures the most of syntactic and semantic information. Therefore, we have tried several layer sets containing the 9-th layer and find that the {9, 10, 12} layer set has the best performance on all four datasets. In addition, we notice that the number of layers in the interpolation layer set is not the more the better. The layer set {3, 4, 6, 7, 9, 12} gives a low improvement in accuracy, indicating that too many interpolated layers will reduce the performance.

Effectiveness of XLNet and GCNII

To investigate the contributions of XLNet and GCNII to the model, we explored the performance of the model when replacing the BERT layers with XLNet layers and when replacing the GCN layers with GCNII layers, based on the BertGCN architecture. The results are shown in Fig. 6.

Fig. 6
figure 6

Test accuracy on four datasets for different model combinations.

From Fig. 6, we can observe that XLNet+GCNII (the combination of XLNet and GCNII) achieved the best performance. XLNet+GCN (the combination of XLNet and GCN) resulted in improved performance on the MR and Ohsumed datasets compared to BertGCN. BERT+GCNII (the combination of BERT and GCNII) resulted in improvements on four datasets compared to BertGCN. When both modules were replaced (XLNet+GCNII), there was a further performance boost on the MR and Ohsumed datasets, as well as a slight improvement on the R52 dataset. In summary, the combined use of XLNet and GCNII has demonstrated outstanding performance in enhancing model accuracy, particularly on the MR and Ohsumed datasets.

Conclusion and future work

In this research endeavor, we put forward a novel method named XLG-Net, which capitalizes on the potent contextual word representations furnished by large-scale pre-trained models and the profound graph convolutional techniques of GCNII for the purpose of text classification. We blend the hidden states of the XLNet model to enhance both the accuracy and robustness of the model. We conducted multiple experiments and compared the XLG-Net with several benchmark models. The experimental results demonstrated that on four benchmark text classification datasets, the XLG-Net outperformed the BertGCN model as well as other benchmark models. In addition, we also delved into the impact of different hyperparameter settings on the model’s performance, thereby further validating the effectiveness of the proposed method.

Nevertheless, within this work, it remains necessary for us to construct the heterogeneous graph of the entire dataset prior to employing the model to extract features. This methodology might be less than optimal when compared to models capable of automatically constructing graphs. Meanwhile, as the number of layers in the GCNII model increases, although the model performance is improved, the model training time becomes longer as well. At this point, optimizing the model structure to accelerate the training efficiency should be taken into consideration. Furthermore, due to limitations in computing power, we only tested the experimental results of the GCNII model with the number of layers less than 16. If the number of layers exceeds 16, the performance of the model might be improved. These issues will be left for future exploration.