Hierarchical contrastive learning for multi-label text classification

Zhang, Wei; Jiang, Yun; Fang, Yun; Pan, Shuai

doi:10.1038/s41598-025-97597-w

Download PDF

Article
Open access
Published: 23 April 2025

Hierarchical contrastive learning for multi-label text classification

Wei Zhang¹^na1,
Yun Jiang¹^na1,
Yun Fang¹ &
…
Shuai Pan¹

Scientific Reports volume 15, Article number: 14101 (2025) Cite this article

5723 Accesses
5 Citations
Metrics details

Subjects

Abstract

Multi-label text classification presents a significant challenge within the field of text classification, particularly due to the hierarchical nature of labels, where labels are organized in a tree-like structure that captures parent-child and sibling relationships. This hierarchy reflects semantic dependencies among labels, with higher-level labels representing broader categories and lower-level labels capturing more specific distinctions. Traditional methods often fail to deeply understand and leverage this hierarchical structure, overlooking the subtle semantic differences and correlations that distinguish one label from another. To address this shortcoming, we introduce a novel method called Hierarchical Contrastive Learning for Multi-label Text Classification (HCL-MTC). Our approach leverages the contrastive knowledge embedded within label relationships by constructing a graph representation that explicitly models the hierarchical dependencies among labels. Specifically, we recast multi-label text classification as a multi-task learning problem, incorporating a hierarchical contrastive loss that is computed through a carefully designed sampling process. This unique loss function enables our model to effectively capture both the correlations and distinctions among labels, thereby enhancing the model’s ability to learn the intricacies of the label hierarchy. Experimental results on widely-used datasets, such as RCV1-v2 and WoS, demonstrate that our proposed HCL-MTC model achieves substantial performance gains compared to baseline methods.

A multi-view contrastive learning for heterogeneous network embedding

Article Open access 25 April 2023

Research on a denoising model for entity-relation extraction using hierarchical contrastive learning with distant supervision

Article Open access 01 July 2025

English-focused CL-HAMC with contrastive learning and hierarchical attention for multiple-choice reading comprehension

Article Open access 17 November 2025

Introduction

The task of text classification, a cornerstone of natural language processing (NLP), has recently attracted increased interest. Its applications span a wide range of fields, including sentiment analysis^1,2,3, document categorization⁴, medical codes prediction⁵, legal studies⁶, patent classification⁷, and financial analysis⁸. Among these, Multi-label Text Classification (MTC) stands out as a particularly complex challenge. In MTC, the goal is to assign multiple labels to a given text, where the set of labels often exhibits a hierarchical structure. This structure implies a relationship between labels, such that information pertaining to one label can influence the inference of another, thereby adding complexity to the classification task.

The current approaches to the MTC task can be broadly classified into two categories: (1) methods that predict labels using textual information alone, and (2) approaches that combine both label and textual information for prediction. The first category of methods relies on local and global features extracted by text encoders to forecast labels. Notable examples include CNN-based models^9,10 that address data imbalance issues caused by a lack of samples for child labels. Other works in this category focus on incorporating semantic information from text¹¹. While these methods are effective at capturing textual subtleties, they generally fail to account for relationships between labels.

The second category of methods aims to integrate textual and label information. Strategies include weight initialization¹², learning label hierarchies^13,14,15, and the use of capsule networks^16,17. These approaches improve the efficiency of MTC by leveraging label information but often achieve only a superficial understanding of the label hierarchy. The graph convolutional network (GCN)-based model¹⁸ shows promise in learning a deep label hierarchy but does not fully exploit label information, focusing exclusively on correlative aspects and overlooking label distinctiveness.

Despite substantial progress, a critical gap in research persists: the majority of current methods do not effectively utilize both the distinctive and correlative aspects of label information to optimize hierarchical multi-label classification. This shortcoming restricts the effectiveness of these models, especially in complex, hierarchical label structures where both types of information are essential for accurate classification.

The concurrent consideration of correlative and distinctive information is fundamental to achieving a deep understanding of the label hierarchy, thus improving the effectiveness of MTC. As depicted in Fig. 1, the similarity $s_{23}$ between nodes 2 and 3 represents distinctive information, which is presumed to be maximized due to the lack of connecting edges, whereas the similarity $s_{26}$ signifies correlative information, presumed to be minimized.

This paper presents Hierarchical Contrastive Learning for Multi-label Text Classification (HCL-MTC). HCL-MTC explicitly models the hierarchical label structure as a directed graph, defining graph edges as contrastive knowledge between labels. This modeling approach facilitates the nuanced integration of both distinctive and correlative label information, enhancing the model’s capacity to capture the full complexity of the label hierarchy. To further improve the performance of label contrastive learning, we introduce a sampling hierarchical contrastive loss function. This loss function is designed to maximize the distinction between unrelated labels and minimize the similarity between closely related labels, thereby refining the model’s classification abilities.

In practical application, given training texts, the model first generates text features based on local and global information extracted from the text encoder. A linear converter then transforms these text features into label-wise features. Subsequently, the contrastive learner aggregates information from each label, incorporating insights from correlated labels based on their contrastive knowledge.

The primary contributions of this work can be succinctly summarized as follows:

Introduction of a novel methodology, HCL-MTC, which conceptualizes the label tree structure as a directed graph. This framework incorporates contrastive knowledge between labels to enhance the learning process effectively.
Development of a sampling hierarchical contrastive loss function to effectively utilize label contrastive knowledge. This innovative loss function is designed to improve the discriminative capabilities of the model in multi-label text classification scenarios by focusing on maximizing the distinction between unrelated labels while minimizing similarities between closely related labels.
Extensive empirical validation on two widely recognized public datasets demonstrates that HCL-MTC achieves significant improvements over existing state-of-the-art multi-label text classification methods.

Related work

MTC endeavors to assign hierarchical labels to given text inputs. Existing solutions for MTC can be broadly classified into different approaches based on their source of information and modeling strategies.

Text-based approaches

These approaches rely solely on the rich textual information inherent in the text at both word and sentence levels. They primarily leverage this information for predicting hierarchical labels without directly using label structure in the learning process.

Convolutional neural network (CNN) based methods: CNN-based methodologies are prevalent in MTC tasks due to their adeptness at capturing local contextual information. For instance, Kim¹⁹ demonstrated the effectiveness of CNNs in text classification. Following this, more advanced models like the Seq2Seq model¹¹ utilized dilated convolution and a hybrid attention mechanism to discern semantic units in texts. This approach helps in capturing broader contexts which are essential in multi-label settings.

Enhanced CNN models: Further developments include models by Shimura et al.⁹ and Yang et al.¹⁰, which introduced fine-tuning techniques in CNN to propagate upper-level information to lower levels. Additionally, the integration of two single CNNs using a siamese approach for tail categories was explored to enhance model sensitivity towards less frequent labels. These methods, however, primarily focus on textual information and often overlook the crucial inter-label relationships that are pivotal in hierarchical multi-label classification.

Hybrid approaches

These methods incorporate both text and label information, attempting to utilize the hierarchical structure of the labels alongside the textual features to improve classification.

Initial hidden layer utilization: Baker and Korhonen¹² initialized the final hidden layer of a CNN model to leverage label co-occurrence relations. This approach demonstrated that integrating label information at an early stage in the model can prime the network to better utilize label correlations.

Capsule networks and label embeddings: Chen et al.¹⁷ introduced a capsule network that incorporates label probabilities to enhance the representation of hierarchical relationships. Similarly, methods like those presented by Huang et al.¹⁵ and Yang et al.²⁰ embed label vectors into the model and learn the label structure from upper to lower levels. These strategies aim to capture label hierarchies more effectively but often only achieve a shallow understanding of these relationships.

Graph convolutional network (GCN) based models: Recent advancements have seen the use of GCN-based models^18,21 which formulate edge features based on word co-occurrence or label dependencies. These models leverage the structure of labels, which can be organized as trees or directed acyclic graphs (DAGs), to enhance classification performance. These approaches demonstrate promise but frequently rely heavily on prior probabilities and predefined label structures.

Hierarchical models

These models are specifically designed to leverage the hierarchical structure of labels, using various strategies to enhance the understanding and utilization of this structure in classification tasks.

Edge feature formulation in GCN-based models: Traditional GCN-based models like those by Marcheggiani and Titov²² and Lu et al.²³ often initialize the adjacency matrix randomly or use basic schemes. However, some works^21,24,25 define edge weights based on word information, such as word co-occurrence, word similarity, and point-wise mutual information. These methods attempt to enhance the model’s capacity to utilize textual and structural information concurrently.

Contrastive knowledge based edge formulation: In contrast, our approach formulates edge features based on the contrastive knowledge between labels, offering a significant departure from the reliance on word-related information. This novel strategy uses contrastive learning to directly encode the relationships between labels into the graph structure, thereby enhancing the model’s ability to understand and utilize the full spectrum of label information, both correlative and distinctive.

Proposed method

In this research, we propose a novel framework called Hierarchical Contrastive Learning for Multi-label Text Classification (HCL-MTC), which delineates contrastive learning methods through two primary components: (1) the transition matrix parameter of the Graph Convolutional Network (GCN), and (2) the introduction of a sampling hierarchical contrastive loss. The subsequent discussion begins with a thorough formulation of the problem, followed by a detailed explanation of our proposed HCL-MTC framework.

Problem formulation

In the domain of MTC, we consider a set of m predefined labels, denoted as $L=\{l_1,l_2,...,l_m\}$. Given a training set of N instances, represented as $\{(T_1,Y_1), (T_2, Y_2),...,(T_N,Y_N)\}$, where $T_i=\{x_1, x_2,...,x_n\}$ signifies the $i^{th}$ text instance, with n being the length of the text and $x_i$ denoting the $i^{th}$ word, and $Y_i$ indicating the subset of L assigned to $T_i$, the objective of the MTC task is to predict $\hat{y}_i$ for each test text. It is noteworthy that: i) Each text instance is associated with one or more labels from the set L; ii) The labels often organize themselves into a tree structure, indicating the existence of both correlative and distinctive information among the labels; iii) The sample size of a child node in the label hierarchy is typically smaller than that of its parent node, reflecting a hierarchical distribution of data among the labels.

Hierarchical contrastive learning for MTC

As depicted in Fig. 2, our proposed model is composed of four main components: a text encoder, a feature extractor, a linear converter, and a hierarchical contrastive learner. When presented with a sentence, the text encoder and feature extractor work together to capture both local and global context, producing a text feature representation. This text feature is then passed to the linear converter, which adjusts its dimensionality to align with that of the label-wise feature space. The final component, the hierarchical contrastive learner, takes into account the contrastive relationships between labels, interpreting these relationships as transition probabilities, thereby completing the model architecture.

Input: Before being processed by the text encoder, the input text undergoes a transformation through a pre-trained embedding matrix. Given a text sequence $T=\{x_1, x_2,...,x_n\}$, where each $x_i$ represents a word in the text, each word is converted into a corresponding vector $\omega _i$, This embedding process results in the construction of the input matrix $I=\{\omega _1,\omega _2,...,\omega _n\}$, which serves as the input to the text encoder.

Text encoder: A variety of text encoders, including Recurrent Neural Networks (RNN)²⁶ and their derivatives such as Long Short-Term Memory (LSTM)²⁷ and Gated Recurrent Unit (GRU)²⁸, have been utilized to capture global context within texts. In recent years, pre-trained models with fine-tuning capabilities, like BERT²⁹ and XLNet³⁰, have shown remarkable performance across a range of Natural Language Processing (NLP) tasks and can be effectively used as text encoders. For the sake of experimental consistency, we opt for the same text encoder, Bi-GRU, as employed in¹⁸. The Bi-GRU encoder layer processes the input matrix $I=\{\omega _1,\omega _2,...,\omega _n\}$ , where the hidden vector of a Bi-GRU is computed as follows:

$$\begin{aligned} \begin{aligned}&\overrightarrow{h}_t=GRU(\overrightarrow{h}_{t-1},\omega _t),\\&\overleftarrow{h}_t=GRU(\overleftarrow{h}_{t+1},\omega _t),\\&h_t=[\overrightarrow{h}_t,\overleftarrow{h}_t], \end{aligned} \end{aligned}$$

(1)

where $\overrightarrow{h}_t$ and $\overleftarrow{h}_t$ are the forward hidden vector and backward hidden vector at time step t. The output $h_t\in \mathbbm {R}^{2\mathbbm {u}}$ of the Bi-GRU is the concatenation of $\overrightarrow{h}_t$ and $\overleftarrow{h}_t$ where $\mathbbm {u}$ indicates the number of hidden units of each unidirectional GRU. The resulting global feature maps are $H=\{h_1,h_2,...,h_n\}$.

Feature extractor: We employ a CNN model to extract n-gram features from the global feature maps H obtained from the text encoder. The choice of CNN is driven by three key considerations: (1) The local connectivity property naturally models n-gram compositions through sliding window operations; (2) Parameter sharing mechanism enhances robustness to lexical variations; (3) Hierarchical filters enable automatic discovery of salient phrase-level patterns. Let $F\in \mathbbm {R}^{g \times 2\mathbbm {u}}$ represent a convolutional kernel, and let $H_{i:i+g-1}$ denote a region of the global feature map spanning g words. The local feature can be formulated as follows:

$$\begin{aligned} \begin{aligned} c_i=F\odot H_{i:i+g-1} + b, \end{aligned} \end{aligned}$$

(2)

where $\odot$ denotes the component-wise multiplication, and $b\in \mathbbm {R}$ denotes a bias term. The feature maps of f filters at $i^{th}$ channel can be denoted as $C_i=\{c_i^{1},c_i^{2},...,c_i^{f}\}$. Next, we apply the k-max pooling method to filter the top k most informative word combinations which can be formulated as follows:

$$\begin{aligned} \begin{aligned} P=&flatten(max(k,[C_1,C_2,...,C_{n-g+1}]))\\ \end{aligned} \end{aligned}$$

(3)

Given that K convolutional kernels are used, the final text feature is obtained by concatenating the outputs from each kernel. Let $P^k$ denote the output of the $k^{th}$ kernel, where $k \in {1, 2, \ldots , K}$. The concatenation of these outputs forms the final text feature, represented as: $O=[P^1,P^2,...,P^K]$. Unlike recurrent architectures that process tokens sequentially, our CNN design allows parallel feature extraction across all text positions, significantly improving computational efficiency for long texts.

Linear converter: The role of the linear converter is to bridge the gap between text semantics and label characteristics through dimension-aware transformation. Motivated by the need to preserve computational efficiency while maintaining interpretable label projections, we formulate this process as a trainable linear mapping. Specifically, given the text feature $O\in \mathbbm {R}^{K}$ extracted from diverse n-gram patterns, the converter first applies weight matrix $M\in \mathbbm {R}^{d_w\times K}$ to project features into latent label space. This linear design intentionally avoids introducing nonlinear distortions, allowing direct interpretation of label-feature correlations. The subsequent reshape operation then reorganizes the $d_w$ dimensional output into $V\in \mathbbm {R}^{m\times d_n}$, where m corresponds to the predefined number of labels and $d_n$ configures the feature depth per label. The dimension adjustment from $d_w$ to $m \times d_n$ is mathematically guaranteed when $d_w = m \times d_n$, ensuring structural compatibility for downstream label-wise operations.

Hierarchical contrastive learner: The Graph Convolutional Network (GCN)³¹ is utilized to represent structural relationships between nodes, including classification labels. In graph-based representations, edges encode the relationships between nodes. Traditional GCNs^22,23 often initialize the transition matrix randomly and rely on error backpropagation to learn node relationships, frequently disregarding information about node correlations. Zhou et al.¹⁸ mitigated this limitation by defining edge features based on the prior probability of label dependencies. Nonetheless, this approach mainly concentrates on learning correlative information between labels, neglecting the distinctiveness of labels.

In contrast, our proposed hierarchical contrastive learner constructs connections between graph nodes using label contrastive knowledge. Building upon the framework of Hierarchy-GCN¹⁸, we represent the label tree as a directed graph, where each node aggregates information from its parent nodes, child nodes, and itself. This aggregation process is dynamically guided by a learned adjacency matrix, which encodes directional relationships between nodes based on contrastive similarities.

Let $\mathcal {G} = (\mathcal {V}, \mathcal {E})$ represent the directed graph, where $\mathcal {V} \in \mathbbm {R}^{m \times d_n}$ denotes the set of m nodes, each corresponding to a label with a feature vector $v_k \in \mathbbm {R}^{d_n}$, and $\mathcal {E}$ represents the set of directed edges. The neighborhood N(k) of node k includes its parent node, child nodes, and itself. The connections between nodes are captured by the adjacency matrix $\textbf{A}$, where each element $A_{j,k}$ is computed based on the contrastive similarity between the feature vectors of node j and node k.

Specifically, the contrastive similarity $a_{j,k}$ is defined as the absolute value of the cosine similarity between the feature vectors $v_j$ and $v_k$:

$$\begin{aligned} a_{j,k} = \left| \frac{v_j \cdot v_k}{||v_j|| \cdot ||v_k||}\right| \end{aligned}$$

Each $a_{j,k}$ represents the strength of the connection between node j and node k, and is used to populate the corresponding entry $A_{j,k}$ in the adjacency matrix $\textbf{A}$. Higher values of $a_{j,k}$ indicate stronger similarity and thus a more direct flow of information from node j to node k, while lower values encourage information transfer across less similar nodes, promoting diversity in the learning process.

The feature information from node j is transferred to node k through an intermediate representation $\mu _{j,k}$, which is influenced by the contrastive similarity and a transfer bias $b_l^k$:

$$\begin{aligned} \mu _{j,k} = a_{j,k} v_j + b_l^k \end{aligned}$$

To regulate this transfer, a gating mechanism is applied. The gate, parameterized by a direction-specific weight matrix $W_g^{d(j,k)}$ and a gate bias $b_g^k$, ensures that the flow of information is controlled based on the relative importance of the neighboring node:

$$\begin{aligned} g_{j,k} = \sigma (W_g^{d(j,k)} v_j + b_g^k) \end{aligned}$$

This gating function, $g_{j,k}$, modulates the contribution of node j to node k, allowing the model to selectively filter the information passed between nodes. After gating, the hidden state of node k is updated by aggregating the gated information from all its neighbors, followed by a ReLU activation function to introduce non-linearity:

$$\begin{aligned} h_k = \text {ReLU}\left( \sum _{j \in N(k)} g_{j,k} \cdot \mu _{j,k}\right) \end{aligned}$$

The adjacency matrix $\textbf{A}$, which governs the structure of the graph, is dynamically learned from the contrastive similarities between node features. It captures both top-down and bottom-up flows, with a symmetric relationship between the two directions: $A_{j,k}$ for top-down flow is equal to $A_{k,j}$ for bottom-up flow. Additionally, each node retains self-loops with $A_{k,k} = 1$, ensuring that a node always retains its own information during the aggregation process. This dynamic construction of the adjacency matrix allows the model to adapt to the hierarchical label structure and to capture the intricate dependencies between nodes, facilitating effective learning from both similar and dissimilar nodes within the hierarchy.

Sampling hierarchical contrastive loss

The hierarchical contrastive loss aims to capture both distinctive and correlative information between labels in a hierarchical structure. To clarify its formulation, we provide an intuitive explanation before introducing the technical implementation.

Intuitive explanation: in a label tree, parent-child pairs can transfer information bidirectionally, while information between different parent nodes remains distinct. Thus, the hierarchical contrastive loss focuses on two main objectives:

Maximizing distinctive information: reducing the similarity between different parent nodes to enhance label distinctiveness.
Minimizing correlative information: increasing the similarity between a parent node and its child nodes to reflect hierarchical correlation.

Loss function formulation: Let $v_{p_i}$ and $v_{c_k}$ denote the embeddings of parent node $i$ and child node $k$, respectively. The similarity between two nodes is measured using the cosine similarity, given by:

$$\begin{aligned} s(v_a, v_b) = \frac{v_a \cdot v_b}{\Vert v_a\Vert \cdot \Vert v_b\Vert }, \end{aligned}$$

where $\cdot$ denotes the dot product, and $\Vert \cdot \Vert$ represents the vector norm. Using this similarity metric, we define two key components of the hierarchical contrastive loss:

1. Distinctiveness Term: This term minimizes the similarity between pairs of parent nodes $v_{p_i}$ and $v_{p_j}$ (where $i \ne j$), ensuring that different parent nodes are well-separated in the representation space:

$$\begin{aligned} L_{\text {distinctive}} = \sum _{p_i \in \mathcal {V}} \sum _{p_j \in \mathcal {V}, j \ne i} \exp (s(v_{p_i}, v_{p_j})). \end{aligned}$$

2. Correlation Term: This term maximizes the similarity between parent nodes $v_{p_i}$ and their respective child nodes $v_{c_k}$, capturing the semantic correlation in the hierarchy:

$$\begin{aligned} L_{\text {correlative}} = \sum _{p_i \in \mathcal {V}} \sum _{c_k \in \text {child}(i)} \exp (-s(v_{p_i}, v_{c_k})). \end{aligned}$$

Combining these two terms, the overall hierarchical contrastive loss is defined as:

$$\begin{aligned} L_{d} = L_{\text {distinctive}} + L_{\text {correlative}}, \end{aligned}$$

This formulation enforces a tradeoff between maximizing distinctiveness and minimizing correlation, ensuring that the learned embeddings adhere to the hierarchical structure.

Computational optimization via sampling: enumerating all node pairs is computationally intensive. To address this, we employ a sampling mechanism. For each hierarchical level, only two randomly selected parent nodes and one randomly selected child node are included in the computation of the hierarchical contrastive loss.

Sampling strategy:

For each parent node $p_i$, randomly select two other parent nodes $p_j$ and $p_k$ for comparison.
In the direct child node set of $p_i$, randomly select one child node $c_k$.
Compute the loss using the selected node pairs.

To better illustrate this process, Fig. 3 shows the hierarchical structure and the sampling strategy.

Classification

The final node features are fed into a fully connected layer, and the probability of node k being activated can be formulated as:

$$\begin{aligned} \begin{aligned} p_k = \sigma (W_kh_k+b^k), \end{aligned} \end{aligned}$$

(4)

where $W_k \in \mathbb {R}^n$ is the weight vector for node k, $h_k \in \mathbb {R}^n$ represents the feature vector of node k, $b^k \in \mathbb {R}^n$ is the bias term for node k, and $\sigma$ denotes the activation function, typically a sigmoid function in the context of multi-label classification to ensure the output is in the range ([0, 1]), interpretable as a probability.

The model then assigns labels to a given test text based on the probabilities $p_k$ of the corresponding nodes. Labels with probabilities greater than a predetermined threshold $\theta$ are considered as predicted labels for the test text. This threshold $\theta$ is a hyperparameter that can be tuned to balance the precision and recall of the model’s predictions.

Loss function

The HCL-MTC model integrates three different types of losses: binary cross-entropy loss, recursive regularization loss³², and sampling hierarchical distance loss. The overall loss function is formulated as follows:

$$\begin{aligned} \begin{aligned} L_c =&-\sum _{i=1}^my_ilog(y_i^\prime)+(1-y_i)log(1-y_i^\prime),\\ L_r =&\sum _{i\in \mathcal {V}}\sum _{j\in child(i)}\frac{1}{2}||\omega _i-\omega _j||^2,\\ L =&L_c+\lambda _1L_r+\lambda _2L_d, \\ \end{aligned} \end{aligned}$$

(5)

Here, $L_r$ represents the recursive regularization loss, which involves the parameters of the final fully connected layer, $y_i$ indicates the ground truth label, and $y_i^\prime$ represents the predicted probability of the $i^{th}$ label.

Experiments

Dataset description

We assess the efficacy of our proposed model using two publicly available datasets: RCV1-v2 and Web-of-Science (WoS).

Reuters corpus volume I (RCV1-v2): This dataset represents a refined version of the original RCV1-v1 data, made available for research purposes³³. It encompasses a total of 804,414 manually categorized newswire stories spanning 103 topics. Each newswire story has the potential for assignment to multiple topics.

Web of Science (WoS): The WoS dataset comprises metadata from 46,985 published papers³⁴. This dataset includes abstracts, domains, and keywords. The abstract serves as the input for text classification, while the domain represents the label hierarchy. Additionally, the keywords offer descriptions of the subsequent label level. The dataset is characterized by a total of 141 domains.

Evaluation metrics

We use the standard evaluation metrics of Micro-F1 and Macro-F1³² to measure our experimental results.

Micro-F1 considers the overall performance of the model which is calculated by overall precision and recall of all the labels.
Macro-F1 considers the local performance of the model which gives equal weight to all labels.

Specifically, the computation of the Micro-F1 score and Macro-F1 score is elucidated below:

$$\begin{aligned} \begin{aligned}&micro F1 = \frac{2\sum _{l\in L}TP_l}{\sum _{l\in L}2TP_l+FP_l+FN_l},\\&macro F1 = \frac{1}{|L|}\sum _{l\in L}\frac{2TP_l}{2TP_l+FP_l+FN_l}, \end{aligned} \end{aligned}$$

(6)

Here, $TP_t$, $FP_t$, and $FN_t$ denote the true-positives, false-positives, and false-negatives, respectively, for the label $l \in L$.

Experimental setup

To provide a comprehensive evaluation, we compared our model against several baselines and state-of-the-art models. The baselines include CNN, RNN, RCNN, and hierarchical models like HiLAP, HMCN, and HiAGM. The specific implementation and training details are as follows:

Table 1 Implementation details: dropout shows the dropout rate in the embedding layer and MLP layer, GRU dropout shows the dropout rate in the Bi-GRU layer and Node dropout shows the dropout rate in the node transformation layer.

Full size table

All experiments are conducted in PyTorch³⁵. To ensure comparability with¹⁸, we adopt similar implementation parameters. The 300-dimensional word embedding vector is initialized using GloVe pretrained embeddings³⁶. A vocabulary size of 60,000, comprising the most frequent words, is employed, with removal of words occurring less than twice. We use the Adam optimizer³⁷ to minimize the total loss. The penalty coefficient for recursive regularization is set to $1 \times 10^{-6}$, and for sampling hierarchical distance loss, it is set to $1 \times 10^{-5}$. The maximum number of epochs is set at 400, and the model stops training if no improvement is observed within 50 epochs. Additional implementation details are provided in Table. 1.

Experimental results

Table 2 Experimental results of MLP baselines, state-of-art models and our proposed model.

Full size table

We systematically evaluate our proposed model on two public datasets, conducting a comprehensive comparison with 12 MLP baselines and state-of-the-art models based on micro-F1 and macro-F1 metrics. The assessment of our model is performed on the test subset, employing the best model determined on the validation subset. The outcomes of our proposed model are presented in Table 2.

Discussion: the experimental findings yield several key conclusions:

Performance on RCV1-v2: our proposed model surpasses all existing models on the RCV1-v2 dataset, exhibiting a noteworthy enhancement of 0.58% in the Micro-F1 score and 0.94% in the Macro-F1 score compared to the HiAGM-TP$_{GCN}$ model. These improvements are statistically significant ($p < 0.05$).
Performance on WoS: on the WoS dataset, HCL-MTC demonstrates considerable improvements of 0.65% and 0.74% in terms of Micro-F1 and Macro-F1, respectively. These improvements are also statistically significant ($p < 0.05$).
Macro-F1 improvement: HCL-MTC primarily boosts the Macro-F1 score across both datasets, indicating its proficiency in handling classes with fewer samples and effectively addressing the challenge of data sparsity.
Hierarchical learning impact: Our proposed model represents an enhancement of the HiAGM-TP$_{GCN}$ framework by integrating a contrastive learning method into its core architecture. Specifically:
- We leverage the similarity between label pairs as transition parameters in the GCN network, in contrast to utilizing the prior probability of label dependencies.
- The sampling hierarchical contrastive loss introduced to the total loss helps the model learn the label hierarchy more effectively.
Distinctiveness and correlation management: The hierarchical contrastive loss aids in minimizing correlative information between parent nodes and their children while maximizing distinctiveness between different parent nodes. This contributes to the overall performance improvement.

Ablation test

We perform an ablation test to systematically analyze the influence of the similarity transition matrix and the sampling hierarchical contrastive loss on the proposed model. The results of this ablation study are presented in Table 3.

The outcomes of the ablation study provide valuable insights. Notably, the performance of HCL-MTC without the inclusion of the similarity transition matrix and the sampling hierarchical contrastive loss exhibits a significant decrease across all datasets, as evident in both Micro-F1 and Macro-F1 metrics. This substantial performance decline underscores the crucial contributions of both the similarity transition matrix and the sampling hierarchical contrastive loss to the overall efficacy of HCL-MTC. Additionally, an interesting observation is the larger decrease in macro-F1 performance compared to micro-F1. This discrepancy can be attributed to the fact that macro-F1 treats each class equally, disregarding imbalances in the distribution of samples across classes. Consequently, our proposed method demonstrates a capacity to mitigate the challenges associated with class imbalance.

Table 3 Ablation study of the HCL-MTC with varying different components on RCV1-v2 and WoS datasets. w/o similarity denotes the HCL-MTC without similarity transition matrix and w/o contrastive loss denotes the HCL-MTC without sampling hierarchical contrastive loss.

Full size table

The rationale behind these findings can be elucidated as follows:

Facilitating deep label hierarchy learning: The objective is to encourage the child node to aggregate more information from its parent, or vice versa, enabling the model to learn deep label hierarchies along the correct path. The similarity transition matrix plays a pivotal role in this process, assigning higher transition probabilities to label pairs that exhibit greater similarity.
Role of sampling hierarchical contrastive loss: This loss function aids the model in minimizing similarity information from parent nodes to their child nodes, while simultaneously maximizing the distinctiveness between parent nodes. Consequently, the sampling hierarchical contrastive loss contributes to guiding the model towards optimal solutions during the training process.

Limitations and future work

While the proposed model demonstrates significant improvements over existing methods, it has notable limitations that warrant further investigation:

Scalability issues: The hierarchical contrastive loss and similarity transition matrix can become computationally expensive as the number of labels increases, potentially limiting scalability for extremely large label sets. Future work could explore techniques such as hierarchical sampling methods, low-rank approximations of similarity matrices, or distributed training strategies to improve computational efficiency.
Data sparsity: In cases of extreme data sparsity, particularly for low-frequency labels, the model may struggle to learn meaningful relationships. Future research could address this by incorporating data augmentation strategies, synthetic data generation, or transfer learning from related tasks with richer datasets to enhance the model’s robustness in sparse settings.
Highly imbalanced label distribution: Despite improvements in Macro-F1 scores, extreme class imbalance remains challenging, especially for labels with limited instances. Potential solutions include dynamic label reweighting, advanced oversampling techniques, or self-supervised pretraining tailored to underrepresented labels.
Complex hierarchical structures: For datasets with exceptionally deep or irregular hierarchical structures, the model may face difficulties in capturing accurate relationships, particularly when hierarchical depths exceed those seen during training. Developing adaptive mechanisms, such as recursive representations or depth-aware loss functions, could allow the model to better handle varying hierarchical complexities.
Domain-specific vocabulary: When the label vocabulary is highly domain-specific, the pretrained GloVe embeddings may insufficiently capture semantic nuances, potentially affecting performance. Fine-tuning embeddings on domain-specific corpora, integrating domain knowledge through external knowledge graphs, or leveraging domain-adapted language models could mitigate this issue.
Generalizability across domains: While the model has demonstrated improvements on specific datasets, its adaptability across diverse domains remains limited. Future work should involve extensive testing across datasets from varied domains and tasks to evaluate robustness. Additionally, exploring domain adaptation techniques could enhance its generalizability.

In conclusion, while the proposed HCL-MTC model achieves promising results, addressing these limitations could further enhance its scalability, robustness, and applicability. Future research directions include optimizing computational efficiency for large-scale label sets, improving methods to handle data sparsity and class imbalance, developing adaptive approaches for complex hierarchical structures, and ensuring robust generalization across diverse domains and specialized vocabularies.

Conclusions

This paper introduces the Hierarchical Contrastive Learning for Multi-label Text Classification (HCL-MTC) framework, which leverages Graph Convolutional Networks (GCN) with a similarity transition matrix and sampling hierarchical contrastive loss. These innovations enhance the model’s ability to learn deep label hierarchies and capture both correlative and distinctive knowledge between labels.

Experiments on the RCV1-v2 and Web of Science (WoS) datasets demonstrate the superior performance of HCL-MTC, achieving Micro-F1 scores of 84.54% and 86.47% and Macro-F1 scores of 64.29% and 81.02%, respectively. These results highlight the model’s robustness in handling data sparsity and class imbalance, establishing HCL-MTC as a highly effective solution for multi-label text classification.

Data availability

The data that support the findings of this study are publicly available and can be accessed on GitHub at https://github.com/hanggun/HCL-MTC.

Code availability

Our code is available at https://github.com/hanggun/HCL-MTC.

References

Pang, B. & Lee, L. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135. https://doi.org/10.1561/1500000011 (2008).
Article Google Scholar
Li, Y., Yin, C., Zhong, S., & Pan, X. (2020). Multi-instance multi-label learning networks for aspect-category sentiment analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20 (Webber, B., Cohn, T., He, Y., Liu, Y. eds.). 3550–3560. (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/V1/2020.EMNLP-MAIN.287.
Ding, Z., Xia, R., & Yu, J. End-to-end emotion-cause pair extraction based on sliding window multi-label learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3574–3583. (Association for Computational Linguistics, Online, 2020) . https://doi.org/10.18653/v1/2020.emnlp-main.290.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489. https://doi.org/10.18653/v1/N16-1174 (Association for Computational Linguistics, 2016).
Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., & Eisenstein, J. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6 (Walker, M.A., Ji, H., Stent, A. eds.). Vol. 1 (Long Papers). 1101–1111. (Association for Computational Linguistics, Online, 2018). https://doi.org/10.18653/V1/N18-1100.
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., & Androutsopoulos, I. Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019 (Korhonen, A., Traum, D.R., Màrquez, L. eds.) . Vol. 1: Long Papers. 6314–6322. (Association for Computational Linguistics, Online, 2019). https://doi.org/10.18653/V1/P19-1636.
Tang, P., Jiang, M., Xia, B.N., Pitera, J.W., Welser, J., & Chawla, N.V. Multi-label patent categorization with non-local attention-based graph convolutional network. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12. 9024–9031. (AAAI Press, Online, 2020). https://doi.org/10.1609/AAAI.V34I05.6435.
Maia, M., Sales, J.E., Freitas, A., Handschuh, S., & Endres, M. A comparative study of deep neural network models on multi-label text classification in finance. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC). 183–190 . https://doi.org/10.1109/ICSC50631.2021.00039 (2021).
Shimura, K., Li, J., & Fukumoto, F. HFT-CNN: Learning hierarchical category structure for multi-label short text categorization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 811–816. (Association for Computational Linguistics, 2018). https://doi.org/10.18653/v1/D18-1093.
Yang, W., Li, J., Fukumoto, F., & Ye, Y. HSCNN: A hybrid-Siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6716–6722. (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/v1/2020.emnlp-main.545.
Lin, J., Su, Q., Yang, P., Ma, S., & Sun, X. Semantic-unit-based dilated convolution for multi-label text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 (Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. eds.). 4554–4564. (Association for Computational Linguistics, Online, 2018). https://aclanthology.org/D18-1485/.
Baker, S., & Korhonen, A. Initializing neural networks for hierarchical multi-label text classification. In BioNLP 2017. 307–315. (Association for Computational Linguistics, 2017). https://doi.org/10.18653/v1/W17-2339.
Liu, Y., Xu, F., Zhao, Y., Ma, Z., Wang, T., Zhang, S., & Tian, Y. Hierarchical multi-instance multi-label learning for Chinese patent text classification. Connect. Sci. 36(1). https://doi.org/10.1080/09540091.2023.2295818 (2024).
Liang, Z., Guo, J., Qiu, W., Huang, Z. & Li, S. When graph convolution meets double attention: Online privacy disclosure detection with multi-label text classification. Data Min. Knowl. Discov. 38(3), 1171–1192. https://doi.org/10.1007/S10618-023-00992-Y (2024).
Article Google Scholar
Huang, W., Chen, E., Liu, Q., Chen, Y., Huang, Z., Liu, Y., Zhao, Z., Zhang, D., & Wang, S. Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. CIKM ’19. 1051–1060. (Association for Computing Machinery, 2019). https://doi.org/10.1145/3357384.3357885.
Wang, G., Du, Y., Jiang, Y., Liu, J., Li, X., Chen, X., Gao, H., Xie, C., & Lee, Y. Label-text bi-attention capsule networks model for multi-label text classification. Neurocomputing 588, 127671. https://doi.org/10.1016/J.NEUCOM.2024.127671 (2024).
Chen, B., Huang, X., Xiao, L., & Jing, L. Hyperbolic capsule networks for multi-label classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3115–3124. (Association for Computational Linguistics, Online, 2020) . https://doi.org/10.18653/v1/2020.acl-main.283.
Zhou, J., Ma, C., Long, D., Xu, G., Ding, N., Zhang, H., Xie, P., & Liu, G. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1106–1117. (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/v1/2020.acl-main.104.
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746–1751. (Association for Computational Linguistics, 2014) . https://doi.org/10.3115/v1/D14-1181.
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., & Wang, H. SGM: Sequence generation model for multi-label classification. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, pp. 3915–3926 (Bender, E.M., Derczynski, L., Isabelle, P. eds.) . (Association for Computational Linguistics, Online, 2018). https://aclanthology.org/C18-1330/.
Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., & Yang, Q. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 World Wide Web Conference. WWW ’18. 1063–1072. (International World Wide Web Conferences Steering Committee, 2018) . https://doi.org/10.1145/3178876.3186005
Marcheggiani, D., & Titov, I. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1506–1515. (Association for Computational Linguistics, 2017). https://doi.org/10.18653/v1/D17-1159.
Lu, J., Du, L., Liu, M., & Dipnall, J. Multi-label few/zero-shot learning with knowledge aggregated from multiple label graphs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20 (Webber, B., Cohn, T., He, Y., Liu, Y. eds.) . 2935–2943. (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/V1/2020.EMNLP-MAIN.235 .
Yao, L., Mao, C., Luo, Y. Graph convolutional networks for text classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019. 7370–7377. (AAAI Press, Online, 2019). https://doi.org/10.1609/AAAI.V33I01.33017370.
Henaff, M., Bruna, J., & LeCun, Y. Deep convolutional networks on graph-structured data. CoRR abs/1506.05163 arXiv:1506.05163 (2015).
Werbos, P. J. Backpropagation through time: What it does and how to do it. Proc. IEEE 78(10), 1550–1560. https://doi.org/10.1109/5.58337 (1990).
Article Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 (1997).
Article CAS PubMed Google Scholar
Cho, K., Merrienboer, B., Bahdanau, D., & Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014 (Wu, D., Carpuat, M., Carreras, X., Vecchi, E.M. eds.). 103–111. (Association for Computational Linguistics, Online, 2014) . https://doi.org/10.3115/V1/W14-4012.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019 (Burstein, J., Doran, C., Solorio, T. eds.) . Vol. 1 (Long and Short Papers). 4171–4186. (Association for Computational Linguistics, Online, 2019) . https://doi.org/10.18653/V1/N19-1423.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., & Le, Q.V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, Vancouver, BC, Canada (Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. eds.) . 5754–5764 (2019). https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html.
Kipf, T.N., & Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, Online . https://openreview.net/forum?id=SJU4ayYgl (2017).
Gopal, S., & Yang, Y. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. KDD ’13. 257–265. (Association for Computing Machinery, 2013). https://doi.org/10.1145/2487575.2487644.
Lewis, D. D., Yang, Y., Rose, T. G. & Li, F. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004).
Google Scholar
Kowsari, K., Brown, D.E., Heidarysafa, M., Jafari Meimandi, K., Gerber, M.S., & Barnes, L.E. Hdltex: Hierarchical deep learning for text classification. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). https://doi.org/10.1109/icmla.2017.0-134 (2017).
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. Automatic Differentiation in Pytorch (2017).
Pennington, J., Socher, R., & Manning, C. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. (Association for Computational Linguistics, 2014) . https://doi.org/10.3115/v1/D14-1162.
Kingma, D.P., & Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Bengio, Y., LeCun, Y. eds.). http://arxiv.org/abs/1412.6980 (2015).

Download references

Funding

The research leading to these results received funding from the National Key Research and Development Program of China under Grant Agreement No 2022YFF0903302.

Author information

Wei Zhang and Yun Jiang contributed equally to this work.

Authors and Affiliations

Advanced Institution of Information Technology, Peking University, No.233, Yonghui Rd, Hangzhou, 311215, Zhejiang, China
Wei Zhang, Yun Jiang, Yun Fang & Shuai Pan

Authors

Wei Zhang
View author publications
Search author on:PubMed Google Scholar
Yun Jiang
View author publications
Search author on:PubMed Google Scholar
Yun Fang
View author publications
Search author on:PubMed Google Scholar
Shuai Pan
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Wei Zhang, Yun Jiang, Yun Fang and Shuai Pan. The first draft of the manuscript was written by Wei Zhang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shuai Pan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical and informed consent for data used

The data employed in this study has been obtained with proper ethical considerations and informed consent.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, W., Jiang, Y., Fang, Y. et al. Hierarchical contrastive learning for multi-label text classification. Sci Rep 15, 14101 (2025). https://doi.org/10.1038/s41598-025-97597-w

Download citation

Received: 31 July 2024
Accepted: 07 April 2025
Published: 23 April 2025
Version of record: 23 April 2025
DOI: https://doi.org/10.1038/s41598-025-97597-w

Keywords

This article is cited by

Contrastive learning-enhanced dual attention network for multi-label text classification
- Hui Huang
- Mingfeng Yu
- Chuan Lin
Journal of King Saud University Computer and Information Sciences (2025)

Subjects

Abstract

Similar content being viewed by others

A multi-view contrastive learning for heterogeneous network embedding

Research on a denoising model for entity-relation extraction using hierarchical contrastive learning with distant supervision

English-focused CL-HAMC with contrastive learning and hierarchical attention for multiple-choice reading comprehension

Introduction

Related work

Text-based approaches

Hybrid approaches

Hierarchical models

Proposed method

Problem formulation

Hierarchical contrastive learning for MTC

Sampling hierarchical contrastive loss

Classification

Loss function

Experiments

Dataset description

Evaluation metrics

Experimental setup

Experimental results

Ablation test

Limitations and future work

Conclusions

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical and informed consent for data used

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Contrastive learning-enhanced dual attention network for multi-label text classification

Search

Quick links