Abstract
With the rapid advancement of Natural Language Processing (NLP) technologies, the application of NLP to enable intelligent syndrome differentiation in Traditional Chinese Medicine (TCM) has become a popular research focus. However, TCM texts contain numerous obscure characters and specialized terminologies, which existing methods struggle to effectively extract, leading to lower accuracy in syndrome differentiation. To address this, we propose a dual-channel knowledge-attention model for TCM syndrome differentiation. The model utilizes the ZY-BERT, a large pre-trained model in the TCM domain, to extract vector representations of TCM texts. A dual-channel network, comprising an improved Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network, is employed to capture both critical local information and global patterns in TCM texts. Additionally, an attention mechanism is introduced to enhance the model’s ability to learn syndrome-related knowledge, integrating syndrome definition knowledge to improve the model’s ability to differentiate complex syndromes. Experiments conducted on a publicly available TCM syndrome differentiation dataset demonstrate that the proposed model achieves an accuracy of 84.01%, representing an 1.75% improvement in accuracy compared to the best baseline model.
Similar content being viewed by others
Introduction
Syndrome differentiation in Traditional Chinese Medicine (TCM) is the process of identifying a patient’s syndrome type by analyzing diagnostic information through four key methods: inspection, listening and smelling, inquiry, and palpation1. This process is fundamental to TCM diagnostics and is crucial for guiding clinical treatments. However, traditional syndrome differentiation is highly complex and subjective, relying heavily on a physician’s knowledge base, clinical experience, and ability to synthesize multiple sources of diagnostic information. The lack of standardized diagnostic criteria and inter-physician variability further exacerbate the challenges in ensuring consistent and accurate diagnosis.
With the rapid advancement of artificial intelligence (AI) and natural language processing (NLP) technologies, researchers have increasingly explored leveraging these tools to address the challenges of TCM syndrome differentiation. AI models can process large amounts of textual data, including patients’ chief complaints, medical histories, and diagnostic records, enabling automated, standardized, and efficient syndrome differentiation. This could significantly reduce dependence on physicians’ subjective judgments, improve diagnostic consistency, and ultimately enhance patient outcomes.
In the past decade, deep learning methods have excelled in NLP tasks, including sentiment analysis, machine translation, and text classification2,3. Inspired by these advances, researchers have started to explore their applications in TCM to address the unique challenges posed by the complexity and ambiguity of TCM texts. For instance, TCM texts often contain rare characters, domain-specific terminology, and implicit semantic relationships, which are challenging to capture using traditional NLP approaches. Additionally, the polysemy and context-dependence of TCM terms further complicate syndrome differentiation tasks.
Transformer-based architectures, such as BERT, excel at capturing long-range dependencies and global semantics but often miss critical local features, especially in short texts like chief complaints4. Furthermore, the computational complexity of transformers can limit their effectiveness in extracting fine-grained features essential for accurate syndrome differentiation. To address these limitations, we propose a novel Dual-Channel Knowledge Attention (DCKA) model that combines convolutional neural networks (CNN) and bidirectional long short-term memory networks (Bi-LSTM). CNN capture fine-grained features in short texts, while Bi-LSTMs use gate mechanisms and bidirectional propagation to extract global contextual information from longer texts. This design complements the strengths of BERT representations and enhances the model’s ability to handle complex TCM texts.
The contributions of this paper are summarized as follows:
-
We propose a dual-channel architecture that integrates CNN and Bi-LSTM networks to effectively extract both global semantic context and local textual features, addressing the limitations of transformer-only models in capturing short-text features.
-
A knowledge-attention mechanism is introduced to incorporate domain-specific knowledge of TCM syndrome patterns, enabling the model to better understand the intricate relationships between symptoms and syndromes.
-
The proposed model is evaluated on a large-scale TCM dataset, demonstrating significant improvements in accuracy, F1-score, and generalization ability compared to existing baseline models.
-
We provide a comprehensive analysis of the model’s performance, highlighting its ability to handle rare syndromes and long-tail distributions, which are common challenges in TCM syndrome differentiation tasks.
By combining CNN and Bi-LSTM to capture multi-level features and integrating domain knowledge through attention mechanisms, this study bridges the gap between domain knowledge and deep learning in TCM diagnosis. The proposed DCKA model not only enhances the accuracy and efficiency of syndrome differentiation but also offers a framework for future research in AI-driven TCM diagnostics.
Literature review
In recent years, significant efforts have been made to apply deep learning techniques to TCM syndrome differentiation, with varying degrees of success. Below, we summarize and critically analyze several representative works in this field, highlighting both their strengths and limitations.
Mucheng et al.4 pre-trained a large TCM model, ZY-BERT, based on Bidirectional Encoder Representations from Transformers (BERT)5 specifically designed for TCM corpora. This model introduced an automatic syndrome classification mechanism, which marked an important step in integrating BERT-based architectures into TCM processing. However, the effectiveness of ZY-BERT is still limited by its inability to fully capture the complex global context of TCM texts. Due to the inherent ambiguity and polysemy of TCM language, ZY-BERT struggles to disambiguate subtle semantic differences, leading to lower accuracy in syndrome classification when compared to human experts.
Hu et al.6 proposed a multi-task learning framework, combining BiLSTM and CNN. They employed embeddings to convert TCM texts into word vectors and used word segmentation as an auxiliary task to incorporate word-level information7. While this model showed improvements in incorporating local textual information, its ability to model long-range dependencies remains weak due to the intrinsic limitations of CNNs. Furthermore, the auxiliary word segmentation task, while useful, cannot fully address the complexities of TCM syndrome differentiation, particularly in handling rare terms that are critical for correct diagnosis.
Ye et al.8 developed a model that integrates external knowledge graphs with BERT to enhance syndrome differentiation. By embedding knowledge graphs, the model leverages additional structured medical knowledge, which significantly improved accuracy. However, the reliance on external knowledge graphs introduces dependency on the completeness and accuracy of the medical knowledge base, which may not fully represent all aspects of TCM. Additionally, the fusion between knowledge graphs and text embeddings is non-trivial and may introduce noise if the medical records are ambiguous or incomplete.
Li et al.9 employed transfer learning with TCM-BERT10 and introduced a dual reinforcement strategy to enhance the model’s ability to recognize rare syndromes. This model incorporates data augmentation at both the sample and feature levels, providing more training data for rare syndromes. However, despite improvements in rare syndrome recognition, this model is prone to overfitting, especially when the augmented data does not accurately represent real-world distributions. Moreover, the model’s performance on common syndromes was not significantly improved, suggesting that the overall generalization capability remains limited.
Zhao et al.11 proposed a model that focuses on fine-grained token-level matching to better capture the relationship between symptoms and syndromes. They introduced a global label attention mechanism and employed Focal Loss12 to mitigate the effects of class imbalance due to the long-tail distribution of syndromes. While this model addressed some issues related to imbalanced data, it still faces challenges in capturing both global and local information simultaneously. The fine-grained token-level matching improves local feature extraction but may overlook broader contextual information crucial for accurate syndrome differentiation.
Unlike the aforementioned studies, DCKA introduces a novel perspective to implement syndrome differentiation in TCM. It employs a dual-channel BiLSTM-CNN architecture combined with knowledge-based attention. Further details will be elaborated in the subsequent sections.
Methods
Task definition
The primary task of this research is to model syndrome differentiation in TCM, which involves identifying the syndrome type of a patient based on diagnostic information such as the four diagnostic methods: inspection, listening and smelling, inquiry, and palpation. Given an input text \(X\), the goal is to construct a classification model \(f\) to predict the corresponding syndrome label \(Y\).
Overview of the model
The DCKA proposed in this paper builds upon the foundation of Chief Complaint Syndrome Differentiation (CCSD)11. DCKA introduces the following three key modifications compared to CCSD:
-
A dual-channel network architecture combining CNN and BiLSTM is constructed based on the characteristics of both long and short text inputs, with an improved CNN introduced to capture key information from chief complaints.
-
DCKA avoids cross-attention between chief complaints and medical history, preserving their independence to extract distinct insights in later computations.
-
DCKA independently calculates attention between the chief complaint and external knowledge, enhancing the chief complaint, medical history, and four diagnostic information (inspection, auscultation, inquiry, and palpation) with knowledge-based attention.
-
DCKA removes the Focal Loss used in CCSD. While Focal Loss improves the model’s ability to recognize rare syndromes, it also leads to the neglect of common syndromes during training, resulting in lower overall model accuracy.
The structure of the DCKA model is illustrated in Fig. 1. The algorithmic process of DCKA is outlined in Algorithm 1.
As shown in Fig. 1 and Algorithm 1, the DCKA model integrates deep learning networks such as ZY-BERT, CNN, and BiLSTM. Each of these modules will be introduced in detail later. The workflow for implementing syndrome differentiation in TCM using the DCKA model is as follows:
Step 1 The text of TCM medical records undergoes preprocessing operations such as trimming, padding, and batching. The processed text is then fed into ZY-BERT to obtain vector representations. The chief complaint vector of the patient is input into the short-text CNN channel, while the patient’s medical history and four diagnostic information are input into the long-text BiLSTM channel.
Step 2 The CNN channel extracts features from short texts, while the BiLSTM channel focuses on long-text features in TCM medical records.
Step 3 ZY-BERT is used to obtain the text representations of syndrome knowledge. The short-text features from the CNN channel and the long-text features from the BiLSTM channel are input into the knowledge attention network, where cross-attention with syndrome knowledge is computed.
Step 4 The outputs from the two channels are fused, and a linear transformation is applied to produce syndrome differentiation predictions.
ZY-BERT
ZY-BERT, a domain-specific large language model tailored for TCM, extends the pre-training of the RoBERTa13 (Robustly Optimized BERT Approach) framework using a substantial collection of TCM-related text data. The training corpus consists of diverse unannotated sources, including websites, books, and academic papers from the China National Knowledge Infrastructure (CNKI), collectively amounting to over 400 million words.
Therefore, ZY-BERT possesses domain-specific knowledge of TCM that traditional large language models lack. By generating high-quality semantic representations of TCM texts, ZY-BERT enhances the model’s ability to understand TCM-specific content, effectively addressing the challenges of semantic ambiguity and accuracy limitations faced by traditional models when processing TCM data.
In the pre-training phase, ZY-BERT, like RoBERTa, removes the Next Sentence Prediction (NSP) task. To further enhance the model’s capacity to capture complex linguistic features, ZY-BERT employs Whole Word Masking (WWM) in place of the traditional character-level masking approach. The model architecture is composed of 24 Transformer2 layers, each with a feature dimension of 1024 and 16 attention heads per layer. ZY-BERT maintains compatibility with BERT’s input requirements, necessitating the transformation of text data into formats such as Input ids and Attention masks. By leveraging ZY-BERT, more precise semantic representations of TCM texts can be achieved.
Dual-channel network
Inspired by the CCSD framework, this paper proposes a dual-channel network designed specifically for short and long texts. The design of the DCKA network incorporates a combination of CNN and Bi-LSTM, driven by the following motivations: CNN excels at extracting local features from short texts, making it particularly suitable for capturing critical symptom information from TCM chief complaints. Bi-LSTM, with its bidirectional propagation and gating mechanisms, effectively models the global contextual dependencies of long texts, such as medical histories and diagnostic information. Moreover, these two components complement BERT’s representations by addressing its limitations in capturing local features in short texts while maintaining manageable computational complexity. Through its dual-channel architecture, DCKA integrates local and global feature extraction, enhancing its capability for comprehensive analysis of TCM texts.
Improved CNN short text channel
In the TCM-SD4 dataset, the “Chief Complaint” field represents the primary symptoms and feelings that patients report to the doctor upon their visit. This typically reflects the immediate reasons for seeking medical assistance and illustrates the most prominent clinical manifestations of the disease. The distribution of the length of the “Chief Complaint” field in the TCM-SD dataset is shown in Fig. 2.
As shown in Fig. 2, the majority of text lengths in the “Chief Complaint” field fall within the range of 0–25 characters, indicating that this field primarily consists of short text. CNNs are particularly effective at capturing features in short text, as the convolutional kernels can identify local features14, which is highly beneficial for recognizing keywords and phrases in such text. The earliest application of CNNs for text classification was introduced by Kim15, known as TextCNN. TextCNN employs three convolutional kernels of sizes 3, 4, and 5, combined with max-pooling to extract essential information, as illustrated in Fig. 3.
Given that key terms in TCM may appear in the “chief complaint” text at various lengths, this paper employs multiple convolutional filters (sizes 2, 3, 4, and 5) to comprehensively capture the features of these terms. Additionally, the model integrates three pooling methods-max pooling, average pooling, and stochastic pooling-to extract textual features from diverse perspectives. Max pooling identifies the most salient features16, average pooling captures the overall textual information, and stochastic pooling introduces randomness, which helps mitigate overfitting. By combining these pooling methods, the model achieves a richer and more diversified feature representation, preserving critical information across multiple convolutional scales to the greatest extent possible. The convolution formula is presented as follows.
The formula above represents the calculation of the output \(c_h\) from a convolution operation followed by a ReLU17 activation function. Specifically, \(X_{ch}\) denotes the input feature map for channel \(c\), \(W_h\) is the convolutional filter applied to this channel, and \(b_h\) is the bias term associated with the convolutional layer. The result of the convolution, \(X_{ch} * W_h + b_h\), is then passed through the ReLU activation function, which introduces non-linearity by setting all negative values to zero. Next, a pooling operation is applied to the convolved features, as defined by the following equation.
Here, max, mean, and random represent max pooling, average pooling, and random pooling, respectively. The three pooling outputs are integrated through Eq. (5).
Through Eq. (5), we integrated the output values from all pooling methods and then concatenated all convolutional kernels to generate the final output \(P\) of the CNN.
The BiLSTM long text channel
In the TCM-SD framework, a patient’s medical history and the four diagnostic methods serve together as inputs for the long-text channel. The medical history, also referred to as the patient’s past medical history, encompasses previous health conditions and treatments, including past illnesses, treatments received, surgical history, and medication usage. The four diagnostic methods refer to the comprehensive health information collected through the techniques of inspection, auscultation and olfaction, inquiry, and palpation. The length distributions of both the medical history and the four diagnostic methods are illustrated in Fig. 4.
Figure 4 shows that the majority of the medical history and the four diagnostic information have lengths ranging between 250 and 500, indicating a characteristic of long text sequences. The bidirectional propagation of the BiLSTM model enables it to leverage comprehensive contextual information18, allowing for a nuanced understanding of each word or phrase within the text. This capability is especially crucial for longer texts, where capturing global context is essential. Moreover, the gated mechanism of LSTM facilitates the retention of significant information while filtering out less relevant details, effectively mitigating information overload in long text sequences14. Therefore, this study employs BiLSTM to fully extract global contextual information. Prior to inputting into the BiLSTM network, vector representations of the medical history and four diagnostic elements are generated using ZY-BERT. The input format for ZY-BERT is structured as “[CLS] + medical history + [SEP] + four diagnostic information + [PAD],” where [CLS], [SEP], and [PAD] are specialized tokens in the BERT family of networks19. The [CLS] token captures the contextual semantics of the entire input, [SEP] denotes relationships between sentences, and [PAD] is used as a padding symbol20.
After obtaining the vector representation of the long text, it is input into a BiLSTM to extract its contextual global information. The formula is shown below.
\(x_t\)
represents the input vector at time step \(t\); \(\overrightarrow{h_t}\) denotes the hidden state at time step \(t\) in the forward LSTM layer, with \(\overrightarrow{h_{t-1}}\) as the previous hidden state in this forward layer21. Similarly, \(\overleftarrow{h_t}\) is the hidden state in the backward LSTM layer at time step \(t\), and \(\overleftarrow{h_{t-1}}\) represents the previous hidden state in the backward layer.
In this context, \(\sigma\) and \(\beta\) represent the sigmoid and tanh functions, respectively. The terms \(i_t\), \(f_t\), and \(o_t\) correspond to the input, forget, and output gates, which are the core components of the LSTM. Finally, by concatenating the last hidden states from both directions of the BiLSTM, the output feature \(H\) is obtained, as described in the following formula22.
In the above equation, concat denotes the feature concatenation function, while − 1 indicates the last time step of the text sequence.
Knowledge attention
The attention mechanism incorporates domain-specific knowledge, such as syndrome definitions and typical manifestations, into the model. This helps the model better identify the unique features of syndromes, improving its accuracy in distinguishing complex cases.
The TCM-SD dataset provides relevant information on syndromes, which primarily includes definitions of each syndrome, their typical manifestations, and commonly associated diseases. An example of syndrome-related information is illustrated in Fig. 5.
From Fig. 5, it can be observed that the knowledge of syndrome patterns effectively elucidates the definition and specific external symptoms of each pattern. Integrating this knowledge into the network significantly aids the model in identifying complex syndrome samples23. This is akin to learning the meaning of each answer option before tackling a multiple-choice question, which naturally improves accuracy.
To enable the model to leverage syndrome pattern knowledge, it must be transformed into vector representations. As shown in Fig. 1, this paper utilizes ZY-BERT to extract representation information of syndrome patterns. To reduce model complexity, ZY-BERT is employed solely as a representational tool and does not participate in training. The proposed approach incorporates an attention mechanism-referred to as knowledge attention-to facilitate the model’s learning of syndrome pattern knowledge. By incorporating relevant knowledge of labels into the model, knowledge attention helps distinguish between different category features more effectively, as illustrated in Fig. 6.
In Fig. 6, the outputs \(P\) and \(H\) from the dual-channel network are each multiplied by the knowledge representation vector, resulting in the knowledge attention matrix \(A\). This knowledge attention matrix is then used to selectively aggregate evidence-based knowledge, enabling the \(P\) and \(H\) vectors to acquire relevant information regarding the evidence type. This process ultimately enhances the model’s ability to make more accurate judgments on evidence classification24. The calculation of knowledge attention is expressed in the following equation.
In this context, \(K\) denotes the knowledge representation, while \(W\) represents the learnable parameter matrix. The variables \(p\) and \(h\) correspond to the channels in the CNN and BiLSTM, respectively. In Eqs. (16) and (17), the knowledge-enhanced features are further linked to the original features through residual connections25, aiming to retain as much of the original semantic information of TCM texts as possible. The symbols \(V_p\) and \(V_h\) signify the dual-channel output after knowledge enhancement. Finally, these two channels are concatenated to form the overall output of the DCKA model, as demonstrated in the following equation.
Results
Dataset
The TCM-SD4 dataset is currently the only large-scale, publicly available TCM dataset focusing on syndrome differentiation. It includes over 50,000 de-identified TCM diagnostic records, covering a total of 148 distinct TCM syndrome types (labels). The dataset is divided into training, testing, and validation sets, with further details provided in Fig. 7.
As shown in Fig. 7, the TCM-SD dataset was divided into training, validation, and test sets in an 8:1:1 ratio.
Experimental parameter settings
The experiments in this study were conducted on an RTX 4090 GPU, utilizing the Windows operating system and the PyTorch framework for model development. The detailed parameter settings for the experiments are presented in Table 1.
The dimensions of the DCKA model are set to 1024, as detailed in Table 1. The text processing length is aligned with ZY-BERT’s maximum input length, which is set to 512. The model is trained for 30 epochs, and the weights from the epoch achieving the highest validation accuracy are used for testing. To ensure that the dual-channel network maintains a dimension size of 1024, the hidden layer sizes of both the CNN and LSTM components are set to 256, with Adam as the optimizer. Additionally, a linear learning rate scheduler is used during training. The DCKA training process is implemented through the Trainer module of the transformers26 library.
Experimental evaluation metrics
The evaluation metrics used in this paper are Accuracy, Precision, Recall, and F1-score, as outlined in reference27. The definitions of these metrics are provided by the following formulas.
Baselines
The DCKA model was compared with the following models in a series of experiments.
DT and SVM: Texts are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) technique, with Decision Tree (DT) and Support Vector Machine (SVM) serving as classifiers.
BiLSTM, BiGRU, and CNN: Text representations are obtained via BERT, and classifiers such as BiLSTM and BiGRU are employed for classification.
BERT3, distilBERT28, ALBERT29, and RoBERTa13: These models perform direct classification based on the [CLS] token.
TCM-BERT10, TLDA9, and CCSD11: TCM-BERT is pre-trained on Traditional Chinese Medicine corpora and performs classification using the [CLS] token. TLDA leverages transfer learning and data augmentation to improve the model’s ability to recognize rare syndromes. CCSD incorporates label attention and Focal Loss, with DCKA (ours) building on improvements made in CCSD.
Main results
The experimental results of DCKA and the baseline model on the TCM-SD dataset are presented in Table 2. In this study, precision, recall, and F1 scores were calculated using the macro-averaging method, consistent with the evaluation approach of CCSD.
As shown in Table 2, DCKA achieved the highest prediction accuracy on both the validation and test sets of the TCM-SD dataset, reaching 83.12% and 84.01%, respectively. This demonstrates the superiority of the proposed method in the task of syndrome differentiation in TCM. Specifically, DCKA outperformed CCSD in accuracy by 1.77% and 2.25% on the validation and test sets, respectively. Furthermore, DCKA also surpassed the CCSD model in terms of recall and F1 score, which underscores the effectiveness of the improvements made to CCSD in this study.
This paper argues that DCKA significantly outperforms CCSDin terms of accuracy for the following reasons. First, the CNN channels in DCKA effectively capture the keywords and phrases embedded in patient complaints, which helps the model match specific syndromes based on the complaints. Second, the dual-channel network captures multi-perspective features from the text data, allowing the model to grasp the diverse connotations of TCM texts. Third, the removal of the focal loss function from CCSD contributes to the improved performance. While focal loss aids in learning the characteristics of rare syndromes, it can also cause the model to overlook common syndromes, leading to a decrease in overall accuracy. Fourth, DCKA independently computes knowledge attention for each channel, capturing attention between “chief complaint-syndrome” and “medical history & four diagnostic methods-syndrome,” thus obtaining enriched features through knowledge-enhanced attention. Next, we conduct ablation experiments on each module of DCKA, with the results presented in Fig. 8.
The results shown in Fig. 8 demonstrate that each module of DCKA has contributed to improvements in both accuracy and F1 score, providing strong evidence of the effectiveness of these components. Specifically, the CNN+BiLSTM model achieved accuracy improvements of 1.15% and 1.22% compared to ZY-BERT, respectively. This indicates that the dual-channel network helps the model capture more diverse features from TCM texts. Additionally, the knowledge attention module led to accuracy increases of 0.54% and 0.60%, suggesting that the knowledge attention mechanism enables the model to learn relevant knowledge definitions of syndrome types, thus enhancing its ability to identify complex syndromes. To further verify the effectiveness of the improved CNN, the experimental results are presented in Tables 3 and 4.
As shown in Table 3, increasing the number of convolutional kernels in the CNN to (2, 3, 4, 5) improves the model’s performance. This suggests that using a greater variety of convolutional kernels helps the model more comprehensively capture key information within the short complaint texts.
The results in Table 4 show that combining various pooling methods improves the accuracy and F1 score of DCKA compared to using max pooling or average pooling alone. This suggests that the combination of different pooling strategies can better preserve the important information extracted by the CNN. The dual-channel knowledge attention mechanism possesses independent parameters, and its experimental results are presented in Table 5.
As shown in Table 5, the incorporation of knowledge attention improves the prediction accuracy of the dual-channel network. From the results, it can be observed that knowledge attention is more effective in the long-text channel. This may be because knowledge attention effectively aids the model in understanding the features of long texts, thereby enhancing the performance of evidence classification. Figure 9 visualizes the knowledge attention heatmap for a batch of samples.
As shown in Fig. 9, the values of knowledge attention are disorganized at the initial stage. However, after fine-tuning, the knowledge attention exhibits a certain degree of regularity. This indicates that knowledge attention is effective, guiding the model to focus on certain critical pieces of phenotypic knowledge.
Failure cases and error analysis
For the TCM-SD dataset, DCKA struggles to accurately predict certain categories. The primary reason lies in the insufficient number of trainable samples in these categories, which prevents DCKA from effectively learning their shared features. Moreover, some categories contain a high proportion of rare words, posing a significant challenge for DCKA to accurately interpret their meanings. Table 6 presents the precision, number of samples, and rare word proportions for selected categories in the TCM-SD dataset.
As shown in Table 6, the categories where predictions fail tend to have too few trainable samples, coupled with an excessive number of rare words in their texts, making it difficult for DCKA to recognize and process them effectively. This is the main reason for the prediction errors. In contrast, categories such as Damp-Heat Pouring Down Syndrome and Qi Deficiency and Blood Stasis Syndrome exhibit better prediction performance. This can be attributed to their larger number of training samples and texts that are easier for DCKA to comprehend. Finally, we also explore the performance of DCKA on other text classification tasks in Appendix 2.
Conclusion
This paper proposes a novel diagnostic model in TCM called DCKA, which features a dual-channel architecture combining CNN and BiLSTM. The model is specifically designed to handle both short and long texts in TCM-related data, excelling in capturing both key information and global context. DCKA enhances the traditional CNN by introducing more convolutional kernels and a more comprehensive pooling strategy, effectively improving the extraction and refinement of chief complaints. Additionally, the model incorporates knowledge attention, enabling it to learn the definitional knowledge of syndrome types from chief complaints, medical history, and the four diagnostic methods, thereby strengthening its understanding of TCM syndromes and allowing for more accurate judgments.
Experimental results on the TCM-SD dataset show that DCKA achieves superior performance in terms of prediction accuracy and F1 score, significantly surpassing baseline models. Each module of DCKA also demonstrates consistent improvements across metrics, highlighting the effectiveness of its design.
However, this study provides not only a practical approach but also a new perspective for applying artificial intelligence to TCM syndrome differentiation. By emphasizing the integration of domain-specific knowledge and data-driven learning, DCKA sheds light on how modern machine learning methods can be tailored for TCM applications.
Nonetheless, the study has limitations, particularly in addressing the long-tail problem in the TCM-SD dataset. Future work will explore methods to enhance DCKA’s ability to recognize rare syndrome samples, paving the way for more comprehensive and robust diagnostic systems in TCM.
Data availability
TCM-SD can be found at https://github.com/borororo/zy-bert.
References
Zhang, S., Wang, W., Pi, X., He, Z. & Liu, H. Advances in the application of traditional Chinese medicine using artificial intelligence: A review. Am. J. Chin. Med. 51, 1067–1083 (2023).
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Mucheng, R. et al. Tcm-sd: A benchmark for probing syndrome differentiation via natural language processing. In Proceedings of the 21st Chinese National Conference on Computational Linguistics, 908–920 (2022).
Koroteev, M. V. Bert: A review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943 (2021).
Hu, C., Zhang, S., Gu, T., Yan, Z. & Jiang, J. Multi-task joint learning model for Chinese word segmentation and syndrome differentiation in traditional Chinese medicine. Int. J. Environ. Res. Public Health 19, 5601 (2022).
Yao, T., Zhai, Z. & Gao, B. Text classification model based on fasttext. In 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS), 154–157 (IEEE, 2020).
Ye, Q., Yang, R., Cheng, C.-L., Peng, L. & Lan, Y. Combining the external medical knowledge graph embedding to improve the performance of syndrome differentiation model. Evid. Based Complement. Altern. Med. 2023, 2088698 (2023).
Li, X. et al. TLDA: A transfer learning based dual-augmentation strategy for traditional Chinese medicine syndrome differentiation in rare disease. Comput. Biol. Med. 169, 107808 (2024).
Yao, L., Jin, Z., Mao, C., Zhang, Y. & Luo, Y. Traditional Chinese medicine clinical records classification with BERT and domain specific corpora. J. Am. Med. Inform. Assoc. 26, 1632–1636 (2019).
Zhao, Z. et al. Thinking the importance of patient’s chief complaint in TCM syndrome differentiation. In 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 1534–1539 (IEEE, 2024).
Ross, T.-Y. & Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2980–2988 (2017).
Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Zhang, T. & You, F. Research on short text classification based on textcnn. In Journal of Physics: Conference Series, Vol. 1757, 012092 (IOP Publishing, 2021).
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751 (2014).
Chen, X., Cong, P. & Lv, S. A long-text classification method of Chinese news based on BERT and CNN. IEEE Access 10, 34046–34057 (2022).
Shen, K. et al. A study on relu and softmax in transformer. arXiv preprint arXiv:2302.06461 (2023).
Deng, J., Cheng, L. & Wang, Z. Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput. Speech Lang. 68, 101182 (2021).
Min, B. et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 56, 1–40 (2023).
Alammary, A. S. Bert models for Arabic text classification: A systematic review. Appl. Sci. 12, 5720 (2022).
Dou, G., Zhao, K., Guo, M. & Mou, J. Memristor-based LSTM network for text classification. Fractals 31, 2340040 (2023).
Liang, M. & Niu, T. Research on text classification techniques based on improved TF-IDF algorithm and LSTM inputs. Procedia Comput. Sci. 208, 460–470 (2022).
Liu, M., Liu, L., Cao, J. & Du, Q. Co-attention network with label embedding for text classification. Neurocomputing 471, 61–69 (2022).
Chen, J., Hu, Y., Liu, J., Xiao, Y. & Jiang, H. Deep short text classification with knowledge powered attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 6252–6259 (2019).
Gasparetto, A., Marcuzzo, M., Zangari, A. & Albarelli, A. A survey on text classification algorithms: From text to predictions. Information 13, 83 (2022).
Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45 (2020).
Li, Q. et al. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 13, 1–41 (2022).
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
Lan, Z. et al. ALBERT: A lite BERT for self-supervised learning of language representations. CoRR arXiv preprint arXiv:1909.11942 (2019).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642 (Association for Computational Linguistics, Seattle, Washington, USA, 2013).
Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002).
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning, 137–142 (Springer, 1998).
Lan, Z. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
Yang, Z. et al. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (Curran Associates Inc., 2019).
Lin, Y. et al. Bertgcn: Transductive text classification by combining gnn and bert. In Findings of the Association for Computational Linguistics: ACL-IJCNLP, Vol. 2021, 1456–1462, (2021).
Acknowledgements
This work was supported by the Research Fund Project of Guangxi Minzu Normal University under Grant 2022SP006; in part by the Research Fund Project of Guangxi Minzu Normal University under Grant 2024YB124; and also by the Guangxi Higher Education Institutions’ Young and Middle-Aged Teachers’ Basic Research Capability Enhancement Project under Grant 2025KY0928.
Author information
Authors and Affiliations
Contributions
The authors confirm their contribution to the paper as follows: study conception and design done by B.L. and J.M.; H.H., X.L. and J.Z. helped in data collection; B.L., H.H. and J.M. done analysis and interpretation of results, draft manuscript preparation. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Terminology explanation
-
Syndrome Differentiation The process in TCM diagnosis where a patient’s syndrome type is classified based on diagnostic information such as chief complaints, symptoms, medical history, and diagnostic methods.
-
Chief Complaint The main symptoms or discomfort described by the patient, which serve as the primary evidence for diagnosis.
-
Medical History Medical history is a record of a patient’s past and current health-related information, typically gathered by healthcare professionals to understand the patient’s health status and guide diagnosis and treatment.
-
Four Diagnostic Information The fundamental diagnostic techniques in TCM, which include:
-
1.
Inspection Visual observation of the patient’s physical appearance, including tongue and complexion.
-
2.
Listening and Smelling Aural and olfactory diagnostics, such as analyzing the patient’s voice and breath odor.
-
3.
Inquiry Collecting the patient’s medical history through questioning.
-
4.
Palpation Physical examination, including pulse diagnosis and touch-based inspection.
-
TCM-Specific Terminology
-
1.
Damp-Heat with Toxin Accumulation Syndrome This syndrome is characterized by the retention of dampness and heat in the body, often combined with toxic factors. It typically manifests as swelling, redness, heat, pain, and sometimes suppuration in the affected area, often seen in skin infections or inflammatory conditions.
-
2.
Spleen and Stomach Qi Deficiency Syndrome A condition where the spleen and stomach fail to perform their functions of digestion and transportation due to a lack of qi (vital energy). Symptoms include poor appetite, fatigue, loose stools, abdominal bloating, and a pale tongue.
-
3.
Phlegm Turbidity Obstructing the Mind Syndrome This syndrome occurs when phlegm and turbidity obstruct the heart orifices, impairing the mind. Symptoms include mental confusion, lethargy, dizziness, nausea, and a greasy tongue coating, often associated with severe illnesses such as stroke or epilepsy.
-
4.
Yin Deficiency with Blood Heat Syndrome A syndrome caused by a deficiency of yin (the cooling, nourishing aspect of the body) leading to internal heat and blood overheating. Symptoms include a dry mouth, irritability, hot sensations, night sweats, insomnia, and a red tongue with little coating.
-
5.
Heart Deficiency with Gallbladder Timidity Syndrome A condition where the heart is deficient in qi or blood, leading to timidity and lack of courage, often affecting the gallbladder’s ability to make decisive actions. Symptoms include palpitations, fearfulness, insomnia, and susceptibility to fright.
-
6.
Damp-Heat Pouring Down Syndrome This syndrome results from damp-heat accumulation in the lower part of the body, often affecting the urinary or reproductive systems. Common symptoms include lower abdominal pain, difficult or painful urination, foul-smelling discharge, swelling or redness in the genital area, and a yellow greasy tongue coating. It is often associated with conditions like urinary tract infections or certain gynecological disorders.
-
7.
Qi Deficiency with Blood Stasis Syndrome This syndrome occurs when insufficient qi (vital energy) fails to promote blood circulation, leading to blood stasis. Symptoms include fatigue, pale complexion, pain that is fixed and stabbing, a pale or dark tongue with purple spots, and a weak pulse. It is commonly seen in chronic illnesses or after prolonged fatigue.
-
ZY-BERT A pre-trained language model tailored for TCM texts. It extends the RoBERTa framework by incorporating a large-scale corpus of TCM-related documents, enabling it to generate high-quality textual representations for syndrome differentiation tasks.
-
Dual-Channel Knowledge Attention Network (DCKA) The proposed model in this paper, which integrates CNN and BiLSTM networks to extract local and global features from TCM texts while leveraging a knowledge-attention mechanism for incorporating syndrome-specific knowledge.
-
TCM-SD A large-scale, publicly available dataset for syndrome differentiation in TCM.
Appendix 2: DCKA for text classification
To further validate the generalization capability of DCKA, Table 7 presents its performance on other text classification tasks. To facilitate comparisons with related models on English datasets, the English version of RoBERTa is employed as the language model for DCKA.
As shown in Table 7, DCKA outperforms other BERT-based models, achieving the highest accuracy on both the R8 and Ohsumed datasets. This improvement can be attributed to the knowledge-aware attention mechanism, which enhances the model’s ability to handle domain-specific datasets. The experimental results in Table 7 demonstrate that DCKA exhibits superior generalization capabilities.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, B., Huang, H., Liu, X. et al. Dual-channel knowledge attention for traditional Chinese medicine syndrome differentiation. Sci Rep 15, 13487 (2025). https://doi.org/10.1038/s41598-025-96404-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-96404-w












