Introduction

Idiopathic pulmonary fibrosis (IPF) is a class of scarring chronic lung diseases with complex etiology, and is the most common and severe form of idiopathic interstitial pneumonia1, with unclear etiology, unpredictable course2, high misdiagnosis, high mortality, high recurrence rate and poor prognosis3. At present, there is no uniform assessment method and diagnostic standard for this disease, which brings great disturbance to patients and society4 and has become a major respiratory disease threatening public health. Therefore, it is urgent to promote the improvement of standardized diagnosis and treatment of IPF.

Traditional Chinese medicine (TCM) is the essence of the wisdom of the Chinese nation and is widely recognized as one of the effective means of treating diseases due to its unique methodological system, rich diagnostic and therapeutic techniques, the advantages of low side effects and good efficacy5. TCM data is characterized by nonlinearity, ambiguity, unstructuredness and multidimensionality6. As computer technology advances incessantly, machine learning algorithms extract feature learning from complex data and process it, and its application to the field of TCM is increasingly rich in research. The results of a large number of studies7,8,9,10 indicate that back propagation(BP) neural network models are suitable for syndrome classification. However,BP neural network is not without its limitations like slow convergence, difficulty in guaranteeing network generalization ability, easy to fall into the local minima, and large dependence on the initial weights11,12. These drawbacks can significantly affect the accuracy of model classification. Levenberg–Marquardt (LM) algorithm is a modified algorithm based on the BP neural network, which can effectively addresse the limitations of slow convergence speed and weak generalization ability associated with the BP neural network13. Genetic algorithm (GA) is an optimization method with parallel random search, which can achieve global search to prevent the BP neural network from getting trapped in local optimum14, and optimize the initial weights and thresholds of the network to further enhance model performance.

The incidence and prevalence rates of IPF show an increasing trend year by year15, and the study of TCM auxiliary diagnosis and treatment of IPF has important practical value. While previous studies have applied machine learning to TCM diagnosis, most have prioritized classification accuracy over model interpretability. In contrast, this study introduced the mean impact value (MIV) algorithm for syndrome-specific feature screening, achieving a transparent mapping between symptoms and syndromes. Additionally, the proposed MIV-GA-LM-BP (MGLB) model was constructed based on 956 real TCM cases of IPF patients, ensuring its clinical relevance and practicality. Therefore, this study provides valuable references in both methodological integration and practical application. As shown in Fig. 1, this experiment utilizes the effective medical case data of IPF treatment in TCM, combines various machine learning algorithms to explore the rules and connections between symptoms and syndromes of the disease, and applies the MIV algorithm to screen the key influencing factors of the symptom features, avoiding the redundant information brought by the input features. The GA-LM-BP neural network model with better fit and higher prediction accuracy provides more informative results for the diagnosis of the TCM syndrome classification of IPF and helps to exert the characteristics and advantages of TCM in preventing and treating major and difficult diseases. These contributions represent not only an application of machine learning to TCM syndrome classification but also a meaningful integration tailored for interpretability and real-world deployment, setting a foundation for future intelligent TCM diagnostic systems.

Fig. 1
Fig. 1
Full size image

Structure drawing of this paper.

Related works

Syndrome differentiation in TCM is a diagnostic thinking process guided by the theories of TCM to clarify the nature of the disease and make a judgement based on the four diagnostic information16. In recent years, the classification methods of AI technology are well adapted to non-linear, complex and fuzzy TCM data, and the use of AI technology to assist in TCM syndrome differentiation and diagnosis has become a research theme of great significance17.

Machine learning models in TCM syndrome classification

The current research on intelligent syndrome differentiation in TCM mainly includes decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), support vector machine (SVM), K-nearest neighbor (KNN), etc.18,19,20,21,22,23,24,25,26,27,28,29. Each of the above algorithms has its own strengths in dealing with high-dimensional data, data of different sizes and complex non-linear relationships, and most of the research focuses on the processing of data and the construction of models, mainly through the comparison of multiple algorithms to derive a single algorithm’s ability to predict the syndrome. Previous studies on intelligent syndrome differentiation in TCM have made less mention of feature screening and optimization algorithms, and fewer of them have incorporated treatment based on syndrome differentiation thinking, resulting in poor interpretability of the constructed models and limited application in the real world. Clinical application needs to follow the principle of treatment based on syndrome differentiation, play the interpretable role of feature screening, and establish a closed loop, so that the model can have the interpretability in line with the clinical reality, and truly provide clinical decision support. However, most of these works focus solely on prediction accuracy and do not consider model interpretability or alignment with TCM diagnostic logic.

BP Neural networks in TCM applications

Although the existing intelligent TCM syndrome differentiation models have achieved good results, they still need to be optimized and should continue to be researched in depth so as to ensure the rapid development of the modernisation of TCM. BP neural network is one of the most widely used neural network models30, and the results of a large number of researches have shown that it is a suitable model for TCM syndrome classification9,10,31. Shenghao Yang et al.32 proposed a BP neural network-based prediction model for the syndrome classification of gastrointestinal disease chronic atrophic gastritis, which adopts the correlation-based feature selection method and improves the initialization of the BP weights based on the Gaussian distribution method to achieve good results. Despite their success, standard BP networks suffer from slow convergence, weak generalization ability, and sensitivity to initial weights and thresholds, which can lead to unstable predictions and limited practical use in clinical settings.

Optimization strategies for BP neural networks

To address the limitations of the standard BP network, researchers have proposed optimization strategies. Ye Wang33 et al. found that the method of optimizing BP neural network based on artificial bee colony (ABC) algorithm can find the mapping relationship between TCM symptoms and TCM syndromes and greatly improves the accuracy of TCM syndrome differentiation, and then the information of TCM syndrome differentiation auxiliary diagnosis and treatment was realized, indicating that it is feasible to use the ABC-BP neural network algorithm in the auxiliary diagnosis and treatment of TCM syndrome differentiation. Zhang Mingqi34 found that after ensemble learning and integration of BP neural network, the evaluation indexes of the TCM syndrome prediction model for liver cirrhosis reached the most balanced state with excellent performance. All of the above methods apply the standard or improved BP neural network to the field of TCM, but ignore the principle of diagnosis and treatment and the interpretability of the model. In real-world diagnosis and treatment, the emphasis is on comprehensive judgement based on individual differences of patients, and the connection between symptoms and syndromes in the neural network modelling process is determined by the connection weights and probabilities generated, which is not the reasoning of TCM syndrome differentiation. Therefore, the modelling cannot rely solely on the automated processing of the algorithm, but must fully consider the core thinking of TCM and follow the principle of diagnosis and treatment. The weights are determined according to the correlation between the actual clinical symptoms and syndromes to ensure the clinical applicability and effectiveness of the model.

Current research has made some progress in the use of AI technology to assist TCM syndrome differentiation, but there is still a need for more in-depth exploration and research on model interpretability and precise symptom-syndrome relationships. “Interpretability” is an important condition for AI technology to be trusted and used in TCM clinical diagnosis and treatment, and previous intelligent syndrome differentiation research has mainly been devoted to the study of the statistical relationship between symptoms and syndromes35. Many researchers have implemented various TCM syndrome classification models, but there is a lack of classification method that highlight symptoms of key diagnostic significance for specific syndrome during the syndrome differentiation process. Since models such as neural networks are not interpretable and can not understand the relationship between symptoms and syndromes from the perspective of knowledge of TCM syndrome differentiation and diagnosis, calculating the exact value of the “symptoms-syndromes” relationship in TCM to enhance the interpretability of the model is the focus of intelligent syndrome differentiation research in TCM. The presence of too many redundant and irrelevant features in the original data will directly affect the performance of machine learning algorithms. Feature screening methods can effectively remove irrelevant and redundant features and improve classification accuracy. TCM syndrome differentiation focuses on individual differences and dynamic changes, and contains a large amount of complex information, and the selection of specific symptoms related to the diagnosis is crucial for the establishment of the syndrome classification model.

In this paper, we explore the application of various optimization algorithms in the intelligent TCM syndrome classification model to improve its prediction accuracy in response to the shortcomings of the standard BP neural network, and further modify the BP neural network by combining the global optimal solution searching ability of GA and particle swarm optimization (PSO) algorithms to overcome its typical shortcomings in practical applications. Importantly, exploring the use of grey relational analysis (GRA), principal component analysis (PCA), and MIV feature dimensionality reduction algorithms for the complex and multi-dimensional characteristics of TCM symptom data helps to enhance the trust in model prediction and improve the reliability of the model.

Data preprocessing

Data source

Data quality directly determines the upper limit of the model, and due to the limited data available, most existing work relies heavily on datasets that are not directly derived from clinical cases. The data for this study are derived from both the literature (open literature and medical monographs) and clinical data supported by National Key R&D Program of China (Grant No. 2018YFC1704104). Literature data were selected from VIP, Wanfang, and CNKI databases. Cross-searches were conducted using keywords “IPF”, “medical cases:, “experience”, “Famous senior TCM practitioners”, “National Master of TCM” and so on. The search time was from the establishment of the database to September 2022. Medical monographs data were manually consulted in national, provincial, and municipal contemporary medical monographs about IPF diagnosis and treatment by famous TCM experts. Cases of famous TCM experts’ experience in the diagnosis and treatment of IPF were collected such as The Collection of Academic Thoughts and Clinical Experience of Du Yumao, Clinical Experience of Senior Physician Gao Yimin, Medical Theories and Cases of National Master Hong Guangxiang, etc. Clinical data were collected from outpatient follow-up cases of famous TCM experts in Southwest China. A total of 956 datasets were screened according to the inclusion and exclusion criteria in Table 1.

Table 1 The inclusion and exclusion criteria.

According to the Diagnostic Criteria for TCM Symptoms of Idiopathic Pulmonary Fibrosis (2019 version)36, the criteria was formulated by the Internal Medicine Branch of CACM, Lung Disease Branch of CMAM and CACM by integrating the outcomes of statistic, artificial neural network and Delphi method on the analysis of the data of the medical cases, and the final artificial standardization of the syndrome types are 8, after the discussion of experts in combination with the clinical practice. They are: 249 cases of syndrome of lung qi deficiency complicated with phlegm and stasis obstructing the collaterals, 151 cases of syndrome of lung qi deficiency complicated with accumulation and binding of phlegm and heat, 136 cases of syndrome of qi deficiency in the lung and kidney complicated with phlegm and stasis obstructing the collaterals, 132 cases of syndrome of lung dryness with yin deficiency, 115 cases of syndrome of lung qi deficiency, 108 cases of syndrome of qi deficiency in the lung and kidney, 61 cases of syndrome of lung qi deficiency complicated with turbid phlegm obstructing the lung, and 4 cases of syndrome of liver qi invading the lung.

Standardized research is the basis for achieving the accuracy of syndrome identification37. In this paper, different expressions of the same symptom are unified and described. Deleting descriptions without statistical significance, such as good appetite, good sleep, normal bowel movement, normal urination, etc. We finally obtained 267 symptoms with a total frequency of 11,034, which were divided into common symptoms and four diagnostic symptoms. The common symptoms include 224 symptoms such as gasping, cough with little phlegm, white phlegm, etc., with a total frequency of 6,478 times, while the four diagnostic symptoms include 43 symptoms such as white coating, thin coating, thready pulse, etc., which include 26 types of tongue and 17 types of pulse, with a total frequency of 4,556 times.

Data encoding

The data of common symptoms and four diagnostic symptoms after screening and normalization were coded. The top 20 symptoms ranked by frequency of common symptoms and four diagnostic symptoms were counted separately, as shown in Table 2.

Table 2 Symptom frequency statistics (top 20).

To facilitate computer recognition, for common symptoms, the code "0" means no such symptom and “1” means having such symptom. Degrees of symptoms light, medium and heavy were coded as 1, 2, 3, such as dry cough symptoms are divided into mild dry cough, moderate dry cough, severe dry cough, and it will be merged into the dry cough column of dataset and coded as 1, 2, 3. Symptoms of the same category were coded as 1, 2, 3, etc. in descending order of frequency, e.g., sticky phlegm、frothy phlegm、clear thin phlegm, etc. were merged into one column of phlegm nature and coded as 1, 2, 3, etc. in descending order of frequency. The four diagnostic symptoms were coded in eight columns of data: tongue color, tongue with stasis, form of the tongue, coating texture, coating color, pulse 1, pulse 2, and pulse 3, respectively. For symptoms of the same type, each column is coded starting with 1 and increasing by 1 unit in turn until all symptoms of that type have been coded. When both symptoms are presented in a column, we start coding at 1.5 and increasing by 1 unit in sequence. If the coating nature symptom column contains thin, greasy, thick, and scanty coating, it is coded 1, 2, 3, 4, etc., in that order, with 1.5 indicating the presence of both thin and greasy coating, 2.5 indicating the presence of both greasy and thick coating, and so forth.

The dataset after adopting this coding rule contains a total of 76 symptom feature dimensions. Since insufficient data can adversely impact the accuracy of training the neural network model, the first 6 types of syndromes (number of cases > 100) totaling 891 data were selected for inclusion in the experiment. The coding was accomplished by using a one-hot code, with syndrome of lung qi deficiency complicated with phlegm and stasis obstructing the collaterals as [1, 0, 0, 0, 0, 0], syndrome of lung qi deficiency complicated with accumulation and binding of phlegm and heat as [0, 1, 0, 0, 0, 0], syndrome of qi deficiency in the lung and kidney complicated with phlegm and stasis obstructing the collaterals as [0, 0, 1, 0, 0, 0], syndrome of lung dryness with yin deficiency as [0, 0, 0, 1, 0, 0], syndrome of lung qi deficiency as [0, 0, 0, 0, 1, 0], and syndrome of qi deficiency in the lung and kidney as [0, 0, 0, 0, 0, 1]. Table 3 is a table of the dataset after coding was completed.

Table 3 Encoded data sets.

Feature screening

The selection of training sample influencing factor features will determine the algorithmic model38. Multi-dimensional features can provide more comprehensive information. However, irrelevant or unrepresentative features will affect the model prediction performance39. So before building the algorithmic model, it is necessary to extract feature of the symptoms of the dataset to screen out the key influencing factors in the sample features. Currently, the main feature selection methods include filter, embedded and wrapper, and the common feature extraction algorithms are Pearson correlation coefficient (PCC)40, kernel principal component analysis (KPCA)41, latent dirichlet allocation (LDA)42, independent component analysis (ICA)43, GRA44, PCA45, and MIV46. Methods widely employed in the field of TCM are GRA algorithm and PCA algorithm47,48,49. This study found that the MIV algorithm applied to the classification of TCM syndrome has a strong interpretability through pre-experimentation. The experiment compares the performance of the above three algorithms to reduce the dimensionality of the symptom features of the dataset.

GRA

The GRA algorithm is a multi-factor correlation evaluation method that shows unique advantages in solving grey system and incomplete information problems for determining the relationship and degree of influence between different factors. If a particular comparison sequence has a high correlation with the reference sequence, it can indicate that the factor has a high impact on the target. The so-called degree of correlation essentially refers to the correlation between different sequences by comparing the degree of similarity between them50, which can be expressed by the correlation coefficient \({\varphi }_{i}\left(k\right)\) defined in Eq. (1).

$${\varphi }_{i}\left(k\right)=\frac{\genfrac{}{}{0pt}{}{min}{i}\genfrac{}{}{0pt}{}{min}{k}\left|{x}_{0}\left(k\right)-{x}_{i}\left(k\right)\right|+\rho \cdot \genfrac{}{}{0pt}{}{max}{i}\genfrac{}{}{0pt}{}{max}{k}|{x}_{0}\left(k\right)-{x}_{i}(k)|}{|{x}_{0}\left(k\right)-{x}_{i}\left(k\right)|+\rho \cdot \genfrac{}{}{0pt}{}{max}{i}\genfrac{}{}{0pt}{}{max}{k}|{x}_{0}\left(k\right)-{x}_{i}(k)|}$$
(1)

where \({\varphi }_{i}\left(k\right)\) is the correlation coefficient between the ith subsequence Xi and the reference sequence X0, which essentially characterizes the degree of correlation between the reference sequence and the comparison sequence, \({x}_{0}\left(k\right)\) is the reference sequence (dependent variable), i.e., the value of the syndrome column data, and \({x}_{i}\left(k\right)\) is the comparison sequence (independent variable), i.e., the value of all the symptom column data. In the GRA method, the reference sequence X0 and the comparison sequence Xi play a crucial role, min and max denote the minimum and maximum values respectively, and ρ is the distinguishing coefficient between 0 and 1, and the correlation coefficient obtained is generally more accurate when 0.5 is taken.

The correlation value corresponding to each data in each data column is too scattered. In order to facilitate the overall comparison, it is proposed to average the correlation coefficients of each data column as the comparison value of the degree of correlation between the comparison series and the reference series51. Symptom correlations are calculated as shown in Eq. (2).

$${r}_{i}=\frac{\sum_{k=1}^{n}{\varphi }_{i}k}{n}$$
(2)

where ri is the degree of correlation and N is the total number of samples.

In this paper, the GRA method was applied to assess the correlation of symptom-influencing factors in the classification of TCM syndromes by processing and analyzing the relevant data. We constructed the correlation sequence and used GRA method to rank the influencing factors, and statistically counted the data whose symptom correlation value is greater than 0.967, as shown in Table 4.

Table 4 Symptom correlation value (correlation value \(\ge\) 0.967).

The correlation degree, between 0 and 1, indicates the degree of similarity and correlation between each symptom and the syndrome. A higher value indicates a stronger correlation and a closer relationship between symptoms and syndromes, and thus the correlation degree is higher. Combined the correlation values for all the symptoms to get the ranking of each symptom. For the current 76 symptoms, tongue color had the highest correlation (correlation: 0.985), followed by coating color (correlation: 0.983). To ensure that we find the number of influencing factors that make the prediction accuracy optimal, symptom factors with a correlation of less than 0.963 are excluded, and the remaining 55 symptom factors are left as inputs to the neural network model.

PCA

The PCA algorithm, which maps the original data onto a new set of linearly independent composite variables by linear transformation, is a commonly used multivariate data analysis method. It aims to reduce the number of features and retain the maximum information value of the input data integrally52, which helps to simplify the processing of complex problems and improve the efficiency of algorithms and model building. Used for data dimensionality reduction, the PCA algorithm maps high-dimensional data into a low-dimensional space by selecting the most informative principal components, which helps to reduce redundant features and improve the prediction accuracy of the model, solving the problem of dimensionality catastrophe while making it easier to analyze, visualize and process the data. The steps for screening the symptom factors using PCA are as follows.

First of all, symptom-influencing factors are standardized and normalized to eliminate the impact of different attributes between different indicators on the comparison of variables, as shown in Eq. (3) . Secondly, the covariance matrix of the sample was calculated by using Eq. (4), and then its eigenvalues and the corresponding eigenvectors are deduced. The corresponding contribution rates (eigenvalues as a percentage of the sum of the eigenvalues) and cumulative contribution rates of all the principal elements are calculated. The top 20 cases ranked by the eigenvalues are shown in Table 4. The top 20 statistical symptom descending eigenvalues, corresponding contribution rate and cumulative contribution rate are shown in Table 5.

Table 5 Symptom descending eigenvalues and contribution values (top 20).
$${x}_{i}^{*}=\frac{{x}_{i}-mean({X}_{i})}{\sigma ({X}_{i})}$$
(3)

Let the variable type matrix of the original sample point X be X \(=\) (xij)n*m = (X1, X2, …, Xm), where Xi = (x1i, x2i, …, xni) T, i = 1,2…m. Where \({X}_{i}\) is the i-th influencing factor of each sample, \(mean({X}_{i})\) is the mean of the influencing factor, and \(\sigma ({X}_{i})\) is the standard deviation of the influencing factor.

$$\begin{aligned} & C_{x} = \left[ {\begin{array}{*{20}c} {cov\left( {x_{1}^{*} ,x_{1}^{*} } \right)} & {cov\left( {x_{1}^{*} ,x_{2}^{*} } \right)} & \cdots & {cov\left( {x_{1}^{*} ,x_{m}^{*} } \right)} \\ {cov\left( {x_{2}^{*} ,x_{1}^{*} } \right)} & {cov\left( {x_{2}^{*} ,x_{2}^{*} } \right)} & \cdots & {cov\left( {x_{2}^{*} ,x_{m}^{*} } \right)} \\ \vdots & \vdots & \vdots & \vdots \\ {cov\left( {x_{n}^{*} ,x_{1}^{*} } \right)} & {cov\left( {x_{n}^{*} ,x_{2}^{*} } \right)} & \cdots & {cov\left( {x_{n}^{*} ,x_{m}^{*} } \right)} \\ \end{array} } \right] \\ & cov\left( {x_{i}^{*} ,x_{j}^{*} } \right) = {\text{E}}\left\{ {\left[ {x_{i}^{*} - E\left( {X_{i}^{*} } \right)} \right]\left[ {x_{j}^{*} - E\left( {X_{j}^{*} } \right)} \right]} \right\} \\ \end{aligned}$$
(4)

where i = 1,2…n j = 1,2…m, The E(x) function represents the mathematical expectation, and \({X}_{i}^{*}\) denotes the result after standardization and normalization for the i-th influence factor.

The selection of p principal components is based on their cumulative contribution, with the aim of achieving it up to 85%, and corresponding components can retain the original information well and simplify the problem. The eigenvectors a = (a1, a2, …, ap) T of the first p principal components are calculated to obtain the principal component expression as Fi = ai1 \({X}_{1}^{*}\) +ai2 \({X}_{2}^{*}\) +··· + aim \({X}_{m}^{*}\), where ai = (ai1, ai2, …aim), i = 1, 2, …, p. Therefore, the first 51 components are selected as principal components, and these 51 principal components are used to describe the relationship between these 76 symptom-influencing factors and can basically summarize these 76 symptom variables. There is a linear relationship between the screened principal components and the original symptom features. Symptom features are selected by the PCA algorithm, which consists of 76 original symptom features reduded dimension to 51 principal components as inputs to the neural network model.

MIV

The MIV algorithm, one of the optimal algorithms for evaluating the correlation of indicators in a neural network53, determines the importance of the influence of input variables on the output results. Variables with a high degree of influence are selected after screening by the MIV algorithm, and the input variables for constructing the BP neural network model are reduced. This method plays a significant role in increasing the training accuracy of the model and reducing the error. The steps for screening symptom-influencing factors using the MIV algorithm are as follows.

A matrix of symptom-influencing factors Xm \(\times\) n is constructed as the original input data, with the number of rows m denoting the number of patient cases and the number of columns n denoting the number of symptom characteristics. The sample data values of the symptom-influencing factors are increased or decreased by δ (δ is taken as 10% in this paper) to obtain two new input matrices X1 and X2 respectively, as shown in Eq. (5). Two new input matrices X1 and X2 are fed into the constructed BP neural network model and two outputs Y1 and Y2 are obtained. The MIV values of the symptom-influencing factors corresponding to each syndrome type are calculated as shown in Eq. (6). The absolute value of the MIV value reflects the relative importance of the input indicators to the output results, which can be ranked according to the absolute value, and the contribution and cumulative contribution rate of each symptom-influencing factor of each syndrome can be calculated to obtain the ranking of the relative importance of the influence of each symptom on each type of syndrome. The formula for calculating the contribution of the symptom influencing factor Ci is shown in Eq. (7).

$$X_{m \times n = } \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \cdots & {x_{1n} } \\ {x_{21} } & {x_{22} } & \cdots & {x_{2n} } \\ \vdots & \vdots & \vdots & \vdots \\ {x_{k1} } & {x_{k2} } & \cdots & {x_{kn} } \\ \vdots & \vdots & \vdots & \vdots \\ {x_{m1} } & {x_{m2} } & \cdots & {x_{mn} } \\ \end{array} } \right]\;\;\;X_{1} ,X_{2} = \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \cdots & {x_{1k} \left( {1 \pm {\updelta }} \right)} & \cdots & {x_{1n} } \\ {x_{21} } & {x_{22} } & \cdots & {x_{2k} \left( {1 \pm {\updelta }} \right)} & \cdots & {x_{2n} } \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {x_{k1} } & {x_{k2} } & \cdots & \cdots & \cdots & {x_{kn} } \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {x_{m1} } & {x_{m2} } & \cdots & {x_{mk} \left( {1 \pm {\updelta }} \right)} & \cdots & {x_{mn} } \\ \end{array} } \right]$$
(5)
$$MIV_{i} = \frac{1}{m}\mathop \sum \limits_{j = 1}^{m} \left( {Y_{1} - Y_{2} } \right)$$
(6)
$$C_{i} = \frac{{|MIV_{i} |}}{{\mathop \sum \nolimits_{i = 1}^{n} \left| {MIV_{i} } \right|}} \times 100{{\% }}$$
(7)

After processing the dataset with the MIV algorithm, the MIV values corresponding to the symptoms for each syndrome and their contribution are obtained and shown in Table 6.

Table 6 MIV values and their contribution to the symptoms corresponding to each syndrome.

Each type of syndrome whose cumulative contribution is less than 95 percent are coded, and other symptoms are coded as 0. If the symptom code in a column is all 0, then the column of this symptom is deleted. Also remove symptom columns with a frequency of less than 10. Symptoms are statistically analyzed and after processing by the MIV algorithm the symptom feature dimension is 55, and finally, the neural network model takes the 55-dimensional symptom dataset as input.

GA-LM-BP model construction

BP

BP neural network is a popular multilayer feed-forward network54 and one of the most commonly used neural network models55. The main idea is to input data samples, and through continuous learning and feedback along the direction of gradient descent. The weights and thresholds of the network are repeatedly adjusted and trained to minimize the sum of squared errors, ensuring that the output value gradually approaches the desired value. The BP neural network model can be regarded as a supervised approach self-learning training model. Its training process has three main layers: input layer, hidden layer and output layer. The schematic structure of a typical three-layer BP neural network is shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

Schematic structure of BP neural network.

The number of neurons in the input layer is generally equal to the number of features in the sample. The number of neurons in the output layer is generally equal to the number of output result categories. The number of neurons in the hidden layer is determined by an empirical formula \(\sqrt{m+n}+a\). Where, m is equal to the number of neurons in the input layer; n is equal to the number of neurons in the output layer; a is a constant between 1 and 10. The BP neural network uses gradient descent, which is prone to local minima and slow convergence. A drawback will be improved in this paper by the optimization algorithms introduced subsequently. The different layers are connected to each other by weight matrixes, and the initialization weights and thresholds of the network are generated randomly, leading to uncertainty in the evaluation results. In this paper, the initialization weights and thresholds will be determined through subsequent introduction of optimization algorithms.

LM

The LM algorithm is an optimization algorithm for nonlinear least squares problems56. Mathematically, the algorithm is known as an improved version of the Gauss–Newton method, which provides more stable and efficient solutions to nonlinear least squares problems. The main idea is to continuously adjust the model parameters through an iterative process to minimize the residuals between the fitted function and the observed data.

The LM algorithm incorporates the damping factor into Newton’s method, combining the strengths of both Newton’s method and gradient descent method57, which show the advantages and properties of Newton’s method and gradient descent method, respectively, with different damping factors. The iterative formula in the optimization process is shown in Eq. (8).

$${u}_{k+1}={u}_{k}-{[{J}^{T}\left({u}_{k}\right)J\left({u}_{k}\right)+\mu I]}^{-1}{J}^{T}\left({u}_{k}\right)e\left({u}_{k}\right)$$
(8)

where \({u}_{k}\) is the kth iteration control input sequence; J denotes the Jacobi matrix; I is the unit matrix; μ denotes the damping factor, which is a constant greater than zero; When \(\mu\) gradually increases, the algorithm is similar to the gradient descent method with slow learning speed and global convergence. When \(\mu\) gradually decreases, the algorithm is similar to Newton’s method with fast learning speed and local convergence; \(e\left({u}_{k}\right)\) is the resulting error.

The LM algorithm can locally converge and converge quickly, with good stability58. It is one of the most recommended basic optimization algorithms in MATLAB training attributes, and its introduction as an optimization algorithm into BP neural networks can effectively improve the model convergence efficiency and learning rate while significantly mitigating the risk of stucking in local minima59. The adjustment formula for the weights and thresholds of the network is shown in Eq. (9).

$${W}_{(k+1)}={W}_{(k)}-{[{J}^{T}\left(X\right)J\left(X\right)+\mu I]}^{-1}{J}^{T}\left(X\right)e\left(X\right)$$
(9)

where X denotes the input sample of the BP neural network; W(k) denotes the weight vector at the kth iteration of the BP neural network; the network output is obtained by learning and training the sample dataset through the model; the weights and thresholds are updated based on the error between the output and the desired output, resulting in the acquisition of a new vector W(k + 1).

GA

GA is a heuristic algorithm that simulates the process of biological heredity and evolution, with good global search ability60. The algorithm understands the data individual problem as a “chromosome”, and evaluates the advantages and disadvantages of the “chromosome” using the fitness function. In the process of GA optimizing BP neural network, real number coding is used to generate the initialization population, and the chromosome length is defined as m \(\times\) s + s + n \(\times\) s + n. Where m denotes the number of input vector dimensions; s denotes the number of neurons in the hidden layer; n denotes the number of output vector dimensions. Individual fitness is calculated based on the test error results. The selection process favors individuals with higher fitness values, indicating a greater chance of being chosen. The fitness function, represented by Eq. (10), is the inverse of the absolute value of the absolute error. Through operations such as selection, crossover and mutation, individuals with better adaptation are retained. After repeated iterations, the optimized weights and thresholds are incorporated into the BP neural network for training and subsequent prediction.

$$F=\frac{1}{\sum_{i=1}^{m}|{y}_{i}-{o}_{i}|}$$
(10)

where F is the fitness function; m is the number of training samples; yi and oi are the predicted output and true output of the ith training sample, respectively.

In this paper, the GA-LM-BP neural network model is trained through the programming implementation of MATLAB simulation platform, and the flow chart of model construction is shown in Fig. 3. The parameters of the model are selected as follows: The number of neurons in the input layer is 76 for the number of symptom features. The number of neurons in the hidden layer is 15 for the number of nodes in the optimal hidden layer according to the empirical formula s=\(\sqrt{m\times n}+a\) where m = 55 (input feature dimensions), n = 6 (output classes), a \(\in\)[1,10]. The training set consists of 80% of the samples, with 20% reserved for testing. This ratio has been widely used in similar studies to ensure sufficient training while maintaining reliable validation. Other hyperparameters, such as the maximum number of training epochs (1000), learning rate (0.01), and training goal (1e-5), were selected based on default recommendations in MATLAB’s Neural Network Toolbox and validated through early-stage experiments. GA parameters including population size (100), crossover rate (0.6), and mutation rate (0.08) were also based on standard practices in evolutionary computation and verified through multiple trials.

Fig. 3
Fig. 3
Full size image

Flow chart of GA-LM-BP neural network model construction.

Results and discussion

Optimization algorithm control group

Checking the literature related to algorithm optimization in the field of machine learning61,62,63, it is found that the PSO algorithm is often used to do controlled experiments with GA. The PSO algorithm possesses superior global search characteristics, and it can be used to optimize the weight and threshold parameters of the BP neural network to improve the convergence speed and prediction accuracy of the neural network64. Based on the results of a large number of literature studies, this experiment constructs three control groups of LM-BP, GA-LM-BP, and PSO-LM-BP, and sets other parameters, such as the number of iterations and the number of neurons in the hidden layer, in order to ensure the validity of the prediction results to remain consistent. Confusion matrix is a widely employed tool for evaluating the performance of a classification model. The numbers on the diagonal of the matrix indicate the number of correctly predicted cases. A denser concentration of predicted values along the diagonal signifies superior model performance. Recall is the proportion of all outcomes where the true value is in the positive category that the model predicts correctly, with a higher recall indicating a higher probability that the syndrome type will be predicted. Accuracy is the proportion of all correctly predicted results of a classification model to the total sample values, calculating how close the predicted values of the model are to the true values, with higher accuracy indicating better model classification. In statistics, the mean squared error (MSE) are used to measure the average of the squared errors between estimated and actual values65. In this study, the confusion matrix is used to represent the difference between the prediction results and the actual results of the model, and the three indicators of recall, accuracy and mean squared error are used to judge the classification advantages and disadvantages. The calculation method of the three assessment indicators is shown in Eqs. (1113), and the prediction results of each model are shown in Table 7.

$$recall=\frac{TP}{TP+FN}$$
(11)
$$accuracy=\frac{TP+TN}{TP+FP+FN+TN}$$
(12)
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{({y}_{i}-{y{^{\prime}}}_{i})}^{2}$$
(13)

where TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives and FN is the number of false negatives. \({y}_{i}\) and \({y{\prime}}_{i}\) are the true value and predicted value of sample i, respectively.

Table 7 Model prediction results statistics.

Through comparative analysis, it is evident from the aforementioned table that the utilization of the optimization algorithm has led to an enhancement in model prediction accuracy compared to the non-optimized performance of 30.94%. The GA-LM-BP neural network model possesses the highest prediction accuracy of 53.35%, implying its superior predictive performance. As a result, the GA-LM-BP neural network model is chosen for conducting subsequent control experiments involving the dimensionality reduction algorithms.

Control group for dimensionality reduction algorithms

In “Feature screening” section, three feature extraction algorithms have been employed to reduce the dimensionality of the dataset, with the GRA algorithm screening for correlations of 0.963 and above, the PCA algorithm selecting results until the cumulative contribution reaches 85% and the MIV algorithm selecting results until the cumulative contribution reaches 95%. Based on the feature screening of the above parameters, the GRA algorithm optimizes the topology of the GA-LM-BP neural network model as “55-15-6”; the PCA algorithm optimizes the topology of the GA-LM-BP neural network model as “51-15-6”; the MIV algorithm optimizes the topology of the GA-LM-BP neural network model as “55-15-6”. In order to clarify the impact of the above feature extraction methods on the prediction results, the samples after screening using the above three features respectively are re-trained and predicted for the model, still dividing the training set and the test set with a ratio of 8:2, and the prediction results after utilizing the three feature screening methods are shown in Table 8.

Table 8 Feature screening prediction results statistics.

Through comparative analysis, it is evident from the aforementioned table that after feature screening, the model prediction accuracy is improved compared to 53.35% before dimensionality reduction. It also suggests that feature screening plays a great importance in model optimization. The highest prediction accuracy of the model after the dimensionality reduction process of the MIV algorithm is 81.22%. The GRA algorithm calculates and arranges the overall descending order of the degree of correlation of symptom by using the symptom data and the syndrome data as the comparison sequence and the reference sequence, respectively, and it cannot screen out the symptoms that have high correlation among various types of syndromes. The meanings of each feature dimension of the principal components processed through the PCA algorithm have a certain degree of ambiguity and are not as interpretive as the original sample features. The MIV algorithm has the best interpretability for multi-class problems, and can better interpret the symptoms that have the top ranking of contribution to each type of syndrome through feature extraction, and further screening as the distinctive symptom features of that type of syndrome to help the neural network model to better categorize. The structure of MGLB model is shown in Fig. 4, and its topological network structure is “55-15-6”. The mean squared error is shown in Fig. 5; the prediction result is shown in Fig. 6; the prediction model confusion matrix is shown in Fig. 7; the network correlation graph is shown in Fig. 8.

Fig. 4
Fig. 4
Full size image

Training interface.

Fig. 5
Fig. 5
Full size image

Mean squared error.

Fig. 6
Fig. 6
Full size image

Prediction result.

Fig. 7
Fig. 7
Full size image

Confusion matrix.

Fig. 8
Fig. 8
Full size image

Network correlation.

The above experiments used different optimization algorithms and dimensionality reduction algorithms to achieve the classification of TCM syndromes, and LM-BP, PSO-LM-BP, GA-LM-BP, GRA-GA-LM-BP, PCA-GA-LM-BP and MGLB models are used for predicting the test samples, respectively. To visualize which model in the test set predicted the correct number of cases for the syndrome type more, a comparison of the recall rate of the models is shown in Fig. 9. The broken line representing the MGLB neural network model stands out at the highest position, exhibiting superior prediction accuracy and stability.

Fig. 9
Fig. 9
Full size image

Model comparison.

The TCM syndrome differentiation accuracy rate of the MGLB model for IPF is 81.22%, which is superior to the baseline model without feature screening and algorithm optimization. In Western medicine, two literatures based on HRCT imaging for diagnosing IPF have been found, and the accuracy rates are 78.9%66 and a slightly higher accuracy of 85.7%67. However, it is worth noting that TCM syndrome differentiation involves high-dimensional, nonlinear and subjective symptom data, which brings greater challenges compared with the structured biomedical indicators used in Western diagnosis. Our model is based on real-world clinical cases from multiple sources and reflects the real practice of TCM diagnosis. This enhances its applicability and interpretability in the real world, making it a valuable tool for intelligent TCM diagnosis.

Conclusion

IPF patients’ scar tissue proliferates over time and it is difficult for oxygen to enter from the alveoli to the bloodstream after the onset of illness, which makes patients feel abnormally short of breath, with an average survival expectancy of only 3–5 years68. Disease conditions of some patients may continue to deteriorate, leading to worsening of symptoms. Therefore, it is important to accurately predict the type of syndrome and prescribe the right medicine for the diagnosis and treatment of IPF. In this paper, a MGLB model is constructed to predict the TCM syndromes of IPF. The main conclusions are as follows:

  1. 1.

    The MIV algorithm can achieve a clear mapping between symptoms and syndrome, which is in line with the logic of syndrome differentiation, screen more characteristic variables and separately represent the symptoms with a higher contribution rate of each syndrome, offering greater interpretability and suitability for multi-class TCM syndrome classification. Compared with GRA and PCA, MIV achieves higher prediction accuracy and retains the original symptom features, ensuring both input and output of the model remain clinically interpretable within the context of TCM syndrome differentiation, though at the cost of increased computational load and greater manual processing complexity.

  2. 2.

    Two optimization algorithms, LM and GA, were used to optimize and update the convergence speed, weights and thresholds of the BP neural network, respectively. This approach addresses the issues commonly associated with traditional BP neural networks, including sluggish convergence speed, easy falling into local minima, and weak generalization ability. By comparing the prediction results of GA and PSO algorithm optimization model, it is found that the GA-LM-BP model based on feature screening has better recall, accuracy and stability, which is feasible to be used for the prediction of TCM syndrome in IPF.

  3. 3.

    The MGLB classification model proposed in this paper provides a more effective auxiliary means of TCM syndrome differentiation for unexperienced TCM doctors and improves the shortcomings of traditional TCM diagnosis and treatment with strong subjectivity. It can also reduce errors in the process of diagnosis and treatment and enable TCM diagnosis and treatment schemes to be formulated more quickly and accurately, which provides a certain reference value for the diagnosis of IPF. The model can be extended to be used for the prediction of syndromes in other chronic lung diseases, which is of great practical significance.

The main contribution of this work lies in their effective integration into a unified framework specifically designed for TCM syndrome classification. This includes: (1) The first systematic and interpretable application of the MIV algorithm in TCM syndrome differentiation, enabling transparent symptom-syndrome mapping that aligns well with the diagnostic logic of TCM; (2) The development of a novel integrated intelligent diagnostic thinking process framework; (3)The validation of the model’s generalization potential through preliminary experiments on other chronic diseases; (4) The use of a real-world TCM clinical dataset comprising 956 IPF cases, ensuring practical relevance and applicability in actual diagnostic settings. In summary, this study presents a novel and interpretable machine learning-based framework for TCM syndrome classification, integrating feature screening, hybrid optimization, and real-world data validation to support accurate and interpretable diagnosis of IPF.

Limitations and future works

This paper demonstrates a novel application of machine learning technology to the research of the classification of TCM syndromes of IPF. The findings offer valuable insights for the intelligent TCM syndrome differentiation and contribute to enhancing the accuracy and scientific rigor of TCM syndrome classification. However, the research still has shortcomings. For neural network algorithms, more data is needed to put into training. The sample size of 956 medical cases included in the experiment of this study is slightly insufficient, and the collected dataset is not balanced. Furthermore, most of the clinical data are derived from expert outpatient clinics in the southwest region, which may introduce regional biases. In addition, as the data were retrospectively collated, some degree of subjectivity cannot be ruled out. To enhance the model’s generalization ability, future work will expand the data collection to include more regions and hospitals, thereby increasing sample diversity and representativeness. Clustering analysis methods will also be explored to assist in identifying and correcting potential data inconsistencies. Moreover, statistical validation methods will be introduced to rigorously assess the significance of feature screening using MIV and evaluate the overall classification performance of the model. In addition, the MGLB model was applied to a dataset of 1205 insomnia cases and 1403 COPD cases, achieving an accuracy of 82.99% and 82.92%, showing its application potential for expansion to other chronic diseases. This research is mainly modeled and evaluated through the MATLAB platform, which can be embedded in the auxiliary diagnosis and treatment system for practical use in the medical field, increasing the depth of research and the value of use. The current research on machine learning algorithms has not yet involved the recommendation of treatments and herbs, and subsequent research on multi-label classification algorithms can be carried out for the part of treatment prediction and herb recommendation to achieve a closed loop.