Introduction

Liver disease is the eleventh-leading cause of death globally, with most fatalities attributed to complications associated with cirrhosis and hepatocellular carcinoma1. As a precursor to cirrhosis, liver fibrosis develops from chronic hepatic injury but can regress with appropriate aetiologic treatment2. Therefore, dynamic monitoring and early detection of liver fibrosis are crucial for controlling disease progression and reducing the overall disease burden3 effectively.

Although liver biopsy remains the reference standard for assessing fibrosis, its clinical utility is limited by its invasive nature and suboptimal intra- and inter-rater reliability4. Non-invasive tests (NITs), including serological tests5 and imaging examinations6, have been developed for the early diagnosis of liver fibrosis. However, these conventional diagnostic methods, with high acquisition and maintenance costs, are unsuitable for routine dynamic monitoring, particularly in primary care settings with limited technical capabilities and resources7.

Tongue inspection, as one of the most essential diagnostic methods in Traditional Chinese Medicine (TCM), provides practitioners with valuable insights into a patient’s overall health8, including the state of internal organs and the severity of illnesses. However, traditional tongue diagnosis heavily relies on practitioner experience and is inherently subjective. Recently, with the advancement of artificial intelligence (AI), tongue diagnosis has undergone significant evolution. Innovative technologies have been developed to enhance the objectivity and accuracy of tongue diagnosis8, enabling it to serve as a screening tool for the early detection of diseases such as breast cancer9 and non-alcoholic fatty liver disease10. Also, specialized diagnostic instruments are now being applied in research and clinical practices10 to expand the scope of tongue diagnosis. The Tongue and Face Diagnosis Analysis-1 (TFDA-1) instrument, developed by Shanghai University of TCM, has been used for diabetes11 and other diseases12. Nevertheless, because they rely on large models and high-performance hardware, TCM diagnostic instruments are expensive and relatively bulky, requiring patients to visit specialized institutions for evaluation.

Our study aims to develop an AI-powered tongue diagnosis system for home-based dynamic monitoring of liver fibrosis via mobile devices. We introduce the TongVMoe model, a multi-task interpretable framework that simultaneously identifies liver fibrosis and key tongue features, and further deploy it within a WeChat mini program to allow patients to upload tongue images and receive real-time diagnostic feedback (Fig. 1). This model was developed and validated on a prospective cohort using ultrasound elastography as the reference standard, and its diagnostic performance was rigorously benchmarked against a range of modern multi-task learning and state-of-the-art architectures. Complete methodological details regarding the study population, image preprocessing, network design, and statistical analysis are delineated in the Methods section. This approach facilitates accessible, continuous monitoring of liver fibrosis, paving the way for a practical follow-up tool suitable for everyday use at home.

Fig. 1: Workflow of the study.
Fig. 1: Workflow of the study.
Full size image

The study consists of two stages. The first stage involves developing the TongVMoe model, which is trained using tongue images based on a multi-task model. The second stage involves remote health monitoring simulation by embedding the TongVMoe model into the WeChat mini-program.

Results

The study consists of two stages: the development of the TongVMoe model and the remote health monitoring simulation. The workflow is presented in Fig. 1.

Patient characteristics

A total of 1601 patients who contributed 2202 tongue images were included in the AI model development phase. The training, validation, and test sets were comprised of 1280 (non-fibrosis: 927 [42.7 ± 10.7 years old], fibrosis: 353 [48.4 ± 12.2 years old]), 160 (non-fibrosis: 111 [42.0 ± 11.5 years old], fibrosis: 49 [47.0 ± 11.9 years old]), 161 (non-fibrosis: 107 [42.6 ± 11.8 years old], fibrosis: 54 [45.1 ± 12.2 years old]) patients, respectively. A total of 108 cases (from 103 individual patients) were enrolled in the remote health monitoring simulation, including 72 non-fibrosis cases and 36 fibrosis cases. Among them, 98 patients underwent the test once, whereas 5 patients completed the test twice at different time points. Table 1 summarizes the clinical characteristics of patients across different datasets.

Table 1 Baseline demographic and clinical characteristics of patients in the training, validation, and test sets

Correlation between tongue features and liver fibrosis

The analysis of the correlation between various tongue features assessed by TCM expert and hepatic fibrosis status revealed no statistically significant associations between tongue color (χ2 = 4.836, P = 0.089), cracks (χ2 = 0.637, P = 0.425), tongue coating color (χ2 = 01.873, P = 0.171), tooth marks (χ2 = 0.038, P = 0.845), tongue coating thickness (χ2 = 0.155, P = 0.561), or greasy coating (χ2 = 0.536, P = 0.464) with LSM. In contrast, petechiae exhibited a highly significant association with liver stiffness (χ2 = 19.516, P < 0.001), highlighting its potential value as a clinical indicator for liver fibrosis.

Therefore, we conducted a subgroup analysis focusing on tongue petechiae to examine whether their presence affected the models’ predictive accuracy and to compare the performance of different models across these subgroups.

Diagnostic performance of the TongVMoe model

The TongVMoe model demonstrated robust performance across most tasks in the test set. For liver fibrosis prediction, the model achieved an accuracy of 77.98%, with a specificity of 87.42% and an AUC (Area under the curve) of 0.8061. In the task of tongue feature recognition, the model excelled in identifying cracks (Accuracy: 91.74%, AUC: 0.9752) and greasy coating (Accuracy: 90.37%, AUC: 0.9232) while displaying balanced performance in tongue coating color (Accuracy: 86.24%, AUC: 0.9310) and tooth marks (Accuracy: 87.16%, AUC: 0.9257). Additionally, the model achieved high specificity (90.80%) in the petechiae task, with an accuracy of 87.16% and an AUC of 0.8912. More details of the performance measurements of the test set are shown in Table 2. The receiver operating characteristic curves (ROC curves) for binary classification data, such as cracks, tongue coating color, tooth marks, greasy coating, petechiae, and LSM, are shown in Fig. 2a. For three-class classification data such as tongue color and tongue coating thickness, confusion matrices are illustrated in Fig. 2c and d (c: tongue color, d: tongue coating thickness)

Fig. 2: ROC curves and confusion matrices of the TongVMoe model and comparison models on the test set.
Fig. 2: ROC curves and confusion matrices of the TongVMoe model and comparison models on the test set.
Full size image

a Receiver operating characteristic (ROC) curves of the TongVMoe model for cracks, tongue coating color, tooth marks, greasy coating, petechiae, and liver stiffness measurement (LSM) prediction tasks. b ROC curves of different deep learning models for liver fibrosis diagnosis based on tongue images. c Confusion matrix of the TongVMoe model for tongue color. d Confusion matrix of theTongVMoe model for tongue coating thickness classification.

Table 2 The diagnostic performance of the TongVMoe in predicting liver fibrosis and tongue features on the test set

Comparison of different models for the prediction of liver fibrosis from tongue images

We compared the TongVMoe with three multi-task learning approaches: Hard Parameter Sharing (HPS)13, Customized Gate Control (CGC)14, and DSelect-k15; and five backbone architectures: Diffusion-based Medical Image Classifier v2 (DiffMIC-v2)16, InceptionNeXt17, Large-Small Network (LSNet)18, TransXNet19, and HorNet20. Results showed the TongVMoe model exhibited superior diagnostic performance in diagnosing liver fibrosis from tongue images (Table 3). Specifically, its AUC was significantly higher than those of HPS (Accuracy: 69.72%, AUC: 0.6526) and CGC (Accuracy: 71.56%, AUC: 0.7375), and comparable to DSelect-k (Accuracy: 73.85%, AUC: 0.7643). Among the five backbone models, DiffMIC-v2 (Accuracy: 69.72%, AUC: 0.6929), InceptionNeXt (Accuracy: 71.10%, AUC: 0.7012), LSNet (Accuracy: 74.31%, AUC: 0.6971), TransXNet (Accuracy: 70.18%, AUC: 0.7062) and HorNet (Accuracy: 66.97%, AUC: 0.7018) all showed lower AUCs compared with the TongVMoe (p < 0.05). The ROC curves comparing these models are illustrated in Fig. 2b.

Table 3 Performance of different models in analyzing the liver fibrosis from tongue images on the test set

Subgroups (with/without petechiae) analysis of models in diagnosing liver fibrosis

To further assess the impact of tongue petechiae on fibrosis prediction, we stratified the test set into a petechiae group (44 tongue images) and a non-petechiae group (174 tongue images). In the non-petechiae group, the TongVMoe achieved the highest AUC (0.7902) with an accuracy of 78.16%, significantly outperforming HPS (AUC: 0.6277), CGC (AUC: 0.6711), DiffMIC-v2 (AUC: 0.6098), HorNet (AUC: 0.6105), InceptionNeXt (AUC: 0.6182), LSNet (AUC: 0.6584), and TransXNet (AUC: 0.6265), while demonstrating comparable performance to DSelect-k (AUC: 0.6882). In contrast, in the petechiae group, the TongVMoe reached an AUC of 0.8606, with less pronounced differences compared to other models. Overall, all nine models demonstrated superior diagnostic performance in the petechiae group compared to the non-petechiae group, with AUCs increasing by 7.04–33.50 percentage points. Notably, HorNet (+33.50), InceptionNeXt (+32.95), and TransXNet (+32.34) exhibited the most significant gains, whereas the TongVMoe showed a modest improvement (+7.04). More details are provided in Table 4.

Table 4 Sub-groups analysis of different models in liver fibrosis diagnosis

Interpretability of the AI diagnosis

The interpretability of the TongVMoe lies in both its architecture and visualization results. Structurally, the model produced explicit predictions for each tongue feature alongside the fibrosis classification demonstrated previously, thereby revealing the underlying associations between observable features and disease states. For visual explanations of the model’s decision-making process, we applied EigenCAM21 to observe the regions attended by the TongVMoe model during prediction. We evaluated whether these areas corresponded to clinical features commonly examined by physicians. Analysis of the heatmaps revealed that the model’s highlighted regions closely matched the areas clinicians typically focused on. Figure 3a–d illustrate four representative cases where the model accurately diagnosed liver fibrosis by focusing on clinically relevant tongue features. For instance (Fig. 3a), when tongue cracks were present, the model focused on the cracked areas, consistent with the regions clinically assessed by physicians, underscoring the clinical validity of the AI model’s predictions.

Fig. 3: Visualization of attended regions generated by EigenCAM during tongue diagnosis.
Fig. 3: Visualization of attended regions generated by EigenCAM during tongue diagnosis.
Full size image

a Tongue image exhibiting cracks. b Tongue image with tooth marks. c Tongue image showing petechiae. d Tongue image with thick and greasy coating.

Home monitoring simulation using the TongVMoe model

We integrated the TongVMoe model into a WeChat mini-program that passed the hospital’s review process. Figure 4 presents the interface of the developed mini-program for tongue image acquisition and liver fibrosis assessment. Overall, the mini-program achieved an accuracy of 77.8%, a sensitivity of 86.2%, and a specificity of 73.7%.

Fig. 4: User interface of the mini-program for liver fibrosis screening based on tongue image analysis.
Fig. 4: User interface of the mini-program for liver fibrosis screening based on tongue image analysis.
Full size image

The mini-program comprises three sections: a the homepage, b the tongue image acquisition page, and c the result page. The original Chinese text in the interface has been translated into English, with translations provided according to the numbered annotations (): Tongue Image-Based Liver Fibrosis Intelligent Evaluation System Upload tongue image Rapid detection Intelligent Analysis Start Ensure that the entire tongue is positioned within the shooting frame; Adequate lighting, no backlight, no exposure, no reflection; Make sure the tongue is not stained Result Tongue Analysis: Damp-heat syndrome, Tongue color: Red; Tongue coating color: Yellow; Cracks: None; Tooth marks: Yes; Tongue coating thickness: Medium; Greasy coating: Yes; Petechiae: None. Screening Result: Liver fibrosis (+); Referral to hepatology is recommended.

Notably, among the 36 cases diagnosed with liver fibrosis, 27 were single-time measurements from individual patients, while the remaining nine cases came from five follow-up patients who underwent elastography twice at different time points. Of these five patients, four were consistently identified as having liver fibrosis in both examinations. In contrast, one patient initially tested negative but was later diagnosed with liver fibrosis during a follow-up visit three months later. Remarkably, the mini-program predicted liver fibrosis in this patient during the second visit as the disease progressed.

Discussion

In this study, we developed a deep-learning AI model, the TongVMoe, designed to screen for liver fibrosis and identify associated tongue features. Furthermore, we integrated the model into a WeChat mini-program and simulated a remote health monitoring scenario for liver fibrosis screening. Our study demonstrates that, with the assistance of AI, tongue images can be used not only for the preliminary screening of liver fibrosis but also as a tool for monitoring disease progression in patients with chronic liver disease during routine follow-ups.

Previously, we investigated the use of tongue images to develop an AI model capable of detecting liver fibrosis22. However, the model lacked interpretability regarding relevant tongue features, which undermined the trust of both physicians and patients in the diagnostic results and limited its clinical applicability. In addition, the training dataset was relatively small, resulting in suboptimal model performance. In the current study, we addressed both issues by enhancing the model’s ability to recognize tongue features and expanding the training dataset to boost overall performance and generalizability.

According to TCM theory, physicians can assess the condition of the liver by observing specific tongue features23. However, tongue diagnosis is highly subjective, and the diversity of tongue features further complicates its assessment. In this context, petechiae serve as a crucial diagnostic indicator. The subtle reddish-purple dots are considered in TCM to correspond to impaired circulation and stagnation of liver Qi and blood. From a modern perspective, this may parallel the microcirculatory disturbances that occur during the progression of hepatic fibrosis. Although visually less prominent than features such as cracks or greasy coating, petechiae provided stronger pathophysiological specificity for fibrosis. This may explain why all models exhibited better performance in the petechiae subgroup. In contrast, the non-petechiae subgroup, which presented more heterogeneous and nonspecific features, posed greater challenges for fibrosis detection. Notably, the TongVMoe model maintained stable performance across both subgroups, underscoring its generalizability in real-world clinical settings where tongue presentations are diverse. Overall, these findings reveal a gap between visual prominence and clinical relevance, suggesting that future model development should strike a balance between visually apparent features and clinically meaningful yet subtle ones.

Deep learning is often considered a “black box,” but Explainable Artificial Intelligence (XAI)24 offers solutions by providing objective and interpretable insights. Heatmaps generated by EigenCAM showed that the TongVMoe model consistently focused on clinically relevant tongue regions when predicting different features, supporting its visual interpretability. In addition to such visual evidence, the multi-task framework allows the model to reason about tongue features in parallel with disease diagnosis, thereby partially simulating the holistic and pattern-based diagnostic thinking of TCM physicians. This not only enhances transparency but also bridges the gap between AI decision-making and clinical reasoning, providing a more intuitive and trustworthy interpretation for practitioners.

We simulated a remote health scenario in which patients captured tongue images using their smartphones. Despite the potential decline in image quality, the AI model maintained robust performance and high sensitivity, enabling early detection of liver abnormalities. The follow-up results further support the model’s potential for early warning and longitudinal monitoring of liver fibrosis, highlighting its practical value in dynamic disease management, particularly in primary care and rural settings. Importantly, patient privacy is strictly protected, as no personal information is collected.

Our study has several limitations. First, we selected ultrasound elastography as the diagnostic reference standard due to its feasibility and reliability. Although ultrasound elastography has demonstrated promising results for the non-invasive assessment of liver fibrosis25, its accuracy may be influenced by factors such as the patient’s body mass index (BMI), the degree of liver inflammation, and the type of equipment used. These potential variabilities may introduce noise into the reference labels and influence the model’s performance evaluation. Second, the tongue images for the mini-program were all captured during clinic visits. Although the pictures were taken using patients’ smartphones, the lighting conditions were relatively uniform. This setting does not fully reproduce the complex lighting and positional variations that are likely to occur in real home environments, which may affect the model’s performance in broader real-world applications. Third, the dataset for simulating home monitoring was relatively small, mainly consisting of single-time measurements, which limited the validation of the model’s longitudinal predictive ability and dynamic monitoring performance. Moreover, the study was conducted in a single clinical center, and external validation across different populations and imaging conditions is still needed to ensure generalizability. In future studies, we plan to expand the dataset and further refine the tongue diagnosis model to improve its robustness and clinical applicability.

In conclusion, this study provides preliminary evidence supporting the use of deep learning-based tongue image analysis as a non-invasive approach to liver fibrosis screening. While further validation in larger and more diverse populations is required, our findings suggest that the AI model could potentially serve as an assistive tool for remote monitoring and follow-up management in chronic liver disease.

Methods

This prospective study was approved by the Institutional Committee on Ethics (ICE) for Clinical Research and Animal Trials at the First Affiliated Hospital of Sun Yat-sen University (Approval No. 2021464), and informed consent was obtained before collecting the tongue images from each participant. This study was registered at the Chinese Clinical Trial Registry (ChiCTR2100053676; registered on 27 November 2021) in accordance with the World Health Organization International Clinical Trials Registry Platform (WHO ICTRP) requirements.

Participants

Our study was conducted in two stages. Stage 1 involved the development of the AI diagnostic model. During this phase, patients who met the following criteria were enrolled between April 2021 and March 2024: (1) age ≥ 18 years; (2) a prior diagnosis of liver diseases, including chronic hepatitis, non-alcoholic fatty liver disease (NAFLD), abnormal transaminase levels and other related conditions, was confirmed in the Department of Gastroenterology at our hospital; (3) consented to tongue image collection. In Stage 2, participants were invited to test the mini-program during outpatient visits before undergoing liver elastography in the Department of Medical Ultrasonics between August 2024 and March 2025.

Tongue images collection and annotation

Participants were required to fast for 4–6 h before undergoing the ultrasound examination, ensuring that their tongues’ appearances were not affected by food or beverages. After receiving ultrasound elastography, patients were instructed to extend their tongues naturally for the collection of tongue images. In the first phase, photos were captured in no-flash mode using a Sony DSC-RX100 camera, and multiple images were taken for some patients to ensure data diversity. In the second phase, patients scanned a QR code to access the mini-program, then used their smartphones to capture a tongue image or upload a pre-existing one. Adhering to data privacy principles, our tongue diagnosis WeChat mini-program did not collect patient data, and diagnostic results were recorded on-site. There were no restrictions on the type of smartphone or camera resolution.

We invited a senior TCM physician with 20 years of clinical experience to annotate tongue images by identifying and labeling their tongue features. To standardize the process, we categorized the features into seven main types, each further divided into two or three subcategories based on tongue features commonly observed in clinical practice. Table 5 details these tongue features and their corresponding indications in TCM theory. The physician referred to the categories outlined in this table during the annotation process to ensure consistency.

Table 5 Details of tongue features, their Traditional Chinese Medicine (TCM) indications, and distributions across the training, validation, and test datasets

Reference standard: ultrasound elastography examination

Ultrasound elastography was selected as the reference standard for assessing the degree of fibrosis in this study, with radiologists having at least 10 years of experience performing real-time two-dimensional shear wave elastography (2D SWE). The patient lay supine, and the transducer was positioned in the intercostal space to obtain clear images of the liver. The propagation of shear waves was measured, and the stiffness value, often expressed in kilopascals (kPa), was calculated. The procedure lasted for 10–20 min. The cutoff value of Liver stiffness measurement (LSM) was set to 726, dividing the patients into “fibrosis (LSM ≥ 7 kPa)” and “non-fibrosis (LSM < 7 kPa)”.

Image pre-processing

Tongue images were divided at an 8:1:1 ratio into training, validation, and test sets. Stratified sampling was conducted based on sub-attributes, ensuring that the class distribution within each attribute remained approximately consistent across all sets (Table 5). There were no overlapping patients between data sets. We employed the TongueSAM27 to segment the tongue images and minimize interference from irrelevant content. For the input tongue images, we performed a series of data augmentation and normalization operations to simulate various real-world scenarios, thereby enhancing the model’s generalizability and robustness.

The data pre-processing pipeline included the following steps: The input images were resized to 256 × 256 pixels using bilinear interpolation; Random padding was performed with a padding of 20 pixels and following with a random crop to 224 × 224 pixels; Horizontal flipping was applied with a probability of 0.5; Random affine transformations was applied including rotations (±10°) and translations (up to 5% of image dimensions); Sharpness adjustment was randomly applied with a sharpening factor of 2.0 (p = 0.3); Random transformations was performed with a randomized sequence including rotation (±10°, p = 0.5), perspective distortion (scale = 0.2, p = 0.5), sharpness adjustment (factor=2.0, p = 0.3), and Gaussian blur (kernel size 5–9, sigma 0.1–5.0, p = 0.4); Then images were converted to tensor format and normalized using dataset-specific statistics (mean = [0.765, 0.73, 0.746], standard deviation = [0.28, 0.315, 0.309]).

This diverse range of augmented samples improved the model’s generalization ability. In contrast, the validation and test phases employed only basic preprocessing (resizing, tensor conversion, and normalization) to ensure consistent evaluation conditions.

Network architecture

Our preliminary research22 demonstrated the effectiveness of the deep learning model for tongue diagnosis in liver fibrosis. However, the model lacks interpretability, limiting its ability to provide clinically meaningful explanations. This time, we employed the state-of-the-art visual VMamba28 for feature extraction and Multi-gate Mixture-of-Experts (MMoE)29 as the network backbone.

Feature Extraction Network: VMamba utilizes a 2D Selective Scan (SS2D) module to scan the image along four directions, efficiently capturing both local and global characteristics. It learns hierarchical visual representations by stacking Visual State Space (VSS) blocks, effectively serving diverse tasks such as tongue feature classification and liver fibrosis diagnosis.

Multi-task Learning with MMoE: The features extracted by VMamba are fed into the MMoE module. MMoE generates independent gates for each task to control the contributions of shared experts, modeling inter-task relationships and learning task-specific feature combinations. Given input features \(x\), the shared bottom layers learn a common representation \(z=f\left(x;\theta \right)\). Then, \(K\) expert networks \({\phi }_{k},k=1,2,...,K\) extract representations softly shared across tasks. The output for the\(\,i\)-th task is:

$${\psi }_{i}=\mathop{\sum }\limits_{k=1}^{K}{g}_{i}^{k}\left(z\right){\phi }_{k}\left(z\right),i=1,2,\ldots ,N$$
(1)

where\({g}_{i}^{k}\left(z\right)\) is the gating network for the \(k\)-th expert and the \(i\)-th task. The final prediction is:

$${\hat{y}}_{i}={\varPhi }_{i}\left({\psi }_{i}\right),i=1,2,\ldots ,N$$
(2)

Gradient Balancing with Conflict-Averse Gradient(CAGrad)30: CAGrad was employed to balance the gradients across the eight tasks and mitigate task imbalance. Let \({{\mathscr{L}}}_{i}\left(\theta \right)\) be the loss for task \(i\). The average loss is:

$${{\mathscr{L}}}_{0}\left(\theta \right)=\frac{1}{K}\mathop{\sum }\limits_{i=1}^{K}{{\mathcal{L}}}_{i}\left(\theta \right)$$
(3)

CAGrad finds an update vector \({\bf{d}}\) by solving:

$$\mathop{\max }\limits_{{\bf{d}}\in {{\rm{{\mathbb{R}}}}}^{m}}\mathop{\min }\limits_{i\in \left[K\right]}\left\langle {{\bf{g}}}_{i},{\bf{d}}\right\rangle \text{s.t.}\parallel {\bf{d}}-{{\bf{g}}}_{0}\parallel \le c\parallel {{\bf{g}}}_{0}\parallel$$
(4)

where \({{\bf{g}}}_{0}=\frac{1}{K}{\sum }_{i=1}^{K}{{\bf{g}}}_{i}\) is the average gradient, and \(c\in [0,1)\) is a hyperparameter.

Finally, the task-specific features are input to the corresponding prediction heads for attribute classification and liver fibrosis diagnosis tasks. Figure 1 shows the schematic diagram of the model architecture.

Training process

The model was trained on a single NVIDIA H100 GPU with a batch size of 16 for 100 epochs. The optimizer was Adam with a learning rate of 1 × 10⁻⁵ and weight decay of 1 × 10⁻⁷. The learning rate scheduler employed a cosine annealing strategy (CosineAnnealingLR) with Tmax = 100 and ηmin = 1e-6. An early stopping mechanism was applied, halting training if the change in validation AUC over the past 10 epochs was less than 0.01. Other key hyperparameters included: Dropout Rate (0.5), Drop Path Rate (0.5), MTL CAGrad \({c}_{\alpha }\) (0.5), and specific initialization for MLP layers.

For binary classification tasks, a probability threshold of 0.5 was applied, where predictions with probabilities greater than or equal to 0.5 were classified as positive.

For multi-class classification tasks, the model produced a probability distribution over all categories through a softmax layer, and the class with the highest predicted probability was assigned as the final label.

Comparison with other multi-task learning models and modern architectures

To comprehensively assess the robustness and generalizability of our model, we compare it with both widely used multi-task learning methods and recently developed backbone designs. For multi-task learning, we include three representative models: HPS, CGC, and DSelect-k, each adopting different strategies for feature sharing across tasks. In addition, we benchmark our approach against five state-of-the-art backbone architectures: DiffMIC-v2, InceptionNeXt, LSNet, TransXNet, and HorNet. These models encompass diverse design philosophies, including diffusion-based learning (DiffMIC-v2), modernized convolutional networks (InceptionNeXt), integrated self-attention module (LSNet), hybrid token mixing (TransXNet), and high-order interaction modeling (HorNet). The performance of each model was evaluated using the same dataset, where we analyzed their effectiveness in diagnosing liver fibrosis.

Statistical analysis

Continuous variables were described as mean ± standard deviations (SD) and were compared by the t-test or Mann–Whitney U test. The association between individual tongue characteristics (e.g., presence of cracks or tooth marks) and liver fibrosis status was analyzed using the Pearson χ² test. The performances of all models were evaluated in terms of Area Under Curve (AUC), accuracy (ACC), precision (Pre), sensitivity (Sen), specificity (Spe), and F1-score(F1). AUC values were compared using the DeLong method. Acc, Pre, Rec, Spe, and F1 were compared using a paired samples t-test after verifying the normality of differences. Results with two-sided P-values of less than 0.05 indicated a statistically significant difference. Confidence intervals (CIs) were computed at a level of 95% using 1000 bootstrap samples. The analyses were conducted using Python 3.8 (Python Software Foundation).