Introduction

Colorectal cancer (CRC) is a major global malignancy, ranking third in incidence and second in cancer-related mortality, thus contributing significantly to the global disease burden1. An important genomic alteration associated with CRC is microsatellite instability (MSI), which arises from defects in the mismatch repair system and occurs in approximately 5–20% of CRC cases2. Notably, the prevalence of MSI is stage-specific; it exceeds 20% in stage II CRC but drops to less than 5% in more advanced stages3.

MSI tumors, characterized by a high tumor mutational burden driven by the MSI carcinogenic pathway, produce numerous immunogenic neoantigens and express immune checkpoints. Consequently, MSI has been identified as a favorable prognostic marker for stage II CRC, with failure to detect MSI potentially leading to unnecessary adjuvant chemotherapy4,5. Furthermore, MSI status predicts immunotherapy response, as studies show CRC patients with MSI respond more effectively to immune checkpoint inhibitors6. According to the National Comprehensive Cancer Network guidelines, MSI testing is recommended for all metastatic CRC patients7. Similarly, the European Society for Medical Oncology advocates MSI evaluation before immunotherapy, and the U.S. Food and Drug Administration has approved MSI as an indication for cancer immunotherapy8,9.

MSI detection methods include immunohistochemistry (IHC) targeting mismatch repair (MMR) proteins such as MLH1, PMS2, MSH2, and MSH6, and polymerase chain reaction (PCR) assays to identify microsatellite instability9. PCR commonly examines mononucleotide repeats like BAT-25 and BAT-26, along with dinucleotide markers. MSI-H is classified as MSI, while low microsatellite instability (MSI-L) is grouped with microsatellite stable (MSS). Although these methods are the standard for CRC classification, they are expensive, time-intensive, and show reduced sensitivity in samples with low tumor cell content. Both IHC and PCR rely on advanced equipment and skilled pathologists, presenting challenges in resource-limited settings. Furthermore, dMMR occurs in only 10–15% of CRC cases, reducing the cost-effectiveness of universal screening10,11,12. Thus, there is an urgent need for a more accessible, accurate, and cost-efficient detection method to improve dMMR and MSI testing strategies and support the advancement of precision medicine.

The introduction of whole slide images (WSIs) in digital pathology has advanced AI-assisted diagnostics by enabling high-resolution analysis and sharing of tissue samples. This innovation has improved cancer diagnosis, classification, and prognosis, enhancing clinical practice and personalized treatment13,14,15. AI advancements address key challenges in molecular pathology, including time-consuming and costly testing methods. Since 2019, growing evidence has demonstrated the ability of DL to accurately identify MSI and MSS status from hematoxylin and eosin (H&E)-stained whole slide images of CRC and other tumors16,17,18. The first automated, end-to-end DL-based MSI/dMMR detection model, developed by Kather et al. in 2019, achieved an area under the curve (AUC) of 0.84 in the TCGA cohort18. Subsequent studies using novel methodologies have reported improved AUC values ranging from 0.78 to 0.9818. Echle et al. developed a DL classifier with an AUC of 0.96 in external validation16. Mohsin Bilal et al. introduced a weakly supervised DL framework with three CNN, achieving an AUC of 0.98 in external cohorts19. Wagner SJ et al. implemented a transformer-based approach for effective mutation status prediction20. In 2022, these advancements led to the first commercial DL biomarker detection algorithm (MSIntuit, Owkin, Paris/New York) being approved for routine clinical use in Europe21.

In recent years, DL algorithm based on WSIs methods have been increasingly studied for predicting the MSI-H status of CRC. However, the predictive performance and reliability of these DL models vary widely, and their overall performance remains uncertain. Therefore, this systematic review aims to combined current findings and evaluate the predictive performance of histological models in diagnosing MSI-H in CRC.

Results

Study selection

The initial database search yielded 1060 potentially relevant articles. After removing 181 duplicates, 879 unique articles were subjected to preliminary screening. Strict application of inclusion criteria resulted in the exclusion of 791 articles. Following a detailed full-text review, 69 studies were further excluded due to insufficient or incomplete diagnostic data (TP, FP, FN, TN). Ultimately, 19 studies that met the criteria for evaluating the diagnostic performance of DL algorithms were included in the meta-analysis16,17,18,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36. The literature screening process was systematically documented using a standardized Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram, as shown in Fig. 1.

Fig. 1: PRISMA flow diagram illustrating the study selection process.
figure 1

This figure presents a detailed overview of the systematic review process, including the number of studies identified, screened, and included at each stage.

Study description and quality assessment

A total of 19 eligible studies were included, with internal validation reported in 13 studies comprising 14 data sets and 14,324 patients or images (range: 100–4738). External validation was reported in 13 studies involving 19,059 patients or images (range: 35–2098), with 25 data sets. These studies were published between 2019 and 2024. All included studies were retrospective in design. Ten studies used PCR as the gold standard, while nine utilized a combination of PCR and IHC. The most commonly employed AI algorithms were CNN-based (10/19, 53%). A detailed summary of the study, patient, and technical characteristics is presented in Tables 1 and 2.

Table 1 Study and patient characteristics of the included studies
Table 2 Technical aspects of included studies

The risk of bias, assessed using the revised QUADAS-2 tool, is summarized in Fig. 2 and Supplementary Table 2. In the patient selection domain, 17 studies were rated as “unclear” due to insufficient information on whether patients were consecutively enrolled. Similarly, in the analysis domain, 17 studies were also rated as “unclear” because, although quality control and data filtering were mentioned, it was unclear whether all eligible patient samples were included in the analysis, raising concerns about potential analysis bias. Despite these limitations, the overall quality assessment indicates that the included studies are of acceptable quality, as most of the other items present low risks.

Fig. 2: Risk of bias and applicability concerns of the included studies using the revised Quality Assessment of Diagnostic Performance Studies (QUADAS-2) tool.
figure 2

This figure was generated using Revman 5.4 software.

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

For the internal validation dataset, DL algorithms based on WSIs achieved a sensitivity of 0.88 (95% CI: 0.82–0.93) and a specificity of 0.86 (95% CI: 0.77–0.92) in detecting MSI-H in CRC patients (Fig. 3). The AUC was 0.94 (95% CI: 0.91–0.95) (Fig. 4a). With a pre-test probability of 20%, the Fagan nomogram demonstrated a positive likelihood ratio of 62% and a negative likelihood ratio of 3% (Fig. 5a).

Fig. 3: Forest plot of deep learning algorithms for identifying microsatellite instability-high in colorectal cancer using whole slide images in the internal validation set of patient-based analysis.
figure 3

Squares represent the sensitivity and specificity of each study, while horizontal bars indicate the 95% confidence intervals. This figure was generated using Stata 15.1 software.

Fig. 4: Summary receiver operating characteristic (SROC) curves of deep learning algorithms for identifying microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images in the internal validation set.
figure 4

a Displays the patient-based SROC curve, indicating the diagnostic performance of the algorithms across different patients, while b provides the image-based SROC curve, reflecting the performance based on individual whole slide images.

Fig. 5: Fagan's nomogram for deep learning algorithms in identifying microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images from the internal validation set.
figure 5

a Displays the patient-based nomogram, illustrating post-test probabilities of MSI-H classification based on pre-test probabilities and algorithm results. b Presents the image-based nomogram, assessing the likelihood of MSI-H based on individual whole slide imaging results.

High heterogeneity was noted in sensitivity (I² = 88.51%) and specificity (I² = 99.24%) within the internal validation dataset. Meta-regression analysis identified that sensitivity heterogeneity was primarily driven by factors including center (single center vs. multicenter, P = 0.04), reference standard (Only PCR vs. non-only PCR, P < 0.001), and magnification (20× vs. 40×, P < 0.001). In the specificity heterogeneity analysis, no sources of heterogeneity related to the center, AI algorithm, reference standard, magnification, and tile size were found (all P > 0.05) (Table 3). The sensitivity analysis revealed no potential source of heterogeneity (Supplementary Table 3).

Table 3 Meta-regression analysis of deep learning algorithm performance based on patient-based analysis in internal validation cohorts for diagnosing microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

For the internal validation dataset, DL algorithms based on WSIs achieved a sensitivity of 0.81 (95% CI: 0.76–0.85) and a specificity of 0.82 (95% CI: 0.72–0.89) in detecting MSI-H in CRC patients (Fig. 6). The AUC was 0.84 (95% CI: 0.81–0.87) (Fig. 4b). With a pre-test probability of 20%, the Fagan nomogram demonstrated a positive likelihood ratio of 53% and a negative likelihood ratio of 5% (Fig. 5b).

Fig. 6: Forest plot of deep learning algorithms for identifying microsatellite instability-high in colorectal cancer using whole slide images in the internal validation set of image-based analysis.
figure 6

Squares represent the sensitivity and specificity of each study, while horizontal bars indicate the 95% confidence intervals. This figure was generated using Stata 15.1 software.

High heterogeneity was noted in sensitivity (I² = 88.28%) and specificity (I² = 96.47%) within the internal validation dataset. The sensitivity analysis revealed that after Omitting Chang et al. the I2 for sensitivity was 17.16%, for specificity was 0%, suggesting it was the potential source of heterogeneity (Supplementary Table 3).

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

For the external validation dataset, the sensitivity of detecting MSI-H in CRC was 0.93 (95% CI: 0.88–0.95), while the specificity was 0.71 (95% CI: 0.57–0.82) (Supplementary Fig. 1). The AUC was 0.92 (95% CI: 0.90–0.94) (Supplementary Fig. 2a). At a pre-test probability of 20%, the Fagan nomogram indicated a positive likelihood ratio of 44% and a negative likelihood ratio of 3% (Supplementary Fig. 3a).

High heterogeneity was identified for sensitivity (I² = 95.30%) and specificity (I² = 99.59%) within the external validation dataset. Meta-regression analysis revealed that the heterogeneity in sensitivity was primarily influenced by the center (single center vs. multicenter), reference standard (Only PCR vs. non-only PCR), (P = 0.03, P < 0.001) while that in specificity was mainly driven by tile size (256*256 or 224*224 vs. 512*512) (P < 0.001) (Table 4). The sensitivity analysis revealed no potential source of heterogeneity (Supplementary Table 4).

Table 4 Meta-regression analysis of deep learning algorithm performance based on patient-based analysis in external validation cohorts for diagnosing microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images

There was no statistically significant difference in sensitivity, specificity and AUC values between the internal and external validation datasets in patient-based analysis (Z = –1.50, 0.67,1.39; P = 0.13, 0.50,0.17).

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

For the external validation dataset, the sensitivity of detecting MSI-H in CRC was 0.80 (95% CI: 0.63–0.90), while the specificity was 0.54 (95% CI: 0.41–0.67) (Supplementary Fig. 4). The AUC was 0.71 (95% CI: 0.66–0.74) (Supplementary Fig. 2b). At a pre-test probability of 20%, the Fagan nomogram indicated a positive likelihood ratio of 30% and a negative likelihood ratio of 9% (Supplementary Fig. 3b). The sensitivity analysis revealed that after Omitting Saillard et al. (MAPTH-UFS), the I2 for sensitivity was 31.30%, suggesting it was the potential source of heterogeneity (Supplementary Table 4).

There was no statistically significant difference in sensitivity values between the internal and external validation datasets in image-based analysis (Z = 0.14; P = 0.89). However, the specificity and AUC of the internal validation dataset was significantly higher than that of the external validation dataset (Z = 3.53,5.10; both P < 0.001).

Publication bias

The Deeks’ funnel plot asymmetry test showed no significant publication bias in the internal validation dataset based on patient-based and image-based analyses for DL (P = 0.73, P = 0.18) (Fig. 7). Likewise, no significant publication bias was detected in the external validation dataset (P = 0.80, P = 0.77) (Supplementary Fig. 5).

Fig. 7: Deek's funnel plot of internal validation set.
figure 7

a Presents the patient-based analysis, while b illustrates the image-based analysis. The funnel plot visually represents the relationship between study size and treatment effect, helping to assess publication bias. Asymmetry in the plots may indicate potential biases in the included studies. A P value of 0.05 was considered significant in evaluating this bias. This figure was generated using Stata 15.1 software.

Discussion

To the best of our knowledge, this is the first meta-analysis to evaluate the diagnostic performance of DL algorithms in detecting MSI-H in CRC using WSIs. For the internal validation dataset, the patient-based analysis yielded a sensitivity of 0.88 and a specificity of 0.86, while the image-based analysis showed a sensitivity of 0.81 and a specificity of 0.82. The AUC for sensitivity was 0.94 and for specificity was 0.84. In contrast, the external validation dataset demonstrated a higher sensitivity of 0.93 and a specificity of 0.71 in the patient-based analysis. The image-based analysis for the external dataset revealed a sensitivity of 0.80 and a specificity of 0.54. The AUC was 0.92 for the patient-based analysis and 0.71 for the image-based analysis. These results suggest that while DL algorithms effectively identify MSI-H in CRC, their performance varies between internal validation and external validation datasets. The outstanding diagnostic performance of deep learning algorithms can be attributed to their ability to automatically learn complex morphological features associated with MSI-H directly from digital pathology slides, features that conventional pathologists may overlook with the naked eye37. The higher specificity in internal validation datasets likely results from consistent data preprocessing, uniform staining, and standardized image acquisition, which help the model accurately distinguish MSI-H from non-MSI-H cases. In contrast, external validation datasets often introduce greater variability due to differences in staining protocols, slide preparation, and image quality, leading to domain shifts and reduced specificity38. These findings highlight the need for standardized data pipelines and the inclusion of multi-center datasets to enhance generalizability. Although DL demonstrates significant potential for MSI-H detection, caution is warranted due to dataset-specific factors and the absence of standardized external validation protocols, which may introduce bias. Future studies should focus on collaborative frameworks to develop robust and diverse training datasets while adopting cross-validation strategies to mitigate overfitting and improve clinical applicability39.

In terms of internal and external validation datasets revealed that patient-based approaches demonstrated higher sensitivity compared to image-based analysis (0.88 vs. 0.83, 0.92 vs. 0.80). In patient-based methods, each patient is represented by one WSI image as an independent sample, whereas image-based methods may include multiple slices from the same patient. Independent sampling ensures the model captures a broader range of variability, enhancing its predictive performance across diverse patient populations40. Patient-based approaches reflect greater diversity, encompassing variations in tumor types, stages, and therapeutic responses. This diversity improves the model’s generalizability by enabling it to learn a wider range of features, including tumor staging and demographic characteristics41. In contrast, image-based training risks overfitting to specific features within individual patients, which may limit the model’s applicability to external datasets42.

In the internal and external validation of AI algorithms, meta-regression analysis revealed no significant statistical differences in sensitivity or specificity between the patient-based CNN and non-CNN groups. For non-CNN models, for instance, Niehues’ study demonstrated that a self-supervised, attention-based multiple-instance learning model effectively focused on relevant tissue regions24. Visualization of the attention mechanism revealed that, for MSI prediction, the model concentrated primarily on tumor tissues while minimally focusing on fibromuscular and non-tumor epithelium. However, some attention dispersion was observed, potentially contributing to the finding that attention-augmented models did not outperform standalone CNN algorithms in sensitivity or specificity. Future comparisons of the diagnostic performance among different deep learning algorithms is a promising area for exploration.

It should be noted that in our patient-based external validation dataset, larger tiles (512*512) demonstrated higher specificity compared to smaller tiles (224*224 or 256*256) (0.91 vs. 0.58, P < 0.001). Larger tiles enhance the model’s ability to capture localized features, which is critical for identifying subtle pathological changes. Conversely, while smaller tiles can provide broader contextual information, they may overlook key details43,44. Although DL algorithms offer promise for improving pathological diagnosis, further research is needed to explore the impact of tile size on model performance and to ensure the reliability of clinical applications.

Furthermore, we found that in the meta-regression analysis using the reference standard, patient-based internal and external validation showed that the sensitivity of the non-only PCR group was significantly higher than that of the only PCR group. But current evidence indicates that PCR demonstrates greater diagnostic performance than IHC as a reference standard for identifying MSI in CRC, especially regarding sensitivity and specificity45,46. Since PCR has higher specificity than IHC for detecting MSI-H, using IHC as the gold standard results in a higher false positive rate (i.e., cases deemed positive by IHC that are not truly positive). In this situation, as long as the deep learning model detects any morphological features associated with IHC positivity in the images, these cases will be counted as “true positives,” thus overestimating the model’s sensitivity. In contrast, when PCR is used as the reference standard, the model is required to accurately identify PCR-positive cases. Although this may decrease sensitivity, it offers a more precise reflection of the actual biological state. Nonetheless, Heterogeneity among studies and the relatively small number of articles in the only-PCR group may contribute to potential instability in the results. Therefore, future research involving larger sample sizes is essential to evaluate the diagnostic performance of different reference standards and achieve more robust findings.

While previous systematic reviews, such as those by Davri et al.47 and Guitton et al.48, have offered valuable insights into the use of DL for CRC diagnosis and the prediction of MSI from WSIs, our study enhances this foundation by incorporating a broader range of internal and external datasets for systematic statistical analysis. This approach improves the assessment of the model’s adaptability across varied populations. Additionally, we emphasize the necessity of standardizing algorithms to mitigate potential overfitting issues during external validation, a concern that has not been thoroughly addressed in existing literature.

Compared to the previous meta-analysis by Ying et al. and Alam et al., our meta-analysis is the first to predict MSI-H in CRC using WSIs. Our study also includes a larger sample size and incorporates more studies. Ying et al.’s meta-analysis used complex confounding models, combining traditional machine learning, clinical, and genomic features, leading to limited scalability49. Alam et al.’s study evaluated MSI prediction across multiple cancer types, including colorectal, gastric, ovarian, and endometrial cancers, but did not perform a pooled analysis of DL’s diagnostic performance specifically for MSI-H in CRC50. In another meta-analysis, Wang et al. assessed AI-based radiomics for MSI prediction in CRC but included fewer studies(14 studies) and limited external validation datasets (four datasets)51.Their reported AUC was 0.83 and sensitivity was 0.76, both lower compared to our AUC of 0.90 and sensitivity of 0.91. Moreover, nine out of 12 studies in Wang et al.’s analysis relied on PET/CT, which is expensive and diverges from AI’s goal of cost-effective diagnostics. In contrast, our study demonstrates that AI models based on WSIs can efficiently identify MSI-H in CRC, providing new evidence for their clinical applicability and advantages in CRC diagnosis.

The high heterogeneity among the included studies may have influenced the pooled sensitivity and specificity of DL in both internal and external validation datasets. Multiple meta-regression identified center, AI algorithm, analysis method, magnification, tile size, and reference standard as sources of heterogeneity in internal validation sensitivity. For external validation sensitivity, analysis method, magnification, and tile size were key contributors. In specificity, center, AI algorithm, tile size, and reference standard influenced internal validation, while magnification was the sole factor in external validation. However, this heterogeneity may stem from other potential factors such as clinical staging of colorectal cancer, dataset size, regional populations, WSI image quality, and specimen origin (e.g., surgical resection or endoscopic biopsy).

Our results demonstrate that DL-based methods achieve high diagnostic performance for MSI-H detection in colorectal cancer across both internal and external datasets. AI has the potential to reduce clinicians’ workloads, minimize diagnostic errors, and prevent adverse outcomes associated with misdiagnoses. However, only one study in our analysis directly compared AI to human performance. Kather et al. reported a sensitivity and specificity of 0.5 for pathologists18. Future studies should focus on comparative evaluations between AI and human performance, particularly that of pathologists. Beyond diagnostic performance, cost-effectiveness is crucial for integrating AI models into routine practice. In hypothetical metastatic CRC populations, combining high-sensitivity AI with confirmatory MSI testing could save approximately $400 million52. AI models also expedite treatment initiation, reducing average time to less than a day and improving patient outcomes. Once trained, AI systems require minimal maintenance costs, while offering valuable insights that may reduce unnecessary treatments or accelerate diagnoses52. Despite these promising potentials, several challenges remain. AI models require large, diverse datasets for robust validation and effective integration into routine clinical workflows. Training these models is time-consuming, often needing hundreds or thousands of annotated images, which may involve extensive manual labeling. Moreover, concerns regarding data privacy, model interpretability, and regulatory approval further complicate implementation. Addressing these challenges is essential to ensure the successful and safe adoption of AI in clinical practice50.

Several limitations of this meta-analysis warrant careful consideration when interpreting the results. First, the training and validation cohorts for all included models were retrospective, which may introduce potential bias. Prospective studies are needed to validate these findings and ensure their applicability in clinical practice23. Second, some studies used a combination of PCR and IHC as the reference standard. Weak staining in IHC could result in missed cases, potentially biasing the diagnostic performance for identifying MSI-H in CRC53. Third, model training heavily relied on specific open datasets (e.g., TCGA, QUASAR, DACHS), with limited use of local clinical WSIs images for training and validation. This reliance may lead to bias and hinder the assessment of the model’s generalizability. Fourth, we recognize that selecting only the highest-performing algorithm from multi-model studies may introduce positive performance bias, as it does not represent the full range of tested algorithms. To minimize patient overlap among the included studies, we chose to extract only the best-performing algorithm from each study, which may lead to an overestimation of performance. Furthermore, due to limited data availability, we used estimated maximum Youden indices, which could also contribute to bias in performance estimates. It is also important to highlight that the QUADAS-2 assessment revealed “unclear” risk of bias in patient selection for 17 out of 19 studies and analysis domains for 18 out of 19 studies, indicating potential spectrum bias and selective reporting.

In conclusion, this meta-analysis confirms that DL algorithms perform excellently in detecting microsatellite MSI-H in CRC using WSIs. However, their lower specificity in external validation suggests overfitting and highlights the need for algorithm standardization to improve generalizability and clinical utility.

Methods

This meta-analysis was conducted in full compliance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) guidelines54. Additionally, the study protocol has been registered in the PROSPERO database (CRD42025632819).

Search strategy

We conducted a systematic literature search using the PubMed, Embase, and Web of Science databases, with the initial search completed on December 15, 2024. A second search was conducted in January 2025 to include newly published studies. The search strategy involved three groups of keywords: artificial intelligence-related terms (e.g., artificial intelligence, machine learning, deep learning), target-related terms (e.g., microsatellite instability, dMMR), and disease-specific terms (e.g., colon cancer, rectal cancer, colorectal cancer). Both free-text keywords and Medical Subject Headings (MeSH) terms were used to ensure precision. Detailed search strategies are available in Supplementary Table 1. Additionally, the references of included studies were reviewed to identify additional relevant literature.

Inclusion and exclusion criteria

These studies were carefully selected following the PITROS framework. Participants (P): The participants in this study are patients diagnosed with CRC. Index test (I): This study employs DL techniques to analyze WSIs for predicting MSI-H. Target condition (T): The positive group is defined as patients with high MSI-H, while the negative group is defined as patients with MSS or MSI-L. Reference standard (R): The reference standard is PCR or IHC to validate the accuracy of the MSI status. Outcomes (O): The primary outcomes include sensitivity, specificity, and the AUC. Setting (S): The study setting includes retrospective or prospective data sources, covering public databases or local hospitals.

Exclusion criteria included studies on animals, non-original articles (e.g., reviews, case reports, conference abstracts, meta-analyses, and letters to editors), and non-English publications due to accessibility issues. Furthermore, studies using general artificial intelligence approaches that are unrelated to deep learning algorithms, such as classic machine learning techniques (e.g., support vector machines (SVM), logistic regression (LR), and random forests (RF)), were excluded. Additionally, studies that relied solely on non-AI methods, such as those using WSIs for diagnosis without employing any AI algorithms, were also excluded.

Quality assessment

To ensure a rigorous assessment of the quality of the included studies, we revised the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. Irrelevant criteria were replaced with standards better suited to evaluating the risk of bias in predictive models. This section outlines modifications made to the tool, informed by experience with the original framework and potential sources of bias arising from variations in study design and implementation.

The revised QUADAS-2 tool includes four domains: patient selection, index test (AI algorithms), reference standard, and analysis. Bias was evaluated across all four domains, while applicability concerns were assessed for the first three. Two reviewers (HL and ZZ) independently applied the modified tool to assess the risk of bias in the included studies, resolving any disagreements through discussion to reach consensus.

Data extraction

Two independent reviewers (HL and JQ) screened the titles and abstracts of the remaining articles to identify potentially eligible studies, with a third reviewer (OY) serving as an arbitrator to resolve any disagreements. Extracted data included the first author’s name, study type, publication year, country of data origin, number of study centers, and patients and images data for the training, internal validation, and external validation sets (e.g., number of enrolled patients, number of images reference standard, diagnostic model algorithm, statistical analysis method, tile size, and magnification). For studies lacking data required for meta-analysis, we contacted corresponding authors via email to request the missing information.

In cases where diagnostic contingency 2×2 tables were not provided, we employed two strategies to construct them: (1) calculating the number of true positives (TP) and total cases based on sensitivity, specificity, and the reference standard; and (2) extracting optimal sensitivity and specificity from ROC curve analyses using the Youden index.

Outcome measures

The primary outcome measures were sensitivity, specificity, and the AUC for internal and external validation sets. Sensitivity, also known as recall or the true positive rate, measures the probability of correctly identifying true MSI-H cases and is calculated as TP/ (TP+ false negative (FN)). Specificity, or the true negative rate, reflects the probability of correctly identifying MSS or MSI-L cases and is calculated as true negative (TN)/(TN+ false positive (FP)). AUC, representing the area under the ROC curve, provides a comprehensive metric of the model’s ability to distinguish between positive and negative cases. For studies presenting multiple contingency tables based on different datasets or types of colorectal cancer, we assumed independence and extracted all contingency tables. Additionally, for studies evaluating multiple deep learning models, only the model with the highest AUC from the internal or external validation sets was extracted.

Statistical analysis

This study utilized a bivariate random-effects model for the meta-analysis to assess the diagnostic performance of deep learning in predicting MSI-H in CRC using WSIs. Sensitivity and specificity were pooled separately for internal and external validation sets. Forest plots visually presented the pooled sensitivity and specificity, while a summary receiver operating characteristic (SROC) curve provided pooled estimates with 95% CIs and prediction intervals. Heterogeneity across studies was evaluated using Higgins’ I² statistic, with I² values of 25%, 50%, and 75% indicating low, moderate, and high heterogeneity, respectively55. Meta-regression analyses were conducted to identify sources of significant heterogeneity (I² > 50%)56.Meta-regression variables included AI algorithm type (CNN, non-CNN), analysis type (patient-based, image-based), reference standard (only PCR, not only PCR), tile size (256 × 256 or 224 × 224, 512 × 512), magnification (20×, 40×), and study center type (single, multiple). Univariate subgroup analyses were performed for these variables, with statistical differences between subgroups evaluated using the likelihood ratio test.

Potential publication bias was assessed using Deeks’ funnel plot asymmetry test57. Statistical analyses were conducted with the Midas and Metadat modules in Stata version 15.1, while RevMan 5.4 from the Cochrane Collaboration was used for risk of bias assessment. All statistical tests were two-sided, with P < 0.05 considered statistically significant, and results were reported with 95% confidence intervals.