Systematic review and meta-analysis of deep learning for MSI-H in colorectal cancer whole slide images

Li, Huo; Qin, Jing; Li, Zhongzhuan; Ouyang, Rong; Chen, Zhixin; Huang, Shijiang; Qin, Shufen; Huang, Qiliang

doi:10.1038/s41746-025-01848-z

Download PDF

Article
Open access
Published: 18 July 2025

Systematic review and meta-analysis of deep learning for MSI-H in colorectal cancer whole slide images

Huo Li¹^na1,
Jing Qin²^na1,
Zhongzhuan Li¹^na1,
Rong Ouyang¹,
Zhixin Chen¹,
Shijiang Huang¹,
Shufen Qin¹ &
…
Qiliang Huang¹

npj Digital Medicine volume 8, Article number: 456 (2025) Cite this article

2465 Accesses
2 Citations
Metrics details

Subjects

Abstract

This meta-analysis evaluated diagnostic performance of deep learning (DL) algorithms using whole slide images (WSIs) for detecting microsatellite instability-high (MSI-H) in colorectal cancer (CRC). PubMed, Embase, and Web of Science were searched until January 2025. Nineteen studies comprising 33,383 samples were included. Bivariate random-effects models calculated pooled sensitivity/specificity with 95% CIs. The revised QUADAS-2 tool was used for quality assessment. Pooled patient-based internal validation showed a sensitivity of 0.88 and specificity of 0.86, while external validation revealed higher sensitivity of 0.93 but lower specificity of 0.71. Image-based analysis showed similar accuracy. Meta-regression identified center, reference standard, and tile size as major sources of heterogeneity, with no significant differences observed between internal and external performance. Overall, DL algorithms demonstrate excellent sensitivity in detecting MSI-H; however, their lower specificity in external validation suggests overfitting and highlights the need for algorithm standardization to improve generalizability and clinical utility.

Deepath-MSI: a clinic-ready deep learning model for microsatellite instability detection in colorectal cancer using whole-slide imaging

Article Open access 28 August 2025

Deep learning for the prediction of early on-treatment response in metastatic colorectal cancer from serial medical imaging

Article Open access 17 November 2021

Ensemble-based multi-tissue classification approach of colorectal cancer histology images using a novel hybrid deep learning framework

Article Open access 31 May 2023

Introduction

Colorectal cancer (CRC) is a major global malignancy, ranking third in incidence and second in cancer-related mortality, thus contributing significantly to the global disease burden¹. An important genomic alteration associated with CRC is microsatellite instability (MSI), which arises from defects in the mismatch repair system and occurs in approximately 5–20% of CRC cases². Notably, the prevalence of MSI is stage-specific; it exceeds 20% in stage II CRC but drops to less than 5% in more advanced stages³.

MSI tumors, characterized by a high tumor mutational burden driven by the MSI carcinogenic pathway, produce numerous immunogenic neoantigens and express immune checkpoints. Consequently, MSI has been identified as a favorable prognostic marker for stage II CRC, with failure to detect MSI potentially leading to unnecessary adjuvant chemotherapy^4,5. Furthermore, MSI status predicts immunotherapy response, as studies show CRC patients with MSI respond more effectively to immune checkpoint inhibitors⁶. According to the National Comprehensive Cancer Network guidelines, MSI testing is recommended for all metastatic CRC patients⁷. Similarly, the European Society for Medical Oncology advocates MSI evaluation before immunotherapy, and the U.S. Food and Drug Administration has approved MSI as an indication for cancer immunotherapy^8,9.

MSI detection methods include immunohistochemistry (IHC) targeting mismatch repair (MMR) proteins such as MLH1, PMS2, MSH2, and MSH6, and polymerase chain reaction (PCR) assays to identify microsatellite instability⁹. PCR commonly examines mononucleotide repeats like BAT-25 and BAT-26, along with dinucleotide markers. MSI-H is classified as MSI, while low microsatellite instability (MSI-L) is grouped with microsatellite stable (MSS). Although these methods are the standard for CRC classification, they are expensive, time-intensive, and show reduced sensitivity in samples with low tumor cell content. Both IHC and PCR rely on advanced equipment and skilled pathologists, presenting challenges in resource-limited settings. Furthermore, dMMR occurs in only 10–15% of CRC cases, reducing the cost-effectiveness of universal screening^10,11,12. Thus, there is an urgent need for a more accessible, accurate, and cost-efficient detection method to improve dMMR and MSI testing strategies and support the advancement of precision medicine.

The introduction of whole slide images (WSIs) in digital pathology has advanced AI-assisted diagnostics by enabling high-resolution analysis and sharing of tissue samples. This innovation has improved cancer diagnosis, classification, and prognosis, enhancing clinical practice and personalized treatment^13,14,15. AI advancements address key challenges in molecular pathology, including time-consuming and costly testing methods. Since 2019, growing evidence has demonstrated the ability of DL to accurately identify MSI and MSS status from hematoxylin and eosin (H&E)-stained whole slide images of CRC and other tumors^16,17,18. The first automated, end-to-end DL-based MSI/dMMR detection model, developed by Kather et al. in 2019, achieved an area under the curve (AUC) of 0.84 in the TCGA cohort¹⁸. Subsequent studies using novel methodologies have reported improved AUC values ranging from 0.78 to 0.98¹⁸. Echle et al. developed a DL classifier with an AUC of 0.96 in external validation¹⁶. Mohsin Bilal et al. introduced a weakly supervised DL framework with three CNN, achieving an AUC of 0.98 in external cohorts¹⁹. Wagner SJ et al. implemented a transformer-based approach for effective mutation status prediction²⁰. In 2022, these advancements led to the first commercial DL biomarker detection algorithm (MSIntuit, Owkin, Paris/New York) being approved for routine clinical use in Europe²¹.

In recent years, DL algorithm based on WSIs methods have been increasingly studied for predicting the MSI-H status of CRC. However, the predictive performance and reliability of these DL models vary widely, and their overall performance remains uncertain. Therefore, this systematic review aims to combined current findings and evaluate the predictive performance of histological models in diagnosing MSI-H in CRC.

Results

Study selection

The initial database search yielded 1060 potentially relevant articles. After removing 181 duplicates, 879 unique articles were subjected to preliminary screening. Strict application of inclusion criteria resulted in the exclusion of 791 articles. Following a detailed full-text review, 69 studies were further excluded due to insufficient or incomplete diagnostic data (TP, FP, FN, TN). Ultimately, 19 studies that met the criteria for evaluating the diagnostic performance of DL algorithms were included in the meta-analysis^{16,17,18,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36}. The literature screening process was systematically documented using a standardized Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram, as shown in Fig. 1.

Study description and quality assessment

A total of 19 eligible studies were included, with internal validation reported in 13 studies comprising 14 data sets and 14,324 patients or images (range: 100–4738). External validation was reported in 13 studies involving 19,059 patients or images (range: 35–2098), with 25 data sets. These studies were published between 2019 and 2024. All included studies were retrospective in design. Ten studies used PCR as the gold standard, while nine utilized a combination of PCR and IHC. The most commonly employed AI algorithms were CNN-based (10/19, 53%). A detailed summary of the study, patient, and technical characteristics is presented in Tables 1 and 2.

Table 1 Study and patient characteristics of the included studies

Full size table

Table 2 Technical aspects of included studies

Full size table

The risk of bias, assessed using the revised QUADAS-2 tool, is summarized in Fig. 2 and Supplementary Table 2. In the patient selection domain, 17 studies were rated as “unclear” due to insufficient information on whether patients were consecutively enrolled. Similarly, in the analysis domain, 17 studies were also rated as “unclear” because, although quality control and data filtering were mentioned, it was unclear whether all eligible patient samples were included in the analysis, raising concerns about potential analysis bias. Despite these limitations, the overall quality assessment indicates that the included studies are of acceptable quality, as most of the other items present low risks.

**Fig. 2: Risk of bias and applicability concerns of the included studies using the revised Quality Assessment of Diagnostic Performance Studies (QUADAS-2) tool.**

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

For the internal validation dataset, DL algorithms based on WSIs achieved a sensitivity of 0.88 (95% CI: 0.82–0.93) and a specificity of 0.86 (95% CI: 0.77–0.92) in detecting MSI-H in CRC patients (Fig. 3). The AUC was 0.94 (95% CI: 0.91–0.95) (Fig. 4a). With a pre-test probability of 20%, the Fagan nomogram demonstrated a positive likelihood ratio of 62% and a negative likelihood ratio of 3% (Fig. 5a).

**Fig. 3: Forest plot of deep learning algorithms for identifying microsatellite instability-high in colorectal cancer using whole slide images in the internal validation set of patient-based analysis.**

Fig. 4: Summary receiver operating characteristic (SROC) curves of deep learning algorithms for identifying microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images in the internal validation set.

**Fig. 5: Fagan's nomogram for deep learning algorithms in identifying microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images from the internal validation set.**

High heterogeneity was noted in sensitivity (I² = 88.51%) and specificity (I² = 99.24%) within the internal validation dataset. Meta-regression analysis identified that sensitivity heterogeneity was primarily driven by factors including center (single center vs. multicenter, P = 0.04), reference standard (Only PCR vs. non-only PCR, P < 0.001), and magnification (20× vs. 40×, P < 0.001). In the specificity heterogeneity analysis, no sources of heterogeneity related to the center, AI algorithm, reference standard, magnification, and tile size were found (all P > 0.05) (Table 3). The sensitivity analysis revealed no potential source of heterogeneity (Supplementary Table 3).

Table 3 Meta-regression analysis of deep learning algorithm performance based on patient-based analysis in internal validation cohorts for diagnosing microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images

Full size table

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

For the internal validation dataset, DL algorithms based on WSIs achieved a sensitivity of 0.81 (95% CI: 0.76–0.85) and a specificity of 0.82 (95% CI: 0.72–0.89) in detecting MSI-H in CRC patients (Fig. 6). The AUC was 0.84 (95% CI: 0.81–0.87) (Fig. 4b). With a pre-test probability of 20%, the Fagan nomogram demonstrated a positive likelihood ratio of 53% and a negative likelihood ratio of 5% (Fig. 5b).

**Fig. 6: Forest plot of deep learning algorithms for identifying microsatellite instability-high in colorectal cancer using whole slide images in the internal validation set of image-based analysis.**

High heterogeneity was noted in sensitivity (I² = 88.28%) and specificity (I² = 96.47%) within the internal validation dataset. The sensitivity analysis revealed that after Omitting Chang et al. the I² for sensitivity was 17.16%, for specificity was 0%, suggesting it was the potential source of heterogeneity (Supplementary Table 3).

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

For the external validation dataset, the sensitivity of detecting MSI-H in CRC was 0.93 (95% CI: 0.88–0.95), while the specificity was 0.71 (95% CI: 0.57–0.82) (Supplementary Fig. 1). The AUC was 0.92 (95% CI: 0.90–0.94) (Supplementary Fig. 2a). At a pre-test probability of 20%, the Fagan nomogram indicated a positive likelihood ratio of 44% and a negative likelihood ratio of 3% (Supplementary Fig. 3a).

High heterogeneity was identified for sensitivity (I² = 95.30%) and specificity (I² = 99.59%) within the external validation dataset. Meta-regression analysis revealed that the heterogeneity in sensitivity was primarily influenced by the center (single center vs. multicenter), reference standard (Only PCR vs. non-only PCR), (P = 0.03, P < 0.001) while that in specificity was mainly driven by tile size (256*256 or 224*224 vs. 512*512) (P < 0.001) (Table 4). The sensitivity analysis revealed no potential source of heterogeneity (Supplementary Table 4).

Table 4 Meta-regression analysis of deep learning algorithm performance based on patient-based analysis in external validation cohorts for diagnosing microsatellite instability-high (MSI-H) in colorectal cancer using whole slide images

Full size table

There was no statistically significant difference in sensitivity, specificity and AUC values between the internal and external validation datasets in patient-based analysis (Z = –1.50, 0.67,1.39; P = 0.13, 0.50,0.17).

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

For the external validation dataset, the sensitivity of detecting MSI-H in CRC was 0.80 (95% CI: 0.63–0.90), while the specificity was 0.54 (95% CI: 0.41–0.67) (Supplementary Fig. 4). The AUC was 0.71 (95% CI: 0.66–0.74) (Supplementary Fig. 2b). At a pre-test probability of 20%, the Fagan nomogram indicated a positive likelihood ratio of 30% and a negative likelihood ratio of 9% (Supplementary Fig. 3b). The sensitivity analysis revealed that after Omitting Saillard et al. (MAPTH-UFS), the I² for sensitivity was 31.30%, suggesting it was the potential source of heterogeneity (Supplementary Table 4).

There was no statistically significant difference in sensitivity values between the internal and external validation datasets in image-based analysis (Z = 0.14; P = 0.89). However, the specificity and AUC of the internal validation dataset was significantly higher than that of the external validation dataset (Z = 3.53,5.10; both P < 0.001).

Publication bias

The Deeks’ funnel plot asymmetry test showed no significant publication bias in the internal validation dataset based on patient-based and image-based analyses for DL (P = 0.73, P = 0.18) (Fig. 7). Likewise, no significant publication bias was detected in the external validation dataset (P = 0.80, P = 0.77) (Supplementary Fig. 5).

**Fig. 7: Deek's funnel plot of internal validation set.**

Discussion

To the best of our knowledge, this is the first meta-analysis to evaluate the diagnostic performance of DL algorithms in detecting MSI-H in CRC using WSIs. For the internal validation dataset, the patient-based analysis yielded a sensitivity of 0.88 and a specificity of 0.86, while the image-based analysis showed a sensitivity of 0.81 and a specificity of 0.82. The AUC for sensitivity was 0.94 and for specificity was 0.84. In contrast, the external validation dataset demonstrated a higher sensitivity of 0.93 and a specificity of 0.71 in the patient-based analysis. The image-based analysis for the external dataset revealed a sensitivity of 0.80 and a specificity of 0.54. The AUC was 0.92 for the patient-based analysis and 0.71 for the image-based analysis. These results suggest that while DL algorithms effectively identify MSI-H in CRC, their performance varies between internal validation and external validation datasets. The outstanding diagnostic performance of deep learning algorithms can be attributed to their ability to automatically learn complex morphological features associated with MSI-H directly from digital pathology slides, features that conventional pathologists may overlook with the naked eye³⁷. The higher specificity in internal validation datasets likely results from consistent data preprocessing, uniform staining, and standardized image acquisition, which help the model accurately distinguish MSI-H from non-MSI-H cases. In contrast, external validation datasets often introduce greater variability due to differences in staining protocols, slide preparation, and image quality, leading to domain shifts and reduced specificity³⁸. These findings highlight the need for standardized data pipelines and the inclusion of multi-center datasets to enhance generalizability. Although DL demonstrates significant potential for MSI-H detection, caution is warranted due to dataset-specific factors and the absence of standardized external validation protocols, which may introduce bias. Future studies should focus on collaborative frameworks to develop robust and diverse training datasets while adopting cross-validation strategies to mitigate overfitting and improve clinical applicability³⁹.

In terms of internal and external validation datasets revealed that patient-based approaches demonstrated higher sensitivity compared to image-based analysis (0.88 vs. 0.83, 0.92 vs. 0.80). In patient-based methods, each patient is represented by one WSI image as an independent sample, whereas image-based methods may include multiple slices from the same patient. Independent sampling ensures the model captures a broader range of variability, enhancing its predictive performance across diverse patient populations⁴⁰. Patient-based approaches reflect greater diversity, encompassing variations in tumor types, stages, and therapeutic responses. This diversity improves the model’s generalizability by enabling it to learn a wider range of features, including tumor staging and demographic characteristics⁴¹. In contrast, image-based training risks overfitting to specific features within individual patients, which may limit the model’s applicability to external datasets⁴².

In the internal and external validation of AI algorithms, meta-regression analysis revealed no significant statistical differences in sensitivity or specificity between the patient-based CNN and non-CNN groups. For non-CNN models, for instance, Niehues’ study demonstrated that a self-supervised, attention-based multiple-instance learning model effectively focused on relevant tissue regions²⁴. Visualization of the attention mechanism revealed that, for MSI prediction, the model concentrated primarily on tumor tissues while minimally focusing on fibromuscular and non-tumor epithelium. However, some attention dispersion was observed, potentially contributing to the finding that attention-augmented models did not outperform standalone CNN algorithms in sensitivity or specificity. Future comparisons of the diagnostic performance among different deep learning algorithms is a promising area for exploration.

It should be noted that in our patient-based external validation dataset, larger tiles (512*512) demonstrated higher specificity compared to smaller tiles (224*224 or 256*256) (0.91 vs. 0.58, P < 0.001). Larger tiles enhance the model’s ability to capture localized features, which is critical for identifying subtle pathological changes. Conversely, while smaller tiles can provide broader contextual information, they may overlook key details^43,44. Although DL algorithms offer promise for improving pathological diagnosis, further research is needed to explore the impact of tile size on model performance and to ensure the reliability of clinical applications.

Furthermore, we found that in the meta-regression analysis using the reference standard, patient-based internal and external validation showed that the sensitivity of the non-only PCR group was significantly higher than that of the only PCR group. But current evidence indicates that PCR demonstrates greater diagnostic performance than IHC as a reference standard for identifying MSI in CRC, especially regarding sensitivity and specificity^45,46. Since PCR has higher specificity than IHC for detecting MSI-H, using IHC as the gold standard results in a higher false positive rate (i.e., cases deemed positive by IHC that are not truly positive). In this situation, as long as the deep learning model detects any morphological features associated with IHC positivity in the images, these cases will be counted as “true positives,” thus overestimating the model’s sensitivity. In contrast, when PCR is used as the reference standard, the model is required to accurately identify PCR-positive cases. Although this may decrease sensitivity, it offers a more precise reflection of the actual biological state. Nonetheless, Heterogeneity among studies and the relatively small number of articles in the only-PCR group may contribute to potential instability in the results. Therefore, future research involving larger sample sizes is essential to evaluate the diagnostic performance of different reference standards and achieve more robust findings.

While previous systematic reviews, such as those by Davri et al.⁴⁷ and Guitton et al.⁴⁸, have offered valuable insights into the use of DL for CRC diagnosis and the prediction of MSI from WSIs, our study enhances this foundation by incorporating a broader range of internal and external datasets for systematic statistical analysis. This approach improves the assessment of the model’s adaptability across varied populations. Additionally, we emphasize the necessity of standardizing algorithms to mitigate potential overfitting issues during external validation, a concern that has not been thoroughly addressed in existing literature.

Compared to the previous meta-analysis by Ying et al. and Alam et al., our meta-analysis is the first to predict MSI-H in CRC using WSIs. Our study also includes a larger sample size and incorporates more studies. Ying et al.’s meta-analysis used complex confounding models, combining traditional machine learning, clinical, and genomic features, leading to limited scalability⁴⁹. Alam et al.’s study evaluated MSI prediction across multiple cancer types, including colorectal, gastric, ovarian, and endometrial cancers, but did not perform a pooled analysis of DL’s diagnostic performance specifically for MSI-H in CRC⁵⁰. In another meta-analysis, Wang et al. assessed AI-based radiomics for MSI prediction in CRC but included fewer studies(14 studies) and limited external validation datasets (four datasets)⁵¹.Their reported AUC was 0.83 and sensitivity was 0.76, both lower compared to our AUC of 0.90 and sensitivity of 0.91. Moreover, nine out of 12 studies in Wang et al.’s analysis relied on PET/CT, which is expensive and diverges from AI’s goal of cost-effective diagnostics. In contrast, our study demonstrates that AI models based on WSIs can efficiently identify MSI-H in CRC, providing new evidence for their clinical applicability and advantages in CRC diagnosis.

The high heterogeneity among the included studies may have influenced the pooled sensitivity and specificity of DL in both internal and external validation datasets. Multiple meta-regression identified center, AI algorithm, analysis method, magnification, tile size, and reference standard as sources of heterogeneity in internal validation sensitivity. For external validation sensitivity, analysis method, magnification, and tile size were key contributors. In specificity, center, AI algorithm, tile size, and reference standard influenced internal validation, while magnification was the sole factor in external validation. However, this heterogeneity may stem from other potential factors such as clinical staging of colorectal cancer, dataset size, regional populations, WSI image quality, and specimen origin (e.g., surgical resection or endoscopic biopsy).

Our results demonstrate that DL-based methods achieve high diagnostic performance for MSI-H detection in colorectal cancer across both internal and external datasets. AI has the potential to reduce clinicians’ workloads, minimize diagnostic errors, and prevent adverse outcomes associated with misdiagnoses. However, only one study in our analysis directly compared AI to human performance. Kather et al. reported a sensitivity and specificity of 0.5 for pathologists¹⁸. Future studies should focus on comparative evaluations between AI and human performance, particularly that of pathologists. Beyond diagnostic performance, cost-effectiveness is crucial for integrating AI models into routine practice. In hypothetical metastatic CRC populations, combining high-sensitivity AI with confirmatory MSI testing could save approximately $400 million⁵². AI models also expedite treatment initiation, reducing average time to less than a day and improving patient outcomes. Once trained, AI systems require minimal maintenance costs, while offering valuable insights that may reduce unnecessary treatments or accelerate diagnoses⁵². Despite these promising potentials, several challenges remain. AI models require large, diverse datasets for robust validation and effective integration into routine clinical workflows. Training these models is time-consuming, often needing hundreds or thousands of annotated images, which may involve extensive manual labeling. Moreover, concerns regarding data privacy, model interpretability, and regulatory approval further complicate implementation. Addressing these challenges is essential to ensure the successful and safe adoption of AI in clinical practice⁵⁰.

Several limitations of this meta-analysis warrant careful consideration when interpreting the results. First, the training and validation cohorts for all included models were retrospective, which may introduce potential bias. Prospective studies are needed to validate these findings and ensure their applicability in clinical practice²³. Second, some studies used a combination of PCR and IHC as the reference standard. Weak staining in IHC could result in missed cases, potentially biasing the diagnostic performance for identifying MSI-H in CRC⁵³. Third, model training heavily relied on specific open datasets (e.g., TCGA, QUASAR, DACHS), with limited use of local clinical WSIs images for training and validation. This reliance may lead to bias and hinder the assessment of the model’s generalizability. Fourth, we recognize that selecting only the highest-performing algorithm from multi-model studies may introduce positive performance bias, as it does not represent the full range of tested algorithms. To minimize patient overlap among the included studies, we chose to extract only the best-performing algorithm from each study, which may lead to an overestimation of performance. Furthermore, due to limited data availability, we used estimated maximum Youden indices, which could also contribute to bias in performance estimates. It is also important to highlight that the QUADAS-2 assessment revealed “unclear” risk of bias in patient selection for 17 out of 19 studies and analysis domains for 18 out of 19 studies, indicating potential spectrum bias and selective reporting.

In conclusion, this meta-analysis confirms that DL algorithms perform excellently in detecting microsatellite MSI-H in CRC using WSIs. However, their lower specificity in external validation suggests overfitting and highlights the need for algorithm standardization to improve generalizability and clinical utility.

Methods

This meta-analysis was conducted in full compliance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) guidelines⁵⁴. Additionally, the study protocol has been registered in the PROSPERO database (CRD42025632819).

Search strategy

We conducted a systematic literature search using the PubMed, Embase, and Web of Science databases, with the initial search completed on December 15, 2024. A second search was conducted in January 2025 to include newly published studies. The search strategy involved three groups of keywords: artificial intelligence-related terms (e.g., artificial intelligence, machine learning, deep learning), target-related terms (e.g., microsatellite instability, dMMR), and disease-specific terms (e.g., colon cancer, rectal cancer, colorectal cancer). Both free-text keywords and Medical Subject Headings (MeSH) terms were used to ensure precision. Detailed search strategies are available in Supplementary Table 1. Additionally, the references of included studies were reviewed to identify additional relevant literature.

Inclusion and exclusion criteria

These studies were carefully selected following the PITROS framework. Participants (P): The participants in this study are patients diagnosed with CRC. Index test (I): This study employs DL techniques to analyze WSIs for predicting MSI-H. Target condition (T): The positive group is defined as patients with high MSI-H, while the negative group is defined as patients with MSS or MSI-L. Reference standard (R): The reference standard is PCR or IHC to validate the accuracy of the MSI status. Outcomes (O): The primary outcomes include sensitivity, specificity, and the AUC. Setting (S): The study setting includes retrospective or prospective data sources, covering public databases or local hospitals.

Exclusion criteria included studies on animals, non-original articles (e.g., reviews, case reports, conference abstracts, meta-analyses, and letters to editors), and non-English publications due to accessibility issues. Furthermore, studies using general artificial intelligence approaches that are unrelated to deep learning algorithms, such as classic machine learning techniques (e.g., support vector machines (SVM), logistic regression (LR), and random forests (RF)), were excluded. Additionally, studies that relied solely on non-AI methods, such as those using WSIs for diagnosis without employing any AI algorithms, were also excluded.

Quality assessment

To ensure a rigorous assessment of the quality of the included studies, we revised the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. Irrelevant criteria were replaced with standards better suited to evaluating the risk of bias in predictive models. This section outlines modifications made to the tool, informed by experience with the original framework and potential sources of bias arising from variations in study design and implementation.

The revised QUADAS-2 tool includes four domains: patient selection, index test (AI algorithms), reference standard, and analysis. Bias was evaluated across all four domains, while applicability concerns were assessed for the first three. Two reviewers (HL and ZZ) independently applied the modified tool to assess the risk of bias in the included studies, resolving any disagreements through discussion to reach consensus.

Data extraction

Two independent reviewers (HL and JQ) screened the titles and abstracts of the remaining articles to identify potentially eligible studies, with a third reviewer (OY) serving as an arbitrator to resolve any disagreements. Extracted data included the first author’s name, study type, publication year, country of data origin, number of study centers, and patients and images data for the training, internal validation, and external validation sets (e.g., number of enrolled patients, number of images reference standard, diagnostic model algorithm, statistical analysis method, tile size, and magnification). For studies lacking data required for meta-analysis, we contacted corresponding authors via email to request the missing information.

In cases where diagnostic contingency 2×2 tables were not provided, we employed two strategies to construct them: (1) calculating the number of true positives (TP) and total cases based on sensitivity, specificity, and the reference standard; and (2) extracting optimal sensitivity and specificity from ROC curve analyses using the Youden index.

Outcome measures

The primary outcome measures were sensitivity, specificity, and the AUC for internal and external validation sets. Sensitivity, also known as recall or the true positive rate, measures the probability of correctly identifying true MSI-H cases and is calculated as TP/ (TP+ false negative (FN)). Specificity, or the true negative rate, reflects the probability of correctly identifying MSS or MSI-L cases and is calculated as true negative (TN)/(TN+ false positive (FP)). AUC, representing the area under the ROC curve, provides a comprehensive metric of the model’s ability to distinguish between positive and negative cases. For studies presenting multiple contingency tables based on different datasets or types of colorectal cancer, we assumed independence and extracted all contingency tables. Additionally, for studies evaluating multiple deep learning models, only the model with the highest AUC from the internal or external validation sets was extracted.

Statistical analysis

This study utilized a bivariate random-effects model for the meta-analysis to assess the diagnostic performance of deep learning in predicting MSI-H in CRC using WSIs. Sensitivity and specificity were pooled separately for internal and external validation sets. Forest plots visually presented the pooled sensitivity and specificity, while a summary receiver operating characteristic (SROC) curve provided pooled estimates with 95% CIs and prediction intervals. Heterogeneity across studies was evaluated using Higgins’ I² statistic, with I² values of 25%, 50%, and 75% indicating low, moderate, and high heterogeneity, respectively⁵⁵. Meta-regression analyses were conducted to identify sources of significant heterogeneity (I² > 50%)⁵⁶.Meta-regression variables included AI algorithm type (CNN, non-CNN), analysis type (patient-based, image-based), reference standard (only PCR, not only PCR), tile size (256 × 256 or 224 × 224, 512 × 512), magnification (20×, 40×), and study center type (single, multiple). Univariate subgroup analyses were performed for these variables, with statistical differences between subgroups evaluated using the likelihood ratio test.

Potential publication bias was assessed using Deeks’ funnel plot asymmetry test⁵⁷. Statistical analyses were conducted with the Midas and Metadat modules in Stata version 15.1, while RevMan 5.4 from the Cochrane Collaboration was used for risk of bias assessment. All statistical tests were two-sided, with P < 0.05 considered statistically significant, and results were reported with 95% confidence intervals.

Data availability

All data generated or analyzed during this study are included in this published article. Further inquiries can be directed to the corresponding author.

Code availability

Not applicable. No custom code was generated or used in this study.

References

Zhou, Y. et al. Burden of six major types of digestive system cancers globally and in China. Chin. Med J.137, 1957–1964 (2024).
Article PubMed PubMed Central Google Scholar
Mann, S. A. & Cheng, L. Microsatellite instability and mismatch repair deficiency in the era of precision immuno-oncology. Expert Rev. Anticancer Ther. 20, 1-4, https://doi.org/10.1080/14737140.2020.1705789 (2020).
Dienstmann, R. et al. Prediction of overall survival in stage II and III colon cancer beyond TNM system: a retrospective, pooled biomarker study. Ann. Oncol. 28, 1023–1031 (2017).
Article CAS PubMed PubMed Central Google Scholar
André, T. et al. Pembrolizumab in microsatellite-instability-high advanced colorectal cancer. N. Engl. J. Med 383, 2207–2218 (2020).
Article PubMed Google Scholar
Cercek, A. et al. PD-1 Blockade in mismatch repair-deficient, locally advanced rectal cancer. N. Engl. J. Med 386, 2363–2376 (2022).
Article CAS PubMed PubMed Central Google Scholar
Copija, A., Waniczek, D., Witkoś, A., Walkiewicz, K. & Nowakowska-Zajdel, E. Clinical significance and prognostic relevance of microsatellite instability in sporadic colorectal cancer patients. Int J. Mol. Sci. 18, 107 (2017).
Article PubMed PubMed Central Google Scholar
Benson, A. B. et al. Colon Cancer, Version 2.2021, NCCN Clinical Practice Guidelines in Oncology. J. Natl. Compr. Cancer Netw. 19, 329–359 (2021).
Article Google Scholar
Diao, Z., Han, Y., Chen, Y., Zhang, R. & Li, J. The clinical utility of microsatellite instability in colorectal cancer. Crit. Rev. Oncol. Hematol. 157, 103171 (2021).
Article PubMed Google Scholar
Luchini, C. et al. ESMO recommendations on microsatellite instability testing for immunotherapy in cancer, and its relationship with PD-1/PD-L1 expression and tumour mutational burden: a systematic review-based approach. Ann. Oncol. 30, 1232–1243 (2019).
Article CAS PubMed Google Scholar
Lim, C. et al. Biomarker testing and time to treatment decision in patients with advanced nonsmall-cell lung cancer. Ann. Oncol. 26, 1415–1421 (2015).
Article CAS PubMed Google Scholar
Shia, J. The diversity of tumours with microsatellite instability: molecular mechanisms and impact upon microsatellite instability testing and mismatch repair protein immunohistochemistry. Histopathology 78, 485–497 (2021).
Article PubMed Google Scholar
Vilar, E. & Gruber, S. B. Microsatellite instability in colorectal cancer-the stable evidence. Nat. Rev. Clin. Oncol. 7, 153–162 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jiang, Y., Yang, M., Wang, S., Li, X. & Sun, Y. Emerging role of deep learning-based artificial intelligence in tumor pathology.Cancer Commun.40, 154–166 (2020).
Article Google Scholar
Niazi, M. K. K., Parwani, A. V. & Gurcan, M. N. Digital pathology and artificial intelligence. Lancet Oncol. 20, e253–e261 (2019).
Article PubMed PubMed Central Google Scholar
Hijazi, A., Bifulco, C., Baldin, P. & Galon, J. Digital pathology for better clinical practice. Cancers16, 1686 (2024).
Article PubMed PubMed Central Google Scholar
Echle, A. et al. Artificial intelligence for detection of microsatellite instability in colorectal cancer-a multicentric analysis of a pre-screening tool for clinical application. ESMO Open 7, 100400 (2022).
Article CAS PubMed PubMed Central Google Scholar
Gustav, M. et al. Deep learning for dual detection of microsatellite instability and POLE mutations in colorectal cancer histopathology. NPJ Precis. Oncol. 8, 115 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bilal, M. et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. Lancet Digit Health 3, e763–e772 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wagner, S. J. et al. Transformer-based biomarker prediction from colorectal cancer histology: A large-scale multicentric study. Cancer Cell 41, 1650–1661.e1654 (2023).
Article CAS PubMed PubMed Central Google Scholar
Saillard, C. et al. Validation of MSIntuit as an AI-based pre-screening tool for MSI detection from colorectal cancer histology slides. Nat. Commun. 14, 6695 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hezi, H., Gelber, M., Balabanov, A., Maruvka, Y. E. & Freiman, M. CIMIL-CRC: A clinically-informed multiple instance learning framework for patient-level colorectal cancer molecular subtypes classification from H&E stained images. Comput. Methods Prog. Biomed. 259, 108513 (2025).
Article Google Scholar
Tong, Z. et al. Development of a whole-slide-level segmentation-based dMMR/pMMR deep learning detector for colorectal cancer. iScience 26, 108468 (2023).
Article CAS PubMed PubMed Central Google Scholar
Niehues, J. M. et al. Generalizable biomarker prediction from cancer pathology slides with self-supervised deep learning: A retrospective multi-centric study. Cell Rep. Med. 4, 100980 (2023).
Article CAS PubMed PubMed Central Google Scholar
Guo, B. et al. Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer. J. Pathol. Clin. Res 9, 223–235 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gerwert, K. et al. Fast and label-free automated detection of microsatellite status in early colon cancer using artificial intelligence integrated infrared imaging. Eur. J. Cancer 182, 122–131 (2023).
Article CAS PubMed Google Scholar
Chang, X. et al. Predicting colorectal cancer microsatellite instability with a self-attention-enabled convolutional neural network. Cell Rep. Med. 4, 100914 (2023).
Article CAS PubMed PubMed Central Google Scholar
Qiu, W. et al. Evaluating the microsatellite instability of colorectal cancer based on multimodal deep learning integrating histopathological and molecular data. Front Oncol. 12, 925079 (2022).
Article CAS PubMed PubMed Central Google Scholar
Jiang, W. et al. Clinical actionability of triaging DNA mismatch repair deficient colorectal cancer from biopsy samples using deep learning. EBioMedicine 81, 104120 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y. et al. Learn to estimate genetic mutation and microsatellite instability with histopathology h&e slides in colon carcinoma. Cancers 14, https://doi.org/10.3390/cancers14174144 (2022).
Fujii, S. et al. Rapid screening using pathomorphologic interpretation to detect BRAFV600E mutation and microsatellite instability in colorectal cancer. Clin. Cancer Res. 28, 2623–2632 (2022).
Article CAS PubMed Google Scholar
Yamashita, R. et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 22, 132–141 (2021).
Article PubMed Google Scholar
Lee, S. H., Song, I. H. & Jang, H. J. Feasibility of deep learning-based fully automated classification of microsatellite instability in tissue slides of colorectal cancer. Int J. Cancer 149, 728–740 (2021).
Article CAS PubMed Google Scholar
Krause, J. et al. Deep learning detects genetic alterations in cancer histology generated by adversarial networks. J. Pathol. 254, 70–79 (2021).
PubMed Google Scholar
Echle, A. et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Gastroenterology 159, 1406–1416.e1411 (2020).
Article CAS PubMed Google Scholar
Cao, R. et al. Development and interpretation of a pathomics-based model for the prediction of microsatellite instability in Colorectal Cancer. Theranostics 10, 11080–11091 (2020).
Article CAS PubMed PubMed Central Google Scholar
Komura, D. & Ishikawa, S. Machine learning methods for histopathological image analysis. Comput Struct. Biotechnol. J. 16, 34–42 (2018).
Article CAS PubMed PubMed Central Google Scholar
Echle, A. et al. Deep learning in cancer pathology: a new generation of clinical biomarkers. Br. J. Cancer 124, 686–696 (2021).
Article PubMed Google Scholar
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article CAS PubMed PubMed Central Google Scholar
Xu-Wilson, M. & Rahman, A. In Proceedings of the 2nd Machine Learning for Healthcare Conference Vol. 68 (eds Doshi-Velez F. et al.) 191-203 (PMLR, Proceedings of Machine Learning Research, 2017).
Xu, R., Chen, G., Connor, M. & Murphy, J. Novel use of patient-specific covariates from oncology studies in the era of biomedical data science: a review of latest methodologies. J. Clin. Oncol. 40, 3546–3553 (2022).
Article PubMed Google Scholar
Lengerich, B. J., Aragam, B. & Xing, E. P., https://doi.org/10.1101/294496 (2018).
McGenity, C. et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digit. Med. 7, 114 (2024).
Article PubMed PubMed Central Google Scholar
Sheikh, T. S., Kim, J. Y., Shim, J. & Cho, M. Unsupervised learning based on multiple descriptors for WSIs diagnosis. Diagnostics 12, https://doi.org/10.3390/diagnostics12061480 (2022).
Chen, J. et al. Microsatellite status detection of colorectal cancer: evaluation of inconsistency between PCR and IHC. J. Cancer 14, 1132–1140 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ho, V. et al. Microsatellite instability testing and prognostic implications in colorectal cancer. Cancers 16, https://doi.org/10.3390/cancers16112005 (2024).
Davri, A. et al. Deep learning on histopathological images for colorectal cancer diagnosis: a systematic review. Diagnostics12, 837 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guitton, T. et al. Artificial intelligence in predicting microsatellite instability and KRAS, BRAF mutations from whole-slide images in colorectal cancer: a systematic review. Diagnostics 14, https://doi.org/10.3390/diagnostics14010099 (2023).
Ying, Y. et al. Accuracy of machine learning in diagnosing microsatellite instability in gastric cancer: A systematic review and meta-analysis. Int J. Med. Inf. 193, 105685 (2025).
Article Google Scholar
Alam, M. R. et al. Recent applications of artificial intelligence from histopathologic image-based prediction of microsatellite instability in solid cancers: a systematic review. Cancers 14, 2590 (2022).
Article PubMed PubMed Central Google Scholar
Wang, Q. et al. Systematic review of machine learning-based radiomics approach for predicting microsatellite instability status in colorectal cancer. Radio. Med. 128, 136–148 (2023).
Article Google Scholar
Kacew, A. J. et al. Artificial intelligence can cut costs while maintaining accuracy in colorectal cancer genotyping. Front Oncol. 11, 630953 (2021).
Article PubMed PubMed Central Google Scholar
Vg, D. R. et al. Testing for microsatellite instability in colorectal cancer - a comparative evaluation of immunohistochemical and molecular methods. Gulf J. Oncol. 1, 70–78 (2022).
Google Scholar
McInnes, M. D. F. et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: The PRISMA-DTA statement. Jama 319, 388–396 (2018).
Article PubMed Google Scholar
Huedo-Medina, T. B., Sánchez-Meca, J., Marín-Martínez, F. & Botella, J. Assessing heterogeneity in meta-analysis: Q statistic or I2 index?. Psychol. Methods 11, 193–206 (2006).
Article PubMed Google Scholar
van Houwelingen, H. C., Arends, L. R. & Stijnen, T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat. Med 21, 589–624 (2002).
Article PubMed Google Scholar
Deeks, J. J., Macaskill, P. & Irwig, L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J. Clin. Epidemiol. 58, 882–893 (2005).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank all contributors and reviewers of the included studies. No funding was received for this work.

Author information

These authors contributed equally: Huo Li, Jing Qin, Zhongzhuan Li.

Authors and Affiliations

Department of Gastroenterology, The Fourth Affiliated Hospital of Guangxi Medical University, Liuzhou, China
Huo Li, Zhongzhuan Li, Rong Ouyang, Zhixin Chen, Shijiang Huang, Shufen Qin & Qiliang Huang
Department of General Medicine, Liuzhou People’s Hospital, Liuzhou, China
Jing Qin

Authors

Huo Li
View author publications
Search author on:PubMed Google Scholar
Jing Qin
View author publications
Search author on:PubMed Google Scholar
Zhongzhuan Li
View author publications
Search author on:PubMed Google Scholar
Rong Ouyang
View author publications
Search author on:PubMed Google Scholar
Zhixin Chen
View author publications
Search author on:PubMed Google Scholar
Shijiang Huang
View author publications
Search author on:PubMed Google Scholar
Shufen Qin
View author publications
Search author on:PubMed Google Scholar
Qiliang Huang
View author publications
Search author on:PubMed Google Scholar

Contributions

H.L. conceived and designed the study. H.L., Q.L., SF, S.J., Z.X., Z.Z. and R.O. extracted and analyzed the data, while H.L. and J.Q. wrote the first version of the manuscript. All authors contributed to the manuscript and approved the final version for submission.

Corresponding author

Correspondence to Huo Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, H., Qin, J., Li, Z. et al. Systematic review and meta-analysis of deep learning for MSI-H in colorectal cancer whole slide images. npj Digit. Med. 8, 456 (2025). https://doi.org/10.1038/s41746-025-01848-z

Download citation

Received: 11 February 2025
Accepted: 28 June 2025
Published: 18 July 2025
DOI: https://doi.org/10.1038/s41746-025-01848-z

Subjects

Abstract

Similar content being viewed by others

Deepath-MSI: a clinic-ready deep learning model for microsatellite instability detection in colorectal cancer using whole-slide imaging

Deep learning for the prediction of early on-treatment response in metastatic colorectal cancer from serial medical imaging

Ensemble-based multi-tissue classification approach of colorectal cancer histology images using a novel hybrid deep learning framework

Introduction

Results

Study selection

Study description and quality assessment

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

Diagnostic performance of internal validation set for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in patient-based analysis

Diagnostic performance of external validation sets for DL based on WSIs in predicting MSI-H in CRC patients in image-based analysis

Publication bias

Discussion

Methods

Search strategy

Inclusion and exclusion criteria

Quality assessment

Data extraction

Outcome measures

Statistical analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary materials

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links