Introduction

Microsatellites (MS), also known as short tandem repeats (STR), consisting of DNA sequences formed by tandem repetitive units of 1-6 nucleotides, are ubiquitous in the human genome1. When the DNA mismatch repair (MMR) machinery is compromised by acquired or inherited factors, deletions or insertions of one or more units accumulate at MS loci. This phenomenon is termed microsatellite instability (MSI)2,3.

MSI is most prevalent in endometrial, gastric and colorectal cancers, with lower incidence in other malignancies4. It serves as an important therapeutic and prognostic marker. High-level MSI status (MSI-H) is associated with sensitivity to immune checkpoint inhibitors and resistance to 5-fluorouracil-based chemotherapy5, therefore, accurate identification of patients’ MSI status is vital.

According to ESMO guidelines6, IHC-based methods targeting protein MLH1, MSH2, PMS2 and MSH6 (IHC-MSI) are the preferred approach for MSI testing. IHC indirectly assesses the integrity of MMR function by checking nuclear location of key MMR proteins. However, the results of IHC-MSI can be influenced by various factors such as pre-analytical processing of samples7 and non-truncating inactivating mutation of genes8. As reported, ~5–11% of MSI samples are caused by such mutations in MMR genes4, which results in loss of function of the gene products but retain their antigenicity. When IHC results indeterminate, PCR-based methods (PCR-MSI) are employed. PCR directly determines the integrity of MMR function by check length changes of microsatellites caused by insertions or deletions of repeating units due to unrepaired “replication slippage”4. At present, a PCR panel of five quasi-monomorphic poly-A mononucleotide repeats9, the most popular commercial implementation of which came from Promega, is widely used for its better performance than the Bethesda panel2. Though the concordance of PCR-MSI and IHC-MSI can reach up to 97%10, it is noteworthy that current approved PCR-MSI testing products of the five loci are intended only for colorectal cancers; the usage of such products on other non-colorectal malignancies is still controversial11.

In recent years, NGS-based MSI detection methods (NGS-MSI) have gained widespread acceptance. NGS check length changes of MS loci as well but can expand the number of MS loci targets, thereby potentially being able to improve the analytical performance, particular in non-colorectal samples. NGS-MSI are highly concordant with PCR-MSI in colorectal cancers; however, some discordance has been reported in non-colorectal cancers12,13,14,15. In a study led by Memorial Sloan Kettering Cancer Center, NGS-MSI detection methods demonstrated 99.4% concordance with PCR or IHC in colorectal and endometrial cancers, and 96.6% concordance with PCR in non-colorectal or non-endometrial cancers14. Several NGS-MSI algorithms have been developed, such as MSIsensor16, MSI-ColonCore17, etc.

In this study, we conducted a large-scale retrospective analysis of NGS-MSI results of 35,563 Chinese pan-cancer cases. Here we introduced a novel NGS-MSI algorithm, examined the prevalence and the genomic variation association of MSI, systematically evaluates the discordance of NGS-MSI with PCR-MSI in a pan-caner context, and, finally extract 7 MS loci suitable for pan-cancer MSI detection.

Results

Development of a NGS-based MSI detector

An in-house NGS-based MSI detector, MSIDRL, was developed primarily according the idea of Wang et al18.

Initially, top 500 most robust noncoding MS loci in 10 colorectal circulating tumor DNA whole-exome sequencing assays were selected. Capture probes targeting these loci were designed and synthesized forming a prototype panel. A training set of 105 pan-cancer FFPE samples, whose MSI status had been determined with PCR (31 MSI-H and 74 MSI-L/MSS, Supplementary Table 2, Training Set), were assayed with the prototype. For any MS locus i, the reads covered the entire repeat were counted and summed in the MSI-H samples and the MSI-L/MSS samples separately, and cumulatively computed by observed repeat length. The observed repeat length maximizing the cumulative read count difference between the MSI-H samples and the MSI-L/MSS samples was defined as the “diacritical repeat length” of locus i, designated DRLi. For any MS locus i of any sample j, the reads of repeat length longer than DRLi were defined as “stable” reads, the count of which was designated SRCij; the reads of repeat length shorter than or equal to DRLi were defined as “unstable” reads, the count of which was designated URCij. We defined the background noise of locus i as:

$${B}_{i}=\frac{\sum _{l}{{URC}}_{{il}}}{\sum _{l}({{SRC}}_{{il}}+{{URC}}_{{il}})}\left(l\in \left\{{\rm{MSI}}-{\rm{L}}/{\rm{MSS\; samples}}\right\}\right)$$
(1)

Then we have \({b}_{{ij}}=\frac{{{URC}}_{{ij}}}{{{SRC}}_{{ij}}+{{URC}}_{{ij}}}\) for any locus i of any sample j, test the null hypothesis \({H}_{0}:{b}_{{ij}} > {B}_{i}\) with binomial test and obtain the p-value pij. With the PCR-predefined MSI status, determine p-value cutoff Pi for each locus, requiring specificity >= 99.0% and sensitivity as higher as possible.

Top 100 most sensitive MS loci were selected, forming the final panel. These loci do not overlap with the 6 loci of PCR-MSI. The unstable locus count (ULC) of a sample is the count of final panel MS loci whose binomial test p-values less than or equal to the cutoffs. ULC could classify the training set and the validation set properly (Supplementary Fig. 1, Supplementary Table 2).

ULC cutoff & MSI-H prevalence

From June 2020 to July 2023, 35,563 valid cases were tested with the MSIDRL-embedded 733-gene NGS LDT (see Materials and Methods), which produced abundant data entailing fine-tuning of the ULC cutoff.

The pan-cancer ULC distribution is bimodal (Fig. 1A). The first peak appeared at the lower extreme of ULC spectrum, followed by a sharp case count decrease near 10 and then a wide flat valley from 10 to 90. We considered the existence of the first peak as a self-explanatory aggregation of MSS cases, so determined the ULC cutoff as 11. The case count rose gently around 90 and culminated at 100 forming a second peak. Intriguingly, once the cases were inspected across anatomical cancer types, only in GACA and BWCA would be observed the second peak, while the first arose in all cancer types (Supplementary Fig. 2).

Fig. 1: The pan-cancer patient case distribution by ULC and ULC-defined MSI-H prevalence of cancer types.
figure 1

A Pan-cancer ULCs demonstrated a bimodal distribution and cases of ULCs >= 11 were considered MSI-H. B Total and MSI-H case counts differed between cancer types of four clusters. BWCA bowel cancers, GACA gastric cancer, UTNP uterine neoplasms, BITC biliary tract cancers, LICA liver cancers, OFPC ovarian cancer including Fallopian tube cancer and primary peritoneal cancer, PACA pancreatic cancer, LUCA lung cancers, The rest, other cancers not above.

With the prevalence of MSI-H calculated (Supplementary Table 3), the cancer types could be grouped into 4 clusters (Fig. 1B). UTNP, GACA and BWCA were common cancers of high MSI-H prevalence; they contributed approximately 80% of the MSI-H cases. BITC, LICA, OFPC, and PACA were common cancers with a lower MSI-H prevalence. LUCA was the most prevalent cancer, but MSI-H was rare in it. The rest cancers were not common, in which few MSI-H cases were reported.

We investigated MSI-H prevalence in some cancer subtypes as well (Supplementary Table 4). Significance difference was observed, between colon cancer and rectal cancer (10.66% vs. 2.19%, p-value = 1.26×10−36), and, esophagogastric junction cancer and esophageal cancer (4.04% vs. 0.30%, p-value = 2.11 × 10−3).

DMGs associated with ULC

Within the scope of the 733-gene panel, 363 genes were found deleteriously mutated in at least one case (Supplementary Data 1), and 94 of which were associated with ULC from a pan-cancer perspective, 92 positively associated and BTK and TERT negatively (Fig. 2). ULC-associated DMGs of a specific cancer type were mostly a subset of the 94 genes, though with some cancer-specific exceptions, such as MYC in UTNP, KRAS in GACA, and CTNNB1 and TP53 in BWCA (Supplementary Fig. 3).

Fig. 2: DMGs associated with ULC.
figure 2

A 94 genes associated with ULC in the entire data set. B ULC distribution comparison between positive and negative cases for the genes of top 10 strongest positive association and the genes of negative association.

We supposed the DMGs positively associated with ULC and of germline mutation incidence were potential MSI drivers. 29 such genes were discovered, most of them were well-established DNA damage repair genes whose products involved in a physical interaction network (Supplementary Fig. 4). In the MMR genes investigated (MLH1, MLH3, MSH2, MSH3, MSH6, PMS2), MLH3 and MSH3 bore few germline mutation, while a considerable amount were observed in the others.

Variants associated with ULC

To discover more specific factors associated with MSI, ACMG P and LP variants, and Variants of Uncertain Significance (VUS) detected in the data set were analyzed.

From the entire data set perspective, 481 variants were associated with ULC, the majority of which were somatic single-nucleotide indels positively associated with ULC (Supplementary Data 2). A single deletion, chr2:g.148683686del (ACVR2A:NM_001616.5:c.1310del:p.K437Rfs*5), was detected in 66.6% (728/1,093) MSI-H cases (Fig. 3A). Four germline VUSs, chr7:g.6445235 C > T (RAC1, rs836554), chr6:g.43737486 C > T (VEGFA, rs833061), chr10:g.131264931 A > C (MGMT, rs1625649) and chr7:g.6443839 T > C (RAC1, rs4720672), were found negatively associated with ULC, they are all non-coding germline SNVs (Fig. 3B).

Fig. 3: Variants associated with ULC.
figure 3

A Variants of top 10 strongest association with ULC, all of which are single-nucleotide indels and most occurrence were somatic. B Germline variants negatively associated with ULC.

Correlation between MSI and TMB

TMB information was available for 97.3% cases of the entire data set (34,588/35,563). We investigated the correlation between MSI and TMB.

As expected, ULC demonstrated a weak positive correlation with TMB in MSI-H cases (Fig. 4A). Meanwhile, a considerable proportion of MSS cases were TMB-H, indicating other MMR-independent mutagenesis mechanisms (Fig. 4A). The overall fractions of MSI-H/TMB-H, MSI-H/TMB-L, MSS/TMB-H, and MSS/TMB-L were 2.97%, 0.08%, 17.49% and 79.47% respectively, though these fractions varied in cancer types (Fig. 4B).

Fig. 4: Relationship of MSI and TMB.
figure 4

A A ULC-TMB scatter plot of the cohort’s cases (TMB is set to 0.1 if 0). B MSI-TMB status concordance in various cancer types. AMCA ampullary cancer, ANCA anal carcinoma, BITC biliary tract cancers, BLCA bladder cancer, BNCA bone cancer. BRCA breast cancer, BWCA bowel cancers, CECA cervical cancer, CNSC central nervous system cancers, EEJC esophageal and esophagogastric junction cancers, EXPD extramammary Paget’s disease, GACA gastric cancer, GIST gastrointestinal stromal tumors, HCNP histiocytic neoplasms, HNCA head and neck cancers, KASA Kaposi’s sarcoma, KICA kidney cancer, LICA liver cancers, LUCA lung cancers, LYMM lymphomas, MESO mesothelioma, MLNM melanoma, NEAT neuroendocrine and adrenal tumors, NMSK non-melanoma skin cancers, OFPC ovarian cancer including Fallopian tube cancer and primary peritoneal cancer, PACA pancreatic cancer, PECA penile cancer, PRCA prostate cancer, STSM soft tissue sarcoma, TECA testicular cancer, TERA teratoma, THCA thyroid carcinoma, TTCA thymomas and thymic carcinomas, UTNP uterine neoplasms, VVCA vulvar and vaginal cancers.

Discordance between PCR-MSI and NGS-MSI

With the fine-tuned ULC cutoff, we reviewed the validation set of MSIDRL development. Three PCR-determined MSI-L/MSS gastric cases were defined as MSI-H by MSIDRL (Supplementary Fig. 1), indicating a non-negligible discordance between PCR-MSI and NGS-MSI.

To verify the idea, 50 cases with ULCs between 12 and 45 were randomly selected from the data set. Their PCR-MSI results, ULC, MLH1 methylation status and genomic variants were integrated and analyzed. In these NGS-determined MSI-H cases, only 4 were determined as MSI-H by PCR and the rest were MSS/MSI-L, though most of them were either with methylated MLH1 promoter or supportive genomic variants or both, except 2 LUCA cases (Fig. 5, Supplementary Data 3). All MSI-L samples were supported by at least MSI-related variants, with one case supported by extra MLH1 methylation. We believed that these samples are actually MSI-H. This phenomenon may suggest a gap in sensitivity between PCR and NGS.

Fig. 5: The discordance of NGS-MSI and PCR-MSI results in the pan-cancer cohort of this study. MLH1 + , MLH1 promoter methylated.
figure 5

MLH MLH1 promoter unmethylated. Variants + , supportive genomic variants detected. Variants-, no supportive genomic variants detected.

Shrinkage of the MSI panel

Inspired by the bimodal distribution of ULC (Fig. 1A) and the prevalence of chr2:g.148683686del in MSI-H cases (Fig. 3A), we wondered if a panel of a small number of MS loci was sufficient to present the MSI status of any case of any cancer type.

With an in-house developed greedy algorithm, the classifier performance of virtual panels consisted of different numbers of MS loci was shown in Fig. 6A. Taking the original 100-locus panel as the reference, a panel of 7 loci (Supplementary Table 5) was able to reproduce the MSI status with an OPA (overall percent agreement) of 99.5%, resulting in 115 false positives and 49 false negatives (Fig. 6B).

Fig. 6: Consolidation of the MSI panel.
figure 6

A Classifier performance by MS locus count. B Correlation of 100-loci ULC (cutoff: >= 11) and 7-loci ULC (cutoff: >=4).

Discussion

In this study, we developed a novel NGS-based pan-cancer MSI detection algorithm MSIDRL. It avoided the biologically-invalidated empirical statistics assumptions (e.g. mean +/- standard deviation × 3, etc.) applied by previous approaches, such as mSINGS19, MSI-ColonCore17, MSIsensor-pro20. The result of MSIDRL, ULC, reflected the extent that MMR deficiency affects MS loci. Interestingly, ULC distribution aggregated in extremes only in GACA and BWCA (Supplementary Fig. 2), indicating the impact of MMR deficiency in these cancers was more intense and universal than that in other cancers. This phenomenon does not depend on the cancer type constituents of the training set, as similar phenomenon was observed when the loci were selected based on the UTNP samples only (data not shown). We also investigated the prevalence of MSI in various cancer types. Besides the well-established MSI-prevalent UTNP, GACA and BWCA, BITC, LICA, OFPC and PACA contributed a considerable amount of MSI-H cases. The prevalence of MSI-H is lower in these Chinese patients than that in the European cohorts, which is consistent with the previous report21. Prevalence difference was observed between subtypes of cancers, such as colon cancer and rectal cancer, which may explain the controversial effect of MSI biomarker in rectal cancer22.

We analyzed the DMGs associated with MSI status, which may help investigate the mechanism or the consequence of MMR deficiency. Some genes were inactivated by germline mutations, such as BRCA2, ATM, RAD50, MLH1, PALB2 etc., indicating a potential role of MSI drivers; while the other were found with only somatic mutations, such as ACVR2A, MSH3, TGFBR2, KMT2C, RNF43 etc., and these mutations were always found in “hotspot” STR regions in their CDS. Though MMR genes, MLH3 and MSH3 bore few germline mutations, this is consistent with rare MSH3 or MLH3-related hereditary non-polyposis colorectal cancer cases23.

Besides somatic single-nucleotide deletion “hotspots” in STR regions, our study found 4 non-coding germline SNPs of strong but negative association with MSI. None of these SNPs were involved in ULC calculation. These SNPs had been reported to associate with chemotherapy toxicity, susceptibility to cancer, or, prognosis24,25,26, but reports of association with MSI were rare and the protective mechanism underlying MSI suppression is intriguing.

We also studied the relationship between MSI and TMB. It’s not surprising to see the ULC of MSI-H cases demonstrated a correlation with TMB, as parts of variants counted by ULC also were counted by TMB. With a cutoff of 10 mutations per Mb, TMB-H cases includes almost all MSI-H cases showing the potential of TMB as a surrogate biomarker for MSI.

The application of PCR-based five-locus MSI detection panel (either Bethesda or Promega) in non-colorectal samples is still under debate4. In this study, we demonstrated the Promega panel brought putative false negative results in pan-cancer cases. In Supplementary Fig. 2, we can see MSI in GACA and BWCA tends to be global, while in the other cancers, its effects were more diverse. So the lack of representativeness or vulnerability of the classical loci in other cancers caused the ineffectiveness, which was probably related to cancer-specific chromatin organization. Eventually, 7 MS loci with the power of pan-cancer MSI detection were discovered. The consolidation from 100 loci to 7 loci would reduce diagnostic costs.

The main limitation of this article is that we lack clinical trials to definitively determine the clinical effectiveness of NGS and PCR on samples that are inconsistent. We expect head-to-head clinical trials of drug response ultimately evaluate the performance of NGS and PCR.

Methods

Patients and data

Cancer patient cases tested by the capture-based NGS 733-gene laboratory-developed test (LDT) (see “NGS LDT” part and Supplementary Table 1) in the CAP-, CLIA- and ISO15189-certified 3DMed Medical Laboratory (3DMed Biomedical Technology, Shanghai, China) from June 2020 to July 2023 were consecutively recruited to this study, except those failed in any QC procedure. Written informed consent was obtained from all participants, permitting the use of anonymized NGS data for academic research.

All data involved in this study were handled in accordance with the Declaration of Helsinki. This study is exempt from ethical review according to Article 32 of the “Measures for Ethical Review of Life Science and Medical Research Involving Human Beings” (https://www.gov.cn/zhengce/zhengceku/2023-02/28/content_5743658.htm) issued by the National Health Commission of the People’s Republic of China.

NGS LDT

For each patient, a pair of samples was analyzed: formalin-fixed paraffin-embedded (FFPE) tumor tissue alongside either FFPE paracancerous normal tissue or peripheral blood. For FFPE samples, we require at least 15 non-stained slides of 4–5 μm thick, prepared within past one year, and with a tumor content >= 20%, to assure a DNA input of 200 ng. For peripheral blood, we require at least 5 ml collected in a Streck tube. All these samples are transported in ambient temperature.

Genomic DNA was extracted with ReliaPrep™ FFPE gDNA Miniprep System (Promega Corporation, Madison, Wisconsin, USA) or QIAamp DNA Blood Mini Kit (QIAGEN, Germantown, Maryland, USA), and sonicated to an average size of 250 bp. Libraries were prepared with KAPA HyperPrep Kit (KAPA Biosystems, Cape Town, South Africa) and targets were enriched by hybridization with customized single-stranded DNA probes synthesized by Integrated DNA Technologies (Coralville, Iowa, USA). Sequencing was performed on NovaSeqTM 6000 Sequencing Systems (Illumina, San Diego, California, USA) in PE100 or PE150 mode to produce adequate data assuring a minimal mean effective depth of 500×. FASTQ files were mapped to human reference genome hg19. Somatic and/or germline single-nucleotide variation (SNV), insertion or deletion not longer than 40 bp (indel), large genomic rearrangement (LGR), copy-number variation (CNV), gene fusion, and MSI, and tumor mutational burden (TMB) were called or calculated with in-house bioinformatics pipelines.

PCR-MSI

PCR-MSI assays were performed by Guangyue Medical Laboratory (Microread Genetics Co., Ltd., Guangzhou, China) with a multiple fluorescent PCR capillary electrophoresis approach. Amplicon lengths of 6 MS loci (NR-21, BAT-26, NR-27, BAT-25, NR-24, MONO-27) were analyzed. MSS was defined as the situation that no locus of the six altered in tested samples, MSI-L as only 1 locus altered and MSI-H as 2 or more loci altered.

Cancer typing and subtyping

The cancers of the test cases were classified primarily according to the categories outlined in the NCCN cancer treatment guidelines (https://www.nccn.org/guidelines/category_1) with a few minor, arbitrary modifications. The classification was primarily based on anatomy with some consideration of certain histological types. Subtyping of cancer types was based on anatomy or histology depending on cancer types.

Definition of deleteriously mutated genes (DMG)

If an ACMG-classified Pathogenic (P) or Likely Pathogenic (LP)27 somatic or germline SNV, indel, CNV, LGR or fusion variant was detected in a gene in a case, the gene was defined as a DMG of the case.

Protein association analysis

Protein association analysis was performed with STRING (Version: 12.0) (https://cn.string-db.org/). Default parameters were used except that only physical interactions evidenced by experiments were considered.

MLH1 promoter methylation test

MLH1 promoter methylation was tested with MethylTargetTM (GeneSky, Shanghai, China). Sample DNA was bisulfite-converted, amplified and sequenced. Average methylation level of CpG dinucleotides between hg19 chr3:37034654- 37034840 larger than or equal to 10% was defined as MLH1 promoter methylated, otherwise, unmethylated.

Statistical Analysis

Statistical analyses were performed with the SciPy package (Version: 1.10.0) of Python (Version: 3.10.9). Fisher’s exact test was used to analyze categorical data. Mann-Whitney U test was used to compare numerical ULC difference between groups. P-values were Bonferroni-adjusted and converted to Q-values for convenience (Eq. 2). Q-values larger than 0 (i.e. adjusted P-values < 0.01) were considered statistically significant. Correlation between MSI and TMB was analyzed with linear regression.

$${\rm{Q}}=\left\{\begin{array}{c}-{\log }_{10}\left(\mathrm{adjusted\; P}\right)-2\,\left(\mathrm{adjusted}\,{\rm{P}}\ne 0\right)\\ 310\,\left(\mathrm{adjusted}\,{\rm{P}}=0\right)\end{array}\right.$$
(2)