Abstract
Lynch syndrome is characterised by heterozygous germline mutations in the MisMatch Repair (MMR) genes and an increased risk of cancer. Previous population estimates based on cohorts with colorectal cancer suggest one in 300 people have a disease-causing variant. This study calculated the population frequency of Lynch syndrome from Predicted pathogenic variants in MMR genes in the general population. MLH1, MSH2, MSH6 and PMS2 variants were downloaded from gnomAD v.2.1, and annotated in ANNOVAR. Our population frequencies of heterozygous Predicted pathogenic variants were calculated from the sum of structural, null, rare computationally-damaging missense changes, and founder variants. Population frequencies were also derived from the proposed ClinGen variant specifications, and from pathogenic variants in the ClinVar, LOVD or InSiGHT websites. Predicted pathogenic variants were found in one in 94 people in gnomAD v.2.1.1 using our strategy, and one in 122, 203, 199 or 594 using the proposed ClinGen specifications, and the ClinVar, LOVD or InSiGHT databases, respectively. The frequencies derived from ClinVar (one in 203) and LOVD (one in 199) were based on accurate assessments of penetrant variants since they were largely derived from patient testing, but were underestimates because not all gnomAD variants had been assessed. Our strategy and that of the proposed ClinGen specifications examined each variant in gnomAD and resulted in more common population frequencies but some assessments may have been inaccurate, and variants incompletely penetrant. The number of Predicted pathogenic MMR gene variants in the general population suggests that Lynch syndrome is more common than reported previously.
Similar content being viewed by others
Introduction
Lynch syndrome is characterised by an increased lifetime risk of colorectal and other cancers1, and results from heterozygous deleterious germline mutations in one of the four DNA mismatch repair (MMR) genes (MLH1, MSH2, MSH6, PMS2)2,3,4,5. Pathogenic variants are associated with microsatellite instability, leading to uncontrolled genomic replication and cancer development6. Less commonly, deletions of the 3’ end of EPCAM7 result in MSH2 promoter inactivation and hypermethylation7.
The MMR genes are each associated with different cancer phenotypes and risks. Pathogenic MLH1 and MSH2 variants are found in classical Lynch syndrome with mainly colorectal or endometrial cancer, a 50% lifetime risk and early onset disease8,9,10. Pathogenic MSH6 and PMS2 variants result in a 20% lesser risk of cancer and an atypical, later presentation11. MSH6 variants are associated with mainly endometrial cancers, and fewer tumours with microsatellite instability10,12,13. PMS2 variants are found with a lower likelihood of colorectal cancer but with microsatellite instability11,14.
The definitive diagnosis of Lynch syndrome is made with genetic testing and identification of a pathogenic MMR variant. Individuals with colorectal or endometrial cancer together with a family history of cancer, microsatellite marker tumour phenotype, or negative MMR immunohistochemistry should be screened15. Early recognition is critical because active management greatly improves the outcome, with surveillance colonoscopies reducing cancer incidence and mortality16.
Knowing the population frequency of Lynch syndrome alerts clinicians to the likelihood of encountering this disease, informs the health system of the need for genetic tests and treatment, and encourages pharmaceutical companies to develop therapies. There are two approaches to identifying the population frequency of Lynch syndrome: one uses the clinical diagnosis of cancer and the other uses pathogenic MMR variants that predict disease. The population frequency of MMR pathogenic variants indicates the maximum population frequency of Lynch syndrome but whether people with a pathogenic variant develop cancer depends on variant penetrance. Previous studies have estimated a population frequency for Lynch syndrome of between one in 660 and 2,00017. More recent studies have found a frequency of one in 279 in US, Canadian and Australian populations18. The highest reported population frequency was one in 226 in Iceland, where founder variants are observed19. However, most of these cohorts have been recruited from probands with colorectal cancer, and other known Lynch syndrome-associated cancer types have not been examined. In addition, these studies have generally examined a single population or have not considered the prevalence of heterozygotes with pathogenic MMR variants and poorly penetrant disease or those who are yet to develop cancer.
Two recent reports of population frequencies for Lynch syndrome have been derived from the computational analysis of a general population20,21. A study from Nevada found 53 people with a Pathogenic/Likely pathogenic variant in a Lynch syndrome gene from a population of 26,906 equivalent to a population frequency of one in 34020. However, these included only 8 missense variants (15%) and no structural or copy number changes. Another study found 76 people with a Pathogenic/Likely pathogenic variant in 49,738 people from the UK Biobank21. Again, this did not include structural or copy number changes , and only 4 of the 48 variants (12.5%) were missense . These low frequencies probably occurred because both studies only considered the most rigorous criteria for pathogenicity and did not add founder variants. Nevertheless, both studies also provided useful unbiased data on variant penetrance or the likelihood of people with a pathogenic Lynch syndrome variant developing cancer.
The present study has used various computational approaches to estimate the population frequency of ‘Predicted pathogenic’ variants in the MMR (Lynch syndrome) genes in the Genome Aggregation Database (gnomAD v.2.1.1)22.Usually variants are classified with the ACMG/AMP criteria23 but gnomAD includes no clinical data except that the cancer, non-cancer and control subgroups can be examined separately. The lack of clinical information, and the absence of tissue immunohistochemistry and other functional data meant that many of the ACMG/AMP and ClinGen criteria for Lynch syndrome could not be applied. Instead, our assessment was based on the number of Predicted pathogenic MMR gene variants in the gnomAD cohort. Similar analyses have been used to deduce variant pathogenicity and population frequencies in other genetic diseases including Alport syndrome, Polycystic Kidney Disease, Gitelman syndrome, Wilson disease, and the mucopolysaccharidoses24,25,26,27,28, and in some cases these results have been confirmed independently with histological or biochemical data24,26.
Methods
Population database
The population frequency of Lynch syndrome was estimated from the Predicted pathogenic variants in MLH1, MSH2, MSH6 and PMS2 (GRCh37/h19) in gnomAD v2.1.1 (https://gnomad.broadinstitute.org/; n = 141,456). GnomAD was developed by a coalition of investigators who aggregated large scale sequencing studies and made the data available to the scientific community.
GnomAD.v.2.1.1 comprises variant information from 125,748 whole exome sequences (WES) and 15,708 whole genome sequences (WGS) of unrelated adults recruited from studies of cancer, diabetes, and cardiac and neuropsychiatric disorders as well as controls. It includes equal numbers of males and females, and ancestries but not clinical data. Variants were downloaded and examined in June to August 2024.
The results from the whole gnomAD cohort were compared with the control, cancer and non-cancer subsets to identify any selection bias. The control subset represented healthy age- and sex-matched individuals recruited for the original studies. The cancer subset was examined to determine the frequency of Lynch syndrome in people with diagnosed cancer. This included various unknown cancer types. The non-cancer subset referred to people recruited to a cancer study because they did not have a diagnosis of cancer, but this did not guarantee no personal or family cancer history. These subsets were provided by gnomAD and likely to overlap, for example, some people in the control group may have had undiagnosed cancer or developed cancer later.
Participant and public involvement
All individuals included in the gnomAD dataset had provided written, informed consent at recruitment for the subsequent use of their anonymised genetic information, and further ethical approval was not required. Patients and the public were not directly involved in the design nor conduct of the present study but may have been involved originally.
Strategy
According to our strategy, Predicted pathogenic variants included loss-of- function structural variants; null variants; and missense changes that were rare and damaging on computational assessment. Many reported pathogenic variants in the MMR genes are null changes (Fig. 1, Simple ClinVar,29) and missense variants are difficult to interpret accurately. For this reason we used a rigorous approach that required all four computational tools (Polyphen 2, SIFT, Mutation Taster and Conserved in a Clustal analysis) to be positive. We evaluated the accuracy of this strategy, and compared our population frequencies with those deduced from REVEL scores and from the ClinGen guidelines for MAPP (Multivariate Analysis of Protein Polymorphisms) + Polyphen2 prior probabilities (Suppl Table 1). MAPP is an application that interprets missense variants in the MLH1 and MSH2 genes. PP2 is a prediction tool that improves the accuracy of variant classification when integrated with MAPP.
Distribution and type of pathogenic variants in the MMR genes (from Simple ClinVar).
We then compared our results from the whole gnomADv.2.1.1 cohort with the control, the cancer and non-cancer subsets. Finally, we compared the population frequencies derived from our strategy and those from the ClinVar, the Leiden Open-source Variation Database (LOVD)30 and the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) databases31, which were likely to be accurate because assessments were from accredited testing laboratories, often from patients with suspected disease and used the ACMG/AMP criteria23.
Structural variants
Structural variants were available for 10,847 people, and those with gnomAD- classified functional consequences of predicted loss of function (pLoF) were assessed as Predicted pathogenic and the number corrected for the total cohort number (× 13).
Annotation of null and missense variants
Our strategy has been described previously (Fig. 2)24,32. Variants in MLH1, MSH2, MSH6 and PMS2 were downloaded, annotated in ANNOVAR (https:annovar.openbioinformatics.org/), and filtered based on the predicted effect on the canonical transcripts (MSH2: ENST00000233146; MLH1: ENST00000231790; MSH6: ENST00000234420; PMS2: ENST00000265849). Initially, variants located in the upstream, downstream, 5’, 3’ UTR, intronic, non-coding regions as well as synonymous variants were excluded. A ‘Predicted pathogenic’ assessment was used to differentiate variants assessed here as disease-causing from those assessed Pathogenic or Likely pathogenic according to the ACMG/AMP criteria23.
Our strategy for identifying Predicted pathogenic variants.
Null variants
Null variants including protein-truncating, frameshift or canonical splice site variants with an allele count below the Maximum Credible Allele frequency (MCAF) cutoffs (MLH1:0.0001, 0.1%; MSH2:0.0001, 0.1%; MSH6:0.00022, 0.22%; PMS2:0.00028, 0.28%) were assessed as Predicted pathogenic (https://www.clinicalgenome.org/affiliation/50099). Nonsense and frameshift variants were classified Predicted pathogenic except where they occurred in the last exon orthe last 50 nucleotides of the penultimate exon and thus escaped nonsense-mediated decay33,34.
Spliceand start loss variants were Predicted pathogenic where SpliceAI scores were > 0.8 (https://spliceailookup.broadinstitute.org/).
Variants affecting the initiation codon of MLH1, MSH6 and PMS2 were Predicted pathogenic, but there was insufficient evidence for pathogenicity prediction for MSH2 because further inframe start codons are present in its exon 134.
In-frame variants
The ClinGen guidelines has no recommendations for managing in-frame changes34 and VEST4 has not been validated on the MMR genes. However we still chose to assess inframe variants as Predicted pathogenic where their Variant Effect Scoring Tool 4 score was > 0.5 (http;//cravat.us/CRAVAT/)35. All variants were counted even if there were more than 5.
Missense variants
All missense variants in each of the four MMR genes were evaluated as Predicted pathogenic where they were found in 5 or fewer people in gnomADv.2.1.1; and were pathogenic in all three computational tools:
Sorting Intolerant From Tolerant 4G (< 0.05, https://sift.bii.a-star.edu.sg/sift4g/), PolyPhen-2 (≥ 0.95, http://genetics.bwh.harvard.edu/pph2/), and MutationTaster (‘Disease-causing’, D or A https://www.mutationtaster.org/); as well as being ‘conserved’(*) or ‘likely conserved’(: ) in Clustal Omega sequence alignments of vertebrates (humans, chicken, and mice) https://www.ebi.ac.uk/jdispatcher/msa/clustalo) (Fig. 2).
Our strategy was compared with the criteria proposed by the ClinGen InSiGHT team (MAPP + Polyphen2 prior probabilities, Suppl Table 1) (ClinGen Variant Curation Expert Panel for MMR, Version 1.0.0, https://url.au.m.mimecastprotect.com/s/V-upCxnMJ5sL80P03s2CviySW2h?domain=cspec.genome.network)(34, 36).
Evaluation of strategy for missense variants
The accuracy of our strategy for missense variants was evaluated with 20 Pathogenic/Likely Pathogenic variants from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), LOVD (https://www.lovd.nl/), and InSiGHT (https://insight-database.org/) that were randomly selected, as well as 20 from gnomAD where the allele count was > 30 and that were assumed to be Benign/Likely Benign. These were evaluated using our missense assessment criteria (rarity, computational scores and conservation) and our strategy’s sensitivity, specificity, and positive (PPV) and negative predictive values (NPV) then calculated (Graphpad, https://www.graphpad.com).
In addition, all missense variants were assessed with a highly stringent REVEL score of > 0.932, and those Predicted pathogenic were added to the total of Predicted pathogenic variants for structural and null variants, and the corresponding population frequency was calculated. This REVEL score has been reported to have the best performance for distinguishing pathogenic from rare benign variants with allele frequencies < 0.5%37.
Finally, all missense variants were evaluated with the combination of REVEL scores and for MAPP + Polyphen2 prior probability (http://priors.hci.utah.edu/PRIORS).
Founder variants
After our assessment was completed, the data were re-examined for possible founder variants that were assessed as Predicted pathogenic and found in at least 15 people in gnomAD v2.1.1. These were then examined for pathogenicity in the InSiGHT database (https://www.insight-database.org/classifications/). Additional published founder variants were also examined for their presence in gnomAD v.2.1.138.
Overall population frequency of Predicted pathogenic MMR variants
The overall population frequency of Predicted pathogenic MMR variants was then calculated from the sum of people with a Predicted pathogenic MSH2, MLH1, MSH6, or PMS2 variant divided by the mean number of people who had undergone sequencing for each gene in the gnomAD dataset. These numbers varied depending on how often each allele was detected by sequencing. Our calculation assumed that each person had only one Predicted pathogenic variant.
Population frequencies in non-cancer, cancer and control subsets of gnomAD
The population frequencies derived from Predicted pathogenic variants using the above strategy were also calculated for Lynch syndrome using the non-cancer (n = 134,187), cancer (n = 7,369) and control (n = 60,146) cohorts in gnomADv2.1.1.
Ancestry-specific population frequencies
The population frequencies for Lynch syndrome and the four associated genes were studied for each of the eight ancestries included in the gnomADv2.1 dataset (African/African American, Latino/Admixed American, Ashkenazi Jewish, East Asian, European (Finnish), European (non-Finnish), South Asian, and Others).
Population frequencies based on gnomAD variants shared with ClinVar, LOVD or InSiGHT
Finally, population frequencies were calculated based on variants present both in gnomAD and classified as ‘Pathogenic’, ‘Likely pathogenic’ or disease-causing in ClinVar, LOVD or InSiGHT. Variants in ClinVar were included where they were Pathogenic or Likely pathogenic or Conflicting with Pathogenic or Likely pathogenic, together with a VUS, but not a Benign assessment. Variants in LOVD or InSiGHT had a score of 4 or 5, that is, Pathogenic or Likely pathogenic. Disease-causing variants in these datasets were included even where the allele count was greater than 5.
Statistical analysis
Results were compared with Chi-square analysis (Graphpad, https://graphpad.com).
Results
Population frequency of Predicted pathogenic MMR variants in gnomAD
Overall, there were 754 Predicted pathogenic variants in 1,201 people in a cohort of 119,883 that corresponded to a population frequency of one in 100 (Table 1, Fig. 3).
Summary of the number of predicted pathogenic variants in the Lynch syndrome genes using different strategies and cohorts.
These included 39 structural, 240 null and 475 missense variants in 39, 475 and 781 people respectively. The ‘null’ variants also included 26 indels that were present in 127 people. One of these, c.279_281TST or p.Leu94del, had a VEST4 score of 0.98 but was present in 50 people (Suppl Table S2). Thus, the null variants and indels represented 32% (381/1201) and missense variants 65% (781/1,201) of all the changes.
Non-cancer, control and cancer cohorts
Our population frequency for Predicted pathogenic variants including founder variants in the overall cohort (one in 94 people) was not different from the noncancer subgroup (one in 105 people, p = 0.18). It was, however, more common than in the control (one in 114 people, p = 0.02) and the cancer cohorts (one in 152, p = 0.004) (Table 1, Fig. 3).
Assessment of strategy for missense variants
Our strategy’s median sensitivity for missense variants was 77% (range 61 – 93%), with a specificity of 92% (range 90 – 97%), and a median PPV of 84% (76 to 94%) and NPV of 86% (82 to 95%) (Suppl Table 3), consistent with a moderate sensitivity and high specificity for pathogenic variants.
When REVEL was used to assess missense variants, there were 111 Predicted pathogenic missense variants in 159 people and an overall population frequency including 420 structural and null variants in 579 people from a total of 119,883 people corresponding to one variant in 207 people (Table 2, Fig. 3).
ClinGen strategy
When the MAPP + Polyphen2 prior probability approach (ClinGen strategy) was used, there were 361 missense variants in 560 people which together with 420 structural and null variants represented an overall population frequency of 980 affected people in a cohort of 119,883 people or one variant in 122 people (Table 2, Fig. 3).
Founder variants
GnomAD v2.1.1 included only a few Founder variants. These were in MSH6 (c.3226C > G; p.Arg1076Cys in 24 people), and PMS2 (c.137G > T; p.Ser46Ile in 48 people), both of which were assessed as Pathogenic in ClinVar and InSiGHT (Suppl Table 4). These were initially excluded from our assessment because of their high allele frequency, but when the total of 72 people was added to the Predicted pathogenic variants, there were 1,273 people with a Predicted pathogenic or pathogenic variant in a cohort of 119,883 corresponding to an overall population frequency of one in 94 people. Other reported founder variants were not present in gnomAD (Suppl Table 5).
Individual genes
Predicted pathogenic variants were commonest in MSH6 (one in 286), then PMS2 (one in 396), MSH2 (one in 420) and MLH1 (one in 618) (Table 1, Fig. 3).
Overall, null variants represented 32% (381/1,201) of all the changes and missense variants 65% (781/1201). For MLH1, there were 124 Predicted pathogenic variants in 196 people, including 35 with null variants and 161 with missense changes corresponding to a population frequency of one in 618. For MSH2, there were 190 Predicted pathogenic variants in 290 people including 39 with structural, 46 with null and 205 with missense variants, corresponding to a population frequency of one in 420. For MSH6, there were 266 variants in 416 people, including 161 with null and 255 with missense changes corresponding to a population frequency of one in 286. For PMS2, there were 174 variants in 299 people, including 139 with null variants and 160 with missense changes, corresponding to a population frequency of one in 396. Thus, variants were most abundant in MSH6 and PMS2 which together accounted for 416 plus 299 or 715/1,201 (60%) of all Predicted pathogenic changes.
Population frequencies in people of different ancestries
According to our strategy, the population frequency of Predicted pathogenic variants was commonest in people of African/American (one in 73, p < 0.0001) or South Asian (one in 88, p = 0.0025) ancestries and least common in people of Finnish (one in 698, p < 0.0001) or Ashkenazi (one in 305, p < 0.0001) ancestries compared with Europeans (Table 3).
Population frequency from other variant databases: ClinVar, LOVD and InSiGHT
ClinVar
There were 219 variants in 593 people in gnomAD that were classified Pathogenic, Likely pathogenic or VUS/P/LP in ClinVar, corresponding to a population frequency of one in 203 people (Table 4, Fig. 3). This was fewer than found with our strategy (p < 0.0001).
LOVD
There were 144 variants in 609 people in gnomAD that were assessed as Pathogenic or Likely pathogenic in LOVD corresponding to a population frequency of one in 197 people. This was also fewer than with our strategy (p < 0.0001).
InSiGHT
There were 75 variants in 202 people in gnomAD that were assessed as Pathogenic (class 5) or Likely pathogenic (class 4) by InSiGHT corresponding to a population frequency of one in 594 people. Again this was fewer than with our strategy (p < 0.0001).
Discussion
The population frequency of Predicted pathogenic MMR variants associated with Lynch syndrome was one in 94 people when our strategy was used to examined gnomAD v.2.1.1 and founder variants were included; one in 100 with our strategy alone; one in 122 using the ClinGen InSiGHT Variant Curation Expert panel strategy (MAPP + Polyphen2 Priors); and one in 203 using known variants from the ClinVar or LOVD (Pathogenic, Likely pathogenic) databases. Our analyses all suggest that Lynch syndrome is more common than the previously reported population frequencies of one in 200 to 30018, one in 600 to 2,00017, or recent computational assessments of one in 340 or one in 65420,21. Our approach nevertheless risked overclassifying variants as disease-causing because of the lack of confirmatory clinical, histopathological or functional evidence.
Previous epidemiological estimates of Lynch syndrome have been largely based on the genetic testing of people with colorectal cancer and their families rather than the general population18. The difference between our estimates and the previous range may be explained by undiagnosed heterozygotes and incomplete disease penetrance from the milder pathogenic missense variants in our analysis. Interestingly, our assessment, and those using REVEL scores and the MAPP + Polyphen 2 Priors strategy all demonstrated that missense variants were more common than null changes, which is different from previous observations and more consistent with an association with milder or absent phenotypes.
Our population frequencies of Predicted pathogenic MMR variants may still be underestimates. The gnomAD cohort itself will not include people who have died of cancer, and may have excluded others who were not recruited because of a suspected cancer diagnosis. In addition, some people with disease-causing variants may have had an undetected cancer that is removed by the body’s own immune system. We also did not examine whether people with a Predicted pathogenic variant in gnomADv.2.1.1 were younger than those in the cancer cohort and had simply not yet developed clinical manifestations. These factors may have also explained why Predicted pathogenic variants were more common in the non-cancer than the cancer subset.
Another reason why our population frequency for Lynch syndrome may be an underestimate is that we examined only the four MMR genes and not the 3’ EPCAM deletions that inactivate MSH2 to cause disease7 or other still-to- be identified colorectal cancer genes. The EPCAM deletions are difficult to assess39,40 and gnomAD included too few to significantly impact the overall frequencies. Other reasons include the gnomAD genetic analysis using mainly WES which does not include all structural, copy number, or deep intronic splicing variants41,42, and our testing strategy for missense variants had a 61 to 93% sensitivity thus overlooking some disease-causing changes. Finally, the population frequencies derived from ClinVar, LOVD and InSiGHT did not include assessments for all gnomAD variants.
Importantly, however, not everyone with a Predicted pathogenic MMR variant develops cancer. The risk is greater for MLH1 and MSH2 than for MSH6 and PMS2 variants. The increased frequency of null variants in MSH6 and PMS2 in the gnomAD cohort is associated with incomplete penetrance and a lesser likelihood of clinical features. Mechanistically, MSH2-MSH3 and MLH1-MLH3 heterodimers may be functionally redundant and compensate for the loss of MSH2-MSH6 and MLH1-PMS2 complexes respectively43,44,45. Thus, null MSH6 and PMS2 variants appear to be better tolerated than loss of function variants in MLH1 and MSH2.
However a computational approach to population frequencies has inherent limitations. Variant assessments may be inconsistent because of different diagnostic criteria, curators and interpretations. Missense variants are difficult to interpret, and improving the accuracy usually requires clinical information. We used particularly stringent criteria for missense changes, that required all four computational tools to be damaging. This approach overlooked some variants, such as NM_000251.3(MSH2): c.488 T > G (p.Val163Gly) that were classified Pathogenic by ClinVar, LOVD, and InSiGHT. On the other hand, the population frequency was much less when missense variants were assessed using a highly stringent REVEL score. The missense validation cohort has wide confidence intervals for sensitivity and specificity and the resulting performance metrics must be interpreted cautiously. Finally, In-frame indels were assessed using VEST-4 and applying a cut-off of > 0.5 that risks inflating the ‘predicted pathogenic’ count since the MMR guidelines do not sanction such a rule. This threshold may contribute to an overestimation of the prevalence.
Our population frequency for Lynch syndrome differed from two other recent computational assessments, and a major difference was that we included founder variants. In the Ashkenazi population, 3 founder variants accounted for 73% of all MMR mutations46. Interestingly, the Ashkenazi had relatively more null changes than other ancestries in our gnomAD assessment, with 12 null and only 5 missense changes. Ten of the variants were present in MSH6, where two changes in 6 people were previously reported as founder changes: c.3959_3962delCAAG (p.Ala1320GlufsTer6) and c.3984_3987dupGTCA (p.Leu1330ValfsTer12)47. The most common Ashkenazi founder variant was MSH2 c.1906G > C (p.Ala636Pro)48 which was present in 5 people in gnomAD, but failed to meet our pathogenicity criteria. In the Finnish population, 2 founder variants accounted for about half the cases of Lynch syndrome49 but neither was present in gnomAD, resulting in an unexpectedly low population frequency for this ancestry.
There are other limitations to calculating a population frequency computationally. It may not be appropriate for PMS2 variants since this gene shares more than 99% of its sequence with its pseudogene PMS2CL, and gnomAD v.2.1 uses short-read sequencing which does not distinguish PMS2 from PMS2CL. Locus-specific methods are required to avoid mis-assignment50,51. A common PMS2 frameshift is found in PMS2CL52 and the splice site variant c.904-1G > A has been detected in PMS2CL in one carrier53. These examples demonstrate that some gnomAD PMS2 calls affect PMS2CL, but the same variant occurs in PMS2 in other people. Only locus-specific confirmation can resolve each case.
Variant assessments from ClinVar, LOVD and InSiGHT databases are more likely to be accurate because assessments are from accredited diagnostic testing laboratories, often from people with suspected disease, and based on the ACMG/AMP criteria. However these databases do not include assessments for all variants and ClinVar, at least, includes many Variants of Uncertain Significance (VUS) which means that these databases, too, underestimate population frequencies. Nevertheless the pathogenic variants found in the ClinVar, LOVD and the InSiGHT database may be more likely to be fully penetrant for at least the index cases because variants are often from people with suspected Lynch syndrome. These observations suggest that the population frequency of predicted pathogenic MMR variants is even more common than one in 203.
The strengths of this study were the use of large datasets of genomic information, the rigorous systematic assessment of gnomAD variants for all four Lynch syndrome genes, the use of different strategies for variant assessment, including the REVEL-MAPP criteria, and the comparisons using different databases to develop a range of population frequencies. In addition, our approach was assessed for accuracy.
The study’s limitations were the lack of the clinical, histopathological and functional confirmation for variant assessments, the inability to detect and assess accurately structural, copy number and splicing variants, as well as indels and the limitations of the computational tools, including the comparator variant databases (ClinVar, LOVD, InSiGHT).
The new assessment guidelines, increasingly accurate computational tools and more expert assessments will help develop more precise population frequencies for predicted pathogenic MMR variants. The information from larger datasets such as gnomAD v.4.1 and those that include clinical features will allow us to ascertain variant penetrance too.
In conclusion, the population frequency of Predicted pathogenic MMR variants in Lynch syndrome is more common than the previously reported frequency of one in 279. While not everyone with a Predicted pathogenic variant will develop cancer because of incompletely penetrant disease, the frequencies of one in 203 from ClinVar and LOVD are more likely to be based on variants found in people with cancer. Clinicians, health service planners and pharmaceutical companies should be aware of these population frequencies, especially the possibly increased prevalence in people from African/ American and South Asian ancestries. Nevertheless these population frequencies probably represent upper limits for estimates, and are based on variants that are not confirmed to be pathogenic. Methodological caveats include the PMS/PMS2CL ambiguity, inclusion of in-frame indels, the small validation set, and ancestry-specific artefacts. Dataset and algorithmic biases (especially for PMS2 and underrepresented ancestries) might further reduce the actual frequency. Locus-specific or functional validation are required. While these population frequencies must not be overinterpreted, at the same time their public health relevance should be recognised.
Data availability
Data is provided within the manuscript and supplementary information files.
References
Lynch, H. T. et al. Review of the Lynch syndrome: history, molecular genetics, screening, differential diagnosis, and medicolegal ramifications. Clin. Genet. 76(1), 1–18 (2009).
Bronner, C. E. et al. Mutation in the DNA mismatch repair gene homologue hMLH1 is associated with hereditary non-polyposis colon cancer. Nature 368(6468), 258–261 (1994).
Fishel, R. et al. The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75(5), 1027–1038 (1993).
Miyaki, M. et al. Germline mutation of MSH6 as the cause of hereditary nonpolyposis colorectal cancer. Nat. Genet. 17(3), 271–272 (1997).
Nicolaides, N. C. et al. Mutations of two PMS homologues in hereditary nonpolyposis colon cancer. Nature 371(6492), 75–80 (1994).
Fishel, R. & Kolodner, R. D. Identification of mismatch repair genes and their role in the development of cancer. Curr. Opin. Genet. Dev. 5(3), 382–395 (1995).
Ligtenberg, M. J. et al. Heritable somatic methylation and inactivation of MSH2 in families with Lynch syndrome due to deletion of the 3’ exons of TACSTD1. Nat. Genet. 41(1), 112–117 (2009).
Bonadona, V. et al. Cancer risks associated with germline mutations in MLH1, MSH2, and MSH6 genes in Lynch syndrome. JAMA 305(22), 2304–2310 (2011).
Vasen, H. F. et al. Cancer risk in families with hereditary nonpolyposis colorectal cancer diagnosed by mutation analysis. Gastroenterology 110(4), 1020–1027 (1996).
Dominguez-Valentin, M. et al. Cancer risks by gene, age, and gender in 6350 carriers of pathogenic mismatch repair variants: findings from the Prospective Lynch Syndrome Database. Genet Med. 22(1), 15–25 (2020).
Senter, L. et al. The clinical phenotype of Lynch syndrome due to germ-line PMS2 mutations. Gastroenterology 135(2), 419–428 (2008).
Wagner, A. et al. Atypical HNPCC owing to MSH6 germline mutations: analysis of a large Dutch pedigree. J. Med. Genet. 38(5), 318–322 (2001).
Wu, Y. et al. Association of hereditary nonpolyposis colorectal cancer-related tumors displaying low microsatellite instability with MSH6 germline mutations. Am. J. Hum. Genet. 65(5), 1291–1298 (1999).
Hendriks, Y. M. et al. Heterozygous mutations in PMS2 cause hereditary nonpolyposis colorectal carcinoma (Lynch syndrome). Gastroenterology 130(2), 312–322 (2006).
Seppälä, T. T. et al. European guidelines from the EHTG and ESCP for Lynch syndrome: an updated third edition of the Mallorca guidelines based on gene and gender. Br J Surg. 108(5), 484–498 (2021).
Järvinen, H. J. et al. Controlled 15-year trial on screening for colorectal cancer in families with hereditary nonpolyposis colorectal cancer. Gastroenterology 118(5), 829–834 (2000).
de la Chapelle, A. The incidence of Lynch syndrome. Fam. Cancer 4(3), 233–237 (2005).
Win, A. K. et al. Prevalence and penetrance of major genes and polygenes for colorectal cancer. Cancer Epidemiol. Biomarkers Prev. 26(3), 404–412 (2017).
Haraldsdottir, S. et al. Comprehensive population-wide analysis of Lynch syndrome in Iceland reveals founder mutations in MSH6 and PMS2. Nat. Commun. 8, 14755 (2017).
Grzymski, J. J. et al. Population genetic screening efficiently identifies carriers of autosomal dominant diseases. Nat. Med. 26(8), 1235–1239 (2020).
Patel, A. P. et al. Association of rare pathogenic DNA variants for familial hypercholesterolemia, hereditary breast and ovarian cancer syndrome, and Lynch syndrome with disease risk in adults according to family history. JAMA Netw Open. 3(4), e203959 (2020).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581(7809), 434–443 (2020).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17(5), 405–424 (2015).
Gibson, J. et al. Prevalence estimates of predicted pathogenic COL4A3-COL4A5 variants in a population sequencing database and their implications for Alport syndrome. J. Am. Soc. Nephrol. 32(9), 2273–2290 (2021).
Lanktree, M. B. et al. Prevalence estimates of polycystic kidney and liver disease by population sequencing. J. Am. Soc. Nephrol. 29(10), 2593–2600 (2018).
Kondo, A. et al. Examination of the predicted prevalence of Gitelman syndrome by ethnicity based on genome databases. Sci. Rep. 11(1), 16099 (2021).
Gao, J., Brackley, S. & Mann, J. P. The global prevalence of Wilson disease from next-generation sequencing data. Genet. Med. 21(5), 1155–1163 (2019).
Borges, P., Pasqualim, G., Giugliani, R., Vairo, F. & Matte, U. Estimated prevalence of mucopolysaccharidoses from population-based exomes and genomes. Orphanet. J. Rare Dis. 15(1), 324 (2020).
Perez-Palma, E., Gramm, M., Nurnberg, P., May, P. & Lal, D. Simple ClinVar: an interactive web server to explore and retrieve gene and disease variants aggregated in ClinVar database. Nucleic Acids Res. 47(W1), W99–W105 (2019).
Fokkema, I. et al. The LOVD3 platform: efficient genome-wide sharing of genetic variants. Eur. J. Hum. Genet. 29(12), 1796–1803 (2021).
Thompson, B. A. et al. Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database. Nat. Genet. 46(2), 107–115 (2014).
Kermond-Marino, A. et al. Population frequency of undiagnosed Fabry disease in the general population. Kidney Int. Rep. 8(7), 1373–1379 (2023).
Nagy, E. & Maquat, L. E. A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem. Sci. 23(6), 198–199 (1998).
Plazzer JP, Macrae, F., Yin, X., Thompson, B.A., Farrington, S.M., Currie, L., et al. . Mismatch repair genes specifications to the ACMG/AMP classification criteria: Consensus recommendations from the InSiGHT ClinGen Hereditary Colorectal Cancer/Polyposis Variant Curation Expert Panel. medRxiv. 2025.
Niroula, A. & Vihinen, M. How good are pathogenicity predictors in detecting benign variants?. PLoS Comput. Biol. 15(2), e1006481 (2019).
Thompson, B. A. et al. Calibration of multiple in silico tools for predicting pathogenicity of mismatch repair gene missense substitutions. Hum. Mutat. 34(1), 255–265 (2013).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99(4), 877–885 (2016).
Ponti, G., Castellsague, E., Ruini, C., Percesepe, A. & Tomasi, A. Mismatch repair genes founder mutations and cancer susceptibility in Lynch syndrome. Clin. Genet. 87(6), 507–516 (2015).
Sivagnanam, M. et al. Identification of EPCAM as the gene for congenital tufting enteropathy. Gastroenterology 135(2), 429–437 (2008).
Kempers, M. J. et al. Risk of colorectal and endometrial cancers in EPCAM deletion-positive Lynch syndrome: a cohort study. Lancet Oncol. 12(1), 49–55 (2011).
Morak, M. et al. Prevalence of CNV-neutral structural genomic rearrangements in MLH1, MSH2, and PMS2 not detectable in routine NGS diagnostics. Fam. Cancer. 19(2), 161–167 (2020).
Arnold, A. M. et al. Targeted deep-intronic sequencing in a cohort of unexplained cases of suspected Lynch syndrome. Eur. J. Hum. Genet. 28(5), 597–608 (2020).
Lynch, H. T., Snyder, C. L., Shaw, T. G., Heinen, C. D. & Hitchins, M. P. Milestones of Lynch syndrome: 1895–2015. Nat. Rev. Cancer. 15(3), 181–194 (2015).
de Wind, N. et al. HNPCC-like cancer predisposition in mice through simultaneous loss of Msh3 and Msh6 mismatch-repair protein functions. Nat. Genet. 23(3), 359–362 (1999).
Chen, P. C. et al. Contributions by MutL homologues Mlh3 and Pms2 to DNA mismatch repair and tumor suppression in the mouse. Cancer Res. 65(19), 8662–8670 (2005).
Goldberg, Y. et al. Lynch syndrome in high risk Ashkenazi Jews in Israel. Fam. Cancer. 13(1), 65–73 (2014).
Raskin, L. et al. Characterization of two Ashkenazi Jewish founder mutations in MSH6 gene causing Lynch syndrome. Clin. Genet. 79(6), 512–522 (2011).
Foulkes, W. D. et al. The founder mutation MSH2*1906G–>C is an important cause of hereditary nonpolyposis colorectal cancer in the Ashkenazi Jewish population. Am. J. Hum. Genet. 71(6), 1395–1412 (2002).
Moisio, A. L., Sistonen, P., Weissenbach, J., de la Chapelle, A. & Peltomäki, P. Age and origin of two common MLH1 mutations predisposing to hereditary colon cancer. Am. J. Hum. Genet. 59(6), 1243–1251 (1996).
Clendenning, M. et al. Long-range PCR facilitates the identification of PMS2-specific mutations. Hum. Mutat. 27(5), 490–495 (2006).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 11, 2 (2022).
Chong, A. S., Chong, G., Foulkes, W. D. & Saskin, A. Reclassification of a frequent African-origin variant from PMS2 to the pseudogene PMS2CL. Hum. Mutat. 41(4), 749–752 (2020).
Munte, E. et al. Open-source bioinformatic pipeline to improve PMS2 genetic testing using short-read NGS data. J. Mol. Diagn. 26(8), 727–738 (2024).
Acknowledgements
We would like to thank gnomAD, Simple ClinVar, ClinVar, LOVD, and HGMD for access to their databases; the many patients who agreed to share their data; and the developers of the computational tools used in this analysis (PP2, SIFT, Mutation Taster, Splice AI, VEST4).
Author information
Authors and Affiliations
Contributions
YG undertook the bioinformatic analysis and drafted the manuscript; MH supervised the bioinformatic analysis; FM advised on the computational tools to use, helped with the analysis and modified the draft; XY suggested comparison tools and modified the final draft; JPP suggested strategies to improve the analysis and modified the final draft; and JS initiated the project, provided supervision and modified the final draft.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics declaration
This was a secondary analysis of the anonymised sequencing data publicly available in gnomAD v.2.1.1. Participants in gnomAD had provided written informed consent to ongoing use of their data at recruitment into the original studies.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Guan, Y., Huang, M., Macrae, F. et al. Population frequency of Predicted pathogenic MisMatch Repair (MMR) gene variants in Lynch syndrome from bioinformatic analyses of the general population. Sci Rep 15, 34545 (2025). https://doi.org/10.1038/s41598-025-17881-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-17881-7





