Introduction

Aneuploidy can occur during human meiosis due to nondisjunction (NDJ) errors, but it is challenging to study because most cannot give rise to viable embryos, with the exception of trisomies 13, 18 and especially 21—the most common aneuploid condition at birth. Therefore, Down syndrome (DS) offers a unique opportunity and is an ideal model for understanding the origin of these events. Individuals with DS are at increased risk of intellectual disability, congenital heart disease, and are predisposed to hematologic malignancies1. Trisomy 21 arises mostly from NDJ or mis-segregation of chromosome 21 (chr 21) during meiosis2,3, or relatively uncommonly, from postzygotic mitosis4. In more than 90% of cases, the extra copy of chr 21 originates from the maternal gamete5. This is because female gametogenesis is more prone than male gametogenesis to erroneous chromosomal segregation due to decades of arrest at prophase I6,7. Inaccurate chromosomal segregation during gametogenesis can be grossly classified into meiosis I (MI) or meiosis II (MII) errors using genotyping data8,9. MI errors are inferred when two homologous parental chromosomes are identified in the gamete and MII errors are inferred when two sister chromatids are identified. These observations can be the result of several mechanisms of mis-segregation, such as NDJ of the homologous chromosomes, precocious separation of sister chromatids, or reverse segregation. The stage at which the NDJ takes place is associated with the number and the position of the chromosomal crossovers. In maternal NDJ of chromosome 21, absence of crossover or crossover distally located from the centromere is associated with MI errors. In contrast, MII errors are associated with pericentromeric crossovers10.

The type/stage of NDJ error of trisomy 21 is usually inferred from proband-parent trios by comparing the heterozygosity patterns of genotypes at the pericentromeric region between the child with DS and the parents10,11,12,13. However, when the parents’ genetic data are unavailable, there is currently no generalizable method to infer the type of error, aside from recent work by refs. 14,15 analyzing NDJs based on low coverage whole genome sequencing (WGS) data in preimplantation genetic testing.

In this work, we develop a method, Mis-segregation Error Identification through Hidden Markov Models (MeiHMM), to infer the type of NDJ and locate the chromosomal crossovers using proband genotype alone, based on the frequencies of single nucleotide polymorphisms (SNPs) or haplotypes in the general population. Applying this to WGS data of 152 DS cases demonstrates high accuracy compared to trio analysis. We further investigate the association between the type of NDJ error and disease characteristics in a cohort of DS-associated acute lymphoblastic leukemia (ALL). MeiHMM does not infer the parental origin of trisomy 21, but it is applicable to NDJ errors of both maternal and paternal origins. For simplicity in the description, the following sections assume a maternal origin for trisomy 21, which is true in ~90% of DS cases5, though the framework is the same for NDJ of paternal origin as well.

Results

MeiHMM implements variant- and haplotype-based analyses to identify NDJ timing and cross over locations

The types of chr 21 NDJ errors can be distinguished by the number of unique haplotypes across the q arm of this chromosome as follows (Fig. 1A): (1) Errors of MI origin are characterized by having three different haplotypes in the centromeric region, i.e., one allele with the paternal haplotype, two remaining alleles with two different maternal haplotypes due to the unsuccessful separation of the homologous chromosomes in the oocyte; (2) By contrast, errors of MII result when the duplicated sister chromatids fail to separate; such failure is inferred when only two haplotypes are represented in the centromeric region, i.e., one allele from the paternal haplotype and two identical alleles from the maternal haplotype; (3) Mitotic errors are inferred when there is complete duplication of maternal chr 21. (Alternatively, this could be inferred as a MII error with no recombination, a noted limitation of this gross classification10) Applying a Hidden Markov model, MeiHMM (1) segments chr 21 into blocks of two- or three haplotypes which informs the stage of NDJ, and (2) identifies the position of chromosome crossover (recombination during meiosis) on the basis of the boundaries of these haplotype blocks.

Fig. 1: Overview of MeiHMM.
Fig. 1: Overview of MeiHMM.
Full size image

A The types of chr 21 nondisjunction (NDJ) and their patterns of haplotypes along chr 21. Here, trisomy 21 is assumed to be of maternal origin. B Summary of the workflow to categorize chr 21 heterozygous SNPs. C Type 1 informative SNPs occur predominantly in two-haplotype blocks and type 2 informative SNPs occur predominantly in three-haplotype blocks. D Diagram of the hidden states and the observations of MeiHMM. E Distribution of the allele frequencies (AF) of two-copy alleles of chr 21 SNPs in simulated SNPs by sampling two (and duplicate one allele to make three alleles) or three alleles from gnomAD and AFLA references. F Distribution of the O/E ratio of hypothetical haplotypes. In panels E and F, yellow and blue colors indicate simulated values by assuming there are two or three haplotypes, respectively. G The results of MeiHMM of an example case with Meiosis I error. Red dashed lines indicate thresholds used to define type 1 (allele frequency <0.003) and type 2 (O/E ratio <0.1) informative SNPs. SNP Single-nucleotide polymorphism, ref alelle reference alelle, alt alelle alternative alelle, gnomAD Genome Aggregation Database, AFLA Allele Frequency Aggregator, AF allele frequency, O/E ratio ratio of observed vs expected frequency. Source data are provided as a Source Data file.

To implement MeiHMM, we categorize heterozygous SNPs on chr 21 into three groups (Fig. 1B). (1) Type 1 informative SNPs: for a given case, we first identify variants that are rare in the general population (alternative allele frequency <0.003 in gnomAD and ALlele Frequency Aggregator [ALFA]); if the genotype configuration at this SNP is alternative: alternative: reference alleles (Alt: Alt: Ref), this suggests a two-haplotype state because the probability of a rare allele occurring twice independently is exceedingly low; thus the configuration likely arises from the duplication of the rare allele, i.e., two-haplotype state (Fig. 1C; see “Methods” for details). (2) Type 2 informative SNPs: for variants that do not meet the criteria in #1 (common SNPs or rare variants without the defined genotype configuration), we utilize genotypes at adjacent variants to discern two- vs three-haplotype state: for a given variant and its two flanking SNPs, we generate hypothetical haplotypes for this three-variant block, assuming a two-haplotype state (Fig. 1C), i.e., one haplotype consists of the combination of the one-copy alleles across the three variants, whereas the other is defined by the combination of alleles with two copies. The likelihood of two- vs three-haplotype state for this region is then scored on the basis of the observed vs expected frequency (O/E) of the hypothetical haplotypes in the 1000 Genomes Project populations, with the expected frequency estimated as the product of the allele frequencies of incorporated alleles (see “Methods”). A lower O/E ratio suggests that the two-haplotype state is less likely, which informs the presence of the three-haplotype states. (3) The remainder of chr 21 SNPs are considered non-informative. Variants on chr 21 were classified into one of these three groups and these assignments were fed into a hidden Markov model to segment chr 21 into smoothened two-haplotype and three-haplotype blocks (Fig. 1D). The stage of NDJ error is then inferred from the segmentation results (Supplementary Fig. 1). Detailed definitions of the two types of informative SNPs and the hidden Markov model are available in “Methods”.

We performed simulations to evaluate the degree to which the aforementioned type 1 (rare variant-based) and type 2 (rare haplotype-based) informative SNPs inform the two- vs three-haplotype state on chr 21 (see “Methods” for details). First, for the rare variant-based inference, we generated triallelic genotypes for chr 21 SNPs according to the allele frequencies reported in gnomAD and ALFA, under the assumption of either a two- or three-haplotype state. Among 44,681 variants in gnomAD and ALFA for which (1) the alternative allele population frequency <0.003 and (2) the simulated genotype configuration is Alt: Alt: Ref, 99.7% were only seen under the two-haplotype hypothesis (Fig. 1E). For rare haplotype-based inference, we performed similar simulations but now by randomly sampling haplotypes from the 1000 Genomes project. In 1,653,980 hypothetical haplotypes with O/E ratio <0.1, 98.2% were only seen under the three-haplotype hypothesis (Fig. 1F).

Figure 1G shows an example of the segmentation results of MeiHMM, based on WGS data from Gabriella Miller Kids First Pediatric Research Program (see “Methods”). For this particular case, rare haplotype analysis of the pericentromeric region indicates a three-haplotype state, and therefore it is classified as MI error, with one crossover event identified in the q arm (Fig. 1G). Additional examples are included in Supplementary Fig. 2.

Performance metrics of MeiHMM

To test the accuracy of MeiHMM, we compared the NDJ error stage and crossover event calling results with those from 152 DS proband-parent trios with WGS data (see “Methods”). Overall, MeiHMM classified 76%, 19%, and 5% of the cases as MI, MII, and mitotic errors (Fig. 2A), respectively, of which 146 were consistent with trio analysis (accuracy = 96.1%; Fig. 2A). Misclassifications of the six cases were due to crossovers (Fig. 2B) near the centromere, resulting in exceedingly narrow two-haplotype blocks (Fig. 2C) in the pericentromeric region with a limited number of informative SNPs; consequently, MII errors were misclassified as MI errors (Fig. 2A).

Fig. 2: Comparison of MeiHMM with trio analysis.
Fig. 2: Comparison of MeiHMM with trio analysis.
Full size image

A Comparison of MeiHMM with trio analysis. B MeiHMM identified 87 of the 95 crossovers at similar positions and crossovers very close to the centromere leads to incorrect classifications. C Comparison of the minimum block size of correctly and incorrectly classified cases. D Frequencies of MI, MII, and mitotic errors identified by MeiHMM in 764 DS cases. E, F Distribution of the crossovers on chr 21 in (E) MI and (F) MII errors. MI meiosis I, MII meiosis II, DS Down syndrome, WGS whole genome sequencing. Source data are provided as a Source Data file.

MeiHMM also accurately identified crossover events (Fig. 2B). In the 95 crossover events identified by trio analysis, 87 were successfully discovered by MeiHMM (sensitivity = 91.6%) with no false positive crossover events (precision = 100%). Again, the discordance is largely due to the pericentromeric crossovers as described above, except for one case with two crossovers too close to each other (Fig. 2B). Two cases analyzed by both MeiHMM and trio are illustrated in Supplementary Fig. S3.

Next, we applied MeiHMM to a large dataset of 764 DS cases with proband-only data genotyped by using WGS, MeiHMM identified 75% (n = 571) of the errors as MI origin (no crossover, 47%; with crossover, 28%), 21% of the errors as MII, and 4% as mitotic errors (Fig. 2D). The frequencies of the error types were consistent with those in the cases by trio analysis (Fig. 2A), as well as those reported in a previous study using microsatellite markers10.

We identified an average of 0.47 crossover events per case in MI errors (range 0–3 per case, total 266 events; Fig. 2E), though this is likely an underestimation of the actual frequency of crossover events during gamete formation due to random selection of sister chromatids in MII followed by MI errors. In cases with MII errors (n = 165), an average of 1.19 crossover events were identified (range 1–3 per case, total 197 events; Fig. 2F). MI and MII errors were characterized by a distinct pattern of crossover position along chr 21. Crossovers in MI errors were more likely to be telomeric, especially in cases with only one crossover (Fig. 2E). In contrast, crossovers in MII errors were preferably pericentromeric, regardless of the number of crossovers, with a relatively heavier tail towards the telomeric region (Fig. 2F). These results were consistent with previous observations that the positioning of crossovers is associated with the type of error10.

Applying MeiHMM to SNP arrays

Because MeiHMM is based on genotype data, it can potentially be used for datasets generated by other genotyping platforms. To demonstrate this, we applied MeiHMM to a previously published SNP arrays dataset of 429 DS case-control series for congenital atrioventricular septal defects (AVSD)16. We elected to only use common variants and perform the haplotype-based analysis (Type 2 in Fig. 1B) because rare SNPs are poorly represented on SNP arrays. We observed MI, MII and mitotic errors in this DS-AVSD cohort with frequencies comparable to our DS-ALL dataset (Fig. 2D and Supplementary Fig. 4).

NDJ error is associated with disease phenotypes of DS-related ALL

The ability to infer chr 21 NDJ error using proband-only data enables the investigation of DS-associated health conditions when genotype data of the parents are not available. To demonstrate this, we applied MeiHMM to a cohort of children with DS-associated acute lymphoblastic leukemia (DS-ALL) (n = 509) genotyped using WGS17. Overall, MI, MII, and mitotic errors accounted for 73% (with crossover, 31%; no crossover, 43%), 22% and 5% of the dataset, respectively, similar to those observed in the general DS population without ALL (n = 764; P = 0.63).

However, there was significant heterogeneity in NDJ error within DS-ALL cases, especially when stratified by the age at ALL onset (P = 0.0026) and by ALL molecular subtype defined by leukemia genomic abnormalities (P = 0.017; Fig. 3A). MII or mitotic errors were over-represented in patients with ALL diagnosed at the age above 10 years (38%) compared to those diagnosed below 10 years (24%; P = 0.0024; Fig. 3A). Across different ALL subtypes, there seemed to be an over-representation of errors of MII origin in patients with ETV6::RUNX1-like, CRLF2 BCR::ABL1-like and C/EBPalt DS-ALL (Fig. 3A). Because MII errors are characterized by more crossover events, we further hypothesize that the specific location of chr 21 recombination (and consequently haplotype diversity of specific loci) may influence DS disease phenotypes. To test this, we compared the frequencies of two- and three-haplotype states in windows of 1 million bp along chr 21 between DS-ALL cases and non-ALL DS cases. Compared to non-ALL DS cases, DS-ALL patients older than 10 years were more likely to have 2 haplotypes (rather than 3 haplotypes) in the proximal q arm, especially at q21.1 (Fig. 3B). In contrast, DS-ALL patients younger than 10 years exhibited similar patterns of haplotype diversity relative to that of the general DS population without leukemia (Fig. 3B). In terms of molecular subtype, CRLF2 BCR::ABL1-like DS-ALL cases exhibited significantly more frequent two-haplotype blocks in the region spanning from q21.3 to q22.2, while C/EBPalt DS-ALL cases are more likely to have two-haplotype blocks at q22.12 when compared to non-ALL DS cases (Fig. 3C).

Fig. 3: Applying MeiHMM to DS-ALL.
Fig. 3: Applying MeiHMM to DS-ALL.
Full size image

A Type of chromosome 21 NDJ errors of DS-ALL patients in specific age groups and molecular subtypes. Associations were tested using Chi Squared test. Nominal p-values were reported without adjusting for multiple comparisons. B, C P-values (by Chi Squared test) of the frequency of two-haplotype blocks across chromosome 21 genomic positions in different (B) age groups and (C) molecular subtypes compared to non-ALL DS cases. For (B and C), dashed red line indicate Bonferroni corrected p-value of 0.05. Size of red circles indicate the odds ratio of two-haplotype state comparing DS-ALL and non-ALL DS cases, when adjusted p-value < 0.05. D, E Kaplan–Meier estimates of the probability of (D) event-free survival or (E) overall survival of DS-ALL patients with meiosis I, meiosis II, and mitotic errors. Differences in survival were tested by log-rank test. Source data are provided as a Source Data file.

We also observed that the type of trisomy 21 error was associated with treatment outcomes of DS-ALL in terms of both event-free survival (EFS; P = 0.016; Fig. 3D) and overall survival (OS; P = 0.0024; Fig. 3E). After adjusting for known prognostic factors such as age, initial white blood cell count, minimal residual disease at end-of-induction, and ALL molecular subtype, MII is independently associated with poorer event-free survival (P = 0.035; Supplementary Table 1). Because the ratio of two- vs three-haplotype blocks differ by NDJ error type, we examined the impact of the size of two-haplotype blocks on treatment outcomes. As shown in Supplementary Table 2, the proportion of two-haplotype blocks was linked to decreasing EFS (P = 0.012) and OS (P = 2.6 × 104), and it remained marginally significant after adjusting for NDJ error type (P = 0.038 for OS). These results imply that the proportion of chr 21 with two-haplotype state (thus the degree of homozygosity) may mediate the impact of NDJ error on leukemia treatment outcomes. Taken together, these results suggest that the NDJ error of trisomy 21 could modify specific phenotypes of DS-associated health conditions.

Discussion

The type of NDJ error and crossover events are central to understanding the etiology of DS (and trisomic chromosomal disorders in general). It has been found that the type of NDJ (MI, MII, or mitotic) is closely related to maternal age12,18, genetic polymorphisms11, lifestyle19,20, socioeconomic status21,22, and crossover events10. An altered crossover pattern is a strong risk factor for chromosomal NDJ2,8,23; and when NDJ happens, the crossover is associated with the type of error10. Because MeiHMM only requires data from the DS child, it can greatly expand the number of cases for which NDJ can be inferred and thus enable large association studies to explore the impact of NDJ on DS disease phenotypes. However, the exact biology underlying the association of NDJ error stage with DS disease phenotypes is unclear and warrants further investigation. We posit that duplication of the deleterious allele (in the two-haplotype state, MII) can result in increased severity of disease phenotypes, compared to cases with only one-copy of the risk allele (in the three-haplotype state, MI). In DS-ALL, we observe association of two-haplotype state with higher risk of unfavorable disease subtype CRLF2 BCR::ABL1-like, which is in line with this hypothesis. That said, this hypothesis should be experimentally tested in future studies.

We have demonstrated that MeiHMM can be applied to genotype data generated by WGS or SNP arrays. It should also be straightforward for MeiHMM to accommodate data generated from other sequencing or genotyping platforms, such as RNA sequencing, whole exome sequencing, or even a small, targeted assay designed specifically for NDJ error identification. In addition to its versatility in data types, MeiHMM can also be applied to study the NDJ of other trisomic chromosomal conditions (e.g., trisomy 13 or trisomy 18)24, either prenatally or postnatally. One limitation of MeiHMM is that it cannot detect crossover events involving small genomic regions with few informative SNPs. This is especially problematic for calling short two-haplotype blocks where rare variants are sparse. For example, as MII errors are associated with proximal crossovers, which results in short two-haplotype blocks at the pericentromeric region, leading to the misclassification of 17% (n = 6 of 35) of MII errors as MI errors (Fig. 2A). It should be noted that these misclassifications may introduce uncertainty in assessing the association of NDJ error type with DS phenotypes, such as treatment outcomes of ALL. Furthermore, because MeiHMM uses data of the proband-only, it does not infer the parental origin of trisomy 21.

The traditional method for NDJ staging relies on diads or trios, with restrictive utility because current DS genomic datasets are not trio-based. Prior efforts such as LD-PGTA15 and LD-CHASE14 are capable of inferring NDJ errors and crossover events in embryos generated through in vitro fertilization, using data from preimplantation genetic testing for aneuploidy. This elegant approach employs a haplotype-based framework, conceptually related to MeiHMM, even though it relies on sampling read pairs from low-coverage WGS data. In contrast, MeiHMM harnesses genotype information, making it more broadly generalizable across platforms; MeiHMM also makes use of rare variants (Type 1 informative SNP), which may be beneficial for regions with sparse linkage disequilibrium. Applying MeiHMM would greatly expand the ability to determine the origin of NDJ, essential in any DS cases genotyped for Chr 21 SNPs at a reasonable density, with potential value for studying the etiology of other congenital aneuploidies.

Methods

Data and genotyping

This study complies with all relevant ethical regulations and was approved by the St. Jude Children’s Research Hospital Institutional Review Board and Baylor College of Medicine Institutional Review Board. All the WGS data used in this study were generated by the Gabriella Miller Kids First Pediatric Research Program. WGS and SNP array data were obtained for a total of 764 and 429 children with non-mosaic DS, respectively. WGS data of both parents were also obtained for a subset of DS children with WGS data, forming 152 trios. In addition, WGS data of DS-associated acute lymphoblastic leukemia cases (DS-ALL, n = 509) was obtained, and subtype classification was performed in conjunction with RNA-Seq data as described in our previous publication17. Briefly, subtypes including CRLF2, ETV6::RUNX1, IGH::IGF2BP1, TCF3::PBX1, BCR::ABL1, KMT2A, and DUX4 were classified by rearrangement identification (using fusionCatcher version 0.99.7d)25; high hyperdiploid subtype was identified by digital karyotyping (using RNASeqCNV version 1.2.2)26; C/EBPalt subtype was defined by overexpression and/or alterations of CEBPD, CEBPA or CEBPE genes17; BCR::ABL1-like and ETV6::RUNX1-like subtypes were called by gene expression analysis27; PAX5alt subtype was defined by unique gene expression pattern and presence of PAX5 alterations27; and PAX5 P80R and IKZF1 N159Y were called using variant identification27. DS-ALL Patients were enrolled in the following Children’s Oncology Group trials: 9904 (ClinicalTrials.gov identifier NCT00005585), 9905 (NCT00005596), 9906 (NCT00005603), AALL0232 (NCT00075725), AALL0331 (NCT00103285), AALL08B1 (NCT01142427), AALL0932 (NCT01190930), AALL1131 (NCT02883049), AALL1731 (NCT03914625), and APEC14B1 (NCT02402244). All samples were collected with informed consent obtained from patients, parents, and/or guardians, in written and signed form. WGS was performed using Kapa Hyperprep library preparation kit (Roche), sequenced on the NovaSeq platform (Illumina), and coverage of reference and alternative alleles were extracted using the GATK (version 3.5-0) germline short variant discovery pipeline28, based on which triallelic genotype calls were made. SNPs with variant allele frequencies between 0.13 and 0.40 or between 0.60 and 0.87 were deemed to have configuration of Alt: Ref: Ref or Alt: Alt: Ref, respectively. Only SNPs with coverage of at least 30 reads, and the frequencies of alternative allele carriers in the cases were within 10% of the expected proportion (estimated using overall population allele frequency from gnomAD29 and ALFA30) were kept for analysis. SNP array data was retrieved from a previously published dataset16 of SNP 6.0 arrays (Affymetrix). For each SNP on chromosome 21, genotyping was performed by unsupervised hierarchical clustering using average linkage and Euclidean distance based on three features: intensity A, intensity B and the ratio between B and A. Only SNPs with genotype counts broadly following Hardy-Weinberg equilibrium (Chi squared statistics <20) were kept for analysis.

Type 1 informative SNPs: variants with rare two-copy alleles

On chromosome 21, let \(M\) be the number of heterozygous SNPs in a DS proband that is properly genotyped, \({{\mathbf{\Omega }}}={({{{\bf{G}}}}_{i})}_{i=1}^{M}\) denote the genotypes of these SNPs where \({{{\bf{G}}}}_{i}\) is the genotype of the \({i}^{{th}}\) SNP. As there are three copies of chromosome 21, the genotype of a heterozygous SNP always consists of two copies of one allele and a single copy of another allele (hereafter referred to as two-copy allele and one-copy allele, respectively). \({{{\bf{G}}}}_{i}\) can be further denoted as

$${{{\bf{G}}}}_{i}=\left({a}_{i},{b}_{i}\right),\,{a}_{i},{b}_{i}\in \left\{A,T,C,G\right\}{{\rm{and}}}\,{a}_{i}\ne {b}_{i}$$
(1)

where \({a}_{i}\) is the two-copy allele and \({b}_{i}\) is the one-copy allele. As an example, for a heterozygous SNP with genotype \({TTG}\), we have \({a}_{i}=T\) and \({b}_{i}=G\). The position of the \({i}^{{th}}\) heterozygous SNP on chromosome 21 is denoted as \({l}_{i}.\) The genotypes \({({{{\bf{G}}}}_{i})}_{i=1}^{M}\) are ordered according to their positions, i.e., \({l}_{i} < {l}_{j},\forall {i} < j\).

Let \({{\rm{AF}}}(l,m,g)\) denotes the allele frequency of a SNP on chromosome 21, at position \(l\), with \(m\in \{{{\rm{eur}}},{{\rm{afr}}},{{\rm{amr}}},{{\rm{eas}}},{{\rm{sas}}}\}\), representing European, African, American, East Asian, and South Asian genetic ancestries, and \(g\in \left\{A,T,C,G\right\}\), representing the allele. We define a variant as having rare two-copy allele status if

$${\max }_{k}{{\rm{AF}}}\left({l}_{i},{m}_{k},{a}_{i}\right) < 0.003,$$
(2)

where \({\{{m}_{k}\}}_{k}\) are the ancestries that account for more than 25% in the genetic ancestry admixture of the subject.

In inequality (2), the allele frequencies were obtained from gnomAD29, or from ALFA30 if the SNP is not reported in gnomAD. The genetic ancestry admixture fractions were estimated by using iAdmix31.

In genomic ranges with two haplotypes, as the third allele is a duplicate, rare alleles are as likely to be duplicated as the common alleles. In contrast, when there are three independent haplotypes, a rare allele is unlikely to appear twice, making rare two-copy alleles more likely to appear in two-haplotype blocks (Fig. 1B).

Type 2 informative SNPs: variants in rare hypothetical haplotypes

When the three copies of chromosomes 21 originate from only two haplotypes (i.e., two of the three chromosomes are identical), then the two distinct haplotypes on heterozygous SNPs can be represented as \({({a}_{i})}_{i=1}^{M}\) and \({({b}_{i})}_{i=1}^{M}\), where former is formed by all the two-copy alleles, and the latter is formed by all the one-copy alleles. For the locus of the \({i}^{{th}}\) SNP, we can generate the hypothetical haplotypes formed by itself and its two immediately adjacent SNPs (a total of 3 loci), which can be represented as \(({a}_{i-1},{a}_{i},{a}_{i+1})\) and \(({b}_{i-1},{b}_{i},{b}_{i+1})\). For genomic ranges with three haplotypes, this procedure shuffles and recombines the alleles, creating arbitrarily rare haplotypes that do not appear in the general population (Fig. 1B). The frequency of the hypothetical haplotype \(({a}_{i-1},{a}_{i},{a}_{i+1})\) in the population can be calculated as \(F({a}_{i-1},{a}_{i},{a}_{i+1})/N\), where \(F({a}_{i-1},{a}_{i},{a}_{i+1})\) and \(N\) represent the number of occurrences of \(({a}_{i-1},{a}_{i},{a}_{i+1})\) and the total number of haplotypes reported in the 1000 Genomes project32, respectively. The expected frequency of the hypothetical haplotype \(({a}_{i-1},{a}_{i},{a}_{i+1})\), under the assumption of no linkage disequilibrium, can be calculated as \({{\rm{AF}}}({l}_{i-1},{a}_{i-1})\times {{\rm{AF}}}({l}_{i},{a}_{i})\times {{\rm{AF}}}({l}_{i+1},{a}_{i+1})\). We score \(({a}_{i-1},{a}_{i},{a}_{i+1})\) as the ratio between observed and expected frequencies (O/E ratio):

$${S}_{{ai}}=\frac{\frac{F\left({a}_{i-1},{a}_{i},{a}_{i+1}\right)}{N}}{{{\rm{AF}}}\left({l}_{i-1},{a}_{i-1}\right)\times {{\rm{AF}}}\left({l}_{i},{a}_{i}\right)\times {{\rm{AF}}}\left({l}_{i+1},{a}_{i+1}\right)}$$
(3)

Note that differing with inequality (2), here the allele frequency \({{\rm{AF}}}({l}_{i},{a}_{i})\) was obtained from 1000 Genomes project (n = 3202 individuals, retrieved on April 4th, 2022). Because of the small number of available haplotypes in this database, the genetic ancestries were ignored when calculating the allele frequency for haplotype analysis. The hypothetical haplotype formed by single-copy alleles \(({b}_{i-1},{b}_{i},{b}_{i+1})\) can be scored similarly, and a SNP is defined to be in a rare hypothetical haplotype when

$$\min \left({S}_{{ai}},{S}_{{bi}}\right) < 0.1$$
(4)

The steps involved in identifying rare two-copy alleles and rare hypothetical haplotypes are summarized in Fig. 1C.

Hidden Markov models

For each DS case, we constructed a hidden Markov model (Fig. 1F) on the set of heterozygous SNPs \(\Omega\), with the components defined as follows:

  1. 1.

    vector of hidden states,

    $${{\bf{Q}}}={\left({q}_{i}\right)}_{i=2}^{M-1},\,{q}_{i}\in \left\{2{{\rm{H}}},\, 3{{\rm{H}}}\right\},$$
    (5)

    where \(2{{\rm{H}}}\) and \(3{{\rm{H}}}\) indicate two-haplotype and three-haplotype states, respectively.

  2. 2.

    vector of observations

    $${{\bf{O}}}={\left({o}_{i}\right)}_{i=2}^{M-1},{o}_{i}=\left\{\begin{array}{cc}{{\rm{T}}}1,\hfill & {{\rm{if}}}\left(2\right)\,{{\rm{is}}}\; {{\rm{true}}},\hfill \\ {{\rm{T}}}2,\hfill & {{\rm{if}}}\left(4\right)\,{{\rm{is}}}\; {{\rm{true}}},\hfill \\ {{\rm{Others}}},& {{\rm{otherwise}}},\hfill \end{array}\right.$$
    (6)

    where \({{\rm{T}}}1\) and \({{\rm{T}}}2\) stand for type 1 and type 2 informative SNPs, respectively. For cases genotyped by SNP arrays, only rare hypothetical haplotypes were used as only common SNPs are covered by the platform.

  3. 3.

    To define the initial, state transition, and emission probabilities, we first randomly selected 40 cases for manual inspection of two- vs three-haplotype state, most of which were visually apparent (e.g., based on AF of two-copy alleles and O/E ratio of hypothetical haplotype, Fig. 1G). For two-haplotype state, the initial probability was set as the proportion of samples with the first variant along the chromosome being assigned to the two-haplotype state; the transition probabilities were set as the proportion of variants followed by a state transition (a crossover event); For the emission probabilities, we used the proportion of each type of variants being assigned to the two-haplotype state. This was then repeated to set the parameters for the three-haplotype state.

The most likely vector of hidden states was estimated by using the Viterbi algorithm33.

Simulation

We estimated the distributions of (i) the allele frequency of two-copy allele and (ii) the score of the 3-SNP hypothetical haplotypes, under the assumptions of the triallelic chromosome 21 having two or three haplotypes, by sampling random genotypes/haplotypes from reference databases. The simulation was done by repeating the following procedure for 100 times:

  1. i.

    Allele frequencies of two-copy alleles

    1. a.

      Three haplotypes: for each SNP reported in reference database (gnomAD or ALFA), three independent alleles were sampled based on allele frequencies in the reference, and the allele frequency of the two-copy allele was recorded if the genotype was heterozygous.

    2. b.

      Two haplotypes: for each SNP, two independent alleles were sampled based on allele frequencies in the reference, and one was doubled at equal probability to make three alleles. If the final genotype was heterozygous, the allele frequency of the two-copy allele was recorded.

  2. ii.

    O/E ratio of hypothetical haplotypes

  1. a.

    Three haplotypes: three complete haplotypes of chromosome 21 were sampled from the 1000 Genomes dataset, and the O/E ratios for all hypothetical haplotypes formed by three adjacent SNPs were recorded.

  2. b.

    Two haplotypes: similarly, two complete haplotypes of chromosome 21 were sampled. One of the haplotypes was doubled to make a total of three haplotypes, and the O/E ratios for all the hypothetical haplotypes formed by three adjacent SNPs were recorded.

Trio analysis and performance evaluation metrics

When genotyping data of the DS child and both parents were available, the stage of NDJ error and crossover event can be determined as described previously11. Briefly, SNPs that were homozygous non-reference in one parent but homozygous reference in the other were used as markers to infer the parent of trisomy 21 origin. In the DS child, these SNPs were heterozygous, and the two-copy alleles were from the parent of origin of the extra copy of chromosome 21. After the parent of origin was determined, SNPs that were heterozygous in the parent of origin but homozygous reference in the other parent were used to segment chromosome 21 into two-haplotype or three-haplotype blocks. For such SNPs, the non-reference allele could be doubled or absent when there were two haplotypes in the DS child. When there were three haplotypes, it was always the reference allele that was doubled. In these SNPs, whether the non-reference is doubled were used in HMMs as observations to segment chromosome 21 into blocks of two or three haplotypes.

Performances of MeiHMM in NDJ error and crossover event identification were evaluated by comparing to the results of trio analysis on the same set of DS cases (n = 152). For NDJ error classification, accuracy was used to measure the overall performance. It is defined as

$${{\rm{Accuracy}}}=\frac{{{\rm{number}}}\; {{\rm{of}}}\; {{\rm{cases}}}\; {{\rm{with}}}\; {{\rm{results}}}\; {{\rm{consistent}}}\; {{\rm{with}}}\; {{\rm{trio}}}\; {{\rm{analysis}}}}{{{\rm{total}}}\; {{\rm{number}}}\; {{\rm{of}}}\; {{\rm{cases}}}}$$
(7)

For crossover event identification, let \({{\rm{TP}}}\), \({{\rm{FP}}}\) and \({{\rm{FN}}}\) denote the number of true positives, false positives, and false negatives, respectively, \({{\rm{precision}}}={{\rm{TP}}}/({{\rm{TP}}}+{{\rm{FP}}})\) and \({{\rm{sensitivity}}}={{\rm{TP}}}/({{\rm{TP}}}+{{\rm{FN}}})\), were used as the performance evaluation metrics.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.