Main

Germline mutations are transmitted from parents to their offspring5. They occur before the formation of a fertilized egg, are found in all cells of the offspring and can be assessed by sampling somatic tissues. Considerable knowledge exists on germline mutations that appear de novo in offspring (DNMs). The majority of DNMs are of paternal origin, although there are genomic regions within which parental contributions are almost equal6. DNMs contribute to disease risk7, their frequency correlates with parental age at conception6,7 and they cluster at areas near recombination events8,9. However, little is known about DNMs that prevent a fertilized egg from developing to term and it is unclear why some sites, or genes, lack sequence variants10,11 and why some variants are never observed in homozygous state12.

Recombination shuffles the genome between generations, enabling selection of certain haplotypes in favour of others13. Although this provides an evolutionary advantage, the recombination process comes at a cost, as it is mutagenic8,9. Recombination failure during meiosis can also result in aneuploidies, that is, extra (trisomies) or missing (monosomies) chromosomes1,14. A quarter of all pregnancies ends in pregnancy loss15 and approximately half of all losses are explained by single chromosome aneuploidies or an extra set of chromosomes (triploidies)14. We previously identified that a sequence variant in SYCE2 associates with recombination patterns and pregnancy loss16, with a more pronounced effect on longer chromosomes. Aneuploidies of larger chromosomes are rarer in pregnancy loss than in blastocysts from in vitro fertilization3. Overall, this indicates that there is considerable structural variation present in early pregnancy loss that is only accessible by directly sampling products of conception that are not carried to term.

Finding the genetic causes of chromosomally normal (euploid) pregnancy loss has also proved challenging. Previous genetic studies on euploid pregnancy loss have mainly focused on late losses or on pregnancies complicated by structural anomalies17,18, while studies involving earlier losses have been limited by inaccessibility of fetal material and maternal cell contamination19. The role of small sequence variants (SSVs; small indels and single-nucleotide variants (SNVs)) and copy-number variants (CNVs) has therefore not been directly assessed at scale in early pregnancy loss.

Despite the hefty societal cost, early pregnancy loss is under-researched and there is a paucity of therapeutic interventions15. The Copenhagen Pregnancy Loss (COPL) study was initiated to improve the understanding of pregnancy loss by applying large-scale recruitment of consecutive patients with clinically diagnosed pregnancy loss, detailed phenotyping and comprehensive sampling of fetal and parental material. Here we present the results from our genomic study of 467 trios affected by pregnancy loss (fetus, mother and father), including 1,007 fetal samples and 934 parental samples (Fig. 1a). This enabled us to document the sequence diversity that is present only in early pregnancy loss and describe the complex interplay between meiotic recombination and point mutations.

Fig. 1: Sequencing overview and fetal presence in pregnancy loss samples.
figure 1

a, Flowchart of the study. b, The gestational ages of fetal samples. The gestational age was estimated as the time difference from the last menstruation to the date of pregnancy loss. c, The paternal fraction of fetal samples. The paternal fraction is shown for intervals of gestational age estimated on the basis of the last menstruation (GA–LM): weeks 0–9, weeks 9–12 and weeks 12–22. d, Kinship between fetal and parental samples. There are 26 (11 pregnancy loss cases) and 93 (59 pregnancy loss cases) fetal samples that had higher than expected kinship with their father and mother, respectively. Furthermore, there are 27 (11 pregnancy loss cases) and 42 (19 pregnancy loss cases) fetal samples from paternal and maternal triploidies, respectively.

Source Data

The COPL cohort

Most of the pregnancy loss cases occurred in the first 12 weeks of gestation, as measured by time since last menstruation (Fig. 1b). If possible, multiple fetal samples were collected from each pregnancy loss. We sequenced DNA from 1,439 fetal samples from 664 loss cases, along with 1,111 parental samples of predominantly European background (83%), to an average depth of 36.5× (Fig. 1a and Supplementary Figs. 1 and 2). For 467 pregnancy loss cases, we sequenced at least one fetal sample and samples from both parents, with 2.2 samples sequenced on average per fetus (1,007 samples). Collected tissue was dissected and categorized as chorionic villi, fetal and unknown. Chorionic villus samples were the most common (52.4%; Fig. 1b) followed by tissues of unknown (28.1%) and fetal (19.5%) origin.

Tissue samples from pregnancy loss are often maternally contaminated, especially in samples from early loss and in those of unknown origin19,20,21. We found that the paternal fraction (Supplementary Note 1) was less than 25% for 275 out of 1,007 fetal samples, indicating substantial maternal contamination in the cohort (Fig. 1c), comparable to previous studies19.

Triploidies and aneuploidies

Oversharing of alleles between the fetus and mother can indicate either maternal cell contamination or a maternally transmitted triploidy. Triploidies are life-limiting, and known causes of pregnancy loss, particularly in the first trimester2,19,21,22. We found 59 pregnancy loss cases in which at least one fetal sample had higher than expected kinship with the mother, and 11 with the father (Fig. 1d). Maternally transmitted triploidy would present as alternating regions of both homologous chromosomes and of sister chromatids from the mother in the fetal genome (Fig. 2). To discern between maternal oversharing due to contamination and oversharing due to triploidy, we searched these losses for chromosomal crossovers using sequence variants specific to one of the parents (homologous-sister informative alleles; HS alleles; Fig. 2 and Methods). In 19 out of 59 pregnancy loss cases with maternal oversharing, we found numerous crossovers (>20) between sister chromatid state and homologous chromosome state due to meiotic recombination (Figs. 1d and 2). We also found numerous crossovers for all 11 pregnancy loss cases with paternal oversharing. The combined incidence (30 triploidies; 6.4% of the set) is consistent with previous reports2,19,23.

Fig. 2: Maternal triploidy example.
figure 2

a, Schematic of meiosis II error. Meiosis I error is shown in Extended Data Fig. 1. b, Maternal crossovers at chromosome 2 for a triploid fetus. The phase-informative sequence variants in the fetal genome are informative about transitions between the homologous chromosome and sister chromatid states along the genome. The phase-informative variants are limited to heterozygous (0/1) variants in the mother (1 is the HS allele). In the homologous state, the maternal variants are always present in the fetal genome (0/0/1), as both of the maternal haplotypes are present. In the sister state, the maternal variants are either absent (0/0/0) or have an allelic balance of 2/3 (0/1/1).

We used whole-genome sequencing (WGS) data to search for aneuploidies among the 467 trios affected by pregnancy loss (Methods and Supplementary Note 2), and found them in 206 loss cases (44.1%, 380 fetal samples). Monosomy X (9.9%) and trisomy 16 (9.2%) were the most frequent aneuploidies (Fig. 3a and Supplementary Table 1), comparable with studies using karyotyping or SNP arrays2,19,21,22. WGS also enables the detection of chromosomal structural variants that would not be captured with SNP arrays. We detected 19 large de novo CNVs (>200 kb) in 14 distinct pregnancy loss cases (23 samples) (Methods, Supplementary Fig. 3 and Supplementary Table 2). In five of the loss cases (11 samples), we detected deletion–duplication pairs that were probably part of the same event, three of which represent translocations between two different chromosomes (Fig. 3b and Supplementary Table 2). In two loss cases, we detected a deletion–duplication pair on chromosome 8 (Fig. 3c) that probably represents unbalanced translocations facilitated by non-allelic homologous recombination between the inverted and non-inverted sequence in heterozygous carriers on chromosome 8 (ref. 24).

Fig. 3: Characterization of triploidies, aneuploidies and CNVs.
figure 3

a, The number and distribution of phased trisomies/monosomies. CNVs are shown in Supplementary Fig. 3. X monosomies are depicted as the loss of the X chromosome (chr.) as we cannot discern between paternal loss of Y and X. b, Translocation of chromosome 15 and chromosome 16. The panels by row are fetal samples from the same pregnancy loss. c, Translocations of 8q. In two pregnancy loss cases, we find pairs of duplications/deletions on chromosome 8 that are probably translocations of 8q. Top, a pregnancy loss in which the mother transmitted the unbalanced translocation. Bottom, a pregnancy loss with a paternally transmitted translocation. The horizontal lines in b and c are the genome-wide averages of the phased coverage. d, Centromere state for trisomies. e, The number of centromeres in a homologous state per triploid case.

Source Data

We ran structural variant detection25,26 on the flanking regions (±5 kb) of the CNVs to annotate them (Methods and Supplementary Table 2). We found the exact breakpoints for 12 CNVs and read pairs supporting the classification of eight CNVs as part of four translocations (Supplementary Figs. 47). Moreover, we found read-alignment patterns of balanced translocations in fathers from two of these losses (Supplementary Figs. 6 and 7), which substantially increases the risk of loss in future pregnancies27. None of the CNVs were near (<100 kb) a common fragile site (Supplementary Table 2). Out of the 14 distinct CNV events, 11 were in euploid losses and 6 of those fulfilled pathogenicity criteria (Methods).

Meiotic origin of the extra chromosomes

At meiosis I, homologous chromosomes segregate to opposite spindle poles, while sister chromatids remain connected at the centromeres28 (Fig. 2a). As crossovers are suppressed at centromeres, the presence of both homologous chromosomes from the same parent at all loci near the centromere (pericentromeric) is indicative of a meiosis I error1 (Extended Data Fig. 1). We found that HS alleles from both homologous chromosomes were generally present at trisomies of chromosomes 15, 16, 21 and 22 (Fig. 3d), pointing to meiosis I errors. Of the 19 maternally transmitted triploidies, 5 and 12 were consistent with meiosis I and II failure, respectively. By contrast, among the paternally transmitted triploidies, we detected homologous chromosomes at the pericentromeres for around half of the chromosomes per fetal sample, pointing to dispermy fertilization29 (Fig. 3e). The two remaining maternally transmitted triploidies showed a similar pattern that might be the product of two independent ova and one spermatid29.

We compared the crossovers identified in the extra chromosomes from triploidies with the sequence-level recombination map from Icelanders30. We found that 27.2% (95% confidence interval (CI) = 23.7–30.6%) and 32.3% (95% CI = 26.6–37.9%; Fig. 4a) of crossovers identified in paternal and maternal triploidies occur within recombination hotspots, respectively (1.6% and 1.8% of the genome). This supports our interpretation that the alternating patterns of homologous-sister segments along the genome represent true crossovers. Although this is substantially less than the fraction of crossovers in recombination hotspots (74.9% and 71.1% for paternal and maternal crossovers, respectively) used to define the recombination map, this could be partly attributed to the window approach used in this study rather than identifying crossovers on the resolution of heterozygous markers.

Fig. 4: Crossovers detected in aneuploidies and triploidies.
figure 4

a, Hotspot use of crossovers in pregnancy loss. The dashed lines are the sex-specific use of hotspots from a previous study8 (male is blue and female is red). The points are the mean values and vertical error bars are the 95% CIs based on a block jackknife procedure. There are 11 paternal triploidies, 16 maternal triploidies and 110 maternal trisomies with detected crossovers. b, Crossover locations in maternal triploidies. The crossover locations in paternal triploidies are shown in Extended Data Fig. 2. c, Crossover locations in trisomies. d, Chromosomes with or without a crossover in maternal trisomies.

Source Data

To assess whether lost fetuses had a different recombination pattern compared with individuals who made it to term, we tallied the crossovers across the genome for triploidies and trisomies (Fig. 4b,c and Extended Data Fig. 2). We found that only 113 out of the 161 maternal trisomies had a crossover and three of the five paternal ones (Fig. 4c). The rate of crossover absence was higher for trisomies of chromosomes 21 and 22 compared with the rest (Fig. 4d; odds ratio (OR) = 4.5, P = 4.71 × 10−5, Fisher’s exact test). This has also been observed for oocytes derived from fetuses31, supporting that meiotic disjunction of chromosomes 21 and 22 suffers from fewer crossovers established during oocyte arrest. Among autosomes, these chromosomes contribute the fewest number of crossovers to genetic maps8.

Mutation detection

The WGS analysis enabled us to investigate the rate and mutational spectrum of DNMs (SNVs and indels) in the pregnancy loss cases by comparing fetal genomes with parental genomes. In 724 fetal samples with low maternal contamination (paternal fraction > 25%; paternal kinship coefficient > 0.2; 366 distinct pregnancy loss cases; Fig. 1a), we identified 75,345 DNMs, with an average of 103.3 DNMs per loss (95% CI = 100.7–105.9). We estimate a discrepancy rate of 8.0% (95% CI = 7.7–8.2%, binomial test) by comparing multiple fetal samples from the same pregnancy loss (Supplementary Note 3).

Discrepancies between samples can result from artifacts or post-zygotic mutations (Supplementary Note 3 and Extended Data Fig. 3). Using 42 pregnancy loss cases with proper fetal and two villus samples, we classified DNMs as post-zygotic mutations if they were non-constitutional (allelic balance (AB) < 50%) in at least one villus sample, and found 9.7 (P = 1.92 × 10−7, t-test) post-zygotic mutations per villus that were absent from the proper fetus. By contrast, we identified 1.6 (P = 1.07 × 10−6, t-test) post-zygotic mutations in villus samples that were also in the proper fetus, occurring before trophoblast branching in early embryogenesis. This is supported by their high frequency in the proper fetus (Supplementary Fig. 8). Villus samples in general had a lower allelic balance (−2.9%; P = 2.07 × 10−40, t-test) and a higher number of DNMs (11.1 DNMs; P = 9.22 × 10−9, t-test; Fig. 5a and Extended Data Fig. 4) compared with proper fetal samples. We compared the mutational spectrum of the DNMs to mutational signatures identified in cancer samples32 and found that all samples with a detectable mutational signature 18 from COSMIC33 (characterized by C>A mutations) were villus samples (P = 0.0005, Fisher’s exact test), consistent with previous studies of placentas34. Overall, this indicates a considerable accumulation of somatic mutations in early development, confined to the trophoblast lineage.

Fig. 5: DNMs detected in pregnancy loss.
figure 5

a, The number of DNMs against the allelic balance per fetal sample. Only fetal samples with low maternal contamination and no detected triploidies were used (Extended Data Fig. 4 includes all samples). b, The paternal fraction of DNMs based on the chromosomal status of the DNM. The estimates are based on 366 (total), 143 (euploid), 337 (chromosome pairs), 11 (triploid father), 18 (triploid mother) and 131 (trisomy mother) unique pregnancy loss cases. c, The number of DNMs per fetal sample against paternal age at conception. The shaded areas represent 95% CIs. See Extended Data Fig. 5 for tissue type colouring. d, A mutation occurring before the formation of sister chromatids (pre-sister) is expected to have an allelic balance of 2/3 in the sister state. A mutation occurring after sister formation (post-sister) is expected to have a 1/3 allelic balance. Paternal triploidies caused by spermatozoa from two distinct spermatogonia fertilizing the oocyte are expected to have mutations in 1/3 allelic balance (Extended Data Fig. 9). e, The allele balance of DNMs conditioned on the sister chromatid state. The vertical line indicates an allelic balance of 50%. Both states are shown in Extended Data Fig. 8. f, Mutation classes of DNMs based on chromosomal status. The asterisk denotes the only significant comparison after multiple-testing correction (P = 0.0002 < 0.05/72) and the block jackknife procedure. The number of DNMs per combination of mutation class and chromosome group are reported above the bars. The numbers of unique pregnancy loss cases for the chromosome sets are 331 (euploid/chromosome pairs), 131 (trisomy–mother), 11 (triploid–father) and 18 (triploid–mother). For b and f, data are mean ± 95% CI based on a block jackknife procedure.

Source Data

The AB of the DNM allele is a key metric in the quality control of DNMs6. The presence of extra chromosomes, as in the trisomies and triploidies, distorts the expected AB from 1/2 to 1/3 or 2/3, which impacts DNM detection. Correcting for this difference in sensitivity (Supplementary Note 3 and Supplementary Fig. 9), we found that the paternal and maternal triploid fetuses carried 40 (95% CI = 23.0–57.1) and 25.9 (95% CI = 15.8–35.9) more DNMs on average than the euploid fetuses, respectively. This is expected as the triploid fetuses have an extra set of chromosomes.

We phased 15,086 DNMs (71.7%) to the paternal and 5,967 (28.3%) to the maternal chromosomes of the fetus (Supplementary Note 3). The high paternal contribution of DNMs to offspring is well known6,35 and, for a fetus receiving two sets of paternal chromosomes (paternal triploidy), we would expect a corresponding excess of paternal DNMs. Consistent with this expectation, we found that the paternal triploidies showed a higher fraction of paternally phased mutations (85.9%; 95% CI = 82.9–88.9%, block jackknife) compared with the euploid fetuses (71.4%; 95% CI = 69.8–73.0%, block jackknife; Fig. 5b). Furthermore, DNMs identified in maternal triploidies (59.5%; 95% CI = 55.3–63.7%, block jackknife) and trisomic chromosomes (54.5%; 95% CI = 46.4–62.5%, block jackknife) showed a lower paternal fraction than those in euploid fetuses.

The number of DNMs per proband is highly dependent on the parental ages at conception6,8,36,37. We observed a similar age effect in the pregnancy loss cases (Fig. 5c and Extended Data Figs. 5 and 6) as in the Icelandic set6,30. There is a higher number of DNMs (16.3; 95% CI = 14.8–17.8) in the pregnancy loss cases than in the Icelandic set after correcting for parental ages at conception, restricting to non-triploid samples and correcting for villus contribution (Extended Data Fig. 7). The villus contribution is consistent with independence from parental ages at conception, as expected for post-zygotic mutations (Extended Data Figs. 5 and 7). To examine this further, we estimated the mutational spectrum of villus contributions. We found that the highest villus contributions were for C>A (3.4 DNMs; P = 2.11 × 10−13, block jackknife) and indel (4.9 DNMs; P = 1.84 × 10−13, block jackknife) mutations (Extended Data Fig. 7 and Supplementary Table 3). To summarize, there is a greater contribution of post-zygotic mutations and potential false positives compared with the Icelandic set, but the strong parental age effects and paternal fraction indicate that most of the mutations identified in the pregnancy loss cases accumulated before completion of meiosis.

In the female germline, sister chromatids are formed in early development, in contrast to in the male germline, in which they are formed after puberty. If a mutation occurred before sister formation (pre-sister), it would be expected to be faithfully replicated on the sister chromatid (Fig. 5d). We would expect the AB of pre-sister DNMs to be 2/3 in the triploidies/trisomies (Fig. 5d), whereas DNMs not replicated on the sister chromatid (post-sister) would have an AB of 1/3. A cut-off of 50% is symmetric between the pre-sister and post-sister ABs and can be used to differentiate them. High-AB (>50%) DNMs within sister segments are a combination of pre-sister mutations and misclassified ones (false-positive DNMs or AB measurement errors), whereas high-AB mutations within homologous segments probably consist of only misclassified mutations as the expected AB is always 1/3. Thus, the difference in the prevalence of high-AB mutations between homologous and sister segments serves as an estimate of the fraction of pre-sister mutations (probably an underestimate; Supplementary Note 4). For maternally transmitted triploidies and trisomies, we found that the prevalence of high-AB DNMs was significantly higher within sister segments (9.3%; 95% CI = 8.0–10.7%, block jackknife) than homologous segments (2.7%; 95% CI = 2.2–3.3%, block jackknife; Fig. 5e and Extended Data Fig. 8). This indicates that 6.6% (95% CI = 5.1–8.0%, block jackknife) of the maternal mutations are pre-sister mutations.

Notably, we did not observe the sister/homologous state difference for high-AB DNMs in the paternally transmitted triploidies (0.1%; P = 0.77). On the paternal side, most DNMs are expected to accumulate in mitosis of spermatogonia before the formation of sister chromatids. We would therefore expect to see a high AB for half of the DNMs in the paternally transmitted triploidies if the extra set of chromosomes was derived from the same spermatogonium (Extended Data Fig. 9). However, we saw little or no high-AB DNMs within sister segments on the father’s side, indicating that the spermatozoa are derived from distinct spermatogonia (Fig. 5d), consistent with the equal fraction of homologous/sister states observed at centromeres of paternally transmitted triploidies (Fig. 3e).

We contrasted the mutational spectrum of DNMs from chromosomes with extra copies (trisomy–mother, triploid–father and triploid–mother) to DNMs from euploid fetuses and pairs of chromosomes not affected by trisomies/monosomies (euploid–chromosome pair; Fig. 5f; 3 × 3 × 8 = 72 tests). C>G mutations were found at a greater frequency among maternal trisomies compared with in the euploid–chromosome pair set (P = 0.002; euploid–chromosome pair). No other comparisons passed multiple-testing correction (Holm). Duplication of chromosome 16 is frequent among the aneuploidies (Fig. 3a), and we have shown previously6 that C>G mutations in the maternal germline accumulate at high rates on chromosomes 8 and 16. Furthermore, these C>G mutations at these C>G-enriched regions are more likely to occur in strand-coordinated mutational clusters near non-crossover gene conversions6,8,30, which are the products of double-stranded breaks (DSBs). As DSBs can cause translocations, we searched the translocation cases and found strand-coordinated clusters at chromosomes 15 and 16 for the chromosome 15–chromosome 16 case (Fig. 3b), mostly consisting of C>A, C>G and C>T mutations (87%) of maternal origin (22 out of 22 phased DNMs). This indicates that chromosome 16 is susceptible to both point mutations and aneuploidies in the ageing oocyte due to DSBs.

Pathogenic genotypes

We searched for pathogenic SSVs in the DNM set (366 pregnancy loss cases) and found 26 SSV genotypes in as many loss cases (45 fetal samples) that classified as pathogenic or likely pathogenic based on criteria from the American College of Medical Genetics and Genomics38. Of these, 23 were DNMs and 3 were biallelic predicted loss of function (pLoF) variants (Methods and Supplementary Table 4). For 11 of the lost fetuses carrying pathogenic DNMs, at least 1 proper fetal sample had been sequenced; for the remaining 12, only villus samples were available. One of the DNMs, a splice donor indel in FBLN1, was a post-zygotic mutation confined to a villus sample. Nonetheless, the mean AB of all pathogenic DNMs (0.495; non-triploid loss cases) was comparable to that of DNMs in proper fetal samples (0.489; non-triploid loss cases; P = 0.95), and the majority of phased variants was of paternal origin (6 out of 8, 75%), consistent with attributes of germline mutations5.

We compared the frequency of pathogenic SSVs in the cases of pregnancy loss to that in adult controls (7,760 trios; Methods) and found an enrichment of pathogenic SSVs in pregnancy loss compared with in the controls (OR = 2.98, P = 5.7 × 10−6, Fisher’s exact test). This enrichment was also significant after correcting for parental age (P = 4 × 10−3, χ2 test, logistic regression) and was not observed for synonymous DNMs (P = 0.36, χ2 test, logistic regression). The enrichments were 3.63 (P = 3.0 × 10−4) and 2.59 (P = 2.2 × 10−3) when restricting to euploid and aneuploid/triploid pregnancy loss cases, respectively. Of the 26 pathogenic SSVs identified, 12 were in euploid fetuses (12 out of 141; Supplementary Table 5 and Supplementary Note 5).

Considering the possibility that not all genes that are essential to early human development have been linked to postnatal phenotypes, we also searched the cases of euploid pregnancy loss for pLoF DNMs in genes highly constrained for LoF variants (Methods) and found an increase over the adult controls (OR, 4.0; P = 0.006, Fisher’s exact test). Three of the pLoF variants were in genes without an established link to disease: DDX5, ZMYM4 and HECTD1 (Supplementary Table 6). To better understand the role of these variants, as well as that of the pathogenic SSVs, in pregnancy loss, we annotated the genes with regards to previous evidence of lethality, relevance of the associated phenotypes to developmental defects, fetal expression of the genes39 and evidence of lethality from mouse knockouts (Supplementary Tables 79). We found that, on average, the genes mutated in euploid fetuses had higher expression in fetal tissue than the ones in aneuploid/triploid fetuses (P = 6.9 × 10−196, t-test; Extended Data Fig. 10).

One of the euploid fetuses was homozygous for a pLoF variant in DHCR7, a gene linked to autosomal recessive Smith–Lemli–Opitz syndrome40. The variant is a splice acceptor variant (NM_001360.2, c.964-1G>C) that is described to result in miscarriage or neonatal death in homozygous state40. DHCR7 c.964-1G>C exhibits a deficit of homozygosity among adults, that is, a lower number of observed homozygous individuals than expected based on frequency12. In our study of 1.52 million genotyped European adults12, DHCR7 c.964-1G>C had the most prominent homozygous deficit, with 92 homozygotes expected but none observed. Here we observed the c.964-1G>C variant in a homozygous state in 1 out of 366 trios, the majority of whom have European ancestry, with an expectation of 0.022 homozygotes (95% CI = 0–0.053; supplementary data 4 from ref. 12). This finding is consistent with our previously estimated fivefold increased risk of miscarriage among couples who are both heterozygous carriers of DHCR7 c.964-1G>C12.

We detected one other biallelic pLoF genotype among the euploid lost fetuses, a compound heterozygous genotype in CPLANE1 (NM_023073.3, c.6354dup (p.Ile2119TyrfsTer2) and c.493del (p.Ile165TyrfsTer17)). CPLANE1 pathogenic genotypes have been detected in fetuses from early pregnancy loss41, and another pLoF variant in CPLANE1 had a suggestive deficit of homozygosity in our previous study (supplementary table 18 from ref. 12). We did not find any other homozygous deficit variants (supplementary table 21 from ref. 12) presenting as biallelic genotypes in our cohort. The biallelic genotypes in DHCR7 and CPLANE1 infer a high recurrence risk (at least 25%) and, in both cases, the couple had a history of recurrent pregnancy loss (three consecutive losses; Supplementary Note 6). DNMs are generally expected to have a lower risk of recurrence, unless there is evidence of parental mosaicism42,43, which we did not observe in this set.

Discussion

Here we performed WGS analysis of 1,439 samples from 664 cases of early pregnancy loss and characterized the sequence diversity in 467 of the loss cases for which samples were available for the full trio. We found probable genetic causes in 254 out of 467 loss cases (55%), the majority of which were aneuploidies (44.1%), followed by triploidies (6.4%), pathogenic SSVs (3.3%) and de novo CNVs (1.3%) (Supplementary Table 5).

Most of the genetic causes originated on the maternal chromosome, mainly in the form of triploidies and trisomies, consistent with the well-described meiotic errors accompanying oogenesis. We observed an indication of interplay between point mutations and large structural variation. The oocytes are in dictyate arrest for decades1 and accumulate mutation clusters in specific regions of the genome6,44, for example, the p-arm of chromosome 8 and termini of chromosome 16. We found a translocation between chromosomes 15 and 16 (Fig. 3b) containing clusters of strand-coordinated DNMs, a likely byproduct of faulty DSB repair.

The extra chromosomes in the trisomy and triploidy cases gave us a unique view into recombination and mutational processes in the germline. We found that triploid lost fetuses carried 31.8 more DNMs on average than euploid lost fetuses, as expected given the extra set of chromosomes. However, these mutations stem from both homologous chromosomes of the parents in contrast to the DNMs observed in adult offspring, providing a more comprehensive assessment of the diploid germ cells. The presence of sister chromatids enabled us to estimate that 6.6% of maternal DNMs occurred before the formation of sister chromatids. Furthermore, we mapped crossover locations for the trisomies and triploidies, providing a complementary view of the recombination map of the human genome.

We found a threefold enrichment of pathogenic SSVs in the trios affected by pregnancy loss in comparison to adult trios, indicating that two-thirds of the pathogenic SSVs identified in the euploid lost fetuses are truly causative, explaining around 6% of those losses (8 out of 141). Assuming that approximately 1 in 17 (1/0.06) euploid loss cases is explained by SSVs and 1 in 8 pregnancies is lost with a euploid fetus, our findings indicate that at least 1 in every 136 (17 × 8) pregnancies will be lost due to a pathogenic SSV. There is an excess of DNMs in the trios affected by pregnancy loss compared with adult trios, partly due to false-positive calls and partly attributed to post-zygotic mutations specific to villi. Nevertheless, most of our pathogenic SSVs were consistent with germline origin and were found in greater frequency in trios affected by pregnancy loss than in adult trios. This shows that the contribution of DNMs to pregnancy loss is determined by whether they occur in essential genomic sequences rather than their overall count. Larger cohorts of sequenced euploid fetuses are needed to identify variants with incomplete penetrance and/or polygenic effects. We realize that high-depth sequencing is cost prohibitive at scale, but low-pass sequencing (around 1×) coupled with imputation is comparable to chip genotyping in terms of cost and quality45. Furthermore, long-read sequencing could aid detection of structural variants in euploid lost fetuses without a pathogenic SSV.

Studies of the contribution of pathogenic SSVs to pregnancy outcomes17,18 have mainly focused on pregnancies with structural abnormalities of the fetus, typically discovered after the first trimester. It is hard to compare such phenotypes with the ones found in cases of early pregnancy loss, including gestational age, fetal anatomical structure, and placental morphology. Notably, we found that one of the pathogenic DNMs was a post-zygotic mutation confined to villi, which are a precursor of the fetal part of the placenta. The mutation occurred in FBLN1 (also known as fibulin-1), which encodes a structural component of blood vessel walls46 and has particularly high expression in placental trophoblast giant cells (Supplementary Table 7), suggesting that perhaps this pregnancy loss was triggered by abnormal placental function.

We detected homozygous and biallelic genotypes in genes with a strong deficit of biallelic loss of function in adults, and a high risk of recurrence. We also considered the possibility that many genes that are important for early development might not yet be identified as such. We found pLoF DNMs in three euploid lost fetuses in constrained genes that are highly expressed during the fetal stage39. Two of these genes, DDX5 and HECTD1, encode proteins that are involved in cellular processes that are essential for early development, such as DNA replication47 and ubiquitylation48, but they were not associated with postnatal phenotypes at the time of submission. During revision of this Article, variants in HECTD1 were described to cause neurodevelopmental defects and even infant death49, further illustrating the potential of sequencing fetal material from pregnancy loss cases to identify gene–disease connections.

There are considerable differences between aneuploidies detected in blastocysts and in fetuses from pregnancy loss cases3. We are therefore probably still missing substantial sequence diversity that is lost in the period from implantation to a clinically recognized pregnancy. Here we have nonetheless shown that sequencing early products of conception that succumb to negative selection can identify essential genes and characterize their function in early development.

Pregnancy loss has been considered to be the natural fate of fetuses with severe genetic defects, making therapeutic interventions unwarranted. However, we are unable to explain 45% of sporadic unselected clinical pregnancy loss cases using WGS, pointing to other factors in the aetiology of pregnancy loss. Given the high incidence of euploid pregnancy loss cases without a clear monogenic cause in the fetus, there is an urgent need to understand their aetiologies and develop therapeutic interventions that can ensure healthy live births. This provides value not only for the couples wanting a child, but also for the many societies that are challenged by decreasing fertility rates and rapidly ageing populations50. Continued sampling of trios and biomarkers throughout pregnancy is needed to disentangle the complex interplay between the fetal and maternal genomes and the environment, and the cascade of events sparked by a single fertilized egg.

Methods

Cohort

The COPL study is a prospective cohort study, with the first patients recruited in November 2020 (described in detail previously20). Women with a confirmed or suspected pregnancy loss were referred by a physician to a gynaecological department at one of three hospitals in Denmark: Copenhagen University Hospital Hvidovre, Herlev or North Zealand.

Women were included in the study if older than 18 years, if the pregnancy loss was before gestational age 22 weeks + 0 days (154 days) and if an intrauterine pregnancy had been confirmed by ultrasound, including anembryonic pregnancy as per definition. The participants provided both oral and written informed consent. Excluded from the study were pregnancies of unknown location, extrauterine pregnancies and molar pregnancies. The study was approved by the Health Research Ethics Committees for the Capital Region of Denmark (H-18024745) and followed the General Data Protection Regulation governance.

Sampling

The woman and her male partner (if possible) had venous blood drawn in 4 ml EDTA tubes that were kept at 5 °C until transport in tubes with cushioning material. All of the blood samples from the woman were drawn before initiating treatment for the pregnancy loss while fetal material was in utero, or within 24 h of passage of the fetal material. In cases of spontaneous or medically induced (mifepristone plus misoprostol) emptying of the uterus (53%), the woman was instructed (both orally and in writing) to collect all pregnancy loss tissue at home and place it in the fridge at 5 °C before bringing it to the hospital within 1 week. If she had a surgical removal (47%), COPL staff collected the material at the hospital and placed it at 5 °C. In cases of second-trimester loss, induction and delivery took place at the hospital, and clinical staff collected biopsies from the fetal foot. COPL staff packed and shipped the tissue in padded envelopes to the University of Copenhagen (maximum transportation time 2 h), where it was stored at 4 °C.

We used a previously described method to extract DNA from fetal material20. In brief, fetal tissue samples were washed several times in sterile PBS to remove maternal blood before being examined under a stereomicroscope to identify fetal tissue and/or chorionic villi. In some cases, the tissue could not be identified as either and was designated unknown. Some of these unknown samples were probably blood clots and maternal decidua. The fetal tissue or villi were dissected into pieces of 0.5 cm2 and then subjected to three consecutive PBS washes before being placed into a 2 ml cryotube (Nunc, Thermo Fisher Scientific, 10577391), submerged in liquid nitrogen for snap-freezing and then stored at −80 °C. For DNA extraction, we used the DNeasy Blood and Tissue 96 with TissueLyser LT (Qiagen, 69504) according to the instructions provided by the supplier. In brief, the samples were put on dry ice and the frozen tissue was moved to a 2 ml Safe-Lock Eppendorf tube (Eppendorf, 0030 123.344), as recommended by the supplier. A single bead (Stainless Steel Beads, Qiagen, 69989) and appropriate buffers from the DNeasy kit were added to the tube containing the frozen tissue together with appropriate buffers, which was bead milled for 1 min at 30 Hz using a 5 mm steel bead on a TissueLyser LT (Qiagen, 85600). The remaining steps were performed according to the recommendations of the supplier. Parental blood samples and extracted fetal DNA were shipped to deCODE, Iceland, on dry ice.

All information on the participants and samples was stored in a secure cloud (REDCap) in the Capital Region of Denmark. Blood tubes were labelled with individual bar codes, a pseudonymized code, and whether it was a maternal or paternal sample. Information about the last menstruation (gestational age–last menstruation; GA–LM), pregnancy loss type (missed miscarriage or spontaneous ongoing miscarriage) and gestational age at the time of loss was gathered by interview and patient records on inclusion.

WGS analysis

Sequencing libraries were prepared using the NEBNext Ultra II PCR-free kit (New England Biolabs). In brief, 0.5–1 µg of genomic DNA was fragmented in 96-well TPX-AFA plates (Covaris) to a mean target size of 450–550 bp using the Covaris LE220plus instrument. End repair and A-tailing was performed in a single step followed by ligation of unique dual-indexed sequencing adaptors (IDT for Illumina) and two rounds of SPRI-bead purification (0.6×) using the Hamilton STAR NGS liquid handler. The quality (concentration and insert size) of the sequencing libraries was determined using the LabChip GX (96-samples) instrument (Perkin Elmer). Sequencing libraries were pooled appropriately using Hamilton STARlet liquid handlers and sequenced using the Illumina’s NovaSeq 6000 instrument. We performed paired-end sequencing on the S4 flowcell (v1.0 chemistry), resulting in read lengths of 2 × 150 cycles of incorporation and imaging, in addition to 2*8 index cycles. Real-time analysis involved conversion of image data to base calling in real time. All of the steps in the workflow were monitored using an in-house laboratory information management system with barcode tracking of all of the samples/plates and reagents.

Alignment and sequence variant calling

The WGS data were aligned against the human reference genome (hg38) as previously described51 and managed with an in-house laboratory information management system. In brief, reads were aligned with BWA mem52 (v.0.7.10) and marked for duplicates with Picard tools (v.1.117). The aligned reads in BAM format for fetal and parental samples were used as input to GraphTyper53 (v.2.7.5), the average coverage of the BAM files is shown in Supplementary Fig. 2.

Paternal fraction estimation

VCF files from GraphTyper were parsed to identify genotypes where the mother was homozygous, and where the father carried an allele absent from the mother. Sites where the genotype of the parents passed the following quality thresholds were considered for downstream analysis; a genotype quality greater than 20 and depth greater than 20. Furthermore, we required the AB to be within [30%, 95%] for heterozygous calls; and within a range of [0, 5%] or [95%, 100%] for homozygous reference or alternative calls, respectively.

For sites that passed these quality criteria, we counted the number of reads in the fetus supporting the paternal allele (Y), conditioned on the genotype of the father and the DP genotype field from the fetus (D). We then regressed the number of reads supporting the paternal allele in the fetus on allele counts of the paternal allele in the father (A). For an uncontaminated euploid fetus, the slope and intercept of this regression (Y = D × A) would be expected to be 1 and 0, respectively. The slope is expected to decrease with lower paternal contribution, the same as for maternal contamination or maternal triploidies.

Coverage of the sex chromosomes (allosomes) can be informative about the maternal contamination in cases of pregnancy loss with male fetuses (Supplementary Fig. 10). Consistent with expectation, we found that fetal samples with 20% of the allosome coverage on the Y chromosome were predominantly of low contamination (90.0% paternal fraction) compared with those without Y chromosome coverage (55.9% paternal fraction; P = 2.7 × 1061).

Estimation of genetic ancestry

Genetic ancestry of parental samples was estimated by running ADMIXTURE54 v.1.3.0 in supervised mode using 1000 Genomes55 CEU (Utah Caucasian, European), CHB (Beijing Han, East Asian), ITU (Indian Telugu, South Asian), PEL (Peruvian, Native American) and YRI (Nigerian Yoruba, African) as training populations. Markers were pruned for pairs in strong linkage disequilibrium by excluding long-range high-linkage-disequilibrium regions56 and running PLINK57 v.1.90b6.15 --indep-pairwise 200 25 0.4. ADMIXTURE was run on batches of ten test individuals at a time.

Coverage-based CNV calling

The CNV caller used for the pregnancy loss set is based on sequence coverage. Its aim is to find CNVs that are long enough for coverage to give enough evidence for calling them. To guard against too high variance in determining the copy-number state at a locus, the sequencing read coverage is binned into 1 kb regions, which is further normalized to overall genomic coverage (using the autosomes).

Sequence coverage shows dependency on G/C content and AT repeat content per region and depends on the fragment length distribution in sequence libraries. Most of the dependency can be captured by the number of G/C base pairs and the number of repeated AT motifs in each read pair fragment58. To correct for G/C content, we considered 400 bp non-overlapping windows across the genome, resulting in a table of number of reads starting at each 400 bp window, classified according to the number of G/C base pairs and the number of AT repeats. After coverage normalization, this gives a correction factor for each 400 bp window. Each 1 kb bin consists of a sequence of overlapping 400 bp fragments, each with a corresponding correction factor from the before mentioned table. The average of those correction factors is used as the correction factor for the 1 kb bin. This correction factor is applied to correct each bin, giving a G/C–AT-corrected normalized binned coverage in 1 kb bins, which are then used in the downstream CNV analysis.

The input data to the CNV caller consist of a series of G/C–AT-corrected normalized 1 kb coverage values for each sample on each chromosome. Supplementary Fig. 11 shows four such series at a locus. Each datapoint has an underlying count of sequencing reads that are assumed to be approximately Poisson distributed. Normalization and G/C–AT correction apply a scaling factor resulting in a scaled Poisson variable. The mean of the Poisson variable depends on the copy-number state of the sample in the bin, giving a mixture distribution. A few parameters specify the mean of each component in the mixture:

$$\alpha \times y \sim \mathop{\sum }\limits_{k-0}^{n}{\pi }_{k}\times {\rm{Poisson}}(\alpha \times \beta +\alpha \times \lambda \times {\mu }_{k})$$

where y is the normalized coverage per 1 kb bin, μk is a multiplier per copy-number state, α is the inverse scaling factor, β is a bias factor (usually giving the mean of the copy-number 0 state since μ0 is usually 0), λ controls the distance between copy number states and πk is the prior probability of copy-number state k. For Illumina sequencing data, we use 7 states, with μk = 0.5 × k.

Each bin gets their own α, β, λ and πk values, which are set to maximum-likelihood estimates for the data for all samples within the bin. A further shift parameter is sometimes used that specifies a shifting of the data on the Poisson scale (when going to the Poisson scale from the y scale, the shift is applied). The main purpose of that parameter is to fit a higher variance in the copy-number 0 cluster than a Poisson variable with mean close to 0 can explain. Supplementary Fig. 12 shows histograms for the CNV in Supplementary Fig. 11. The copy-number clustering for copy-number states 0 and 1 (as well as 2 of course) are very clear.

The actual CNV calling is then run when the model above has been fitted. Each sample is run with a hidden Markov model (HMM) with states corresponding to the copy-number states59. Within each state, the model above is used for the emission probability given the copy-number state. The transition probabilities have a parameter specifying the probability of transitioning to a different state. The Viterbi HMM algorithm is used to find the most probable path through the HMM for each sample and, where the Viterbi path diverges from the normal copy number (2 for autosomal chromosomes), a CNV is called. The algorithm is run twice, first with a stringent transition probability (1 × 10−6) to find longer variants without splitting them into a series of shorter ones, and then with a more relaxed transition probability (0.001) to allow finding shorter variants. CNVs with similar boundaries are grouped together and the HMM is used to estimate which CNVs correspond to the same event and what is the most likely boundary for each such group. The probability model can be used to calculate the posterior likelihood for each copy-number state for each CNV, which can be used in downstream phasing and imputation.

Note that informative structural variants not captured by our copy-number analysis may have been missed because additional rearrangement analysis was not performed.

Sensitivity of the coverage-based CNV detection

To assess our sensitivity for detecting CNVs in the presence of contamination, we simulated duplications by switching between a pair of uncontaminated fetal sample with trisomy 16 and maternal sample along the chromosome (Supplementary Fig. 13). To simulate deletions, we switched between a male fetal sample and maternal sample along chromosome X. Contamination was simulated by taking the weighted average of the maternal and the fetal corrected coverage value for simulated CNV segments. We considered the following combinations of CNV sizes (5 kb, 10 kb, 20 kb, 50 kb 100 kb, 200 kb and 500 kb) and contamination values (0, 10%, 20%, 30%, 40%, 50%, 60% and 70%). We determined that we found a simulated CNV if 50% or more of the windows within the CNV were called in deletion state for deletions and duplication state for duplications in the output from the HMM program. We found 89.4% (95% CI = 79.4–95.6%) of the simulated duplications and 96.3% (95% CI = 94.5–97.6%) of the simulated deletions of size 10 kb at 30% contamination; other combinations are shown in Supplementary Fig. 13.

Combining segments and phasing

The calls from the HMM algorithm in the fetal samples were in the form of a series of small segments. Similar to the Battenberg algorithm60, we joined these disjoint segments through phase-informative sequence variants. For the joining, we assessed the allele-specific depth in the fetal sample limited to variants where parents were homozygous for different alleles.

Letting CMM be the measured maternal coverage, CM the actual maternal coverage and CCM the coverage that is due to maternal contamination, then

$${C}_{{\rm{MM}}}={C}_{{\rm{M}}}+{C}_{{\rm{CM}}}$$

Letting CP be the paternal coverage, then:

$${C}_{{\rm{T}}}={C}_{{\rm{MM}}}+{C}_{{\rm{P}}}$$
$${C}_{{\rm{T}}}={C}_{{\rm{M}}}+{C}_{{\rm{CM}}}+{C}_{{\rm{P}}}$$

CT represents the total coverage of the phase-informative variants (limiting to homozygous variants in either parent) in the fetal sample.

Assuming a diploid fetal sample, that is,

$${C}_{{\rm{M}}}={C}_{{\rm{P}}}$$

Then, we can get an estimate of the global maternal contamination coefficient CGLOB

$${C}_{{\rm{MM}}}={C}_{{\rm{P}}}+{C}_{{\rm{CM}}}$$
$${C}_{{\rm{GLOB}}}={\widehat{C}}_{{\rm{CM}}}={C}_{zrmMM}-{C}_{{\rm{P}}}$$

For each segment, the maternal contribution was then corrected for CGLOB. If the maternal and paternal contributions deviated from the expected 50:50 ratio, within the predefined segments, those segments were assigned either a duplication or deletion status and assigned to the corresponding parent (phased). All calculations for the refinement of aneuploidies and CNVs were performed in R v.4.4.3, refined segments were combined into CNVs using Python v.3.9.21.

The pathogenicity of identified CNVs was estimated based on criteria from the ACMG61.

CNV breakend annotation with GraphTyper2

We discovered structural variants with Manta (v.1.6.0), merged the variant sites over all individuals with svimmer (git hash 2af9ccf1ac056b9516c25ec4a0121657a9f9ab8e), and finally genotyped them jointly across the entire set with GraphTyper2 (v.2.7.7). We considered all types of structural variants within 5 kb of CNV breakends. We searched for structural variant DNM candidates using the same criteria as described in the ‘DNM detection’ section with two exemptions: a more permissive allelic balance cut-off (10%) and a lower minimum number of reads supporting the DNM allele (6). If we found a DNM candidate among the structural variants of the CNV carrier within this region, then we set that as the position of the CNV breakpoint Supplementary Table 2. To search for balanced translocations present in the parents, we manually curated the translocation calls by GraphTyper with an allele count less than 10. We found exact breakpoints of the two balanced translocations that are present in parents (Supplementary Figs. 6 and 7).

Kinship and oversharing

We estimated kinship between fetal samples and parental samples using KING62 (v.2.1.5) on SNVs in subsampled 50 kb regions of the genome (1%). We considered only biallelic SNVs, with allele counts greater than four and an AAScore from GraphTyper of greater than 0.5.

We searched for indication of triploidy based on oversharing between the fetal sample and either of the parental samples, that is, a higher kinship coefficient than expected for parent–offspring relationship (k = 0.25). If kmat represents the kinship coefficient between the fetal sample and the maternal sample and kpat represents the kinship coefficient between the fetal sample and the paternal sample, we considered the following scenarios:

$${k}_{{\rm{pat}}} > 0.3\,{\rm{and}}\,{k}_{{\rm{mat}}} > \,0.2$$
$${k}_{{\rm{pat}}} > 0.2\,{\rm{and}}\,{k}_{{\rm{mat}}} > \,0.3$$

as potential triploidies. We then used the change in variance of allelic balance, as described in the next section, across the genome to further refine these potential triploidies.

Detecting crossovers between chromosomal states

We defined phase-informative variants as variants present in heterozygous state in one parent and homozygous in the other. In the following method, we restricted to phase-informative heterozygous variants in the parent transmitting the extra chromosome, where the other parent had a homozygous genotype.

If two sister chromatids from the transmitting parent were present at a given locus, we expected the AB of HS alleles at that locus to be either 0 or 2/3, depending on whether or not the sisters carried the HS allele (Fig. 2b). If both homologous chromosomes were present at a locus, we expected the AB to always be 1/3, considering the presence of an HS-allele in the transmitting parent. The state of homologous chromosomes has much lower variance than the state of sister chromatids, as there is no variability in sampling the homologous chromosomes. The sampling variance of the homologous chromosome state should be due to only sequencing variability, that is, reads supporting the alternative allele in the fetal sample.

We let Y be the observed coverage of the phase-informative variant in the fetal sample. T is a Bernoulli variable denoting the transmission of the phase-informative variant from the parent to the fetus. Then, by the law of total variance, we can partition the variance based on transmission T:

$$V[Y]=E[V[Y| T]]+{V}[E[Y| T]]$$

If we assume the read coverage in the fetus given transmission is determined with a Poisson variable:

$$Y| T=1 \sim {\rm{Poisson}}(\mu )$$

Then:

$$\begin{array}{c}E[V[Y| T]]={P}_{T}\times V[Y| T=1]+(1-{P}_{T})\times V[Y| T=0]\\ \,\,\,\,\,\,=\,{P}_{T}\times V[Y| T=1]+(1-{P}_{T})\times 0\\ \,\,\,\,\,\,=\,{P}_{T}\times \mu \end{array}$$

The other part of the sum will be:

$$\begin{array}{c}V[E[Y| T]]={P}_{T}\times {(\mu -{P}_{T}\cdot \mu )}^{2}+(1-{P}_{T})\times {(0-{P}_{T}\cdot \mu )}^{2}\\ \,\,\,\,\,=\,{P}_{T}{\mu }^{2}{(1-{P}_{T})}^{2}+{P}_{T}^{2}{\mu }^{2}(1-{P}_{T})\\ \,\,\,\,\,=\,{P}_{T}{\mu }^{2}-2\times {P}_{T}^{2}{\mu }^{2}+{P}_{T}^{3}{\mu }^{2}+{P}_{T}^{2}{\mu }^{2}(1-{P}_{T})\\ \,\,\,\,\,=\,{P}_{T}{\mu }^{2}-{P}_{T}^{2}{\mu }^{2}\\ \,\,\,\,\,=\,{\mu }^{2}{P}_{T}(1-{P}_{T})\end{array}$$

Taken together:

$$V[Y]={P}_{T}\cdot \mu +{\mu }^{2}{P}_{T}(1-{P}_{T})$$

If both homologous chromosomes are present (Pi = 1) at 30× coverage in the fetus (μ = 10) then we expect:

$$V[Y]=1\times 10+{10}^{2}\times 0=10$$

By contrast, in the sister state (Pr = 0.5) the variance is much higher:

$$V[Y]=0.5\times 20+{20}^{2}\times 0.25=110$$

We used this change in variance of the allelic balance along the genome to find crossovers between the homologous chromosome states and sister chromatid states. We binned the number of phase-informative variants in 1 kb windows, and calculated the average allelic balance. We excluded 1 kb windows that overlapped larger 100 kb windows with sequence variants with low average quality metrics (AAScore < 0.8 or Carrier_regression_beta < 0.7; see the ‘DNM detection’ section for details of AAScore and Carrier_regression_beta). We then used a changepoint technique on the 1 kb windows that passed the exclusion criteria to detect changes in variance in the 1 kb windows (cpt.var from the changepoint package in R). We used the following criteria:

  • A minimum segment length of 1000 1 kb windows (1 Mb)

  • A maximum of eight changepoints per chromosome

  • The PELT method for detecting multiple changepoints

  • Our own penalty function (penalty = “manual”) with 10 × log(n) (pen.value = “10log(n)”) where n is the number of windows per chromosome.

DNM detection

We searched sequence variants and extracted DNM candidates in a similar manner as before6,8 for 9,651 Icelandic trios by comparing the genotypes of the parents and offspring. The DNM candidates included both SNVs and indels (Supplementary Fig. 14). In brief, we defined a DNM candidate with permissive cut-offs for the genotype of the proband requiring an allele balance greater than 0.25 and depth of 12 reads at the position (supporting either the reference or alternative allele). For the genotypes of the parents, we required at least 12 reads, a maximum of one read supporting the alternative allele and the allelic balance to be less than 5%. Likely (NLIK) and possible (NPOSS) carriers of the DNM allele outside the descendants of the parent pair were defined as previously6. We restricted to DNM candidates with less than 50 likely carriers and either less than 10 possible carriers or a NLIK/NPOSS ratio greater than 80%. We tuned the DNM candidate filtering by using segregation of DNM candidates in three generation families (2,042 probands) and the following quality covariates in a generalized additive model with a logistic link:

  • AAScore: prediction probability from GraphTyper that the variant is a true positive.

  • Carrier_regression_beta: slope from the alternative allele depth regression for the sequence variant. To estimate the quality of the sequence variants, we regressed the alternative allele counts (AD) on the depth (DP) conditioned on the genotypes (GT). For a well-behaving sequence variant, the mean alternative allele count for a homozygous reference genotype should be 0; for a heterozygous genotype, it should be DP/2; and, for a homozygous alternative genotype, it should be DP. Under the assumption of no sequencing or genotyping error, the expected value of AD should be DP conditioned on the genotype; in other words, an identity line (slope 1 and intercept 0). Deviations from the identity line indicate that the sequence variant is spurious or somatic.

  • Carrier_regression_alpha: intercept from the alternative allele depth regression for the sequence variant.

  • Proband_het_AB: the allelic balance of the proband.

  • MaxAAS: the maximum read support for the sequence variant across all individuals.

  • Alignment_Alt_Reads: the number of reads supporting the alternative allele. This covariate and the following covariates were derived by identifying the reads in the BAM files supporting DNM allele.

  • Alignment_Alt_Unique_Positions: the unique number of starting positions for the reads supporting the alternative allele.

  • Alignment_Alt_Soft_clipped: the number of soft clipped bases (S in CIGAR string).

  • Alignment_Alt_Matched_bases: the number of matched bases (M in CIGAR string).

  • Alignment_Alt_Score_diff: the difference of the alignment score of the best and the second best hit as reported by BWA mem.

  • Alignment_Alt_Pair_sw_nm: the pairwise mismatches between reads supporting the alternative allele using Smith Waterman implementation in SeqAn63.

  • Alignment_Alt_Pair_align: the number of bases in the pairwise alignments.

With the following formula for the gam function from the mgcv R package64:

threegen_Consistent_hs~ I(cut(alignment_Alt_Unique_Positions,c(-1,2,4,8,10,Inf)))+ s(I(AAScore))+ s(Carrier_regression_beta)+ s(Carrier_regression_alpha)+ I(ifelse(alignment_Alt_Reads>0, (alignment_Alt_Score_diff/alignment_Alt_Reads)>10, FALSE))+ I(ifelse(alignment_Alt_Pair_align>0, (alignment_Alt_Pair_sw_nm/alignment_Alt_Pair_align)>0.05, TRUE))+ I(ifelse(alignment_Alt_Matched_bases>0, alignment_Alt_Soft_clipped/alignment_Alt_Matched_bases>0.5, TRUE))+ s(Proband_het_AB)+ s(ifelse(MaxAAS>15, 16, MaxAAS))+ I(NPOSS == 0)

We restricted to instances of the DNM candidates in cases in which we saw both of the proband’s haplotypes at a locus transmitted to the offspring of the proband (see figure 1c in ref. 6). In brief, if the DNM is a true germline variant, then the allele of the DNM candidate should be present in one of the proband’s offspring. On the other hand, if it is absent from the children, then this assay suggests that the DNM candidate is a false-positive DNM call (a more detailed description was reported previously6). Like before, we fitted the generalized additive model using the mgcv R package64.

We used the cut-off of 0.5 to call high-quality DNM candidates for the predicted probability of correct segregation in three generation families. To validate the false-positive detection rate of the DNMs, we also used the genotype consistency between pairs of monozygotic twins. We found that 3.8% of DNMs were unobserved in the monozygotic twin of the proband. Note that this approach overestimates the false-positive rate as there are instances of high frequency post-zygotic mutations that differ between pairs of monozygotic twins65.

Identifying DNMs in COPL trios

The AB of DNM candidates is a crucial quality-control metric for discerning between false positive DNMs (that is, somatic and genotyping artifacts) and DNMs that transmit to the next generation6. However, in the case of the triploidies, trisomies and the maternally contaminated samples, the AB is expected to be lower than 50% for genuine DNMs. This is not compatible with the generalized additive model applied on the deCODE trios that uses the AB; we therefore decided on a simpler threshold strategy to define high-quality DNMs in the COPL data. We defined a high-quality DNM in the COPL set as one that satisfies AB of proband > 0.25; NPOSS < 10 and AAScore > 0.5. To compare the subsets in COPL, we calculated the detection power to detect DNMs per fetal sample using the ratio of the phase-informative markers and a binomial distribution (Supplementary Fig. 9). We considered only samples with paternal fraction > 25% and excluded a single sample with more than 800 DNMs after quality filtering.

Phasing of mutations in Icelandic trios

Like before, we used two complementary approaches to phase DNMs6,8: one by tracking of the transmission of the DNMs to the offspring of the proband (142,069 DNMs; three-generation phasing) and another one using read tracing of the DNM allele to phase-informative alleles (246,877 DNMs; read phasing). We combined the phase of the DNMs from the different approaches, resulting in 333,365 phased DNMs; if the phase of DNMs differed between the approaches, they were set to unphased DNMs in the downstream analysis.

Phasing of mutations in COPL trios

We read-traced the DNM allele as with deCODE data, except for one crucial difference—we considered only reads linking the DNM allele to nearby phase-informative sequence variants, if the reads were linked to an allele private to one of the parents. This was done to avoid phase ambiguity due maternal contamination or extra chromosomes.

Annotation of pathogenic variants

Sequence variants were annotated on the basis of release 100 of the Variant Effect Predictor (VEP)66 using RefSeq gene annotations67. We included SNVs and small indels (<50 bp) at coding and splicing regions in the autosome. We required that the variants were supported by at least six sequence reads, with at least three reads supporting the alternative allele. For indels, we required an AB of at least 0.4. For heterozygous variants, we considered pLoF (stop-gained, frameshift indels, splice essential variants) and missense DNMs in known autosomal-dominant disease genes (human phenotype ontology code HP:0000006 for autosomal-dominant disease genes). For missense DNMs, we also required either that the gene was highly constrained for missense variants (95th percentile over all genes, corresponding to a gnomAD68 v.4.1.0 missense z score of over 3) or that the variant was nearby (within <10 bp) a previously reported pathogenic/likely pathogenic missense variant on ClinVar (with at least one ClinVar submission supporting the variant as pathogenic/likely pathogenic and no benign/likely benign submissions).

We also included homozygous and compound heterozygous pLoF variants in known autosomal recessive disease genes (based on the code HP:0000007). For homozygous variants, we required the position to be covered by at least eight sequence reads, and for the alternative allele to be supported by at least eight reads. For compound heterozygous variants, we required the positions to be covered by at least four reads, and each variant to be supported by at least two sequence reads. To detect compound heterozygous genotypes, we performed phasing of the variants to paternal and maternal alleles. We also included genotypes where we could not phase both variants, allowing for the possibility of a DNM on one allele. We used minor allele frequencies from 56,969 Icelandic genomes69 and 730,947 exomes and 76,215 genomes from gnomAD11 to restrict to variants with frequencies <2%, and considered only genotypes present in fewer than three genomes. All pathogenic genotypes detected in the trios affected by pregnancy loss were confirmed by manual inspection of BAM files.

The same approach was used for defining pathogenic genotypes in a set of adult control trios. For the purpose of pathogenic SSV analysis, individuals with neurodevelopmental disorders, including epilepsy, autism and intellectual disability, were removed from the set of 9,651 Icelandic trios. Those patients, who were whole-genome sequenced as part of ongoing projects at deCODE genetics and have been described previously70, were ascertained through the State Diagnostic Counselling Center and the Department of Child and Adolescent Psychiatry and diagnosed based on ICD-10 criteria. Individuals with a clinical diagnosis of a rare genetic disease, as described previously69, were also removed from downstream analyses. Removal of these patients from the set of Icelandic trios resulted in a set of 7,760 adult control trios.

Constrained genes

To investigate DNMs in genes that have not been linked to any postnatal phenotypes but that might be essential to early developmental processes, we looked at genes that are highly constrained for pLoF variants. We used constraint data from gnomAD68 v.4.1.0, ranking genes (n = 17,481 protein-coding genes) based on their pLI scores. We defined highly constrained genes as the genes in the 90th percentile of all ranked genes.

Age regression

To compare the accumulation of mutations with advancing parental age at conception between pregnancy loss and the Icelandic trios, we regressed the number of DNMs against the parental ages at conception with the following formula:

Nr_of_DNMs~ fathers_age+ mothers_age+ tissue_villi

Note that, we are using a shared intercept model. This is due the fact DNMs detected in the fetal samples are only partially phased through the read-tracing, which results in unidentifiable sex-specific intercepts. We restricted to the subset of the pregnancy loss cohort that is more comparable to the Icelandic set, in terms of quality, that is, we restricted to non-triploid samples with 95% power of detecting DNMs.

We ran separate regressions for the different mutational classes (C>A, C>G, C>T, CpG>TpG, Indel, T>A, T>C and T>G) along with the total number of DNMs per fetal sample/proband. The estimates of the age effects, villus contribution and intercepts are provided in Extended Data Fig. 7 and Supplementary Table 3. The CIs and P values are derived from a block jackknife procedure.

Mutational signatures

To estimate the contribution of COSMIC mutational signatures to the 96-class mutation spectra of the pregnancy loss samples, we used the Fit() function provided by the signature.tools.lib (v.2.4.4) package71 in R (https://github.com/Nik-Zainal-Group/signature.tools.lib). Note that the mutation burden of each pregnancy loss sample is modest, which limits power for signature detection. The purpose of this exercise was not to discover mutational signatures de novo but, rather, to specifically test the hypothesis that SBS18 was elevated in some villus samples as reported previously34. The mutation spectrum of spermatogonia is dominated by SBS1 and SBS5/40 (ref. 72) and this is mirrored in the mutational spectrum of germline variants (see figure 5 of ref. 73). With this in mind, we fitted only SBS1, SBS5 and SBS18 onto the mutational catalogues. Signatures were fitted using 100 bootstraps.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.