Introduction

Diversified reproductive systems can be observed in plant kingdoms and applied for crop breeding, including sexual crossing, self-pollination, and asexual clonal propagation. Crossing involves the deliberate combination of distinct varieties or even species to generate new genetic combinations. These hybrid offspring often exhibit heterosis, leading to enhanced yield and quality1. Self-pollination (selfing) includes pollination within the same plant or variety. This process promotes homozygosity and may aid in the elimination of recessive deleterious mutations through purging2,3. Nevertheless, selfing can lead to inbreeding depression, resulting in compromised growth potential, reduced yield, and diminished biotic and abiotic resistances4. The third reproductive type, clonal propagation, avoids the introduction of additional genetic material and offers the advantage of preserving favorable agronomic traits present in specific varieties5. Clonal propagated lines can, however, accrue somatic mutations. Some somatic mutations have produced new breeding resources6,7,8, such as bud sport varieties9, but somatic mutations can also be deleterious and negatively influence plant fitness10,11,12. These plant reproductive systems can shape the landscapes of genomic variation both within and among populations13, but more empirical work is necessary to characterize the magnitude of the effects of reproductive systems on genomic landscapes13. Such investigations could facilitate advancements in crop genomic breeding14,15.

Grapevine (Vitis vinifera L.) is one of the most widely planted and economically important perennial fruit trees. There are two subspecies of grapevine: domesticated V. vinifera subsp. vinifera and its wild progenitor V. vinifera subsp. sylvestris16,17,18. The initial steps of V. vinifera domestication may date as far back as ~15,000 years ago, and it included a shift from a dioecious, obligately outcrossing mating system to monoecy, with the potential for self-pollination18,19. Despite this shift, the cultivation of domesticated grapes has primarily been based on hybridization and on clonal propagation of highly heterozygous genotypes, in part due to substantial inbreeding depression for selfed materials20. In some cases, clonal propagation has been prolonged; some modern cultivars possess genetic information identical to grape seeds from the medieval period21. Thus, the evolutionary history and breeding of domesticated grapes have been influenced by diverse reproductive systems, significantly shaping their genomic makeup20.

Pinot Noir (PN) is a premium red wine variety popular for its flavorful, aromatic wines. With a history spanning over a millennium, PN has served as a focus of research and as a foundational parent in grapevine breeding. In its over 900 years of clonal propagation21, numerous variants have been derived from PN through bud sports, including Pinot Grigio, Pinot Meunier, Pinot Gris, and Pinot Blanc22. The initial grapevine reference genome was determined from a highly homozygous descendant of PN called PN4002423. This accession originated from successive self-pollinations of Helfensteiner (HE)24, an offspring resulting from a cross between PN and Schiava Grossa (SG) performed in 1931 (www.vivc.de). Recently, a complete telomere-to-telomere (T2T) genome assembly of PN40024 has also been made available25.

In this study, we construct a haplotype-resolved chromosomal genome of PN and characterize the haplotypic diversity present within it, including significant structural variations and gene families unique to each haplotype. Additionally, we examine both somatic and fixed variants within the PN population to enhance our understanding of the cultivar’s development. Our analysis of various grapevine germplasm samples indicates that reproductive types have a substantial impact on genomic landscapes and grapevine breeding practices. These effects differ in terms of genome heterozygosity, as well as deleterious and structural burdens. We find that selfing significantly purges heterozygous deleterious SNPs (dSNPs) and structural variants (by 62% and 65%, respectively, compared to HE) in PN40024. The close linkage of large-effect deleterious and structural variations in repulsion phases maintains 4.3% of the genomic regions in a heterozygous state even after successive selfing. Our study explores the evolutionary genomics underlying the transitions of reproductive systems in forming grapevine lineages, which sheds insights on genomic breeding of grapevine.

Results

Comparative genomics of clonal Pinot Noir and selfing PN40024

To construct the haplotype-resolved PN genome, we generated a total of 33 Gb (33,349,412,693 bp, ~65x coverage) of HiFi reads and 84 Gb (83,958,664,800 bp, ~160x coverage) of Hi-C reads (see “Methods”). Two haplotypes of PN were assembled: PN1 (496.43 Mb) and PN2 (489.78 Mb) (Fig. 1A), with a contig N50 of 23.60 Mb and 24.19 Mb, respectively. The BUSCO score of PN1 (98.3%) and PN2 (98.4%) was comparable to the complete assembly of PN40024 (PN40024_T2T, 98.5%, Supplementary Fig. 1). These statistics suggested that employing a similar methodology as PN40024_T2T25, all centromeres and most of the telomeres in the two haplotypes’ genome were identified. For genome annotation, PN1 (37,038) and PN2 (37,350) yielded a similar number of identified genes. We also identified repetitive sequences representing 66.64% and 66.21% of the PN1 and PN2 assemblies, respectively (Supplementary Table 1). Interestingly, we detected a large inversion on chromosome 19 between PN1 and PN2, which is almost 4 Mb in size (Fig. 1A and Supplementary Fig. 2). Further analysis of long reads, complemented by PCR analyses, supported the inversion between the two haplotypes at the breakpoints (Supplementary Table 2, Supplementary Figs. 35).

Fig. 1: Haplotype-resolved assemblies of the clonal Pinot Noir genome.
figure 1

A Collinearity between the reference genome PN40024_T2T and the two haplotypes of Pinot Noir: haplotype 1 (PN1) and haplotype 2 (PN2). Collinear regions are indicated by gray lines. B Presence and absence of gene families in the PN and PN40024_T2T genomes. The absence of a gene family in a genome is indicated by gray coloring. The red gradients represent gene abundance within each gene family when present in the genome. C Shared and unique gene families among each genome assembly. The numbers in parentheses indicate the total number of genes identified within each assembly.

To study the impacts of reproduction systems on genomic landscapes, we also conducted comparative genomics among PN1, PN2, and PN40024_T2T (Fig. 1B). K-mer analyses estimated the genome heterozygosity to be 1.43% in PN (Supplementary Fig. 1), which is higher than PN40024_T2T (0.18%)25. The three genomes were highly diverged at the sequence level, including gene content. If no homologous genes were identified for a gene family within a genome, that gene family was considered absent from that genome. Using this approach, we identified 20,800 shared gene families among PN40024_T2T, PN1, and PN2. However, we also detected extensive variation at gene contents: 2869, 2864, and 3026 gene families were exclusive to PN1, PN2, and PN1 + PN2, respectively, compared to PN40024_T2T, while smaller numbers (2243, 2214 and 581 gene families) were exclusive to PN40024_T2T compared to PN1, PN2 and PN1 + PN2 (Fig. 1C), respectively. These observations could have two causes. First, it is possible that the SG, parent of HE, had fewer gene families than the PN parent. An alternative hypothesis is that selfing led to the loss of many gene families, potentially through a process that favors shorter haplotypes, as suggested for the rapid loss of genome size in selfed maize lineages3.

Selection and introgression shaped the characteristics of Pinot Noir

To assess evolutionary processes contributing to the formation of distinguishing characteristics of PN, we gathered resequencing data from 38 V. vinifera samples (Supplementary Data 1 and 2), including 18 PN clones (nine generated in this study and nine from previous publications), 20 previously published wild grapes accessions (ten from Europe (EU) and ten from the Middle East and Caucasus region (ME), respectively) and three muscadine grapes used as outgroup. These sequences constitute three groups with which to investigate introgression and selection signals in PN. After SNP and SV calling and filtering, we counted 4,687,377 SNPs and 18,469 SVs across the entire dataset. Genome-wide selection signatures were observed throughout the PN clones, especially strong peaks on chromosomes 1, 3, 4, 5, and 18, by applying the population branch statistic (PBS)26 with ME and EU populations as controls to identify genomic regions with exaggerated divergence relative to the controls (Fig. 2A). Gene set enrichment analysis (GSEA)27 revealed an enrichment of organelle assembly, glycerolipid metabolic process, regulation of protein metabolic process, response to auxin, response to abiotic stimulus and beta-glucan biosynthetic process, and so on (Fig. 2B, Supplementary Data 3). Note that the PBS is likely identifying extended lengths of the branch leading to PN, and thus it is likely that the inferred selection events occurred prior to the diversification of PN clonal lineages.

Fig. 2: Signatures of introgression and selection in the Pinot Noir genome.
figure 2

A PBS analysis of the PN group, utilizing the ME and EU groups as controls. B Gene set enrichment analysis for genes ranked by PBS values. Five biological processes are shown. C fd analysis for the PN groups. D π and Dxy values for the region on chromosome 11 that contains the CONSTANS-like 5 (COL5) gene. The red vertical line indicates the location of the COL5 gene. Source data are provided as a Source Data file.

As EU wild population has a significant contribution to the origin of modern wine grapes28, we explored how such introgression events shaped the genome of PN clones using f-statistic (fd)29. Although the fd -test is primarily used in sexual populations, it focuses on shared alleles rather than allele frequencies. Therefore, it could be an efficient way for detecting introgressed fragments in clonal PN. By using ME as the sister population, we detected strong introgression signals in the PN genomes from EU, particularly on chromosomes 1, 2, 3, and 19 (Fig. 2C). The genes in the top 1% of regions with the highest fd values were enriched in pathways related to plant physiological processes, such as photosynthesis and generation of precursor metabolites and energy (Supplementary Fig. 6, Supplementary Data 4). One locus, CONSTANS-like 5 (COL5), related to plant flowering (Supplementary Fig. 7)30, stood out as one of the top outlier in introgression analysis. Both the nucleotide diversity (π) of PN clones and sequence similarity (Dxy) between PN clones and EU were significantly lower at COL5 and surrounding regions (Fig. 2D). These observations strongly suggest that both selection and introgression shaped COL5 in PN clones, conferring local adaptation to adjust flowering time in new climates after the spread of cultivars globally after domestication. We further validated the reliability of this introgression in PN by conducting a phylogenetic analysis (Supplementary Fig. 8, Supplementary Data 2). The results indicated that ME and EU were mainly grouped into two separate clades. Instead of forming an independent clade, some cultivars, including PN, clustered with the EU clade, suggesting an introgression from EU in certain wine grapevines in this region.

Germline and somatic mutations in Pinot Noir clonal lineage

To investigate the impact of clonal propagation on PN, we analyzed genetic variants among the PN clones and their genetic differentiation from other grapevine populations. In total, 3,915,920 SNP variants and 17,035 SV variants were identified in grapes (PN, EU, and ME populations) in the absence of the outgroup data (used the inferred ancestral alleles, Fig. 3A). Among them, 70,291 SNPs and 605 SVs were unique to PN (specific variants), while 2,227,889 SNPs and 8635 SVs were shared with wild EU, ME or both populations, which are primarily likely to be germline mutations (Fig. 3A). The frequency of both shared SNPs and specific SNPs in PN clones displayed a “double U-shaped” site frequency spectrum (SFS) (Fig. 3B). However, the distribution of PN-specific SNPs showed an excess of rare variants (16.1% compared to 11.1% of all PN SNPs that were observed only once in PN clones) and a likely heterozygous state (69.2% compared to 21.1% of all PN SNPs that were observed 18 times across 18 PN clones), suggesting that many of these PN-specific variants are mostly somatic mutations in heterozygous states (Fig. 3B).

Fig. 3: Somatic SNP and SV variants in clonal Pinot Noir.
figure 3

A A Venn diagram depicting the overlap of SNPs and SVs among the EU, ME, and PN groups. The number of SVs is indicated in parentheses. B Frequency of allelic variants in clonal Pinot Noir. The red segments represent the 70,291 SNPs specific to PN, as identified in the Venn diagram, while the gray segments indicate the 2,227,889 SNPs shared between PN and wild grapes. C The distribution of PN-specific SNPs and SVs among individuals. The bar chart displayed inside shows the average heterozygosity of germline variants (SNPs, n = 49,963, and SVs, n = 429) in PN grapes. Data are presented as mean ± SD. D GO term enrichment analysis of genes associated with PN-specific SVs. E A comparison of the genomes of PN and PN40024, focusing on the SVs present in PN but absent in wild grapes (EU and ME), located ~4.7 Mb on chromosome 10. The arrows on the genomes indicate the direction and location of genes belonging to the S-locus family in these regions.

Given the duration of its cultivation, there has been ample opportunity for somatic mutations to occur in different clonal lineages of PN. To assess the number and type of these mutations, we divided the PN-specific SNPs into two groups: (i) 20,328 mutations that vary among PN clones (28.9%), which reflect the accumulation of somatic mutations and (ii) mutations that were observed across PN clones (71.1%), representing germline mutations that occur during the formation of the PN cultivar prior to the diversification of the clonal lineages, or somatic mutations fixed during clonal propagation (Fig. 3C). We defined the first group as “somatic mutations” and the second as “fixed mutations (including both somatic and germline mutations)”. Previous studies identified somatic mutations across clones of other varieties31,32,33,34,35,36,37,38,39, but most of them failed to distinguish somatic and germline mutations. We applied control populations to detect cultivar-specific mutations, which greatly improved the precision of classifying somatic and germline mutations (Supplementary Fig. 9). Interestingly, almost all (98.6%) of these fixed mutations were maintained as heterozygous in the 18 PN clones. Additionally, 55.5% of the remaining putative somatic SNPs were unique to a single individual, suggesting the possible accumulation of somatic mutations in each clone (Fig. 3C). We randomly selected ten somatic mutation sites specific to one individual by examining the reads in IGV. The results showed that the mutation existed only in some reads of the individual predicted to have the mutation, and not in other individuals (Supplementary Figs. 1014). The distribution of these somatic and fixed SNPs was then investigated in terms of gene structure. Somatic SNPs were found to occur more frequently in intergenic regions compared to fixed SNPs (Supplementary Fig. 15). A similar pattern has been observed across clones of the Zinfandel variety31. GO analysis showed that genes with fixed SNPs were enriched in biological processes such as cell cycle process, negative regulation of biological process, cytoskeleton organization, microtubule cytoskeleton organization, mRNA splicing via spliceosome, and so on (Supplementary Data 5).

Similar to SNPs, we also identified 605 SVs unique to PN by comparing SVs among the three populations (Fig. 3A). The SVs were categorized into four types: deletion (DEL), tandem duplication (DUP), inversion (INV), and inter-chromosomal translocations (BND). DEL was the most prevalent type, accounting for 59.7%, followed by BND and DUP, while INV was the least common, constituting only 1.82% (Supplementary Table 3). We found that 87.4% of the SVs contained TEs. Similar to SNPs, among the 605 SVs unique to PN, 70.9% of them were shared by all 18 individuals, and the majority (91.9%) of these “fixed” SVs were present in a heterozygous state (Fig. 3C). Additionally, 176 variants were found in 1 to 17 individuals, with an average heterozygosity of 99.8%. For these fixed SVs that were retained in all 18 individuals, we annotated the genes on them and found they were enriched in biological processes related to recognition of pollen, cell recognition, ketone biosynthetic process, quinone metabolic process, quinone biosynthetic process, and cellular ketone metabolic process (Fig. 3D, Supplementary Data 6). We further zoomed in on the twelve genes associated with the pollen recognition process and found all of them belonging to the S-locus or related genes and hence probably related to self-incompatibility40 (Supplementary Data 7). Nine of the twelve genes were clustered in one SV on chromosome 10 (~4.7 Mb), which was present in PN clones but absent in EU and ME sylvestris samples. The region containing this SV was highly heterozygous in all PN lineages (Supplementary Fig. 16). Comparative genomics between the PN haplotypes and PN40024_T2T genomes indicated that both duplication and insertion contributed to the formation of this SV on PN1 (Fig. 3E).

The impact of clonal propagation on grapevine genomes

To investigate the impact of various reproductive systems on grapevine breeding, we gathered more resequencing samples from the PN lineages with different reproductive modes for analysis, including five SG clones, four Gouais Blanc (GB) clones, ten Chardonnay (CD) clones, two Gamay Noir (GN) clones, two HE clones, and four PN40024 clones (Supplementary Date 2). The phylogenetic tree showed that the wild and domesticated grapes display reciprocal monophyly (Fig. 4A). Admixture analyses indicated HE, GN, and CD with admixture components from their parents (Fig. 4A). In contrast, the PN40024 clones were inferred to be separate, non-admixed group, perhaps reflecting distinctness evolved during selfing. Kinship analysis was conducted to verify the relationship between samples in each group and to confirm the true-to-type of these samples41. As expected, PN and GB were identified as parents of CD and GN, and PN and SG were identified as the parents of HE but not PN40024 (Fig. 4B).

Fig. 4: Genetic relationship among PN lineages.
figure 4

A Phylogenetic tree with admixture analysis. B Pedigree relationships among populations, highlighting three mating histories. C Genetic burden of dSNPs shared within each group. The sample sizes for each group, from left to right, are as follows: 10, 10, 18, 5, 4, 10, 2, 2, and 4. All data are presented as mean ± SD for each group, with individual data points represented by dots. The two-sided Least Significant Difference (LSD) test was used for statistical analysis to compare each group; groups sharing the same letter are not significantly different (P < 0.05). D Genetic burden of SVs shared within each group. The sample sizes for each group, from left to right, are as follows: 10, 10, 18, 5, 4, 10, 2, 2, and 4. All data are presented as mean ± SD for each group, with dots indicating individual data points. The two-sided Least Significant Difference (LSD) test was used for statistical analysis to compare each group; groups with the same letter are not significantly different (P < 0.05). E Distribution of conserved SNPs with different genotype combinations among cultivars. The dashed black box highlights the variants that remain heterozygous across all grape populations.

We assessed genetic variation within each group to elucidate the impact of reproductive modes on genetic diversity. To begin with, we calculated the nucleotide diversity (π) value and observed heterozygosity (HO) for each group. As expected, selfed PN40024 exhibited the lowest levels of genetic diversity (π = 0.0003) and heterozygosity (observed heterozygosity, HO = 0.01) compared to all other groups. Among the groups, the wild grape ME group had the highest nucleotide diversity (π = 0.0039), while the other clonal grapes had π values ~ 0.002 (e.g., PN: 0.0024, GB: 0.0024, SG: 0.0019, CD: 0.0020, GN: 0.0028 and HE: 0.0029) (Supplementary Fig. 17), reflecting the effects of the domestication bottleneck. However, the average heterozygosity is slightly higher in clonal groups than in wild groups (EU: 0.20, and ME: 0.25, PN: 0.28, GB: 0.27, SG: 0.26, CD: 0.30, GN: 0.28, Supplementary Fig. 18).

The reproductive systems directly affect the efficiency of recombination, which is associated with the purging and maintenance of genetic burden42,43. To detect such effects, we identified dSNPs using SIFT44. The recessive burden for each individual was measured by the number of homozygous dSNPs; the heterozygous burden was measured by the number of heterozygous dSNPs; while the additive burden was calculated as the number of heterozygous dSNPs plus two times the number of homozygous dSNPs45. First, we examined the dSNPs shared by all individuals within each grape group and found that the clonal groups (PN, SG, GB, CD, GN, and HE) exhibited significantly higher genetic burden compared to the outcrossing wild groups (EU and ME) (Fig. 4C). The selfing PN40024 had the highest recessive burden and lowest heterozygous burden. In contrast, the wild grape groups (EU and ME) showed a significantly higher genetic burden of dSNPs that were not shared by all individuals within each grape group (Supplementary Fig. 19). A similar pattern was also observed for SV among the different grape groups (Fig. 4D and Supplementary Fig. 20).

Conserved heterozygous regions in grapevines

To understand the dynamics of alleles among cultivars during the breeding process, we assessed segregating sites that consistently occurred in each grape group. As shown in Fig. 4E, each column in the bar plot represents a specific genotype combination among the grape groups, with a total of 381 combinations observed. We selected 167 combinations that include more than 30 sites for subsequent analyses (Fig. 4E). The categories with the most sites were the SNPs that were heterozygous in clonally propagated PN, SG, or HE but homozygous in PN40024 due to selfing, and the SNPs that were only heterozygous in the GB sub-lineage (including GB, CD, and GN clones). A total of 97 SNPs remained heterozygous in all grape samples during domestication and diversification (wild and cultivated grapes including PN40024) and were dispersed throughout the genome. We identified 14 genes enriched in defense response and in the tricarboxylic acid cycle with SNPs located on the genes or within 2 kb up- and down-stream of the genes (Supplementary Data 8). In addition, most of the homologous genes of the 14 genes were related to plant growth, and some of them are associated with recessive lethal phenotypes (Supplementary Data 9). In addition, after nine generations of selfing to generate PN40024, 99.8% of the genome was homozygous25, except for 4782 conserved heterozygous SNPs (hSNPs) (Fig. 4E). GO analysis for the 577 genes that were associated with these hSNPs was enriched in only one molecular function process called chitin binding (Supplementary Data 10).

To study the effects of recombination on genome architecture and genetic burden, we identified the specific markers (see “Methods”) between PN and SG, and inspected their distribution on HE and PN40024. As shown in Fig. 5A and Supplementary Fig. 21, most of the specific markers of PN and SG could be located on the HE genome in a syntenic region. Almost all these specific markers (98.1%) were in a heterozygous state. This result supported the fact that HE is the offspring of a PN by SG cross. After continuous selfing in PN40024, 91.2% of these specific markers became homozygous, while 8.8% of them were still heterozygous (Fig. 5A, Supplementary Fig. 22). Recombination between the PN and SG haplotypes was observed on PN40024 chromosomes. For example, on chromosome 1, the end of the long arm of the SG haplotype was replaced by PN-specific markers. Most detectable recombination events occurred at the ends of chromosomes (Supplementary Fig. 22).

Fig. 5: Heterozygous regions on the PN40024 genome and their genetic burden.
figure 5

A Distribution of PN- or SG-specific markers on HE and PN40024 across chromosome 1 and chromosome 7. B Genotypes of conserved sites in each cultivar on chromosome 7. C Density of heterozygous dSNPs in HE and PN40024 on chromosome 2. D Frequency spectrum of GERP values in heterozygous blocks compared to regions outside of these blocks. E Distribution of dSNPs and dSNPs originating from PN or SG based on the specific markers within the heterozygous block on chromosome 2 of PN40024. The size of the circles represents GERP scores. F Genotypes of conserved sites in the SDR region. G The 67 heterozygous sites in PN40024 that do not exhibit both 0/0 and 1/1 genotypes simultaneously in other grape varieties. H Out of the 67 SNPs, 55 are identified as originating from PN. I Distribution of the 55 SNPs across two genes. J Comparison between PN40024 and Pinot Noir assemblies around the VviVI2 gene.

Close linkage of deleterious and structural variants in repulsion phases after successive selfing

Although almost the whole PN40024 genome is homozygous, some heterozygous variants (116,199 SNPs and 1125 SVs) remained in all four PN40024 clones, and most of them were clustered in blocks (Fig. 5B, Supplementary Figs. 22 and 23)25. We identified six large heterozygous blocks (see “Methods”) (Supplementary Table 4). To mitigate the influence of potential assembly errors and structural deviations from the reference genome, we assessed the read coverage of each sample within the heterozygous blocks relative to the read coverage on the whole genome (Supplementary Data 11). We observed that the average read depth in the heterozygous block of chromosome 16 (coverage: 12,992,557–13,599,104) is 2.13 times higher than that outside this block, likely due to erroneous assembly or structural disparities with the reference genome. Therefore, this block was excluded from subsequent analysis. The remaining five blocks account for ~4.3% of the genome and contain ~93% of the hSNPs and ~90% of the SVs in the entire genome.

To find out why PN40024 retained heterozygous blocks, we compared the distribution pattern of variants and found that the distribution pattern of SVs and dSNPs in heterozygous blocks of PN40024 was inherited from HE (Fig. 5B, C, Supplementary Figs. 2429). In addition, the heterozygous blocks overlapped with regions of low recombination (Fig. 5A, Supplementary Figs. 21 and 22). According to the heterozygous blocks on PN40024, we divided the genome into two parts: the heterozygous block regions and genomic regions outside of them. We found that there were more hSNPs located on genes and the flanking region of genes (2 kb upstream and downstream) in the heterozygous block regions than in regions outside of them in HE (Supplementary Fig. 30). Previous work has shown that heterozygous blocks in selfing lineages are caused by dSNPs in repulsion in low recombination regions3,46. We assessed this phenomenon in PN40024 by assessing GERP values for dSNPs. Remarkably, dSNPs in heterozygous regions exhibited significantly rightward shifts in predicted deleterious effects relative to dSNPs outside of heterozygous regions, suggesting the presence of numerous large-effect deleterious mutations within heterozygous blocks (P < 0.001, Wilcoxon rank sum test, Fig. 5D). We then identified dSNPs in PN40024 that were inherited from both PN and SG in the heterozygous blocks, which suggested that dSNPs are probably located on both haplotypes of these blocks (Fig. 5E and Supplementary Fig. 31). Therefore, it was presumed that, due to the low recombination rate in these heterozygous regions and large-effect dSNPs in repulsion phases, the genome could not purge these variants, leading to the retention of heterozygosity in these blocks.

As an example, the ~200 kb SDR was located on the heterozygous block on chromosome 2. There are four SDR haplotypes: Male-like haplotype (M), female-like haplotype (F), and hermaphrodite-haplotype (H1 and H2) with dominance effects M > H > F. H1 and H2 originated from the recombination of F and M to become hermaphroditic47,48. We found that PN, GN, HE, and PN40024 are H2/F genotypes, SG and GB are H1/F genotype, and CD is H1/H2 genotype, which indicated that HE inherited the H2 haplotype from PN and the F haplotype from SG. Interestingly, after nine generations, PN40024 maintained the heterozygous genotype (H2/F) at SDR locus, enabling it to remain hermaphroditic (Fig. 5F). The F/F genotype could not be selfing by PN40024, because the flowers would be female. However, why did the H2 haplotype not become homozygous? One possibility is the presence of strongly recessive deleterious (even recessive lethal) SNPs on the H2 haplotype. To investigate this, we screened the sites that are heterozygous (0/1 genotype) in PN40024 and cannot simultaneously exist in a homozygous state for both the reference allele (0/0 genotype) and the alternative allele (1/1 genotype) in other grape accessions. Finally, 67 SNPs were identified. To focus on dSNPs located on the H2 haplotype inherited from PN, we excluded SNPs inherited from SG, resulting in 55 SNPs found in both PN and PN40024 accession (Fig. 5G, H). All these SNPs are located on two genes (VviVI2 and VviGSM2) and their flanking regions. Six of the SNPs had a GERP score reaching the maximum value of four, indicating a strong detrimental effect on grapes (Fig. 5I). Five of the six SNPs were in the CDS region of VviVI2, along with 24 indels and an SV (87 bp) (Fig. 5J). Knocking out of the AtVI2 gene in Arabidopsis resulted in a severe reduction in pollen germination (Supplementary Fig. 32)49,50,51,52. Therefore, the VviVi2 gene, associated with recessive lethality, located on the H2 haplotype likely maintained the SDR region in a heterozygous state on chromosome 2.

Discussion

In this study, we integrated comparative genomic and population genetic analyses to investigate the impacts of different reproductive systems on grapevine genomes and breeding. We collected grapevine lineages that represent different reproductive systems, from the wild ancestors of domesticated grapes to a diverse collection of accessions related to PN, including clonal variants, hybrid materials, and a selfed lineage. The results revealed the effects of different reproductive systems on population genetic characteristics, including heterogeneity, SFS, and the structural and deleterious burdens. The clonal propagation dramatically masked the harmful effects of recessive deleterious variants in heterozygous states, while selfing purged most of the deleterious and structural burdens. These discoveries underscore the genetic basis of reproductive systems, enhance our understanding of complex evolutionary genomic processes and provide a theoretical guideline for future grapevine genomic breeding by combining beneficial variations of agronomic and resistant traits while purging deleterious variations.

PN has long served as a pivotal genetic resource for breeding. We assembled a haplotype-resolved chromosomal genome of PN to capture the haplotypic diversity within this genome, including a ~4 Mb inversion between haplotypes PN1 and PN2 (Fig. 1A and Supplementary Fig. 2). We identified 1794 and 1819 gene families that were exclusive to PN1 and PN2 (Fig. 1B, C), respectively, suggesting that >10% of genes were in a hemizygous state in PN, which is also observed in other grapevine cultivars36,45,53,54,55.

We conducted further investigations into somatic and fixed variants within the PN population to understand the origin of this cultivar. After domesticated grapevines were introduced to Europe, they exchanged genes with European wild grape28. The signals of similar introgression events are evident in the PN genome and likely shaped agronomic traits. For instance, flowering-related loci CONSTANS-like 5 (COL5) in the PN genome show evidence of both positive selection and introgression from the EU wild grape population to PN or its ancestral lineages (Fig. 2D). One gene belonging to the same family, CONSTANS-LIKE 9, has been reported previously to be under both introgression and positive selection in wine grapes in the Iberian Peninsula region10,56. In addition, the SVs could play an important role in PN, which was also reported in other studies35,45. For example, an SV on chromosome 10 harbored a gene cluster of S-locus family (Fig. 3E), which is probably related to the switch to self-fertility in domesticated grapes.

The reproductive systems have varying impacts on rates of linkage and recombination, the levels of genetic drift effects, the efficacy of selection in plants, which could have pronounced effects on genomic variations42,43. In order to uncover the impact of clonal propagation in grapevine, we conducted an investigation into both somatic mutations and fixed mutations in PN using comparative population genetics. Almost all fixed mutations in PN were heterozygous (98.6% for SNPs and 91.9% for SVs, Fig. 3C). Clonal propagation is widely applied in grapevine production, which have an great advantage in ensuring the stability of vigorous phenotypes and genotypes5,57. In this context, our results suggest that rapid selection with the aid of clonal propagation by breeders could occur once favorable mutations are identified during grapevine breeding28. The highly heterozygous fixed mutations highlight the advantageous role of clonal reproduction in fixation of vigorous phenotypes and genotypes during the cultivation of grapes like PN.

On other hand, clonal propagation inhibits the recombination of variants and leads to the accumulation of somatic mutations5,11. However, its effect on fitness is variable and depends on somatic mutation rates and patterns, which can differ among various plants or even among different organs of a single individual58,59. In this study, we identified 20,328 SNPs and 176 SVs that were specific to the PN group and that varied among PN clones. We deemed this group of mutations to be somatic mutations, since they represent new mutations since the formation of the PN varietal. We recognize, however, that some of these mutations may be chimeric—i.e., fixed throughout the tissues of the plant – but our sampling strategy is unable to detect such chimerisms. Additionally, the misidentification of somatic variations due to errors in SNP calling could occur. In this study, we decreased the occurrence of false positive results by using the most up-to-date sequencing and bioinformatic methods, including the complete T2T reference genome PN40024_T2T for read mapping, high-coverage whole genome sequencing data, and stringent criteria to remove low-quality variants. Therefore, these putative errors are limited and are unlikely to have significant impacts on the results of the population genetics analyses in this study. We found that most somatic mutations were exclusive to one individual and tended to be located in intergenic regions (Fig. 3C, Supplementary Fig. 15), suggesting that somatic variations tend to be rare which is consistent with that observed in cultivar Zinfandel and maize3,31. In theory, deleterious somatic mutations can decrease the fitness of clonal plants10,11,12. However, previous studies have used forward simulation to show that clonal propagation does not decrease fitness under recessive selection model20,60. In addition to deleterious mutations, somatic mutation can give rise to beneficial variants that contribute to new traits36,38,39, and even lead to the development of new cultivars, for example, the three bud sports varieties (Pinot Grigio, Pinot Meunier, and Pinot Blanc) from PN22.

Selfing, predominantly adopted by 10–15% of flowering plants, is another common strategy for plant reproduction61 and is applied for breeding due to its many advantages, including the fixation of desirable variants and protection from potential performance downsides of hybridization62. However, there is a long-running debate as to whether selfing is “dead end”61,63,64. Inbreeding depression is widely observed in the plant kingdom64,65, including grapevines20,45,66. One reason for this phenomenon could be the uncovering of recessive deleterious mutations, which has been observed in other selfing crops, such as potato and maize3,67,68. In this study, we observed a significant decrease in genetic diversity in PN40024 group compared to other clonal and crossing reproduction cultivars (PN, SG, HE, GB, CD, GN, and the wild grapes). Successive selfing dramatically purged heterozygous dSNPs (62% compared to HE) and SVs (65% compared to HE) in PN40024, but 32% and 31% of the heterozygous burdens (dSNPs and SVs) were shifted to a homozygous state in PN40024 (Fig. 4C, D, Supplementary Figs. 19 and 20).

After nine generations of selfing, there were 116,199 SNPs and 1,125 SVs that persisted in heterozygosity in all four PN40024 clones (Fig. 5B, Supplementary Figs. 22, 23, 25, and 28)25. A similar phenomenon – higher than expected heterozygosity - has also been reported for genomic regions of other selfed lineages, such as maize, potato, Brachypodium, and Arabidopsis3,67,69,70,71. The hSNPs were found predominantly in six heterozygous blocks in PN40024 (Supplementary Table 4). Some of these heterozygous sites could reflect read misalignment69,71. We examined read coverage in the heterozygous blocks and found that one such block on chromosome 16 could be false (Supplementary Data 11). To explore the genomic origins of these retained variants in selfing offspring, we compared them to the HE, PN, SG populations. We found that the variants in heterozygous states of HE have a higher proportion anchoring to genes in the regions corresponding to the heterozygous blocks in PN40024 than those outside of these regions (Supplementary fig. 30). The genes that are heterozygous in PN40024 were involved in chitin binding (Supplementary Data 10). In addition, rare recombination events were observed in these heterozygous block regions in PN40024 (Fig. 5A, Supplementary Figs. 21 and 22), indicating that low recombination rates prevent the genome from becoming homozygous during selfing. Similar results were also found in maize and potato3,68. For example, Roessler et al. observed a significantly higher proportion of nonsynonymous SNPs in heterozygous blocks compared to homozygous blocks, and noted low recombination rates at these regions. Similarly, Zhang et al. showed that a highly heterozygous region of selfed potatoes contained two large-effect deleterious mutations (led1 and yl1) in repulsion. Interestingly, the sex-determination region remains highly heterozygous in hermaphroditic domesticated grapes including PN40024 (Fig. 5F)45,47,72,73. One reason, of course, is that plants with the F/F genotype are no longer hermaphroditic47,72, and there are (to our knowledge) very few cases of homozygous H2/H2 genotypes. One potential reason is the presence of homozygous lethal alleles. We provide some evidence and speculate that one candidate gene, VviVI2, on H2 haplotype of SDR region contains a recessive lethal mutation that prevents the propagation of the homozygous H2/H2 genotype (Figs. 5F, I and J). Thus, the deleterious variants located in low-recombining regions, where Hill-Robertson interference decreases the efficacy of purging deleterious variants in repulsion phases, are a primary driver of the retained heterozygous blocks in PN40024 and other selfed plant lineages65,67.

One goal of breeders is to combine known beneficial alleles while purging deleterious alleles68,74,75,76,77. The contribution of phased genomes, such as PN40024_T2T25, the PN genome introduced here, along with several additional grapevine varieties53,54,55,78,79,80 and wild Vitis genomes81,82,83,84, can help identify comprehensive variants in grapevine and develop efficient molecular markers for grapevine improvement. This study also presented a new understanding of applying different reproductive systems for breeding. Outcrossing is a good strategy for combining and introducing of new traits in domesticated grapes, and subsequent clonal reproduction could retain these preferred alleles for production. However, the clonal reproduction leads to the accumulation of deleterious variants, which is not suitable for future breeding. Inbreeding depression is one of the main obstacles to fixing these preferred alleles through selfing while exposing heterozygous deleterious variants accumulated during clonal propagation20,75. In this study, we revealed that deleterious mutations in repulsion phase are the cause of several heterozygous blocks in the PN40024 genome. Although this phenomenon has only been observed in one grapevine progeny PN40024, similar occurrences have been reported in crops such as potato and maize3,67. Breaking the linkage of these deleterious mutations in repulsion phase would be a practical method, as has been done in potatoes68. In addition, the influence of structural variations on grapevine phenotypes highlights the necessity of integrating SVs into breeding programs55,85.

Our study illuminates the diverse impacts of different breeding methodologies, including cloning, crossing, and selfing, on the genetic heterozygosity, beneficial variants, and dSNP and SV burdens in the grapevine genome. Armed with this knowledge, breeders can make informed decisions regarding the selection of breeding methods and combinations to strategically pursue their specific breeding objectives.

Methods

Plant materials and sample collection

Fresh and healthy leaf tissue from plants of Vitis vinifera cultivar “PN” clone AGIS_01 was collected from the grapevine germplasm collection at the Agricultural Genomics Institute at Shenzhen, Shenzhen, Guangdong Province, China, and immediately frozen in liquid nitrogen. These materials were packaged for PacBio HiFi and Hi-C sequencing, respectively, and subsequently submitted to the company for genomic DNA extraction and library preparation.

Library preparation and DNA sequencing

Isolation of high-molecular-weight genomic DNAs using the DNeasy Plant Mini kit according to the manufacturer’s instructions. For PacBio HiFi sequencing, single-molecule real-time cells were sequenced on the PacBio Sequel II platform using Circular Consensus Sequencing (https://github.com/PacificBiosciences/ccs) with default parameters. For the Hi-C library, samples were digested with the restriction enzyme DpnII and constructed following a standard Hi-C protocol as described previously. These Hi-C libraries were sequenced on the Illumina HiSeq X Ten platform.

De novo haplotype-resolved genome assembly and quality assessment

Using the Hi-C Integrated Assembly mode of HiFiasm, we first generated two contig-level haplotype genomes using the HiFi and Hi-C data of PN86. Genomic heterozygosity was assessed by GenomeScope (v2.0)87, employing a k-mer-based methodology applied to raw HiFi reads. Then, we used RagTag to determine the approximate order of contigs on chromosomes by using PN40024_T2T as a reference genome25. Subsequently, the Hi-C sequencing data were harnessed to anchor all contigs via Juicer (v1.5)88. This was succeeded by employing a 3D-DNA scaffolding pipeline to further refine the structure89. Manual adjustments were conducted on the acquired outcomes using Juicebox (v1.11.08, https://github.com/aidenlab/Juicebox), followed by a secondary application of the 3D-DNA approach to achieve the genome at the scaffold level. Employing Minimap2 (v2.24)90, a comprehensive comparison was conducted between scaffold-level genomes from different Vitis species and the raw HiFi data. The outcomes were then imported into IGV (v2.12.3)91 to pinpoint the precise sequence positions of gaps. To validate the accuracy of gap filling, the identified sequences were reintegrated into the genome utilizing Minimap290.

For genome quality, we used QUAST to count the basic information of the contig level genome and the final genome (https://github.com/ablab/quast). The genome completeness was evaluated by BUSCO using the embryophyta_odb10 database92. The genome continuity was evaluated by calculating the contig N50 length. For genome accuracy, we mapped the genome with HiFi reads using Minimap2 and calculated mapping rates90.

Annotation of genes and transposable elements

We primarily used this genome-wide annotation pipeline for genome annotation (https://github.com/unavailable-2374/Genome-Wide-Annotation-Pipeline). RNA-seq datasets were collected covering a variety of tissues, including flowers, leaves, and other tissues. RNA sequences were aligned to repeat mask assemblies using Hisat2 (v2.10.2) and subsequently assembled into transcripts using StringTie (v1.3.0)93,94. Genes were first searched by using transcripts and UniProt (https://www.uniprot.org/) as evidence. In this process, we used Exonerate, Genewise, and Transfrag. In short, an initial gene model was created for the genes and further searches were performed using AUGUSTUS (v3.4.0)95. Genes involving duplicated regions, CDS regions shorter than 90 nucleotides, or without any evidence to support them were filtered out. Finally, all results were checked with a hidden Markov model downloaded from the Pfam database to obtain the final gene model.

RepeatModeler (open-2.0.3) was used to build the TE library, with the -LTRStruct96. Genome-wide TE annotation was performed using RepeatMasker (open-4.1.2, https://github.com/rmhubley/RepeatMasker), with -e rmblast -lcambig and slow model.

Identification of telomeres and centromeres

For telomere identification, plant telomeric sequences (CCCATTT at the 5’ end and TTTAGGG at the 3’ end) were identified, and 70 out of the expected 76 telomeres (spanning 38 chromosomes of 2 haplotypes, 35/38 for each genome) were identified using the telomere pipeline developed by the TIDK (v0.2.0)97.

TRF v4.09 was used to finish tandem repeat annotation, and then we merged the results of annotation using TRF2GFF98. To complete data visualization, we analyzed the results in IGV (v2.12.3)91. The results were compared with TE annotation and TRF using IGV to identify the centromeres.

Genome comparison between PN1, PN2 and PN40024_T2T

We aligned the PN1, PN2, PN40024_T2T genomes using Minimap2, and indexed the alignment BAM file using SAMtools (v1.4)99. Next, SyRI100 was used to find structural variants between the genomes, and the results were visualized using plotsr101.

MUMmer (v.4.0) was used to compare the genome with the reference genome PN40024_T2T using whole-genome alignments102. First, we aligned the two genome sequences using Nucmer and then filtered one-to-one alignments with a minimum alignment length of 10,000 bp (delta-filter -i 95 -l 10000).

To identify gene families in PN40024_T2T, PN1, and PN2, Orthofinder v2.5.2 was utilized103 according to a previous study104, comparing their protein-coding gene sequences. Basically, the genes that similar to each other and distinct from genes in other groups were clustered together based on the sequence similarity. If no gene was identified in a gene family for a specific genome, this gene family was counted as absent in that genome. The gene abundance in one gene family on one genome was calculated by dividing the number of the genes in this gene family on that genome by the maximum number of the genes in this gene family among the three genomes.

Comparing SVs on the Pinot Noir genomes

To identify SVs on the PN genome (between the two haplotypes), we called SVs using the Sniffles pipeline105. First, PacBio reads longer than 500 bp were mapped onto PN1 PN2 genomes using the aligner Minimap290. Variant calling was then performed with Sniffles. SV analysis outputs (VCF files) were filtered by VCFtools (v0.1.16)106 to find the heterozygous SVs between the two haplotypes in PN. To further validate the existence of this inversion, we designed primers at the break points on both haplotypes, following with PCR and gel electrophoresis analyses (Supplementary Table 2 and Supplementary Fig. 4).

SNP calling and filtering

We used 68 grapevine resequencing samples, including ten wild grapes (V. vinifera subsp. sylvestris) from Europe (EU), ten wild grapes from the Middle East (ME), 18 PN, five SG, four GB, ten CD, two GN, two HE, and four PN40024, along with three muscadine grapes used as outgroup. Among these samples, nine of the clonal PN collected from Foundation Plant Services, University of California, Davis were sequenced in this study, while the remaining data was downloaded from the database on the website (Supplementary Table 4).

Using default parameters, fastp (v0.21) was used to regulate the quality of resulting raw reads107. The PN40024_T2T assembly (PRJNA882193 in NCBI)25 served as the reference genome. Quality-controlled reads were mapped to the genome using bwa (v 0.7.15) with default parameters108. SAMtools (v1.4) and GATK (v4.1.8) were used for sorting and indexing the bam file with no duplicates109,110. GTX, which is based on the Haplotype Caller of GATK, was used for SNP calling across all samples.

To reduce false positives, filtering was conducted using VCFtools (v0.1.16)106. We removed genotypes with a genotype quality <20 (option -minGQ 20). SNPs with more than two alleles were excluded (options -min-alleles 2 --max-alleles 2), as well as SNPs with more than 20% missing genotypes (option -max-missing 0.8).

The filtered SNPs were phased by Beagle (v5.4) genotype imputation method111. The results were then reversed using Model 1 in the superSFS (https://github.com/xhchauvet/superSFS) script with parameter 3 to infer ancestral alleles according to outgroup species.

Phylogeny and population structure

For the analysis related to introgression in PN, the phylogenetic tree was constructed by IQ-TREE (v2.1.4)112 based on general time reversible (GTR) model using variants from three outgroup samples, 20 wild grapes and 42 wine cultivars in the 20 kb region where COL5 is located. For the analysis related to the impact of clonal propagation on grapevine genomes, the whole genome SNPs of the 68 previously mentioned grapevines were first thinned using VCFtools (--thin 1000). Phylogenetic tree was then constructed using IQ-TREE based on the GTR model. To construct the phylogenetic tree of the COL5 and VI2 genes among different species, the DNA sequences of these genes were analyzed using MEGA X113 with the method neighbor-joining and p-distance.

The population structure based on nuclear SNPs was estimated using Admixture (v1.3.0) with K varying from 2 to 10114. Kinship and IBS0 analyses were performed using King41. The relationship between samples were assessed according to the following criterion: identical clones (K ≥ 0.49 and IBS0 ≤ 0.001), parent-offspring (0.177 < K < 0.354 and IBS0 ≤ 0.001), highly related/sibling (0.177 < 0.354 and IBS0 ≤ 0.25)21.

Population genetic analyses

The Nucleotide diversity (π) value for each sample was calculated using VCFtools with a 100 kb window size. Sequence similarity (Dxy) and f-statistic (fd) were calculated using the Python script: popgenWindows.py (https://github.com/simonhmartin/genomics_general) with every 50 kb nonoverlapping windows. For fd analysis, ME were used as the sister population to PN, and the gene flow between EU and PN grapes was evaluated. The heterozygous sites were counted using VCFtools. The number of heterozygous sites in each sample was divided by the total number of SNPs to calculate the heterozygosity of each sample. These statistics were evaluated within or between all six groups (EU, n = 10; ME, n = 10; PN, n = 18; SG, n = 3; HE, n = 2; and PN40024, n = 4). The heterozygosity was calculated by dividing the number of heterozygous sites in each sample by the total number of SNPs.

Selective sweep detection

PBScan (https://github.com/thamala/PBScan) was used for PBS analysis26. The PN group was designated as POP1, the ME group as POP2, and the EU group as POP3. The analysis was conducted every 50 SNPs with a step size of 50 SNPs based on sequence similarity (Dxy).

Detection of deleterious mutations

We used Sorting Intolerant From Tolerant 4G (SIFT 4G) (https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB) to annotate the SNP dataset in order to estimate the functional effects of mutations44. Swiss-Prot (https://www.uniprot.org/help/downloads) was utilized as a reference protein set to construct a grapevine database. Meanwhile, we used the genome and annotation files of PN40024_T2T. The GFX3 format was converted to the Ensemble GTF format. The generated database was then used to annotate the flipped SNPs. SIFT values range from 0 to 1, and any nonsynonymous position with a SIFT score of less than 0.05 was considered putatively deleterious.

ANGSD analysis

We used ANGSD software to calculate the locus spectrum (http://www.popgen.dk/angsd/index.php/ANGSD), which shows the frequencies of individual alleles at specific loci in the PN population115. First, we ran the script using the reversed VCF file, combining the reference genome to extract the sequence file of the ancestral genotype state (anc.fa) We then used the -doSaf 1 parameter to generate.saf (site allele frequency likelihood), and finally, the realSFS command was used to generate the site spectrum result file.sfs text format.

SV calling and filtering

Delly (v1.0.3)116 was used to call SVSs with PN40024_T2T as the reference genome. The “call” and “merge” functions were first used to get the.bcf files, and then the Delly “call” function was used again with “-v” parameter. Subsequently, the BCFtools (v1.13) “merge” function was used to merge the.bcf files. We removed low-quality SVs from the merged SVs that did not pass quality filters. After that, we used VCFtools to remove SV genotypes with more than 20% missing genotypes. The number of SVs containing TEs was counted based on the TE annotation of PN40024_T2T using Bedtools117.

Gene set enrichment analysis and gene ontology enrichment analysis

We downloaded the protein database (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/swissprot.gz) from NCBI. Protein sequence alignment was then performed using Diamond118. The database was first established using the Diamond “makedb” command, and then sequence alignment was performed using the Diamond “blastp” command. Next, the GSEA and GO enrichment analyses were performed using clusterProfiler (v4.0)27 with the GO terms of the whole grapevine genome proteins as the background. First, we used “AnnotationForge” package in R to build “OrgDb” for gene to GO mapping. Then, for the GSEA analysis, the genes were ordered according to their PBS values, and the “gseGO” function of clusterProfiler was used for these genes with parameters: pAdjustMethod = “BH”, qvalueCutoff = 0.05. For the GO analysis, “enrichGO” function of clusterProfiler were used for candidate genes with the parameters: pAdjustMethod = “BH”, qvalueCutoff = 0.05.

Analysis of heterozygous variants

The heterozygous blocks were detected by following the methods: if the distance between two hSNPs was less than 150,000 bp, the two SNPs were counted as continuous variation sites; if the span of the continuous variation sites was more than 500,000 bp, this region was counted as heterozygous blocks.

For analysis of different genotype combinations among cultivars, the conserved sites, which were consistent in each cultivar group, were used for counting. The combination categories with site numbers less than 30 were not included.

For the distribution analysis of SNPs on genes, 2 kb upstream or downstream of the genes were counted as promotors or terminators. The density of heterozygous SVs and dSNPs was calculated in every 200 kb non-overlapping window.

Species specific marker

To select the specific marker belonging to PN or SG, we used the following criteria: the variants must exist in only one grapevine cultivar with the frequency greater than 50%, and must not be observed in all individuals of another grapevine cultivar.

The GERP score

The longest transcripts were extracted from the genomes of Arabidopsis thaliana, Populus trichocarpa, Oryza sativa, Vitis retordii, Vitis amurensis, Cissus rotundifolia, Malus domestica, Ficus macrocarpa and PN40024, then Orthofinder (v2.5.4)103 was used to generate single-copy protein sequences, which were used to construct the phylogenetic tree using IQ-TREE.

The genome of PN40024 was split into individual chromosomes, a multigene linear comparison was performed using last (https://gitlab.com/mcfrith/last), multiple .maf files were merged and converted to.fa files. Each SNP on each of PN40024’s chromosomes was assessed using GERP119.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.