Extended Data Fig. 4: The quality control, variant discovery, and structural haplotype analysis of the macaque pangenome. | Nature

Extended Data Fig. 4: The quality control, variant discovery, and structural haplotype analysis of the macaque pangenome.

From: Integrated analysis of the complete sequence of a macaque genome

Extended Data Fig. 4

(a) Flagger evaluation of 20 haplotype-resolved assemblies is shown on the left panel, while the right panel shows the average across 20 assemblies and the evaluation of T2T-MFA8 (no chr. Y). (b) The cumulative number of added bases when adding assemblies one by one is illustrated, with red representing MFA and blue representing MMU. The total of added polymorphic sequences shows slow growth after the seventh MFA or MMU assembly. The species switch (MFA → MMU) increases the yield of added sequences. Transparent colors indicate singleton (AF < 5%), doubleton (5% ≤ AF < 10%), polymorphic (10% ≤ AF < 50%), and common (AF ≥ 50%) alleles. (c) The left panel shows the number of small variants (top) and SVs (bottom) per haplotype in the pangenome graph. The right panel shows the average number of small variants (top) and SVs (bottom) of MFA, MMU, and humans (from the HPRC-year1 MC pangenome graph). (d) The biallelic SNV comparison between the pangenome graph and the macaque whole-genome sequencing (WGS) cohort (289 macaques). The gray histogram illustrates the count of SNVs from the macaque cohort at MAF cutoffs (x-axis, e.g., MAF > 0.05 includes the SNV count with MAF greater than 0.05), while the line chart represents the fraction of these SNVs covered by the pangenome. This panel shows that the pangenome graph covers 80% of genetic variation with MAF ≥ 5% in the macaque cohort. (e, f) These panels show the correlation of AFs between the pangenome and 79 wild samples (e) and between the macaque cohort and the same wild samples (f). (g) The bar plot illustrates the most common copy number (CN) variable genes in SDR hotspots of macaques. The x-axis represents the number of gene copies that can be mapped to a bubble in the pangenome graph, while the y-axis shows the 17 most CN variable genes. (h) This panel demonstrates the complexity of major histocompatibility complex (MHC) in macaques. SNV and SV densities for eight structural haplotypes with gene models are shown above (top). The syntenic relationship between T2T-MFA8v1.1 and MFA186ZAI-H2 (bottom) shows a ~ 1 Mbp deletion in MFA186ZAI-H2 with respect to T2T-MFA8v1.1. (i) This panel displays the syntenic relationship of the CYP2C76 region in primates. In each assembly, the syntenic regions are represented as blocks, while non-syntenic regions are represented as thin lines, along with their DupMasker and gene annotation attached to each genome segment. (j) The structural representation of the GSTM family is shown, with the gene annotation. Green and purple refer to the start and end of GSTM gene bodies, respectively. (k) The graphical representation of four structural haplotypes of GSTM follows different paths in the pangenome, with red and purple representing the start and end of a path, respectively. The haplotype of T2T-MFA8v1.1 is GSTM (5A, 1A, 1B, 2). (l) The table illustrates the frequency statistics of GSTM haplotypes and their schematic graph. The frequency of structural haplotypes in the pangenome graph is displayed in the first column, while the inferred frequency from the population with short-read genotyping is shown in the second column.

Back to article page