Introduction

The vast complexity and diversity of the human genome has opened emerging avenues for research in genomics, leading to significant advances in our understanding of human biology and disease. However, the application of these genomic advances is limited by the quality, completeness, and representativeness of the reference genome used1. In the pursuit of understanding the intricate tapestry of human genome variation, large-scale sequencing projects have provided invaluable insights2,3. Recently, due to significant advancements in long-read genome technologies, the complete telomere-to-telomere (T2T) sequence of the haploid human genome CHM134 and the complete sequence of the Y chromosome were obtained5. This finding is a remarkable achievement that fills 8% of the gaps that exist in the current reference genome GRCh384. Although this sequence represents a fully resolved genome, it does not reflect human sequence diversity, and a population-wide approach is necessary to detect population-specific variants, sequences and regulatory elements. The Human Pangenome Reference Consortium (HPRC) has made significant strides in genomic research by constructing a pangenome from 47 ethnically diverse samples, adding 119 million base pairs and identifying 1115 gene duplications absent in the GRCh38 reference genome6. This finding led to a 104% increase in structural variant detection per haplotype6. A recent study in China (CPC) of 58 samples from 36 ethnic minorities identified 5.9 million small variants and 34,223 structural variants not found in the HPRC pangenome7. These findings underscore the value of incorporating genetically diverse individuals for a comprehensive understanding of the human genome landscape.

Arabs constitute culturally diverse communities primarily from Middle East and North Africa (MENA) regions with a combined population of nearly 500 million, comprising ~6% of the global population. Unfortunately, Arab populations lack adequate representation in large-scale sequencing projects (i.e., the gnomAD database); neither the HPRC pangenome nor the 1000 Genomes Project include samples from this demographic8. The ancestral complexities between different Arab ethnic backgrounds have not yet been fully understood through large-scale genomic initiatives9,10. Moreover, the Arab population, which experiences a higher incidence of consanguineous marriages, is predisposed to an increased rate of rare recessive disorders11,12,13. The incidence of diabetes, heart disease and cancer in Arab populations is increasing14,15. The lack of reference genomes for Arab populations has limited the investigation of genetic diversity and the genetic underpinning of numerous diseases. Population-specific reference pangenomes will enable the identification of variants associated with diseases and sequences that are unique or prevalent in Arab populations.

In this study, we sought to generate a draft of United Arab Emirates (UAE) Pangenome Reference (UPR) from Arab individuals residing in the UAE using multiple sequencing technologies and de novo assembly construction approaches. Here, we present the preliminary construction of the UPR from 53 human genomes with high-quality de novo diploid phased assemblies and their comparison with reference genomes and genomic datasets. Our analysis also included the application of the UPR in functional annotation, short and long-read whole-genome mapping for variant detection. We anticipate that our efforts will produce a pangenome that will allow for population-specific genome interpretation, facilitating more accurate genome interpretation with fewer biases and gaps, which will enhance the precision of detecting both structural and small variants.

Results

Healthy Arab sample cohort

We constructed a pangenome reference for healthy Arab individuals from eight Arab countries (the United Arab Emirates (UAE), Saudi Arabia, Oman, Jordan, Egypt, Morocco, Syria and Yemen) all residing in the UAE (Fig. 1a). 53 samples, including 50 unrelated adults (18 years or older) and a trio of family members with no known rare or common chronic diseases (i.e., no history of hypertension, diabetes mellitus, cancer, or lung or heart disease), were enrolled (Supplementary Fig. 1, Supplementary Table 1).

Fig. 1: Cohort characterization and sequencing quality.
Fig. 1: Cohort characterization and sequencing quality.
Full size image

a Geographic diversity of sample collection. A map highlighting the distribution of sample collection sites. Each point marks the geographical location of cohorts involved in the study, illustrating the broad recruitment strategy employed to capture genetic diversity. n = 53 individuals. Panel created in BioRender. Jamalalail, B. (2025) https://BioRender.com/ gyyr5q3. b Ultra-long read yield from Oxford Nanopore Technologies (ONT) sequencing. Histogram showing the yield of ultra-long reads (>100 kb) generated by ONT sequencing. The x-axis shows the reads based on length intervals, while the y-axis indicates the total yield per bin. n = 53 individuals. c Chromosome mapping distribution of sequencing reads. Boxplot presenting the distribution of ONT and Pacific Biosciences (PacBio) reads that align to acrocentric, metacentric, and submetacentric chromosomes. Neon pink and teal blue bars represent PacBio and ONT data respectively. Box plots show the 25th and 75th percentiles (interquartile range, IQR), center line indicates the median, and whiskers extend to the minimum and maximum values. Individual data points are overlaid. n = 53 individuals. d Boxplot illustrating the coverage of subtelomeric and pericentric regions by both ONT and PacBio reads. Box plots show the 25th and 75th percentiles (interquartile range, IQR), center line indicates the median, and whiskers extend to the minimum and maximum values. Individual data points are overlaid. n = 53 individuals. e Population structure via Principal Component Analysis (PCA). Two-dimensional scatter plot derived from PCA, visualizing the genetic variance among different ethnicities. Samples from Human Genome Diversity Project (HGDP) and Human Origins database are color coded by ethnicity, with UPR highlighted in red. n = 53 UPR individuals, and n = 1,040 Human Origins individuals, including n = 304 Arabs. The two axes represent the percentage variance explained by PCA1 and PCA2. Source data are provided as a Source Data file.

Assessment of sequencing quality and variant statistics

The genomes were sequenced using Pacific Biosciences (PacBio) high-fidelity (HiFi), Oxford Nanopore Technologies (ONT) ultralong read sequencing (ULK) and high-coverage (Hi-C) Illumina short-read sequencing methods. The average coverage for each sample was 35.27X (26.60X–69.39X) using PacBio HiFi sequencing, 54.22X (33.33X–90.87X) using ONT ultralong read sequencing and 65.46X (50.89X–91.96X) using Hi-C sequencing (Supplementary Fig. 2). The average median Q score, indicating sequencing quality, was 33.19 (28–36) for PacBio HiFi sequencing, 17.35 (16-18.2) for ONT sequencing and 37 for Hi-C. The average N50 values were 16.69 kb (14.23 kb–19.06 kb) and 57.67 (43.64 kb–87.02 kb) for PacBio HiFi and ONT ultralong read sequencing, respectively (Supplementary Fig. 2 and Supplementary Table 2). Notably, ultralong reads exceeding 100 kb contributed to a substantial yield of 2011.14 Gb, translating to 670.38X coverage (an average of 12.53X per sample) (Fig. 1b, Supplementary Table 3). When mapped against CHM13 v2.0, both PacBio HiFi and ONT reads provided valuable insights into the sequencing coverage across acrocentric and metacentric chromosomes. Due to the use of ultralong protocols, the ONT reads exhibited better mapping across all chromosomes than did the PacBio reads, particularly for the acrocentric chromosomes, where 99.49% (94.95%–100%) coverage was achieved by ONT in comparison to 95.60% (91.61%–99.75%) coverage achieved by PacBio (Fig. 1c, Supplementary Fig. 3, Supplementary Table 4a). We utilized both read types to analyze coverage across diverse regions of the human genome, including satellite DNA, centromeric transitions, and ribosomal DNA. Both platforms exhibited excellent coverage, with regions such as AlphaSat_div, AlphaSat_mon, GammaSat, and Other CenSat exhibiting average coverage above 96% (96.67%–100.00%). However, the ONT platform demonstrated a marked increase in rDNA coverage, with 99.93% (99.57%–100%) and 67.44% (52.25%–97.25%) coverage for the ONT and PacBio HiFi reads, respectively (Fig. 1d, Supplementary Table 4b).

Our joint calling analysis of PacBio HiFi data incorporating the DeepVariant pipeline (GRCh38) revealed an average of 4.21 million (M) (ranged from 4.02 M–4.48 M) single-nucleotide variants (SNVs), 913,786 (800,438–971,731) insertion/deletions (indels) and 53,618 (51,732–56,573) structural variants (SVs) per sample (Supplementary Table 5, Supplementary Fig. 4). We identified population-specific variants, including 1.40 M SNVs and 308,803 M indels, that were not previously reported in the 1000 Genomes Project2, dbSNP16, gnomAD17 or GME18. We identified 158,696 population-specific structural variants not previously reported in the 1000 Genomes Project or Database of Genomic Variants (DGV)19.

Population structure of the UPR samples

We subsequently analyzed the genetic structure of 53 UPR genomes from published global and regional populations9,20,21,22. Principal component analysis (PCA) demonstrated distinct clustering patterns of the UPR samples and Arab groups (Fig. 1e), reflecting their unique population history and diversity in contrast to other global populations (Supplementary Fig. 5a, b and Supplementary Table 6a). Using the relatedness analysis using plink, we confirmed all UPR samples are unrelated except for the trio (Supplementary Table 6b). Haplotype sharing information obtained using the fineSTRUCTURE algorithm23 demonstrated the substructures within our dataset, reflecting the different Arab subpopulations and the unrelated status of all 50 samples (Supplementary Fig. 5c). Model-based clustering using the ADMIXTURE algorithm, which integrated the genomic data from different Arab groups, indicated that the ancestry of the UPR cohort generally reflects that of various Arab subpopulations (Supplementary Figs. 6, 7).

Our analysis revealed that the UPR cohort exhibited mitochondrial and Y chromosome haplogroup patterns closely resembling those associated with Arab populations. Specifically, the hierarchical clustering heatmaps demonstrated a significant alignment in haplogroup frequencies between our cohort and reference populations from Arab regions (Supplementary Fig. 8, Supplementary Table 7).

Assembly construction and evaluation of diverse Arab genomes

Our comparative analysis of the UPR trio revealed that both the Hifiasm24 and Verkko25 assemblers yielded comparable contig lengths (Supplementary Table 8). In samples lacking parental genome data, Hifiasm exhibited better performance in terms of both contig length and overall contiguity. Consequently, we performed de novo assembly using Hifiasm (v0.19.5-r603) for our entire cohort and used those assemblies for downstream analysis. Sequence quality control was performed using Kraken26, which filtered out non-human eukaryotic pathogen genomes and retained reads classified as human. We excluded an additional 109 contigs that mapped to ChrM with at least 95% identity (Supplementary Table 9). Additionally, we identified and cataloged 185 interchromosomal joins in the assemblies based on their sequence identity using paftools27 (Supplementary Fig. 9, Supplementary Table 10), resulting in the exclusion of four contigs due to significantly large mis-joins in acrocentric chromosomes. Inspector assembly polishing28 was further carried out to reduce small-scale assembly errors.

The resulting assemblies yielded an average genome size of 3.01 Gb (2.86 Gb–3.11 Gb) (Supplementary Fig. 10, Supplementary Table 11). Haploid assemblies with an X chromosome had an average total length of 3.02 Gb, representing 99.1% of the T2T-CHM13 (3.06 Gb). Conversely, haploid assemblies with a Y chromosome averaged a total length of 2.98 Gb, highlighting the inherent size difference between the sex chromosomes. On average, the assemblies comprised 103 contigs (ranged from 58 to 245), with a contig N50 length of 124.28 Mb (86.26 Mb–146.15 Mb) (Fig. 2a) and an average auN of 122.48 Mb (87.77 Mb–146.00 Mb) (Supplementary Fig. 11). This value exceeds the GRCh38 contig N50 of 57.88 Mb, demonstrating enhanced contiguity across all our assemblies. All samples surpassed the 40 Mb N50 of the HPRC reference genome (Supplementary Fig. 11). The average QV of our assemblies, computed using yak (v0.1-r66), was 57.53 (52.85–62.59), indicating robust assembly quality (Fig. 2b). We compared our assemblies to the CHM13 and GRCh38 reference genomes and revealed an average coverage of 93.59% (92.79% for males, 94.24% for females) and 96.53% (95.49% for males, 97.39% for females), respectively (Fig. 2c). We identified an average of 49.86 Mb (29.31 Mb–73.14 Mb) and 76.81 Mb (54.74 Mb–108.31 Mb) of assembled contigs that did not align with CHM13 and GRCh38, respectively (Fig. 2d, Supplementary Fig. 12). This observation is consistent with the unmapped rates reported in diverse pangenome studies, reflecting the complexity of the genome assembly and alignment processes. The duplication ratio, indicative of read mapping multiplicity against reference genomes, averaged 1.03 (1.023–1.047) and 1.012 (1.005–1.022) for GRCh38 and CHM13, respectively (Supplementary Fig. 12), highlighting the precision of our assembly with minimal duplicated regions.

Fig. 2: Quality assessment of 53 phased diploid assemblies and gene duplication analysis.
Fig. 2: Quality assessment of 53 phased diploid assemblies and gene duplication analysis.
Full size image

a Assembly contiguity. A line plot showing contig length plotted against cumulative assembly coverage, with reference contiguities for both CHM13 and GRCh38 genomes included for comparison. n = 53 individuals. b Assembly accuracy and completeness. A plot illustrating the mapping rate versus consensus accuracy (Quality value, QV), offering insights into the completeness and accuracy of the assemblies. n = 53 individuals. c Genome fraction. Scatter plot showing the fraction of the genome covered by the assemblies compared to benchmark references CHM13 and GRCh38. n = 53 individuals. d Unaligned length. A scatter plot comparing the unaligned length of the assemblies relative to the CHM13 and GRCh38 references. n = 53 individuals. e Flagger analysis. Bar plot illustrating the reliability of 53 UPR assemblies using read mapping. The plot differentiates between paternal and maternal haplotypes, with regions flagged as reliable (blue) representing the majority of each assembly. The y-axis is broken to emphasize the dominant reliable haploid component and the stratification of the unreliable blocks. f Gene and transcript annotation. Scatter plot showing the percentages of protein-coding and noncoding genes, as well as transcripts annotated from the reference set in each of the assemblies. n = 53 individuals. g Gene duplication per assembly. Histogram showing the number of unique duplicated gene families in each phased assembly in comparison to the number of duplicated genes annotated in GRCh38. n = 53 individuals. h Comparative duplicated gene analysis. Venn diagram visualizing the overlap and unique counts of duplicated genes across UAE-based Arab Pangenome Reference (UPR), Human Pangenome Reference Consortium (HPRC), and Chinese Pangenome Consortium (CPC) assemblies. n = 106 UPR, 88 HPRC, 116 CPC assemblies. i Arab-HPRC duplicated gene overlap. Bar graph showcasing five overlapped duplicated genes with a higher frequency ( ≥5%) in Arab assemblies (blue) compared to HPRC (orange). n = 106 UPR, 94 HPRC assemblies. j Arab-CPC duplicated gene overlap. Bar chart illustrating five overlapped duplicated genes with a significantly higher frequency (≥5%) in Arab assemblies (blue) in contrast to CPC (yellow). n = 106 UPR, 116 CPC assemblies. k Bar graphs indicating the count of UPR unique duplicated genes across chromosome types: acrocentric, metacentric, and submetacentric. l Bar graph showing the count of UPR unique duplicated genes dispersed across all individual chromosomes, highlighting regions of enrichment. m Gene duplication in microsatellite region. Bar graph depicting the count of UPR unique duplicated genes located in microsatellite regions. Source data are provided as a Source Data file.

To evaluate the reliability of our assembled genomic sequences, we employed the Flagger tool29, which is designed to identify misassemblies within a phased diploid assembly. Our analysis using Flagger revealed that a minimal proportion (1.28% or 37.91 Mb) of the total assembly was deemed unreliable (Fig. 1e, Supplementary Table 12), indicating the high accuracy of the UPR assemblies.

The phasing accuracy of the trio samples was evaluated using yak trioeval, revealing a switch error rate and Hamming error of 0.18% and 0.23%, respectively. For the samples with Hi-C data, the phasing accuracy was computed using pstools30, with an average switch error rate and Hamming error of 0.64% (0.21%–1.07%) and 1.96% (0.76%–3.21%), respectively (Supplementary Table 13a). To evaluate the mapping accuracy of our assembled contigs within the most complex centromeric regions, we employed UniAligner31 to calculate the mapping percentages at 100 bp intervals across these regions. Our analysis revealed the average mapping rate of 26.97% for UPR (Supplementary Fig. 13), compared to 20.42% for HPRC, highlighting the accuracy and reliability of UPR assemblies in representing complex genomic regions such as centromeres. Largest scaffolds in our cohort were identified at the chromosome level using strict criteria, requiring that each scaffold predominantly map to a single chromosome with a maximum of three gaps and cover a significant portion of the chromosome’s length, excluding regions composed of microsatellites. Applying these criteria, we identified 417 chromosome-scale scaffolds in the UPR genome assemblies (Supplementary Table 13b). These comprehensive quality control assessments verified the high contiguity and accuracy of our de novo assembled genomes, providing a solid foundation for downstream genomic and clinical applications.

Gene duplications in UPR assemblies

To investigate gene duplication events within UPR diploid assemblies, we employed the Liftoff (v1.6.3)32 tool and annotated the genes using a subset of GENCODE GRCh38.p14 (GENCODE release 38) (detailed in Methods). The identified gene duplications revealed unique and diverse genomic features of the Arab population. Remarkably, a median of 99.15% (96.20%–99.34%) of protein-coding genes and 99.13% (95.86%–99.28%) of protein-coding transcripts were identified across all the UPR assemblies (Fig. 2f). Each genome had an average of 35 genes with a gain in copy number relative to GRCh38 (Fig. 2g, Supplementary Table 14). We observed that 37.83% of duplicated genes were present as singletons, whereas 8.17% were common (>5%) among UPR assemblies.

We performed a comparative assessment of gene duplications within the UPR, HPRC, and CPC assemblies to identify genes that were specific to the Arab population. We identified 1135 duplicated genes unique to the UPR assemblies that were absent from the HPRC assemblies and a further 883 duplicated genes that were absent from both the HPRC and CPC assemblies (Fig. 2h). In addition, 38 duplicated genes from the UPR assemblies were also observed in the HPRC and CPC assemblies. Among these overlapping genes, some genes were present at a greater frequency (≥5%) in the UPR assemblies than in the HPRC and CPC assemblies (Fig. 2i, j Supplementary Table 15), such as the USP17L genes. The USP17L genes encode ubiquitin-specific proteases, which are involved in various cellular processes, such as protein degradation, DNA repair, and cell cycle regulation33,34.

Among the unique UPR duplicated genes, the TAF11L5 gene was present in all UPR assemblies but absent in the HPRC and CPC assemblies. The TAF11L5 gene encodes a TATA-box binding protein (TBP)-associated factor, which is predicted to be involved in the assembly of the RNA polymerase II preinitiation complex, a large complex of proteins that initiates the transcription of protein-coding genes35. These unique gene duplications were predominantly located on submetacentric chromosomes (Fig. 2k, Supplementary Table 16), primarily chromosomes 16 and 1 (Fig. 2l). Conversely, the acrocentric chromosomes had a low number of duplication events. This observation may be attributed to the inherent challenges in resolving these regions. Notably, unique gene duplications were present in microsatellite regions (191 out of 883), primarily within centromeric satellites (Fig. 2m, Supplementary Table 17). Furthermore, 15.06% of unique UPR duplicated genes were implicated in recessive conditions, compared to 14.61% of unique HPRC genes and 11.25% of unique CPC genes (Supplementary Table 17). Using the Horizon platform, 1.70% of unique duplicated genes were annotated as pan-ethnic disease genes associated with severe phenotypes36 (Supplementary Table 18).

Recurrent gene duplications specific to UPR assemblies further highlight the unique genetic landscape, with 255 and 123 duplicated genes not found in the HPRC and CPC assemblies, respectively (Supplementary Fig. 14, Supplementary Table 19). In addition, 13 duplicated genes from the UPR assemblies were also observed in the HPRC and CPC assemblies.

Pangenome graph construction and Arab genome-specific variants

We constructed a pangenome graph incorporating 53 Arab genomes using Minigraph-Cactus (v2.7.2)37, integrating 106 long-read haplotype assemblies into a graph structure. The graph was seeded with the CHM13 (as the backbone) and GRCh38 reference genomes and expanded using Minigraph to incorporate the UPR assemblies. CHM13-T2T was used as the primary reference for variant calling. The resulting pangenome encompassed a total length of 3,330,235,835 bp, with 80,141,431 nodes and 110,455,287 edges, whereas the HPRC pangenome has a total length of 3,328,787,872 bp (Supplementary Table 20, Supplementary Fig. 15).

We conducted a comparative analysis of the UAE pangenome graph with the combined HPRC and CPC pangenome graph, which represents a diverse set of human genomes from different populations. Each genome in our cohort contained an average of 5.45 M (5.11 M–5.99 M) small variants (Fig. 3a, Supplementary Table 21). Our analysis revealed an average of 899,585 (782,486–1,077,439) unique small variants per sample (Fig. 3b) and 13.70 M small variants that were unique to UPR (Fig. 3c), not present in HPRC-CPC, CHM13 or GRCh38. Furthermore, we identified 8.94 M population-specific small variants absent in the HPRC-CPC pangenomes, CHM13, GRCh38 and variant databases, including dbSNP16, gnomAD17, 1000G2, or GME18. SV analysis revealed that each sample contained an average of 33,383 (30,811–37,314) structural variants (Fig. 3d) and 15,475 (14,157–17,444) unique SVs (Fig. 3e) not found in HPRC-CPC, CHM13 or GRCh38. Moreover, we identified 235,195 population-specific SVs (Fig. 3f) that were absent in the HPRC-CPC pangenomes, CHM13, GRCh38 and other databases, including DGV19 and 1000G2.

Fig. 3: Arab genome specific sequences.
Fig. 3: Arab genome specific sequences.
Full size image

a Bar graph demonstrating the total number of small variants across 53 individuals, distinguishing between singleton (light blue) and polymorphic (dark blue) variants. b Bar graph showcasing the number of UPR-specific small variants for each individual, further differentiating between singleton and polymorphic variants. n = 53 individuals. c Venn diagram comparing the small variants from UPR to those in Human Pangenome Reference Consortium (HPRC), the Chinese Pangenome Consortium (CPC), CHM13, and GRCh38 assemblies. n = 53 UPR, 47 HPRC, 58 CPC individuals. d Stacked bar graph detailing the total structural variants (SVs) per sample, categorizing between singleton and polymorphic variants for both insertions and deletions. n = 53 individuals. e Stacked bar graph illustrating the SVs that are UPR-specific for each sample, for both insertions and deletions. n = 53 individuals. f Venn diagram visualizing the overlap and differences in SVs from UPR with HPRC and CPC datasets, CHM13, GRCh38, 1000 G and DGV. n = 53 UPR, 47 HPRC, 58 CPC individuals. g Visualization of Arab-specific SVs from the pangenome graph across autosomes. Sites of complex SVs are marked with blue. n = 53 individuals. h Pangenome growth curve for UPR graph. Core represents (≥95%), common (≥5%), and singleton (only one haplotype). n = 53. i Bar graph displaying the length distribution of additional identified sequences for each sample, offering insights into the diversity of unreported sequence lengths. n = 53 individuals. Source data are provided as a Source Data file.

The average SV length was 3.08 kb (1.17 kb–974 kb), and the median was 154 bp (Supplementary Fig. 16). These findings validated the assembly integrity by showing the Alu and LINE-1 repeat content along the chromosomes. We further analyzed the Arab-specific SVs in the autosomes (Fig. 3g, Supplementary Table 22a), involving 8995 unique genes (Supplementary Table 22b, Supplementary Table 23). These regions may harbor important genomic features that are specific to the Arab population.

Pangenome growth was assessed using Panacus38 and we identified a total of 212.94 Mb of nonreference sequences that were added from the 53 diploid genomes. Of these, 35.30 Mb were classified as singletons, representing unique genetic variations specific to individual genomes. Furthermore, 2.60 Mb was present in ≥95% of all analyzed haplotypes, which can be considered the core genome of our sampled populations. Additionally, 133.57 Mb of nonreference sequences were identified in ≥5% of the haplotypes, representing a common, but not necessarily unique, sequence among the Arab populations (Fig. 3h).

Uncharacterized euchromatic sequences in the UPR

We measured the number of previously uncharacterized sequences derived from SV insertions that are not present in GRCh38, T2T-CHM13, HPRC, CPC or DGV revealing that each Arab diploid genome harbored an average of 4616 (4049–5787) unreported sequences (Supplementary Figs. 1718a). A small percentage of the total unrepresented sequences were present in centromeric and telomeric regions (1.62% and 5.53%, respectively) (Supplementary Fig. 17). The pangenome graph added 111.96 Mb of unique non-reference sequences from the 106 haplotype Arab genomes, including 104.04 Mb of singletons. The average non-reference sequence length per individual was 2.61 Mb (range 2.04–3.60 Mb) (Fig. 3i, Supplementary Table 24). Nearly one-fifth (22.80%) of the previously unreported sequences were in microsatellite regions (Supplementary Fig. 18b), especially in the pericentromeric satellite, HSat3 and other centromeric satellites. To characterize the context of these previously unidentified sequences located within 84,311 loci, we performed repeat annotation using RepeatMasker and found that LINE, SINE, LTR and satellite repeats constituted 8.22%, 8.38%, 4.99% and 26% of the uncharacterized sequences, respectively (Supplementary Fig. 18c; Supplementary Table 25). These uncharacterized sequences may represent previously unrecognized functional elements or structural variations that are missed by conventional methods. These results demonstrate the power and utility of using long-read sequencing and pangenome graph construction to capture the genomic diversity and complexity of human populations.

Complex structural variation in the UPR pangenome graph

We used a pangenome graph to visualize and analyze complex structural variation (CSV) in the UPR samples. A complex structural variation site was defined as an SV site with at least one 10 kb SV with a minimum of 5 haplotypes. We found that 1.40% (733 out of 52,465) of these SV sites were complex multiallelic bubbles (Supplementary Table 21a) and 406 sites were UPR specific (Supplementary Table 26b). The CSVs were dispersed across all chromosomes, with a significant concentration on acrocentric chromosomes (average 38.3 CSVs), particularly chromosome 22, which had the highest count (60 CSVs). UPR CSVs were predominantly located in pericentromeric satellite, HSat3 and alpha satellite DNA (Supplementary Fig. 19). Our analysis revealed that most CSVs were approximately 10 kb in length, which was the established lower bound for CSV inclusion in our analysis (Supplementary Fig. 20). Utilizing rigorous criteria, we delineated complex structural variation regions as areas within a 100 kb window that consisted of two or more multiallelic SV sites, with each site containing at least one 10 kb SV within the haplotypes. This approach led to the discovery of 659 CSV regions (Supplementary Tables 27 and 28). We further investigated the complex variant regions (Supplementary Table 29a) by overlapping them with UPR-specific SVs (Supplementary Table 29b).

We compared the allelic variations in the UPR to those in the combined HPRC + CPC pangenomes in regions of clinical relevance: the PRAMEF (Fig. 4a, Supplementary Table 30) and POLR2J3 - SPDYE2 (Fig. 4b) gene regions. In the PRAMEF region, we observed a common allele present in HPRC + CPC pangenomes that was also present in 86.80% of the UPR samples, while others were unique haplotypes to the UPR pangenome. We observed 2 unique haplotypes that were exclusively found in 13.20% of the Arab cohort. Preferentially expressed antigen in melanoma (PRAME) belongs to a group of cancer/testis antigens that are mainly expressed in the testis and an array of tumors and play crucial roles in immunity and reproduction39. In the POLR2J3 - SPDYE2 region, we observed that seven alleles present in the HPRC + CPC pangenomes were absent in the UPR pangenome. We observed three previously undescribed allele exclusively present in UPR (absent in HPRC + CPC, CHM13 and GRCh38). The POLR2J gene encodes a subunit of RNA polymerase II, which is essential for synthesizing messenger RNA in eukaryotes40. The SPDYE2 gene encodes a member of the Speedy/RINGO cell cycle regulator family and is involved in protein kinase binding activity41. Furthermore, allelic variations were observed in the HLA-DRB region that were absent in the references (Supplementary Fig. 21). Other CSV regions containing CYP2D6 exhibited variable haplotype presences within the UPR that were different from those of the CHM13 or GRCh38 references (Supplementary Fig. 22).

Fig. 4: Visualizing complex structural variation region.
Fig. 4: Visualizing complex structural variation region.
Full size image

a Preferentially Expressed Antigen in Melanoma Family (PRAMEF) region subgraph. Diagram showcasing the specific location of the PRAMEF genes. b Sample haplotypes in PRAMEF Region. Distinct paths taken by different samples through the PRAMEF region. c PRAMEF region haplotype count. Linear structural diagrams representing the frequency and structural visualization of haplotypes identified by the graph across 106 haplotype assemblies, compared against the Human Pangenome Reference Consortium (HPRC)-the Chinese Pangenome Consortium (CPC) graph. d POLR2J3 - SPDYE2 region subgraph. Diagram highlighting the specific location of the POLR2J3 - SPDYE2 region. e Sample haplotypes in POLR2J3 - SPDYE2 region. Unique paths traversed by different samples through the POLR2J3 - SPDYE2 region. f POLR2J3 - SPDYE2 region haplotype count. Linear structural diagrams depicting the frequency and structural visualization of haplotypes as determined by the graph among 106 haplotype assemblies, compared with the HPRC-CPC graph for a comprehensive comparison. Variation among haplotype walks that did not involve genes was visualized using color coded lines, from red to blue to indicate directions. n = 53 UPR, 47 HPRC, 58 CPC individuals. Source data are provided as a Source Data file.

Mitochondrial pangenome characterization

The mitochondrial Arab pangenome (mtUPR) was constructed using PacBio HiFi reads from 53 individuals (Supplementary Fig. 23). The reads that mapped to ChrM with at least 90% similarity and a length greater than 15 kb were used. The average lengths of the resulting mtUPR and mtHPRC reads were 16.29 kb and 16.39 kb, respectively. The mtUPR graph encapsulating the mitochondrial heteroplasmy diversity consisted of 34,143 nodes and 61,187 edges (Fig. 5a). In contrast, the mtHPRC pangenome graph, derived from a comparable number of reads using CHM13 as a reference, consisted of 32,078 nodes and 57,494 edges. For variant calling, we used the T2T-CHM13 mitochondrial genome as the reference. The graph shows a complex and diverse structure that reflects the genetic diversity of human mitochondrial DNA. We annotated the graph using the ChrM annotations from GENCODE (v38)42, facilitating the identification of mitochondrial genes.

Fig. 5: Mitochondrial pangenome analysis and nuclear pangenome performance gain.
Fig. 5: Mitochondrial pangenome analysis and nuclear pangenome performance gain.
Full size image

a A circular representation of the mitochondrial pangenome, detailing the position and nomenclature of annotated mitochondrial genes within the pangenome. Each bubble or loop represents a haplotype. b Mitochondrial UAE-based Arab Pangenome Reference (mtUPR)variant landscape. A bar chart showcasing the number of UPR-specific small variants observed across different samples in comparison to Human Pangenome Reference Consortium (HPRC), differentiated between polymorphism (dark blue) and singleton (light blue). n = 53 individuals. c Comparative analysis of variant calling performance using linear, assembly and pangenome methods. Violin plot displaying the recall of linear variant calls using assembly-based and pangenome-based methods. n = 10 UPR individuals. d Bar graph illustrating the proportion of errors in Single Nucleotide Polymorphism (SNP) and Insertion and Deletion (Indel) variant calls using three different methods: assembly (red), linear (green), and pangenome (blue). e Mapping accuracy assessment. Box plot illustrating the percentage of properly paired reads in alignments of 9 short read whole genome sequenced Arab samples (from UAE, Saudi, Syria, and Oman) to the UPR and HPRC genomic graphs, compared to the CHM13 reference. Box plots show the 25th and 75th percentiles (interquartile range), center line represents the median, whiskers extend to the minimum and maximum values, and individual data points are overlaid. f Genotyping recall for SNPs. Box plot depicting the recall rates for genotyping of polymorphic variants in easy genomic region based on CHM13 variant calls. Easy genomic regions are defined as parts of the genome excluding segmental duplications, centromeric/satellite sequences, composite repeats, satellites, chrXY sequence classes, telomeres, and palindromes/inverted repeats. n = 9 Arab individuals. g Structural variants across samples in easy genomic regions. Line graph comparing the count of structural variants identified across Arab samples mapped to the UPR and HPRC graphs. h Line graph depicting the frequency of SV lengths across Arab samples mapped to UPR and HPRC graphs. n = 53 UPR, 47 HPRC individuals. Source data are provided as a Source Data file.

Comprehensive variant analysis within the mtUPR graph revealed 843 variant sites with a total of 3291 variants, of which 902 were unique (Supplementary Fig. 23, Supplementary Table 31a). Compared with the mtHPRC, the mtUPR revealed 623 unique small variants specific to the UAE pangenome, with 457 singletons and 166 polymorphic variants. These variants were distributed across 600 distinct sites (Fig. 5b, Supplementary Table 31b) cataloging the UPR heteroplasmy heterogeneity. A previously unreported structural variant spanning 652 bp encapsulating duplicated mitochondrial genes (MT-TF and MT-RNR1) was also detected. Insertions greater than 10 bp in length derived from the mitochondrial pangenome variant calling data cumulatively measured 1738 bp (Supplementary Table 32a). Clustering of these sequences using the cd-hit-est program with a 90% similarity threshold (-c 0.9) revealed unique previously unreported insertions that were not accounted for in the reference genomes, which included up to 1436 bp of mitochondrial DNA (Supplementary Table 32b).

Performance gains for pangenome-aided analysis

To determine the benefits of utilizing a pangenome-based approach in genomic analysis, we conducted a comprehensive evaluation comparing pangenome-based analyses with traditional methods. We assessed the performance of assembly based variant caller, PAV43 and graph-based variant caller, vg call by comparing them to a consensus set derived from HiFi long read-based variant calls using DeepVariant44 and Longshot45. The results demonstrated high overall variant recall rates, with an average of 0.990 (0.988–0.991) for PAV calls and 0.987 (0.987–0.988) for graph-based calls (Fig. 5c). This indicates that pangenome-based analyses can achieve comparable accuracy to read-based methods, emphasizing their utility in capturing diverse genetic variations efficiently. We further assessed the Mendelian concordance of the trio samples from UPR cohort by comparing read-based, assembly-based and pangenome-based variant calls (Fig. 5d). The assembly-based calls had the lowest proportion of variants inconsistent with Mendelian inheritance, with error rates of 0.0011 for SNPs and 0.0009 for indels. Further, the UPR pangenome variant calls exhibited lower errors in Mendelian concordance (with error rates of 0.0071 for SNPs and 0.0172 for indels), significantly lower than those observed by HPRC (with error rates of 0.0504 for SNPs and 0.0749 for indels).

We assessed the mapping accuracy of short reads from 9 whole genome sequenced Arab samples9 (UAE, Saudi, Syria and Oman). These samples were aligned to both UPR and HPRC graphs using the Giraffe mapper46 and for comparison, reads were aligned to CHM13 using BWA-MEM. Our analysis showed that the average mapping rate for alignment to the pangenome graph increased to 92.32% (87.34–95.64) with UPR and 92.04% (87.12–95.37) with HPRC, compared to 87.28% (81.68–90.42) when aligned to CHM13 (Fig. 5e, Supplementary Table 33). Additionally, we found that, on average, 5.04% fewer reads mapped to CHM13 compared to pangenome graph alignments, highlighting that these reads are better represented by the pangenome graph. Further analysis on the genotyping of polymorphic variants revealed high recall rates for easy genomic region (excluding satellite, repeat and segmental duplication region), with UPR achieving a 95.49% (94.70–95.70) average genotyping recall and HPRC 94.78% (94.52–95.01), as determined against CHM13 variant calls (Fig. 5f). Recall rate for UPR and HPRC for all region was 91.74% (91.21–92.04) and 90.71% (90.21–91.10) respectively (Supplementary Fig. 24a). Our comparative analysis shows that the UPR identified an average of 6.10% more structural variants in easy genomic regions compared to HPRC (Fig. 5g), with notable increases in both small and large structural variants (Fig. 5h). We conducted comprehensive genome sequencing on 4 Arab trios, each including a proband diagnosed with autism spectrum disorders (ASD), utilizing both PacBio HiFi long-read whole-genome sequencing (average coverage 21.85) and short-read whole-exome sequencing (average coverage 428) to evaluate the performance of the UPR reference. The mapping accuracy of the exome sequencing data revealed an average mapping rate of 99.67% when aligned to the CHM13 reference genome and 99.62% to the UPR graph, indicating a comparably high level of precision in genomic alignment for both references (Supplementary Fig. 24b). Our genotyping analysis revealed that when mapping whole-exome sequencing data to the UAE pangenome reference, 93.29% and 90% of missense variants identified with the UPR were also recalled using the short-read exome and long-read whole genome mapping to CHM13 reference, respectively (Supplementary Table 34). Overall, these findings demonstrate the UPR’s robust performance in variant detection and recall, particularly for clinically relevant genetic alterations, underscoring its utility in genetic research and diagnostics.

Discussion

We constructed an openly accessible pangenome reference specific to the Arab population from 53 deeply phenotyped apparently healthy adults by processing 106 high-quality haplotype assemblies. These samples were subjected to careful phenotyping to identify manifestations of the onset and progression of chronic complex diseases. Using 35.27X PacBio HiFi reads and 54.22X ultralong reads, we achieved average contig N50 of 124.28 Mb, which is 3.11-fold longer than that recently reported by HPRC. De novo assembly algorithms used 99.96% of available long-read sequences to construct the contigs. Our average assembly quality score of QV 57.53 suggested a high quality of the assemblies used to construct the pangenome. The average haplotype of the UPR samples showed 93.59% and 96.53% coverage of CHM13 and GRCh38, respectively, which is consistent with the HPRC and CPC assemblies. The coverage was lowest for the Y chromosome due to its complex Yq12 heterochromatin repeat complex region. The construction of a high-resolution reference pangenome for the Arab population provided additional insight into the unique genome organization within these populations. While PacBio HiFi data provided a base of highly accurate long reads, ONT ultralong reads contributed significantly to filling the gap in scaffolding and covering complex regions of UPR genomes. The ULK data (>100 kb) exhibited an average coverage of 12.53X for the UPR samples, contributing significantly to the extended N50 length that UPR achieved.

The CHM13 reference is the most complete human genome to date, and GRCh38 has the most complete annotations of functional elements. The UAE pangenome was constructed based on CHM13 by integrating GRCh38 and the UPR assemblies. We observed a significant overlap between UPR and HPRC + CPC variants. We anticipate that the application of the UPR in clinical settings will markedly increase the diagnostic yield for numerous single-gene disorders. The identified population-specific variants illustrate the separation time of Middle Eastern populations from other continental groups and the subsequent effects of isolation, admixture and drift9. Unfortunately, the lack of representative data from the Middle Eastern populations in large genomic initiatives (i.e., the 1000 Genomes Project, gnomAD) prevented the generation of impactful genomic resources for this region. To assess the clinical significance of thousands of previously uncharacterized UPR-specific variants, further follow up clinical cohort-based analyses are needed. One of the unresolved limitations that we observed across all pangenomes, is the mapping rate within the complex satellite regions which still suffers in accuracy. Significant care should be taken to study these regions where mutations rate is known to be multiple-fold higher47.

UPR-specific structural variants revealed 111.96 Mb sequences (range, 2.04 to 3.60 Mb of unreported sequences per haplotype) that were not present within the CHM13 and GRCh38 reference genomes or the HPRC and CPC pangenomes and DGV. These previously undescribed sequences include highly repeated satellites that are inaccessible by short-read technologies. A linear reference genome is limited by its static sequences, which hinder the ability to decipher the polymorphic nature of SVs. The clinical application of UPR pangenome-based SV detection will increase the clinical yield due to the large number of uncharacterized SVs. Moreover, the UPR will enable the detection of base resolution phased haplotype complexities to infer their association with diseases. The SVs have a profound effect on how the UPR differs from the HPRC or CPC graphs. The UPR included 235,195 unique SVs (with an average of 15,472 unique SVs per sample) that create different bubbles or haplotype walks (involving genes) within these genomic regions compared to other pangenomes or human genome references. Our findings suggest that using a larger sample size could produce a more accurate and comprehensive pangenome, capturing additional rare gene duplications and additional unique sequences. These rare genetic variants have the potential to enhance the diagnostic yield for rare genetic diseases and cancer. By employing a pangenome graph approach, we were able to identify previously undetected complex haplotypes that might contribute to various disorders and to track the origin and distribution of these haplotypes among different cohorts. In the analysis of short-read whole-genome data from Arab individuals, we observed improved accuracy in mapping paired reads using UPR compared to CHM13. Additionally, the transcriptome annotation is highly complete, particularly for protein-coding transcripts, for which the mapping rate was ~99%. However, short- or long-read sequence alignment, variant calling and phasing workflows using the pangenome remain too time-consuming for clinical laboratories. Through our analysis, we observed mapping and variant genotyping rate from short- or long-read improve when using UPR, its practical application is expected to significantly improve over time.

The mapping and genotyping accuracy demonstrated a significant advantage for UPR over linear reference CHM13. This is largely attributed to the unique UPR-specific sequences and the utilization of a graph genome that integrates multiple individuals as a reference. Advanced methodologies are needed for rapid and precise graph pangenome mapping of long-read whole genome to fully leverage the benefits of global graph pangenomes. Our short-read mapping to UPR and recalling rate on long read data shows over 90% recall rate of missense variants. The additional variants detected uniquely in the UPR mapping but not in CHM13 could likely be ascribed to differences stemming from variant detection sensitivity within the complex repetitive regions, UPR assemblies with previously uncharacterized sequences, differences in the algorithms used for read mapping, variant calling and assembly building, difference in short and long read sequencing technologies and potential assembly inaccuracies. Overall, our findings underscore the UPR’s utility in capturing a broader spectrum of genomic variants, with the potential to enhance genetic diagnosis yield and additional genomic insights into ASD and other rare disorders within Arab populations.

Our UPR-specific gene duplication analysis identified 883 unique genes involved in ribonucleotide metabolism and oxidative phosphorylation pathways. Of the duplicated genes identified, 15.06% have been previously associated with recessive conditions, which is particularly prominent in Arab populations48. Duplicated recessive genes may increase the risk factor of manifesting rare diseases, we have not found any recessive pathogenic variants within these genes in our UPR cohort. According to previous reports49, gene duplications are highly active in regions with segmental duplications. TAF11L5 is the most frequently duplicated gene in UPR cohort, it encodes TATA-box binding protein (TBP), which is predicted to be involved in RNA polymerase II preinitiation complex assembly. Our results suggest that TBP is a functionally active element within Arab populations and follow up studies will be required to assess the role of TAF11L5 gene duplication. The core conserved region of TBP that is also duplicated within the UPR, is predominantly involved in double-stranded nucleic acid binding and enables transcription initiation50.

The mitochondrial UPR contributed unreported 1436 bp sequence that was absent from the HPRC mitochondrial pangenome. This addition significantly enriches the UAE mitochondrial reference with a diverse array of haplotypes and variants. Total mtUPR unique variants impact 600 sites of the CHM13 mtDNA bases, effectively cataloging the mitochondrial heteroplasmy haplotypes found in healthy Arab populations. As the UPR cohort was clinically assessed as healthy, this Arab mitochondrial pangenome serves not only as a comprehensive reference but also as a crucial tool for mitochondrial disease diagnostics, offering valuable heteroplasmy frequency data specific to the Arab population.

The understanding of human genomic diversity and complexity depends on the collective effort to produce multiple pangenomes from different ethnicities. Currently, technological limitations hinder our ability to precisely decipher heterochromatic regions, specifically those near the centromeres. Despite utilizing various alignment algorithms, mapping reads into heterochromatic regions remain challenging as these regions harbor highly complex repeats and mobile elements. However, advancements in sequencing technologies and algorithms are expected to gradually overcome the difficulties in decoding regions with complex repeats. While we acknowledge that more samples are needed to adequately represent the genetic diversity within Arab populations, the UPR represents a crucial foundation for genetic research. Arab populations are notably underrepresented in large genomic databases, and the UPR will address this gap by offering essential resources for clinical genomic laboratories to enhance the precision of variant interpretation.

Methods

Ethical compliance

All research procedures in the study were performed in accordance with relevant ethical regulations. This study was reviewed and approved by the Institutional Review Boards of Dubai Scientific Research Ethics Committee (DSREC-12/2022_01, DAHC/MBRU-IRB/2023-15) and Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU-IRB-2017-004).

Sample collection and phenotyping

We collected 8-10 ml blood samples from 53 healthy individuals (one trio and 50 unrelated individuals) of Arab descent from the United Arab Emirates (UAE), Saudi Arabia, Oman, Jordan, Egypt, Morocco, Syria and Yemen through convenience sampling. Informed written consent was obtained from each participant regarding their participation in our research study and sharing the data for public access. In addition, we included a family trio (mother (UPR-M), father (UPR-F), and child (UPR-S)) for in-depth analysis. The inclusion criteria were self-identified Arabs who were 18 years or older, willing to participate voluntarily and presumptively healthy (free from diseases such as diabetes, cardiovascular disease, hypertension, chronic kidney disease, lung disease and liver disease). To map long-read genome and exome sequencing data to the UPR, we recruited 4 trios including children with autism spectrum disorders (ASDs) to conduct whole exome short read and whole genome long read sequencing. Furthermore, we processed short-read Illumina whole-genome sequencing data from 9 published Arab samples9.

DNA isolation and sequencing

Pacific Bioscience high-fidelity sequencing

For PacBio HiFi long-read sequencing of UPR cohort samples and ASD trios, high-molecular-weight DNA was isolated from 200 µL of flash-frozen blood using both GenFind V3 (Cat. No. C34881) and Nanobind DNA (Cat. No. 102-762-700) extraction kits per the manufacturer’s instructions. The DNA concentration was measured using a Qubit 3.0 fluorometer (Thermo Fisher, USA) with a dsDNA HS Assay kit (Thermo Fisher) (Cat. No. Q33231), and the samples were adjusted to a concentration of 30 ng/µL in a total volume of 130 µL for shearing. Shearing was performed on a Diagenode Megaruptor 3 (Diagenode, Belgium) hydropore syringe (Cat. No. E07010003) at a speed setting of 29–31, targeting 15–18 kb fragment lengths according to PacBio’s recommendations. The sheared sample fragment lengths were verified on a 4200 TapeStation (Agilent Technologies, USA) system using genomic DNA screen tape (Cat. No. 5067-5365). SMRTbell libraries were prepared using the SMRTbell Prep Kit 3.0 (Cat. No. 102-182-700), with libraries anchored to Sequel II primer 3.2 (Cat. No. 102-326-200) and Sequel II DNA polymerase 2.2. (Cat. No. 102-293-300) Sequencing was conducted on PacBio Sequel IIe equipment (Pacific Biosciences, Menlo Park, CA, USA) using SMRT cells 8 M and the Sequel II Sequencing Kit 2.0. (Cat. No. 101-820-200) The sequencing protocol was designed to enable adaptive loading, with 2 h of preextension and 30 h of movie capture. For each sample run on Sequel 11e, three SMRT cells (Cat. No. 101-389-001) were used. A total of 17 samples were processed on a Revio system (Pacific Biosciences, Menlo Park, CA, USA) using a Revio SMRT cell tray (Cat. No. 102-202-200) and a polymerase kit (Cat. No. 102-817-600) (Supplementary Methods Section B).

Oxford Nanopore Technologies (ONT) ultralong DNA sequencing

Blood samples from the UPR cohort were aliquoted into 1.8 ml cryovials, flash frozen and stored at −80 °C until further processing. For sequencing, frozen aliquots were thawed, and peripheral blood mononuclear cells (PBMCs) were isolated using red blood cell lysis buffer (Part. No. T3051-1). The purified PBMCs were counted, and approximately 60 million cells per sample were subjected to ultralong DNA extraction using the NEB Monarch Tissue DNA Extraction Kit (Cat. No. #T3060L). Sequencing libraries were then prepared with the extracted DNA using the Oxford Nanopore Technologies Ultra-Long DNA Sequencing Kit (SQK-ULK114) following the manufacturer’s protocol with minor modifications. Briefly, the DNA was tagmented at room temperature for 10 min, followed by a 10 min incubation at 75 °C. Rapid sequencing adapters were added, and the samples were incubated for 30 min at room temperature. Cleanup of the adapted DNA was performed using the precipitation star and buffers provided in the kit. The final libraries were quantified and loaded onto a minimum of 3 PromethION flow cells (R10.4.1) at approximately 30 ng per flow cell (Supplementary Table 2). Sequencing was performed for 96 h per flow cell using R10.4.1 kits, with a minimum read length setting of 1 kb.

Hi-C sequencing

The Hi-C libraries were prepared using the Qiagen EpiTect Hi-C Kit (Cat. No. 59971) with significant modifications. PBMCs were isolated from approximately 1 ml of frozen blood, with cell counts ranging from 5 × 103 to 2.5 × 106 cells per sample. The cells were pelleted using a 10× volume of RBC lysis buffer, followed by washing in PBS containing 2% FBS. The cell pellet was then fixed in 1% formaldehyde at room temperature, and crosslinking was quenched using 3 M Tris, followed by washing with PBS. The fixed cells were lysed using Hi-C lysis buffer, and the crosslinked chromatin was digested at GATC sites using a specific combination of endonuclease and buffer. This step was followed by the standard Hi-C protocol for end labeling, ligation, chromatin decrosslinking, and purification. The purified DNA was fragmented to 400–600 bp fragments using a Bioruptor Pico sonicator (Diagenode, Cat. No. B01060010) followed by purification and enrichment. The DNA was then prepared for sequencing with dual indexing using Illumina barcode adapters and amplified using Illumina sequencing primers. The final step involved the purification of amplified libraries, which were quantified, pooled, and sequenced on an Illumina NovaSeq-6000 platform (Illumina, USA), achieving read lengths of 2 × 150 bp.

Sequencing and variant QC

For ONT sequencing data, base calling was performed using Oxford Nanopore’s high-accuracy model utilizing Guppy basecaller (v6.5.7) with modifications for detecting 5-methylcytosine (Supplementary Methods Section C). We employed the high accuracy (HAC) model to maximize the precision of the derived base sequences. The PacBio sequencing data were processed using the circular consensus sequencing (CCS) algorithm. In the subsequent quality control steps, we employed NanoStat v1.6.0, which provides an in-depth statistical overview of the sequence data. Additionally, metrics obtained from the SMRT Link software were incorporated into the analysis to provide a comprehensive assessment. To refine the PacBio reads, HiFiAdapterFilt (v2.0.1) was used to remove any residual adapter sequences that might hinder the assembly process.

The PacBio human whole-genome sequencing (WGS) pipeline was utilized for read alignment and variant calling. Reads were aligned to the GRCh38 and CHM13 reference genomes using pbmm2 (v1.10.0). The CHM13 reference used was chm13v2.0_maskedY_rCRS.fa, while the hg38 reference used was human_GRCh38_no_alt_analysis_set.fasta. Variant calling was then performed using DeepVariant (v1.5.0) pbsv (v2.9) was utilized to jointly call SVs from the aligned long reads. To identify small variants, DeepVariant and GLnexus (v1.4.1) were applied for joint SNP and indel calling. Variant annotation was performed using sliVAR (v0.2.2) for small variants and svPACK for SVs, which also includes functional predictions and other annotations for the raw variant calls. All pass variants with GQ > = 10 were used for unique variant calculations. We used bedtools (v2.31.0) to compute the coverage of individual chromosomes and pericentromeric regions across all samples. We utilized Longshot45 for detecting SNPs from sequencing reads using the following command:

longshot --bam {reads.bam} --ref {reference.fa} --out {out.vcf}

Population ancestry inference

To discern the genetic diversity and ascertain the population structure of the 53 UPR genomes, we created a merged dataset of published global and Middle Eastern populations with our samples. We performed variant calling using the command:

BCFtools mpileup/call (BCFtools mpileup -f ${REFGEN} -I -E -T ${VCF} -r chr{i} -b {bam.list} -Ou | BCFtools call -Aim -C alleles -T ${TSV} -Oz -o ${OUT} -f GQ)

on a set of 596,417 variants found in the Human Origins array dataset downloaded from the Allen Ancient DNA Resource (V54.1.p1) (ref. 51). The joint called UPR VCF was merged with a set of 1040 samples from the Human Origins (HO) dataset. This dataset comprised 54 global populations from the HGDP dataset (which contains 4 Middle Eastern populations: Bedouins, Druze, Palestinians and Mozabites), in addition to other relevant Middle Eastern populations (Egyptians, Moroccans, Saudis, Yemenis, Iraqis, Syrians, and Jordanians) genotyped on the HO array20,21,22. We excluded any sample that had an ‘outlier’ label. We subsequently added additional relevant Middle Eastern samples from Emiratis, Saudis, Omanis, Yemenis, Iraqis, Jordanians, and Syrians9. The merged file was filtered for genotype quality (GQ) greater than 20, minor allele frequency (MAF) greater than 0.05, and linkage disequilibrium (LD) pruning using the following command:

plink --bfile data --maf 0.05 --indep-pairwise 50 50 0.5 --make-bed --out file

LD filtering revealed 182,322 variants. We then employed SNPRelate (v1.28.0)52 on a merged dataset with UPR (HiFi variants), 1000 Genomes (n = 2879 samples), and Human Origins (1040 samples, including 304 Arab samples). A total of 182,322 SNPs per sample were utilized for principal component analysis (PCA) and ADMIXTURE. We applied the model-based clustering ADMIXTURE tool53 to model the genetic diversity of all 53 UPR samples in the merged global dataset. We used the major continental labels from the HGDP dataset (South Asian, European, East Asian, American, African, and Oceanian), to which we added ‘Arab’ to our UPR samples. We tested k values ranging from 3 to 8 and found that k = 8 had the lowest cross-validation score.

For the relatedness analysis, we ran the following command on the pruned dataset:

plink --bfile {APR.samples.bed} --genome --min 0.2 --out {APR.samples}

This command performs a genome-wide relatedness check, printing out pairs of samples that share more than 20% identity-by-descent, effectively flagging any related individuals. Further, to confirm the unrelated status of all 50 UPR samples, we constructed a heatmap using haplotype sharing variance obtained from fineSTRUCTURE runs. In brief, variants called using DeepVariant were filtered for a GQ of ≥20. We then reduced the dataset to one variant per 10 kb window, jointly phased it with Eagle, and processed the data using the fineSTRUCTURE pipeline.

Mitochondrial and Y haplogroup classification

We calculated the mitochondrial haplogroups of the UPR samples using the Haplogrep tool (v2.4.0)54. To contextualize our findings, we extracted haplogroup frequency information for various ethnicities, including Arab regions, from published literature55,56. The nonrecombining region of the Y chromosome (NRY) haplogroups in the UPR samples was classified using the Y-LineageTracker tool (v1.3.0)57. We conducted an extensive literature review to compile Y-haplogroup frequency data across ethnicities58,59,60. Both mitochondrial and Y chromosome haplogroup distributions were visualized using hierarchical clustered heatmaps generated with the Python Seaborn library to illustrate the comparative haplotype frequencies among different populations.

De novo genome assembly

Sequence reads were screened for contaminant sequences and polished to remove artifacts prior to analysis. Kraken26 was used to taxonomically classify the assembled contigs, retaining only those annotated as human. Kraken was run using the following command:

kraken2 --memory-mapping --unclassified-out ${sample}.unclassified.fastq --classified-out ${sample}.classified.pb.fastq --output ${sample}.out.bed --threads 16 --db kraken_db ${sample}.fastq.gz

High-quality de novo assemblies were generated for each sample using a hybrid assembly approach, combining PacBio HiFi, ONT ultralong reads and Hi-C reads. A total of 53 samples were sequenced on 1–6 SMRT cells and 3–7 ONT flow cells (Supplementary Table 4). We combined PacBio HiFi reads, ultralong (ULK) reads and Hi-C reads from multiple runs of each sample and used Hifiasm (v0.19.5-r603)24 to carry out both primary and diploid de novo assembly for 53 samples (Supplementary Methods Section G). The following commands were used to generate the assemblies:

hifiasm -o ${sample} --dual-scaf -t128 --ul ${sample}.filtered.ont.fasta --h1 ${sample}_r1.sampled.fastq --h2 ${sample}_r2.sampled.80.fastq ${sample}.filtered.pb.fastq

To evaluate and compare the effectiveness of genome assembly approaches, we utilized both Verkko (v1.3.1) and Hifiasm for constructing genome assemblies of both trio and individual samples within our dataset. Our initial step involved the generation of Merqury hapmer databases, which are crucial for facilitating high-quality, k-mer-based evaluation of haplotype assemblies. We generated hapmer databases for our trio samples using Merqury with the following command:

meryl count compress k = 30 threads = 96 memory = 350 G ${sample}.*fastq.gz output ${sample}_compress.k30.meryl

Following the creation of the hapmer databases, we proceeded with the assembly process using Verkko, which was configured specifically to leverage the advantages of trio-based phased assembly. Verkko was run with the following parameters:

verkko -d ${sample}.workdir --hifi ${sample}.pb.fastq.gz --nano ${sample}.ont.fastq.gz --hap-kmers maternal_compress.k30.hapmer.meryl paternal_compress.k30.hapmer.meryl trio

Genome assembly polishing

We employed Inspector (v1.2) to evaluate and improve the quality of the 106 haplotype genome assemblies. Initially, we assessed assembly errors using Inspector, followed by an error correction step executed using the following command:

inspector-correct.py -i $asm/ --datatype pacbio-hifi -o $asm.corrected/ --skip_structural -t 64

This step was specifically designed to refine the assemblies by addressing identified inaccuracies while excluding structural errors.

Mitochondrial read pairs were identified by mapping to ChrM using minimap2 (v2.26) with the option -L -eqx, and mapped reads were filtered using samtools (v1.6) view -F 4, and any corresponding mitochondrial contigs were removed. Potential interchromosomal joins in the assembly were identified using minimap software, followed by paftools27. The specific commands initiated were -cxasm chm13v2.0.fa APR043.bp.hap1.p_ctg.fa and paftools misjoin-e APR043.1.misjoins.paf, respectively.

We employed PAV43 for identifying variants from genome assemblies. The assemblies were sourced from ‘assemblies.tsv’, and were analyzed against a reference genome file ‘config.json’. The command executed for this analysis was:

singularity run --bind “$(pwd):$(pwd)” library://becklab/pav/pav:latest -c all

Assembly quality assessment

To evaluate the quality and structural integrity of 53 primary assemblies and 106 haplotype assemblies, we used QUAST (v5.2.0)61, which provides metrics such as completeness, N50, and number of contigs. QUAST was run with the following extensive parameter set, as detailed:

quast.py -o APR043.chm13v2.0 -r chm13v2.0.fa -t 16 APR043.bp.hap1.p_ctg.fa APR043.bp.hap2.p_ctg.fa --large -e

The yak suite (v0.1-r66)24, which includes yak count, yak qv and yak trioeval, was employed for genome quality validation.

yak count -t16 -b37 -o APR043.pb.yak APR043.pb.fastq.gz

was used to build k-mer database and yak qv was used to calculate the QV score with the following parameters:

-t32 -p -K 3.2 g -l 100k APR043.pb.yak APR043.bp.hap1.p_ctg.fa > APR043.hap1.pb.yak.qv.txt

The mapping rate was assessed using minimap2 (v2.26).

We utilized Flagger (v0.3.3) to identify small-scale errors within the diploid assemblies. The HiFi reads were mapped to the concatenated assembly of both haplotypes using the following minimap2 command:

minimap2 --cs -L -Y -t 32 -ax map-hifi $sample.concat.fa.gz $sample.hifi.fq.gz

The mapped reads were then sorted using SAMtools sort. Each haplotype assembly was further mapped to chm13v2.0 using the following command:

minimap2 -t 64 -L --eqx --cs -ax asm5 chm13v2.0.fa $sample.$hap.polished.fa.gz

The Flagger WDL pipeline29 was subsequently executed using the sorted bam files as input in no-variant-calling mode. The resulting bed files were analyzed to differentiate between the haplotypes within the assemblies.

All assembled contigs were aligned to the CHM13v2 reference genome using Minimap2 with -L --eqx option. After successful mapping, the chromosome to which each contig was predominantly aligned was identified for further analysis. The centromeric region corresponding to this chromosome was then extracted from the reference genome using SAMtools faidx. To assess the alignment quality of the assembly contig to the extracted centromeric region, we employed UniAligner31, specifically using the tandem_aligner command:

tandem_aligner --first chm13v2.chr1.fa --second $sample{}.1.polished.part_h1tg000053l.fa -o ${sample}.1_h1tg000053l_chr1”

The output of this alignment is a CIGAR string and each contig’s alignment to the centromere of its respective chromosome was further analyzed. A sliding window approach was applied to each CIGAR string, using a window size of 100 bp, to calculate the alignment percentage within each window. Finally, the alignment percentages from all windows were plotted to visualize the distribution of alignment quality across different sections of the contigs.

To calculate the Hamming rate and switch errors in assemblies constructed with Hi-C data, we used pstools (v0.1)30. The phasing errors were assessed employing the following command: pstools phasing_error -t 96 $sample.1.polished.fa $sample.2.polished.fa $sample_r1.20fastq $sample_r2.20.fastq

Gene duplication

We used Liftoff (v1.6.3) (ref. 32), a tool that accurately maps gene annotations between genome assemblies, to identify gene duplications in our dataset. Liftoff aligns gene sequences from a reference (GENCODE v38) to a target genome and finds the mapping that maximizes sequence identity while preserving the gene structure. For this analysis, we used a subset of GENCODE (v38) that excludes genes located on patches or haplotypes. Additionally, we included only one copy of the genes from the X/Y pseudoautosomal regions (PAR).

An identity threshold of 90% (-sc 0.9) was used, and the following command was used:

liftoff -p 64 -sc 0.90 -copies -g GENCODE.V38.gff3 -u unmapped.txt -o gene_dup.gff3 -polish {$ASSEMBLY} {$REFERENCE}

To remove partial matches from the analysis, we used the -exclude_partial option in Liftoff. We then quantitatively assessed the frequency of copy number variations (CNVs) for each gene in the target genome by comparing the number of gene copies to that in the reference (GENCODE GRCh38.p14 (GENCODE v38)). UPR-specific gene duplications were compared against HPRC and CPC gene duplication matrices, as reported in studies utilizing similar methodologies and tool versions. We have constructed a downstream analytical tool PanScan (https://github.com/muddinmbru/panscan) for analyses of complex structural variant, gene duplication analysis and unreported sequence estimation from pangenome graph vcf file.

Pangenome graph construction

We constructed Arab pangenome variation graphs using the Minigraph-Cactus pipeline (v2.7.2)37, which combines structural variants and single-nucleotide variants (SNVs) into a single graph representation. We used two reference genomes, GRCh38 and CHM13, with CHM13 as the backbone for graph construction following the steps described in the Minigraph-Cactus documentation. Initially, an SV graph (>50 bp) was constructed using Minigraph62 by sequentially aligning the 106 haplotype assemblies to the reference genome. Centromeric and telomeric regions were masked using dna-brnn63, and the assemblies were remapped to exclude highly repetitive sequences. Contigs were then split and assigned to chromosomes based on their alignment coordinates. Base-level alignment was performed using Cactus, and the HAL output64 was converted to vg format (hal2vg). Paths >10 kb that were not aligned to the graph were removed, and the graph was normalized with GFAffix. The chromosome graphs were combined into a whole-genome graph, indexed with vg v1.50.1, and exported to VCF format (vg view). Subsequently, SNP variants were incorporated into the SV graph using the same pipeline. The assemblies were remapped to the SV graph, and quality control was applied, excluding softmasked sequences >100 kb and alignments with MAPQ < 5. Cactus was executed on the chromosome graphs to introduce base-level variants, and the outputs were converted to vg format. The same filtering, normalization, combining, and indexing steps were applied to produce the final SNP + SV Arab pangenome graph. All analyses were conducted using the CATG high-performance computing cluster Flamingo. The command line used in our end-to-end pangenome construction pipeline is as follows:

cactus-pangenome./apr-js-chm13.2908./seq_hifiasm.seq --latest --disableCaching --outName apr_review_v1_2902_chm13 --outDir apr_review_v1_2902_chm13 --reference CHM13 GRCh38 --filter 9 --giraffe clip filter --vcf --chrom-vg clip filter --gbz clip filter full --viz --gfa clip full --vcf --logFile apr-chm13.log --mgCores 160 --mapCores 160 --consCores 160 --indexCores 160 --binariesMode local --workDir wor

Pangenome growth curve

Panacus38 was used to calculate the UPR pangenome growth. The percentages of samples included were >5% (common), >95% (core) of the total samples and 1 haplotype (singleton). The command used was:

panacus ordered-histgrowth -c bp -t320 -l 1,3,27,51 -S -e apr_v1_2902_chm13.paths.chm13.txt apr_v1_2902_chm13.gfa

Identification and visualization of population-specific variants and unreported sequences

Variants were called using CHM13-T2T as the reference genome. The multiallelic sites in the VCF file were split into biallelic records using the “bcftools norm -m-“ command, and then the VCF file was separated into SNPs and other complex variants using custom Perl script. The complex variants were decomposed into SNPs, indels (<=50 bp) and SVs (>50 bp) using the command:

vcfdecompose --break-indels --break-mnps

from RTG tools (v.3.12.1), and the genotype information from identical variants across multiple samples was merged to create a single record using custom Perl script. The SNPs obtained from “vcfdecompose” were also included in the downstream analysis. The same methodology was used for the analysis of the CPC-HPRC VCF file.

Unique SVs (<80% reciprocal overlap with HPRC and CPC, 1000 Genomes, and DGV) found in the Arab assemblies were classified as population specific. The unique SVs in the UPR were identified by comparison with the HPRC-CPC SVs using the Truvari command:

truvari bench -r 1000 -C 1000 -O 0.8 -p 0.8 -P 0.0 -s 50 -S 15 --sizemax 100000

A reciprocal overlap of 80% was the criterion used to classify an SV as unique. This process collapses multiple SVs into one unique SV within our cohort. The resulting SVs were compared against the DGV Gold Standard Variants and the 1000 Genomes Project data (phase1_v3.20101123) with 80% reciprocal overlap using custom perl script. To identify unreported sequences within the UPR unique SV insertions, we clustered them using cd-hit-est (4.8.1) with the default sequence identity threshold of 0.9.

We visualized SV distributions and enrichment significance on chromosomes with RIdeogram (v0.2.2)65. Subgraphs surrounding SVs were extracted with gfabase (v0.6.0) and visualized in Bandage NG (v2022.09)66 after aligning gene sequences with GraphAligner (v1.0.17)67 using the following command:

GraphAligner -g {viz_output} -f {pc_fasta_file} -t 32 -a {pc_alignment_file} -x vg --multimap-score-fraction 0.1

The gfatools package v0.5-r287 was subsequently used to derive in-depth statistical data from the pangenome graphs.

To identify the repeat elements within unreported sequences, we screened fasta files with RepeatMasker (4.1.2-p1) using command:

RepeatMasker -e rmblast -pa 60 -species human APR_Unreported_Seq.fa -dir Result

Assessment of Mendelian concordance in trio samples

To evaluate the consistency of our trio samples’ variant calls with Mendelian concordance, we employed the PLINK command:

plink --bfile {trio-file.bed} --mendel --out {out.name}

Complex SV region and Bandage plotting

A site was considered complex if at least one variant larger than 10 kb was present, as well as at least five different alleles. The complex sites were identified by analyzing the snarls VCF files. The unique SVs were then overlapped with the complex sites to identify unique complex sites (see unique variants subsection) in the Arab population. The precise locations of the genes were determined by mapping the gene sequences to the graph using a graph aligner with the following parameter: --multimap-score 0.1. If multiple genes mapped to the same location (in the case of isoforms), only the gene mapping with the highest accuracy was retained. Complex structural variation (CSV) regions were defined as areas within a 100-kilobase (kb) window that consisted of two or more multiallelic complex SV sites, with each site containing at least one 10-kilobase (kb) SV within the haplotypes. The complex sites were plotted using Bandage. The 50 Kbp flanking region for each complex site was extracted using the Gfabase sub. To determine whether a haplotype consisted of a gene, the path was followed to determine whether it traversed through the gene region. Variation among haplotype walks that did not involve genes was visualized using color coded lines, from red to blue to indicate directions.

Mitochondrial pangenome construction

To construct a mitochondrial UAE pangenome (mtUPR) that captures the diversity of Arab mitochondrial DNA, we used high-quality long reads from 53 individuals. We mapped the reads to ChrM of the CHM13v2 reference genome using minimap2 (v2.26)68 with 90% similarity and retained only reads that were longer than 15 kb; this threshold was set to substantially reduce the chances of inadvertent nuclear DNA contamination. This resulted in a total of 20,520 reads (19,251 ONT reads and 1385 HiFi reads). We used PGGB (v0.5.4)69 to construct a mitochondrial pangenome graph from the mitochondrial contigs of HiFi reads of all individuals, and each read of >15 kb was concatenated in one fasta file along with the CHM13v2 mitochondrial chromosome. PGGB was then run on these samples with the following command: pggb -i pggb.in.90.no_dup.chm13.fasta.gz -o output_chm13_local -n 53 -t 90 -p 90 -s 100 -V ‘chm13v2:#:50’

We visualized the mtUPR graph using Bandage NG (v2022.09)66 and displayed the nodes, edges and variants in different colors and shapes.

Mitochondrial pangenome variant identification

To process the VCF files of the mtUPR pangenome, we employed a multistage approach to ensure data accuracy and integrity. The initial step involved segregation of multiallelic variant sites into biallelic records, which was accomplished using the BCFtools normalization function (BCFtools norm -m). Next, we implemented the RTG tool vcfdecompose (v3.12.1) with the parameters --break-mnps and --break-indels to further resolve complex variants into their constituent single-nucleotide polymorphisms (SNPs), indels and SVs. To merge genotype information, identical variants identified across the dataset were consolidated into singular records. This step was carried out using a custom Perl script. Parallel to the UPR data, the HPRC mitochondrial pangenome VCF file was subjected to the same rigorous processing methodology to maintain consistency across datasets. The small variants (<10 bp) were further filtered to obtain those that were concordant with DeepVariant calls of the mitochondrial reads. To identify unique variants unique to the UPR mitochondrial pangenome, we systematically removed variants that were shared with the HPRC mitochondrial pangenome. We then extracted insertions of 10 base pairs or more that were subsequently clustered using the cd-hit-est program, which applied a stringent 90% similarity threshold (-c 0.9) to discern and characterize previously unobserved insertions.

Evaluation of short read whole genome and whole exome mapping

We used 9 WGS data from Arab descent9 to quantify the mappability accuracy of short read sequences into UPR and HPRC pangenome and CHM13 reference. The quality metrics of each samples were obtained using FASTQC tool (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and the low quality bases were removed using fastp (v0.23.4).

To calculate the mapping rate of short reads to CHM13, we aligned the paired end reads to CHM13v2 using bwa-mem. The following commands were used:

bwa mem -K 100000000 -Y -R ‘@RG\tID:Sample\tLB:Sample\tPL:ILLUMINA\tPM:HISEQ\tSM:Sample’ Filtered-Sample_R1.fastq Filtered-Sample_R2.fastq -t 64 > Sample.sam samtools view -@ 64 -bS Sample.sam | samtools flagstat -@ 64 - > Sample_Flagstat.txt

The variant calling was performed using DeepVariant (v1.5.0) by providing --model_type=WGS for the whole genome samples and --model_type=WES for the whole exome samples respectively.

To determine the mapping rate of short reads to the UPR and HPRC graphs and to conduct variant calling, filtered samples were mapped to pangenomes using vg Giraffe (v1.55.0). The “vg stats” command was used to produce stats for the properly paired reads for graph references and “vg call” command was used to call the variants:

vg giraffe -p -t 64 -Z APR.giraffe.gbz -d APR.giraffe.dist -m APR.giraffe.min -x APR.giraffe.xg -f Filtered-Sample_R1.fastq -f Filtered-Sample_R2.fastq > Sample.gam vg stats -p 64 -a sample.gam > Sample_Stat.txt vg pack -t 64 -x APR.giraffe.xg -g Sample.gam -Q 5 -o Sample_aln.pack vg call -t 64 APR.giraffe.xg -k Sample_aln.pack -S CHM13 > Sample-CHM13_graph_calls.vcf

Pathway enrichment analysis

To discern the biological pathways affected by gene sets, we conducted gene set enrichment analysis (GSEA) leveraging the KEGG and GO pathway databases, a well-curated repository featuring pathways that include molecular interactions, reactions, and relation networks. Gene names were converted to Entrez gene identifiers using the bioMart tool. Our analytical framework included determining the degree of overlap between the gene sets and the pathways in the KEGG-GO database, utilizing GeneOverlap to conduct one-sided Fisher’s exact tests (FETs). We excluded pathway gene sets that numbered fewer than 50 or exceeded 1000. To account for multiple hypothesis testing and control the false discovery rate (FDR), we applied the Benjamini‒Hochberg procedure to adjust the p values obtained from Fisher’s exact test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.