A draft UAE-based Arab pangenome reference

Nassir, Nasna; Almarri, Mohamed A.; Kumail, Muhammad; Mohamed, Nesrin; Balan, Bipin; Hanif, Shehzad; AlObathani, Maryam; Jamalalail, Bassam; Elsokary, Hanan; Kondaramage, Dasuki; Shiyas, Suhana; Kosaji, Noor; Satsangi, Dharana; Abdelmotagali, Madiha Hamdi Saif; Abou Tayoun, Ahmad; Ahmed, Olfat Zuhair Salem; Youssef, Douaa Fathi; Suwaidi, Hanan Al; Albanna, Ammar; S Du Plessis, Stefan; Khansaheb, Hamda Hassan; Alsheikh-Ali, Alawi; Uddin, Mohammed

doi:10.1038/s41467-025-61645-w

Download PDF

Article
Open access
Published: 24 July 2025

A draft UAE-based Arab pangenome reference

Nasna Nassir^1,2,
Mohamed A. Almarri ORCID: orcid.org/0000-0003-1255-0918^2,3,
Muhammad Kumail¹,
Nesrin Mohamed¹,
Bipin Balan^1,2,
Shehzad Hanif¹,
Maryam AlObathani¹,
Bassam Jamalalail¹,
Hanan Elsokary¹,
Dasuki Kondaramage¹,
Suhana Shiyas¹,
Noor Kosaji^1,2,
Dharana Satsangi²,
Madiha Hamdi Saif Abdelmotagali⁴,
Ahmad Abou Tayoun ORCID: orcid.org/0000-0002-9134-1673^5,6,
Olfat Zuhair Salem Ahmed⁴,
Douaa Fathi Youssef⁴,
Hanan Al Suwaidi²,
Ammar Albanna^2,7,
Stefan S Du Plessis ORCID: orcid.org/0000-0003-4617-4367^1,2,
Hamda Hassan Khansaheb^1,2,
Alawi Alsheikh-Ali ORCID: orcid.org/0000-0002-1213-4546^1,2,8 &
…
Mohammed Uddin ORCID: orcid.org/0000-0001-6867-5803^1,2,9

Nature Communications volume 16, Article number: 6747 (2025) Cite this article

15k Accesses
7 Citations
60 Altmetric
Metrics details

Subjects

Abstract

Pangenomes provide a robust and comprehensive portrayal of genetic diversity in humans, but Arab populations remain underrepresented. We present a preliminary UAE-based Arab Pangenome Reference (UPR) utilizing 53 individuals of diverse Arab ethnicities residing in the United Arab Emirates. We assembled nuclear and mitochondrial pangenomes using 35.27X high-fidelity long reads, 54.22X ultralong reads and 65.46X Hi-C reads. This approach yielded contiguous haplotype-phased de novo assemblies of exceptional quality, with an average N50 of 124.28 Mb. We discovered 111.96 million base pairs of previously uncharacterized euchromatic sequences absent from existing human pangenomes, the T2T-CHM13 and GRCh38 reference human genomes, and other public datasets. Moreover, we identified 8.94 million population-specific small variants and 235,195 structural variants within the Arab pangenome, not present in linear and pangenome references and public datasets. We detected 883 gene duplications, including the TATA-binding protein gene TAF11L5, which was uniquely duplicated across all Arab populations and that included 15.06% of genes associated with recessive diseases. By exploring the mitochondrial pangenome, we identified 1,436 bp of previously unreported sequences. Our study provides a valuable resource for future genetic research and genomic medicine initiatives in Arab population and other population with similar genetic backgrounds.

A pangenome reference of 36 Chinese populations

Article Open access 14 June 2023

The 1000 Chinese Pangenome empowers medical and population genetics

Article Open access 01 April 2026

A draft human pangenome reference

Article Open access 10 May 2023

Introduction

The vast complexity and diversity of the human genome has opened emerging avenues for research in genomics, leading to significant advances in our understanding of human biology and disease. However, the application of these genomic advances is limited by the quality, completeness, and representativeness of the reference genome used¹. In the pursuit of understanding the intricate tapestry of human genome variation, large-scale sequencing projects have provided invaluable insights^2,3. Recently, due to significant advancements in long-read genome technologies, the complete telomere-to-telomere (T2T) sequence of the haploid human genome CHM13⁴ and the complete sequence of the Y chromosome were obtained⁵. This finding is a remarkable achievement that fills 8% of the gaps that exist in the current reference genome GRCh38⁴. Although this sequence represents a fully resolved genome, it does not reflect human sequence diversity, and a population-wide approach is necessary to detect population-specific variants, sequences and regulatory elements. The Human Pangenome Reference Consortium (HPRC) has made significant strides in genomic research by constructing a pangenome from 47 ethnically diverse samples, adding 119 million base pairs and identifying 1115 gene duplications absent in the GRCh38 reference genome⁶. This finding led to a 104% increase in structural variant detection per haplotype⁶. A recent study in China (CPC) of 58 samples from 36 ethnic minorities identified 5.9 million small variants and 34,223 structural variants not found in the HPRC pangenome⁷. These findings underscore the value of incorporating genetically diverse individuals for a comprehensive understanding of the human genome landscape.

Arabs constitute culturally diverse communities primarily from Middle East and North Africa (MENA) regions with a combined population of nearly 500 million, comprising ~6% of the global population. Unfortunately, Arab populations lack adequate representation in large-scale sequencing projects (i.e., the gnomAD database); neither the HPRC pangenome nor the 1000 Genomes Project include samples from this demographic⁸. The ancestral complexities between different Arab ethnic backgrounds have not yet been fully understood through large-scale genomic initiatives^9,10. Moreover, the Arab population, which experiences a higher incidence of consanguineous marriages, is predisposed to an increased rate of rare recessive disorders^11,12,13. The incidence of diabetes, heart disease and cancer in Arab populations is increasing^14,15. The lack of reference genomes for Arab populations has limited the investigation of genetic diversity and the genetic underpinning of numerous diseases. Population-specific reference pangenomes will enable the identification of variants associated with diseases and sequences that are unique or prevalent in Arab populations.

In this study, we sought to generate a draft of United Arab Emirates (UAE) Pangenome Reference (UPR) from Arab individuals residing in the UAE using multiple sequencing technologies and de novo assembly construction approaches. Here, we present the preliminary construction of the UPR from 53 human genomes with high-quality de novo diploid phased assemblies and their comparison with reference genomes and genomic datasets. Our analysis also included the application of the UPR in functional annotation, short and long-read whole-genome mapping for variant detection. We anticipate that our efforts will produce a pangenome that will allow for population-specific genome interpretation, facilitating more accurate genome interpretation with fewer biases and gaps, which will enhance the precision of detecting both structural and small variants.

Results

Healthy Arab sample cohort

We constructed a pangenome reference for healthy Arab individuals from eight Arab countries (the United Arab Emirates (UAE), Saudi Arabia, Oman, Jordan, Egypt, Morocco, Syria and Yemen) all residing in the UAE (Fig. 1a). 53 samples, including 50 unrelated adults (18 years or older) and a trio of family members with no known rare or common chronic diseases (i.e., no history of hypertension, diabetes mellitus, cancer, or lung or heart disease), were enrolled (Supplementary Fig. 1, Supplementary Table 1).

**Fig. 1: Cohort characterization and sequencing quality.**

Assessment of sequencing quality and variant statistics

The genomes were sequenced using Pacific Biosciences (PacBio) high-fidelity (HiFi), Oxford Nanopore Technologies (ONT) ultralong read sequencing (ULK) and high-coverage (Hi-C) Illumina short-read sequencing methods. The average coverage for each sample was 35.27X (26.60X–69.39X) using PacBio HiFi sequencing, 54.22X (33.33X–90.87X) using ONT ultralong read sequencing and 65.46X (50.89X–91.96X) using Hi-C sequencing (Supplementary Fig. 2). The average median Q score, indicating sequencing quality, was 33.19 (28–36) for PacBio HiFi sequencing, 17.35 (16-18.2) for ONT sequencing and 37 for Hi-C. The average N50 values were 16.69 kb (14.23 kb–19.06 kb) and 57.67 (43.64 kb–87.02 kb) for PacBio HiFi and ONT ultralong read sequencing, respectively (Supplementary Fig. 2 and Supplementary Table 2). Notably, ultralong reads exceeding 100 kb contributed to a substantial yield of 2011.14 Gb, translating to 670.38X coverage (an average of 12.53X per sample) (Fig. 1b, Supplementary Table 3). When mapped against CHM13 v2.0, both PacBio HiFi and ONT reads provided valuable insights into the sequencing coverage across acrocentric and metacentric chromosomes. Due to the use of ultralong protocols, the ONT reads exhibited better mapping across all chromosomes than did the PacBio reads, particularly for the acrocentric chromosomes, where 99.49% (94.95%–100%) coverage was achieved by ONT in comparison to 95.60% (91.61%–99.75%) coverage achieved by PacBio (Fig. 1c, Supplementary Fig. 3, Supplementary Table 4a). We utilized both read types to analyze coverage across diverse regions of the human genome, including satellite DNA, centromeric transitions, and ribosomal DNA. Both platforms exhibited excellent coverage, with regions such as AlphaSat_div, AlphaSat_mon, GammaSat, and Other CenSat exhibiting average coverage above 96% (96.67%–100.00%). However, the ONT platform demonstrated a marked increase in rDNA coverage, with 99.93% (99.57%–100%) and 67.44% (52.25%–97.25%) coverage for the ONT and PacBio HiFi reads, respectively (Fig. 1d, Supplementary Table 4b).

Our joint calling analysis of PacBio HiFi data incorporating the DeepVariant pipeline (GRCh38) revealed an average of 4.21 million (M) (ranged from 4.02 M–4.48 M) single-nucleotide variants (SNVs), 913,786 (800,438–971,731) insertion/deletions (indels) and 53,618 (51,732–56,573) structural variants (SVs) per sample (Supplementary Table 5, Supplementary Fig. 4). We identified population-specific variants, including 1.40 M SNVs and 308,803 M indels, that were not previously reported in the 1000 Genomes Project², dbSNP¹⁶, gnomAD¹⁷ or GME¹⁸. We identified 158,696 population-specific structural variants not previously reported in the 1000 Genomes Project or Database of Genomic Variants (DGV)¹⁹.

Population structure of the UPR samples

We subsequently analyzed the genetic structure of 53 UPR genomes from published global and regional populations^9,20,21,22. Principal component analysis (PCA) demonstrated distinct clustering patterns of the UPR samples and Arab groups (Fig. 1e), reflecting their unique population history and diversity in contrast to other global populations (Supplementary Fig. 5a, b and Supplementary Table 6a). Using the relatedness analysis using plink, we confirmed all UPR samples are unrelated except for the trio (Supplementary Table 6b). Haplotype sharing information obtained using the fineSTRUCTURE algorithm²³ demonstrated the substructures within our dataset, reflecting the different Arab subpopulations and the unrelated status of all 50 samples (Supplementary Fig. 5c). Model-based clustering using the ADMIXTURE algorithm, which integrated the genomic data from different Arab groups, indicated that the ancestry of the UPR cohort generally reflects that of various Arab subpopulations (Supplementary Figs. 6, 7).

Our analysis revealed that the UPR cohort exhibited mitochondrial and Y chromosome haplogroup patterns closely resembling those associated with Arab populations. Specifically, the hierarchical clustering heatmaps demonstrated a significant alignment in haplogroup frequencies between our cohort and reference populations from Arab regions (Supplementary Fig. 8, Supplementary Table 7).

Assembly construction and evaluation of diverse Arab genomes

Our comparative analysis of the UPR trio revealed that both the Hifiasm²⁴ and Verkko²⁵ assemblers yielded comparable contig lengths (Supplementary Table 8). In samples lacking parental genome data, Hifiasm exhibited better performance in terms of both contig length and overall contiguity. Consequently, we performed de novo assembly using Hifiasm (v0.19.5-r603) for our entire cohort and used those assemblies for downstream analysis. Sequence quality control was performed using Kraken²⁶, which filtered out non-human eukaryotic pathogen genomes and retained reads classified as human. We excluded an additional 109 contigs that mapped to ChrM with at least 95% identity (Supplementary Table 9). Additionally, we identified and cataloged 185 interchromosomal joins in the assemblies based on their sequence identity using paftools²⁷ (Supplementary Fig. 9, Supplementary Table 10), resulting in the exclusion of four contigs due to significantly large mis-joins in acrocentric chromosomes. Inspector assembly polishing²⁸ was further carried out to reduce small-scale assembly errors.

The resulting assemblies yielded an average genome size of 3.01 Gb (2.86 Gb–3.11 Gb) (Supplementary Fig. 10, Supplementary Table 11). Haploid assemblies with an X chromosome had an average total length of 3.02 Gb, representing 99.1% of the T2T-CHM13 (3.06 Gb). Conversely, haploid assemblies with a Y chromosome averaged a total length of 2.98 Gb, highlighting the inherent size difference between the sex chromosomes. On average, the assemblies comprised 103 contigs (ranged from 58 to 245), with a contig N50 length of 124.28 Mb (86.26 Mb–146.15 Mb) (Fig. 2a) and an average auN of 122.48 Mb (87.77 Mb–146.00 Mb) (Supplementary Fig. 11). This value exceeds the GRCh38 contig N50 of 57.88 Mb, demonstrating enhanced contiguity across all our assemblies. All samples surpassed the 40 Mb N50 of the HPRC reference genome (Supplementary Fig. 11). The average QV of our assemblies, computed using yak (v0.1-r66), was 57.53 (52.85–62.59), indicating robust assembly quality (Fig. 2b). We compared our assemblies to the CHM13 and GRCh38 reference genomes and revealed an average coverage of 93.59% (92.79% for males, 94.24% for females) and 96.53% (95.49% for males, 97.39% for females), respectively (Fig. 2c). We identified an average of 49.86 Mb (29.31 Mb–73.14 Mb) and 76.81 Mb (54.74 Mb–108.31 Mb) of assembled contigs that did not align with CHM13 and GRCh38, respectively (Fig. 2d, Supplementary Fig. 12). This observation is consistent with the unmapped rates reported in diverse pangenome studies, reflecting the complexity of the genome assembly and alignment processes. The duplication ratio, indicative of read mapping multiplicity against reference genomes, averaged 1.03 (1.023–1.047) and 1.012 (1.005–1.022) for GRCh38 and CHM13, respectively (Supplementary Fig. 12), highlighting the precision of our assembly with minimal duplicated regions.

**Fig. 2: Quality assessment of 53 phased diploid assemblies and gene duplication analysis.**

To evaluate the reliability of our assembled genomic sequences, we employed the Flagger tool²⁹, which is designed to identify misassemblies within a phased diploid assembly. Our analysis using Flagger revealed that a minimal proportion (1.28% or 37.91 Mb) of the total assembly was deemed unreliable (Fig. 1e, Supplementary Table 12), indicating the high accuracy of the UPR assemblies.

The phasing accuracy of the trio samples was evaluated using yak trioeval, revealing a switch error rate and Hamming error of 0.18% and 0.23%, respectively. For the samples with Hi-C data, the phasing accuracy was computed using pstools³⁰, with an average switch error rate and Hamming error of 0.64% (0.21%–1.07%) and 1.96% (0.76%–3.21%), respectively (Supplementary Table 13a). To evaluate the mapping accuracy of our assembled contigs within the most complex centromeric regions, we employed UniAligner³¹ to calculate the mapping percentages at 100 bp intervals across these regions. Our analysis revealed the average mapping rate of 26.97% for UPR (Supplementary Fig. 13), compared to 20.42% for HPRC, highlighting the accuracy and reliability of UPR assemblies in representing complex genomic regions such as centromeres. Largest scaffolds in our cohort were identified at the chromosome level using strict criteria, requiring that each scaffold predominantly map to a single chromosome with a maximum of three gaps and cover a significant portion of the chromosome’s length, excluding regions composed of microsatellites. Applying these criteria, we identified 417 chromosome-scale scaffolds in the UPR genome assemblies (Supplementary Table 13b). These comprehensive quality control assessments verified the high contiguity and accuracy of our de novo assembled genomes, providing a solid foundation for downstream genomic and clinical applications.

Gene duplications in UPR assemblies

To investigate gene duplication events within UPR diploid assemblies, we employed the Liftoff (v1.6.3)³² tool and annotated the genes using a subset of GENCODE GRCh38.p14 (GENCODE release 38) (detailed in Methods). The identified gene duplications revealed unique and diverse genomic features of the Arab population. Remarkably, a median of 99.15% (96.20%–99.34%) of protein-coding genes and 99.13% (95.86%–99.28%) of protein-coding transcripts were identified across all the UPR assemblies (Fig. 2f). Each genome had an average of 35 genes with a gain in copy number relative to GRCh38 (Fig. 2g, Supplementary Table 14). We observed that 37.83% of duplicated genes were present as singletons, whereas 8.17% were common (>5%) among UPR assemblies.

We performed a comparative assessment of gene duplications within the UPR, HPRC, and CPC assemblies to identify genes that were specific to the Arab population. We identified 1135 duplicated genes unique to the UPR assemblies that were absent from the HPRC assemblies and a further 883 duplicated genes that were absent from both the HPRC and CPC assemblies (Fig. 2h). In addition, 38 duplicated genes from the UPR assemblies were also observed in the HPRC and CPC assemblies. Among these overlapping genes, some genes were present at a greater frequency (≥5%) in the UPR assemblies than in the HPRC and CPC assemblies (Fig. 2i, j Supplementary Table 15), such as the USP17L genes. The USP17L genes encode ubiquitin-specific proteases, which are involved in various cellular processes, such as protein degradation, DNA repair, and cell cycle regulation^33,34.

Among the unique UPR duplicated genes, the TAF11L5 gene was present in all UPR assemblies but absent in the HPRC and CPC assemblies. The TAF11L5 gene encodes a TATA-box binding protein (TBP)-associated factor, which is predicted to be involved in the assembly of the RNA polymerase II preinitiation complex, a large complex of proteins that initiates the transcription of protein-coding genes³⁵. These unique gene duplications were predominantly located on submetacentric chromosomes (Fig. 2k, Supplementary Table 16), primarily chromosomes 16 and 1 (Fig. 2l). Conversely, the acrocentric chromosomes had a low number of duplication events. This observation may be attributed to the inherent challenges in resolving these regions. Notably, unique gene duplications were present in microsatellite regions (191 out of 883), primarily within centromeric satellites (Fig. 2m, Supplementary Table 17). Furthermore, 15.06% of unique UPR duplicated genes were implicated in recessive conditions, compared to 14.61% of unique HPRC genes and 11.25% of unique CPC genes (Supplementary Table 17). Using the Horizon platform, 1.70% of unique duplicated genes were annotated as pan-ethnic disease genes associated with severe phenotypes³⁶ (Supplementary Table 18).

Recurrent gene duplications specific to UPR assemblies further highlight the unique genetic landscape, with 255 and 123 duplicated genes not found in the HPRC and CPC assemblies, respectively (Supplementary Fig. 14, Supplementary Table 19). In addition, 13 duplicated genes from the UPR assemblies were also observed in the HPRC and CPC assemblies.

Pangenome graph construction and Arab genome-specific variants

We constructed a pangenome graph incorporating 53 Arab genomes using Minigraph-Cactus (v2.7.2)³⁷, integrating 106 long-read haplotype assemblies into a graph structure. The graph was seeded with the CHM13 (as the backbone) and GRCh38 reference genomes and expanded using Minigraph to incorporate the UPR assemblies. CHM13-T2T was used as the primary reference for variant calling. The resulting pangenome encompassed a total length of 3,330,235,835 bp, with 80,141,431 nodes and 110,455,287 edges, whereas the HPRC pangenome has a total length of 3,328,787,872 bp (Supplementary Table 20, Supplementary Fig. 15).

We conducted a comparative analysis of the UAE pangenome graph with the combined HPRC and CPC pangenome graph, which represents a diverse set of human genomes from different populations. Each genome in our cohort contained an average of 5.45 M (5.11 M–5.99 M) small variants (Fig. 3a, Supplementary Table 21). Our analysis revealed an average of 899,585 (782,486–1,077,439) unique small variants per sample (Fig. 3b) and 13.70 M small variants that were unique to UPR (Fig. 3c), not present in HPRC-CPC, CHM13 or GRCh38. Furthermore, we identified 8.94 M population-specific small variants absent in the HPRC-CPC pangenomes, CHM13, GRCh38 and variant databases, including dbSNP¹⁶, gnomAD¹⁷, 1000G², or GME¹⁸. SV analysis revealed that each sample contained an average of 33,383 (30,811–37,314) structural variants (Fig. 3d) and 15,475 (14,157–17,444) unique SVs (Fig. 3e) not found in HPRC-CPC, CHM13 or GRCh38. Moreover, we identified 235,195 population-specific SVs (Fig. 3f) that were absent in the HPRC-CPC pangenomes, CHM13, GRCh38 and other databases, including DGV¹⁹ and 1000G².

**Fig. 3: Arab genome specific sequences.**

The average SV length was 3.08 kb (1.17 kb–974 kb), and the median was 154 bp (Supplementary Fig. 16). These findings validated the assembly integrity by showing the Alu and LINE-1 repeat content along the chromosomes. We further analyzed the Arab-specific SVs in the autosomes (Fig. 3g, Supplementary Table 22a), involving 8995 unique genes (Supplementary Table 22b, Supplementary Table 23). These regions may harbor important genomic features that are specific to the Arab population.

Pangenome growth was assessed using Panacus³⁸ and we identified a total of 212.94 Mb of nonreference sequences that were added from the 53 diploid genomes. Of these, 35.30 Mb were classified as singletons, representing unique genetic variations specific to individual genomes. Furthermore, 2.60 Mb was present in ≥95% of all analyzed haplotypes, which can be considered the core genome of our sampled populations. Additionally, 133.57 Mb of nonreference sequences were identified in ≥5% of the haplotypes, representing a common, but not necessarily unique, sequence among the Arab populations (Fig. 3h).

Uncharacterized euchromatic sequences in the UPR

We measured the number of previously uncharacterized sequences derived from SV insertions that are not present in GRCh38, T2T-CHM13, HPRC, CPC or DGV revealing that each Arab diploid genome harbored an average of 4616 (4049–5787) unreported sequences (Supplementary Figs. 17, 18a). A small percentage of the total unrepresented sequences were present in centromeric and telomeric regions (1.62% and 5.53%, respectively) (Supplementary Fig. 17). The pangenome graph added 111.96 Mb of unique non-reference sequences from the 106 haplotype Arab genomes, including 104.04 Mb of singletons. The average non-reference sequence length per individual was 2.61 Mb (range 2.04–3.60 Mb) (Fig. 3i, Supplementary Table 24). Nearly one-fifth (22.80%) of the previously unreported sequences were in microsatellite regions (Supplementary Fig. 18b), especially in the pericentromeric satellite, HSat3 and other centromeric satellites. To characterize the context of these previously unidentified sequences located within 84,311 loci, we performed repeat annotation using RepeatMasker and found that LINE, SINE, LTR and satellite repeats constituted 8.22%, 8.38%, 4.99% and 26% of the uncharacterized sequences, respectively (Supplementary Fig. 18c; Supplementary Table 25). These uncharacterized sequences may represent previously unrecognized functional elements or structural variations that are missed by conventional methods. These results demonstrate the power and utility of using long-read sequencing and pangenome graph construction to capture the genomic diversity and complexity of human populations.

Complex structural variation in the UPR pangenome graph

We used a pangenome graph to visualize and analyze complex structural variation (CSV) in the UPR samples. A complex structural variation site was defined as an SV site with at least one 10 kb SV with a minimum of 5 haplotypes. We found that 1.40% (733 out of 52,465) of these SV sites were complex multiallelic bubbles (Supplementary Table 21a) and 406 sites were UPR specific (Supplementary Table 26b). The CSVs were dispersed across all chromosomes, with a significant concentration on acrocentric chromosomes (average 38.3 CSVs), particularly chromosome 22, which had the highest count (60 CSVs). UPR CSVs were predominantly located in pericentromeric satellite, HSat3 and alpha satellite DNA (Supplementary Fig. 19). Our analysis revealed that most CSVs were approximately 10 kb in length, which was the established lower bound for CSV inclusion in our analysis (Supplementary Fig. 20). Utilizing rigorous criteria, we delineated complex structural variation regions as areas within a 100 kb window that consisted of two or more multiallelic SV sites, with each site containing at least one 10 kb SV within the haplotypes. This approach led to the discovery of 659 CSV regions (Supplementary Tables 27 and 28). We further investigated the complex variant regions (Supplementary Table 29a) by overlapping them with UPR-specific SVs (Supplementary Table 29b).

We compared the allelic variations in the UPR to those in the combined HPRC + CPC pangenomes in regions of clinical relevance: the PRAMEF (Fig. 4a, Supplementary Table 30) and POLR2J3 - SPDYE2 (Fig. 4b) gene regions. In the PRAMEF region, we observed a common allele present in HPRC + CPC pangenomes that was also present in 86.80% of the UPR samples, while others were unique haplotypes to the UPR pangenome. We observed 2 unique haplotypes that were exclusively found in 13.20% of the Arab cohort. Preferentially expressed antigen in melanoma (PRAME) belongs to a group of cancer/testis antigens that are mainly expressed in the testis and an array of tumors and play crucial roles in immunity and reproduction³⁹. In the POLR2J3 - SPDYE2 region, we observed that seven alleles present in the HPRC + CPC pangenomes were absent in the UPR pangenome. We observed three previously undescribed allele exclusively present in UPR (absent in HPRC + CPC, CHM13 and GRCh38). The POLR2J gene encodes a subunit of RNA polymerase II, which is essential for synthesizing messenger RNA in eukaryotes⁴⁰. The SPDYE2 gene encodes a member of the Speedy/RINGO cell cycle regulator family and is involved in protein kinase binding activity⁴¹. Furthermore, allelic variations were observed in the HLA-DRB region that were absent in the references (Supplementary Fig. 21). Other CSV regions containing CYP2D6 exhibited variable haplotype presences within the UPR that were different from those of the CHM13 or GRCh38 references (Supplementary Fig. 22).

**Fig. 4: Visualizing complex structural variation region.**

Mitochondrial pangenome characterization

The mitochondrial Arab pangenome (mtUPR) was constructed using PacBio HiFi reads from 53 individuals (Supplementary Fig. 23). The reads that mapped to ChrM with at least 90% similarity and a length greater than 15 kb were used. The average lengths of the resulting mtUPR and mtHPRC reads were 16.29 kb and 16.39 kb, respectively. The mtUPR graph encapsulating the mitochondrial heteroplasmy diversity consisted of 34,143 nodes and 61,187 edges (Fig. 5a). In contrast, the mtHPRC pangenome graph, derived from a comparable number of reads using CHM13 as a reference, consisted of 32,078 nodes and 57,494 edges. For variant calling, we used the T2T-CHM13 mitochondrial genome as the reference. The graph shows a complex and diverse structure that reflects the genetic diversity of human mitochondrial DNA. We annotated the graph using the ChrM annotations from GENCODE (v38)⁴², facilitating the identification of mitochondrial genes.

**Fig. 5: Mitochondrial pangenome analysis and nuclear pangenome performance gain.**

Comprehensive variant analysis within the mtUPR graph revealed 843 variant sites with a total of 3291 variants, of which 902 were unique (Supplementary Fig. 23, Supplementary Table 31a). Compared with the mtHPRC, the mtUPR revealed 623 unique small variants specific to the UAE pangenome, with 457 singletons and 166 polymorphic variants. These variants were distributed across 600 distinct sites (Fig. 5b, Supplementary Table 31b) cataloging the UPR heteroplasmy heterogeneity. A previously unreported structural variant spanning 652 bp encapsulating duplicated mitochondrial genes (MT-TF and MT-RNR1) was also detected. Insertions greater than 10 bp in length derived from the mitochondrial pangenome variant calling data cumulatively measured 1738 bp (Supplementary Table 32a). Clustering of these sequences using the cd-hit-est program with a 90% similarity threshold (-c 0.9) revealed unique previously unreported insertions that were not accounted for in the reference genomes, which included up to 1436 bp of mitochondrial DNA (Supplementary Table 32b).

Performance gains for pangenome-aided analysis

To determine the benefits of utilizing a pangenome-based approach in genomic analysis, we conducted a comprehensive evaluation comparing pangenome-based analyses with traditional methods. We assessed the performance of assembly based variant caller, PAV⁴³ and graph-based variant caller, vg call by comparing them to a consensus set derived from HiFi long read-based variant calls using DeepVariant⁴⁴ and Longshot⁴⁵. The results demonstrated high overall variant recall rates, with an average of 0.990 (0.988–0.991) for PAV calls and 0.987 (0.987–0.988) for graph-based calls (Fig. 5c). This indicates that pangenome-based analyses can achieve comparable accuracy to read-based methods, emphasizing their utility in capturing diverse genetic variations efficiently. We further assessed the Mendelian concordance of the trio samples from UPR cohort by comparing read-based, assembly-based and pangenome-based variant calls (Fig. 5d). The assembly-based calls had the lowest proportion of variants inconsistent with Mendelian inheritance, with error rates of 0.0011 for SNPs and 0.0009 for indels. Further, the UPR pangenome variant calls exhibited lower errors in Mendelian concordance (with error rates of 0.0071 for SNPs and 0.0172 for indels), significantly lower than those observed by HPRC (with error rates of 0.0504 for SNPs and 0.0749 for indels).

We assessed the mapping accuracy of short reads from 9 whole genome sequenced Arab samples⁹ (UAE, Saudi, Syria and Oman). These samples were aligned to both UPR and HPRC graphs using the Giraffe mapper⁴⁶ and for comparison, reads were aligned to CHM13 using BWA-MEM. Our analysis showed that the average mapping rate for alignment to the pangenome graph increased to 92.32% (87.34–95.64) with UPR and 92.04% (87.12–95.37) with HPRC, compared to 87.28% (81.68–90.42) when aligned to CHM13 (Fig. 5e, Supplementary Table 33). Additionally, we found that, on average, 5.04% fewer reads mapped to CHM13 compared to pangenome graph alignments, highlighting that these reads are better represented by the pangenome graph. Further analysis on the genotyping of polymorphic variants revealed high recall rates for easy genomic region (excluding satellite, repeat and segmental duplication region), with UPR achieving a 95.49% (94.70–95.70) average genotyping recall and HPRC 94.78% (94.52–95.01), as determined against CHM13 variant calls (Fig. 5f). Recall rate for UPR and HPRC for all region was 91.74% (91.21–92.04) and 90.71% (90.21–91.10) respectively (Supplementary Fig. 24a). Our comparative analysis shows that the UPR identified an average of 6.10% more structural variants in easy genomic regions compared to HPRC (Fig. 5g), with notable increases in both small and large structural variants (Fig. 5h). We conducted comprehensive genome sequencing on 4 Arab trios, each including a proband diagnosed with autism spectrum disorders (ASD), utilizing both PacBio HiFi long-read whole-genome sequencing (average coverage 21.85) and short-read whole-exome sequencing (average coverage 428) to evaluate the performance of the UPR reference. The mapping accuracy of the exome sequencing data revealed an average mapping rate of 99.67% when aligned to the CHM13 reference genome and 99.62% to the UPR graph, indicating a comparably high level of precision in genomic alignment for both references (Supplementary Fig. 24b). Our genotyping analysis revealed that when mapping whole-exome sequencing data to the UAE pangenome reference, 93.29% and 90% of missense variants identified with the UPR were also recalled using the short-read exome and long-read whole genome mapping to CHM13 reference, respectively (Supplementary Table 34). Overall, these findings demonstrate the UPR’s robust performance in variant detection and recall, particularly for clinically relevant genetic alterations, underscoring its utility in genetic research and diagnostics.

Discussion

We constructed an openly accessible pangenome reference specific to the Arab population from 53 deeply phenotyped apparently healthy adults by processing 106 high-quality haplotype assemblies. These samples were subjected to careful phenotyping to identify manifestations of the onset and progression of chronic complex diseases. Using 35.27X PacBio HiFi reads and 54.22X ultralong reads, we achieved average contig N50 of 124.28 Mb, which is 3.11-fold longer than that recently reported by HPRC. De novo assembly algorithms used 99.96% of available long-read sequences to construct the contigs. Our average assembly quality score of QV 57.53 suggested a high quality of the assemblies used to construct the pangenome. The average haplotype of the UPR samples showed 93.59% and 96.53% coverage of CHM13 and GRCh38, respectively, which is consistent with the HPRC and CPC assemblies. The coverage was lowest for the Y chromosome due to its complex Yq12 heterochromatin repeat complex region. The construction of a high-resolution reference pangenome for the Arab population provided additional insight into the unique genome organization within these populations. While PacBio HiFi data provided a base of highly accurate long reads, ONT ultralong reads contributed significantly to filling the gap in scaffolding and covering complex regions of UPR genomes. The ULK data (>100 kb) exhibited an average coverage of 12.53X for the UPR samples, contributing significantly to the extended N50 length that UPR achieved.

The CHM13 reference is the most complete human genome to date, and GRCh38 has the most complete annotations of functional elements. The UAE pangenome was constructed based on CHM13 by integrating GRCh38 and the UPR assemblies. We observed a significant overlap between UPR and HPRC + CPC variants. We anticipate that the application of the UPR in clinical settings will markedly increase the diagnostic yield for numerous single-gene disorders. The identified population-specific variants illustrate the separation time of Middle Eastern populations from other continental groups and the subsequent effects of isolation, admixture and drift⁹. Unfortunately, the lack of representative data from the Middle Eastern populations in large genomic initiatives (i.e., the 1000 Genomes Project, gnomAD) prevented the generation of impactful genomic resources for this region. To assess the clinical significance of thousands of previously uncharacterized UPR-specific variants, further follow up clinical cohort-based analyses are needed. One of the unresolved limitations that we observed across all pangenomes, is the mapping rate within the complex satellite regions which still suffers in accuracy. Significant care should be taken to study these regions where mutations rate is known to be multiple-fold higher⁴⁷.

UPR-specific structural variants revealed 111.96 Mb sequences (range, 2.04 to 3.60 Mb of unreported sequences per haplotype) that were not present within the CHM13 and GRCh38 reference genomes or the HPRC and CPC pangenomes and DGV. These previously undescribed sequences include highly repeated satellites that are inaccessible by short-read technologies. A linear reference genome is limited by its static sequences, which hinder the ability to decipher the polymorphic nature of SVs. The clinical application of UPR pangenome-based SV detection will increase the clinical yield due to the large number of uncharacterized SVs. Moreover, the UPR will enable the detection of base resolution phased haplotype complexities to infer their association with diseases. The SVs have a profound effect on how the UPR differs from the HPRC or CPC graphs. The UPR included 235,195 unique SVs (with an average of 15,472 unique SVs per sample) that create different bubbles or haplotype walks (involving genes) within these genomic regions compared to other pangenomes or human genome references. Our findings suggest that using a larger sample size could produce a more accurate and comprehensive pangenome, capturing additional rare gene duplications and additional unique sequences. These rare genetic variants have the potential to enhance the diagnostic yield for rare genetic diseases and cancer. By employing a pangenome graph approach, we were able to identify previously undetected complex haplotypes that might contribute to various disorders and to track the origin and distribution of these haplotypes among different cohorts. In the analysis of short-read whole-genome data from Arab individuals, we observed improved accuracy in mapping paired reads using UPR compared to CHM13. Additionally, the transcriptome annotation is highly complete, particularly for protein-coding transcripts, for which the mapping rate was ~99%. However, short- or long-read sequence alignment, variant calling and phasing workflows using the pangenome remain too time-consuming for clinical laboratories. Through our analysis, we observed mapping and variant genotyping rate from short- or long-read improve when using UPR, its practical application is expected to significantly improve over time.

The mapping and genotyping accuracy demonstrated a significant advantage for UPR over linear reference CHM13. This is largely attributed to the unique UPR-specific sequences and the utilization of a graph genome that integrates multiple individuals as a reference. Advanced methodologies are needed for rapid and precise graph pangenome mapping of long-read whole genome to fully leverage the benefits of global graph pangenomes. Our short-read mapping to UPR and recalling rate on long read data shows over 90% recall rate of missense variants. The additional variants detected uniquely in the UPR mapping but not in CHM13 could likely be ascribed to differences stemming from variant detection sensitivity within the complex repetitive regions, UPR assemblies with previously uncharacterized sequences, differences in the algorithms used for read mapping, variant calling and assembly building, difference in short and long read sequencing technologies and potential assembly inaccuracies. Overall, our findings underscore the UPR’s utility in capturing a broader spectrum of genomic variants, with the potential to enhance genetic diagnosis yield and additional genomic insights into ASD and other rare disorders within Arab populations.

Our UPR-specific gene duplication analysis identified 883 unique genes involved in ribonucleotide metabolism and oxidative phosphorylation pathways. Of the duplicated genes identified, 15.06% have been previously associated with recessive conditions, which is particularly prominent in Arab populations⁴⁸. Duplicated recessive genes may increase the risk factor of manifesting rare diseases, we have not found any recessive pathogenic variants within these genes in our UPR cohort. According to previous reports⁴⁹, gene duplications are highly active in regions with segmental duplications. TAF11L5 is the most frequently duplicated gene in UPR cohort, it encodes TATA-box binding protein (TBP), which is predicted to be involved in RNA polymerase II preinitiation complex assembly. Our results suggest that TBP is a functionally active element within Arab populations and follow up studies will be required to assess the role of TAF11L5 gene duplication. The core conserved region of TBP that is also duplicated within the UPR, is predominantly involved in double-stranded nucleic acid binding and enables transcription initiation⁵⁰.

The mitochondrial UPR contributed unreported 1436 bp sequence that was absent from the HPRC mitochondrial pangenome. This addition significantly enriches the UAE mitochondrial reference with a diverse array of haplotypes and variants. Total mtUPR unique variants impact 600 sites of the CHM13 mtDNA bases, effectively cataloging the mitochondrial heteroplasmy haplotypes found in healthy Arab populations. As the UPR cohort was clinically assessed as healthy, this Arab mitochondrial pangenome serves not only as a comprehensive reference but also as a crucial tool for mitochondrial disease diagnostics, offering valuable heteroplasmy frequency data specific to the Arab population.

The understanding of human genomic diversity and complexity depends on the collective effort to produce multiple pangenomes from different ethnicities. Currently, technological limitations hinder our ability to precisely decipher heterochromatic regions, specifically those near the centromeres. Despite utilizing various alignment algorithms, mapping reads into heterochromatic regions remain challenging as these regions harbor highly complex repeats and mobile elements. However, advancements in sequencing technologies and algorithms are expected to gradually overcome the difficulties in decoding regions with complex repeats. While we acknowledge that more samples are needed to adequately represent the genetic diversity within Arab populations, the UPR represents a crucial foundation for genetic research. Arab populations are notably underrepresented in large genomic databases, and the UPR will address this gap by offering essential resources for clinical genomic laboratories to enhance the precision of variant interpretation.

Methods

Ethical compliance

All research procedures in the study were performed in accordance with relevant ethical regulations. This study was reviewed and approved by the Institutional Review Boards of Dubai Scientific Research Ethics Committee (DSREC-12/2022_01, DAHC/MBRU-IRB/2023-15) and Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU-IRB-2017-004).

Sample collection and phenotyping

We collected 8-10 ml blood samples from 53 healthy individuals (one trio and 50 unrelated individuals) of Arab descent from the United Arab Emirates (UAE), Saudi Arabia, Oman, Jordan, Egypt, Morocco, Syria and Yemen through convenience sampling. Informed written consent was obtained from each participant regarding their participation in our research study and sharing the data for public access. In addition, we included a family trio (mother (UPR-M), father (UPR-F), and child (UPR-S)) for in-depth analysis. The inclusion criteria were self-identified Arabs who were 18 years or older, willing to participate voluntarily and presumptively healthy (free from diseases such as diabetes, cardiovascular disease, hypertension, chronic kidney disease, lung disease and liver disease). To map long-read genome and exome sequencing data to the UPR, we recruited 4 trios including children with autism spectrum disorders (ASDs) to conduct whole exome short read and whole genome long read sequencing. Furthermore, we processed short-read Illumina whole-genome sequencing data from 9 published Arab samples⁹.

DNA isolation and sequencing

Pacific Bioscience high-fidelity sequencing

For PacBio HiFi long-read sequencing of UPR cohort samples and ASD trios, high-molecular-weight DNA was isolated from 200 µL of flash-frozen blood using both GenFind V3 (Cat. No. C34881) and Nanobind DNA (Cat. No. 102-762-700) extraction kits per the manufacturer’s instructions. The DNA concentration was measured using a Qubit 3.0 fluorometer (Thermo Fisher, USA) with a dsDNA HS Assay kit (Thermo Fisher) (Cat. No. Q33231), and the samples were adjusted to a concentration of 30 ng/µL in a total volume of 130 µL for shearing. Shearing was performed on a Diagenode Megaruptor 3 (Diagenode, Belgium) hydropore syringe (Cat. No. E07010003) at a speed setting of 29–31, targeting 15–18 kb fragment lengths according to PacBio’s recommendations. The sheared sample fragment lengths were verified on a 4200 TapeStation (Agilent Technologies, USA) system using genomic DNA screen tape (Cat. No. 5067-5365). SMRTbell libraries were prepared using the SMRTbell Prep Kit 3.0 (Cat. No. 102-182-700), with libraries anchored to Sequel II primer 3.2 (Cat. No. 102-326-200) and Sequel II DNA polymerase 2.2. (Cat. No. 102-293-300) Sequencing was conducted on PacBio Sequel IIe equipment (Pacific Biosciences, Menlo Park, CA, USA) using SMRT cells 8 M and the Sequel II Sequencing Kit 2.0. (Cat. No. 101-820-200) The sequencing protocol was designed to enable adaptive loading, with 2 h of preextension and 30 h of movie capture. For each sample run on Sequel 11e, three SMRT cells (Cat. No. 101-389-001) were used. A total of 17 samples were processed on a Revio system (Pacific Biosciences, Menlo Park, CA, USA) using a Revio SMRT cell tray (Cat. No. 102-202-200) and a polymerase kit (Cat. No. 102-817-600) (Supplementary Methods Section B).

Oxford Nanopore Technologies (ONT) ultralong DNA sequencing

Blood samples from the UPR cohort were aliquoted into 1.8 ml cryovials, flash frozen and stored at −80 °C until further processing. For sequencing, frozen aliquots were thawed, and peripheral blood mononuclear cells (PBMCs) were isolated using red blood cell lysis buffer (Part. No. T3051-1). The purified PBMCs were counted, and approximately 60 million cells per sample were subjected to ultralong DNA extraction using the NEB Monarch Tissue DNA Extraction Kit (Cat. No. #T3060L). Sequencing libraries were then prepared with the extracted DNA using the Oxford Nanopore Technologies Ultra-Long DNA Sequencing Kit (SQK-ULK114) following the manufacturer’s protocol with minor modifications. Briefly, the DNA was tagmented at room temperature for 10 min, followed by a 10 min incubation at 75 °C. Rapid sequencing adapters were added, and the samples were incubated for 30 min at room temperature. Cleanup of the adapted DNA was performed using the precipitation star and buffers provided in the kit. The final libraries were quantified and loaded onto a minimum of 3 PromethION flow cells (R10.4.1) at approximately 30 ng per flow cell (Supplementary Table 2). Sequencing was performed for 96 h per flow cell using R10.4.1 kits, with a minimum read length setting of 1 kb.

Hi-C sequencing

The Hi-C libraries were prepared using the Qiagen EpiTect Hi-C Kit (Cat. No. 59971) with significant modifications. PBMCs were isolated from approximately 1 ml of frozen blood, with cell counts ranging from 5 × 10³ to 2.5 × 10⁶ cells per sample. The cells were pelleted using a 10× volume of RBC lysis buffer, followed by washing in PBS containing 2% FBS. The cell pellet was then fixed in 1% formaldehyde at room temperature, and crosslinking was quenched using 3 M Tris, followed by washing with PBS. The fixed cells were lysed using Hi-C lysis buffer, and the crosslinked chromatin was digested at GATC sites using a specific combination of endonuclease and buffer. This step was followed by the standard Hi-C protocol for end labeling, ligation, chromatin decrosslinking, and purification. The purified DNA was fragmented to 400–600 bp fragments using a Bioruptor Pico sonicator (Diagenode, Cat. No. B01060010) followed by purification and enrichment. The DNA was then prepared for sequencing with dual indexing using Illumina barcode adapters and amplified using Illumina sequencing primers. The final step involved the purification of amplified libraries, which were quantified, pooled, and sequenced on an Illumina NovaSeq-6000 platform (Illumina, USA), achieving read lengths of 2 × 150 bp.

Sequencing and variant QC

For ONT sequencing data, base calling was performed using Oxford Nanopore’s high-accuracy model utilizing Guppy basecaller (v6.5.7) with modifications for detecting 5-methylcytosine (Supplementary Methods Section C). We employed the high accuracy (HAC) model to maximize the precision of the derived base sequences. The PacBio sequencing data were processed using the circular consensus sequencing (CCS) algorithm. In the subsequent quality control steps, we employed NanoStat v1.6.0, which provides an in-depth statistical overview of the sequence data. Additionally, metrics obtained from the SMRT Link software were incorporated into the analysis to provide a comprehensive assessment. To refine the PacBio reads, HiFiAdapterFilt (v2.0.1) was used to remove any residual adapter sequences that might hinder the assembly process.

The PacBio human whole-genome sequencing (WGS) pipeline was utilized for read alignment and variant calling. Reads were aligned to the GRCh38 and CHM13 reference genomes using pbmm2 (v1.10.0). The CHM13 reference used was chm13v2.0_maskedY_rCRS.fa, while the hg38 reference used was human_GRCh38_no_alt_analysis_set.fasta. Variant calling was then performed using DeepVariant (v1.5.0) pbsv (v2.9) was utilized to jointly call SVs from the aligned long reads. To identify small variants, DeepVariant and GLnexus (v1.4.1) were applied for joint SNP and indel calling. Variant annotation was performed using sliVAR (v0.2.2) for small variants and svPACK for SVs, which also includes functional predictions and other annotations for the raw variant calls. All pass variants with GQ > = 10 were used for unique variant calculations. We used bedtools (v2.31.0) to compute the coverage of individual chromosomes and pericentromeric regions across all samples. We utilized Longshot⁴⁵ for detecting SNPs from sequencing reads using the following command:

longshot --bam {reads.bam} --ref {reference.fa} --out {out.vcf}

Population ancestry inference

To discern the genetic diversity and ascertain the population structure of the 53 UPR genomes, we created a merged dataset of published global and Middle Eastern populations with our samples. We performed variant calling using the command:

BCFtools mpileup/call (BCFtools mpileup -f ${REFGEN} -I -E -T ${VCF} -r chr{i} -b {bam.list} -Ou | BCFtools call -Aim -C alleles -T ${TSV} -Oz -o ${OUT} -f GQ)

on a set of 596,417 variants found in the Human Origins array dataset downloaded from the Allen Ancient DNA Resource (V54.1.p1) (ref. ⁵¹). The joint called UPR VCF was merged with a set of 1040 samples from the Human Origins (HO) dataset. This dataset comprised 54 global populations from the HGDP dataset (which contains 4 Middle Eastern populations: Bedouins, Druze, Palestinians and Mozabites), in addition to other relevant Middle Eastern populations (Egyptians, Moroccans, Saudis, Yemenis, Iraqis, Syrians, and Jordanians) genotyped on the HO array^20,21,22. We excluded any sample that had an ‘outlier’ label. We subsequently added additional relevant Middle Eastern samples from Emiratis, Saudis, Omanis, Yemenis, Iraqis, Jordanians, and Syrians⁹. The merged file was filtered for genotype quality (GQ) greater than 20, minor allele frequency (MAF) greater than 0.05, and linkage disequilibrium (LD) pruning using the following command:

plink --bfile data --maf 0.05 --indep-pairwise 50 50 0.5 --make-bed --out file

LD filtering revealed 182,322 variants. We then employed SNPRelate (v1.28.0)⁵² on a merged dataset with UPR (HiFi variants), 1000 Genomes (n = 2879 samples), and Human Origins (1040 samples, including 304 Arab samples). A total of 182,322 SNPs per sample were utilized for principal component analysis (PCA) and ADMIXTURE. We applied the model-based clustering ADMIXTURE tool⁵³ to model the genetic diversity of all 53 UPR samples in the merged global dataset. We used the major continental labels from the HGDP dataset (South Asian, European, East Asian, American, African, and Oceanian), to which we added ‘Arab’ to our UPR samples. We tested k values ranging from 3 to 8 and found that k = 8 had the lowest cross-validation score.

For the relatedness analysis, we ran the following command on the pruned dataset:

plink --bfile {APR.samples.bed} --genome --min 0.2 --out {APR.samples}

This command performs a genome-wide relatedness check, printing out pairs of samples that share more than 20% identity-by-descent, effectively flagging any related individuals. Further, to confirm the unrelated status of all 50 UPR samples, we constructed a heatmap using haplotype sharing variance obtained from fineSTRUCTURE runs. In brief, variants called using DeepVariant were filtered for a GQ of ≥20. We then reduced the dataset to one variant per 10 kb window, jointly phased it with Eagle, and processed the data using the fineSTRUCTURE pipeline.

Mitochondrial and Y haplogroup classification

We calculated the mitochondrial haplogroups of the UPR samples using the Haplogrep tool (v2.4.0)⁵⁴. To contextualize our findings, we extracted haplogroup frequency information for various ethnicities, including Arab regions, from published literature^55,56. The nonrecombining region of the Y chromosome (NRY) haplogroups in the UPR samples was classified using the Y-LineageTracker tool (v1.3.0)⁵⁷. We conducted an extensive literature review to compile Y-haplogroup frequency data across ethnicities^58,59,60. Both mitochondrial and Y chromosome haplogroup distributions were visualized using hierarchical clustered heatmaps generated with the Python Seaborn library to illustrate the comparative haplotype frequencies among different populations.

De novo genome assembly

Sequence reads were screened for contaminant sequences and polished to remove artifacts prior to analysis. Kraken²⁶ was used to taxonomically classify the assembled contigs, retaining only those annotated as human. Kraken was run using the following command:

kraken2 --memory-mapping --unclassified-out ${sample}.unclassified.fastq --classified-out ${sample}.classified.pb.fastq --output ${sample}.out.bed --threads 16 --db kraken_db ${sample}.fastq.gz

High-quality de novo assemblies were generated for each sample using a hybrid assembly approach, combining PacBio HiFi, ONT ultralong reads and Hi-C reads. A total of 53 samples were sequenced on 1–6 SMRT cells and 3–7 ONT flow cells (Supplementary Table 4). We combined PacBio HiFi reads, ultralong (ULK) reads and Hi-C reads from multiple runs of each sample and used Hifiasm (v0.19.5-r603)²⁴ to carry out both primary and diploid de novo assembly for 53 samples (Supplementary Methods Section G). The following commands were used to generate the assemblies:

hifiasm -o ${sample} --dual-scaf -t128 --ul ${sample}.filtered.ont.fasta --h1 ${sample}_r1.sampled.fastq --h2 ${sample}_r2.sampled.80.fastq ${sample}.filtered.pb.fastq

To evaluate and compare the effectiveness of genome assembly approaches, we utilized both Verkko (v1.3.1) and Hifiasm for constructing genome assemblies of both trio and individual samples within our dataset. Our initial step involved the generation of Merqury hapmer databases, which are crucial for facilitating high-quality, k-mer-based evaluation of haplotype assemblies. We generated hapmer databases for our trio samples using Merqury with the following command:

meryl count compress k = 30 threads = 96 memory = 350 G ${sample}.*fastq.gz output ${sample}_compress.k30.meryl

Following the creation of the hapmer databases, we proceeded with the assembly process using Verkko, which was configured specifically to leverage the advantages of trio-based phased assembly. Verkko was run with the following parameters:

verkko -d ${sample}.workdir --hifi ${sample}.pb.fastq.gz --nano ${sample}.ont.fastq.gz --hap-kmers maternal_compress.k30.hapmer.meryl paternal_compress.k30.hapmer.meryl trio

Genome assembly polishing

We employed Inspector (v1.2) to evaluate and improve the quality of the 106 haplotype genome assemblies. Initially, we assessed assembly errors using Inspector, followed by an error correction step executed using the following command:

inspector-correct.py -i $asm/ --datatype pacbio-hifi -o $asm.corrected/ --skip_structural -t 64

This step was specifically designed to refine the assemblies by addressing identified inaccuracies while excluding structural errors.

Mitochondrial read pairs were identified by mapping to ChrM using minimap2 (v2.26) with the option -L -eqx, and mapped reads were filtered using samtools (v1.6) view -F 4, and any corresponding mitochondrial contigs were removed. Potential interchromosomal joins in the assembly were identified using minimap software, followed by paftools²⁷. The specific commands initiated were -cxasm chm13v2.0.fa APR043.bp.hap1.p_ctg.fa and paftools misjoin-e APR043.1.misjoins.paf, respectively.

We employed PAV⁴³ for identifying variants from genome assemblies. The assemblies were sourced from ‘assemblies.tsv’, and were analyzed against a reference genome file ‘config.json’. The command executed for this analysis was:

singularity run --bind “$(pwd):$(pwd)” library://becklab/pav/pav:latest -c all

Assembly quality assessment

To evaluate the quality and structural integrity of 53 primary assemblies and 106 haplotype assemblies, we used QUAST (v5.2.0)⁶¹, which provides metrics such as completeness, N50, and number of contigs. QUAST was run with the following extensive parameter set, as detailed:

quast.py -o APR043.chm13v2.0 -r chm13v2.0.fa -t 16 APR043.bp.hap1.p_ctg.fa APR043.bp.hap2.p_ctg.fa --large -e

The yak suite (v0.1-r66)²⁴, which includes yak count, yak qv and yak trioeval, was employed for genome quality validation.

yak count -t16 -b37 -o APR043.pb.yak APR043.pb.fastq.gz

was used to build k-mer database and yak qv was used to calculate the QV score with the following parameters:

-t32 -p -K 3.2 g -l 100k APR043.pb.yak APR043.bp.hap1.p_ctg.fa > APR043.hap1.pb.yak.qv.txt

The mapping rate was assessed using minimap2 (v2.26).

We utilized Flagger (v0.3.3) to identify small-scale errors within the diploid assemblies. The HiFi reads were mapped to the concatenated assembly of both haplotypes using the following minimap2 command:

minimap2 --cs -L -Y -t 32 -ax map-hifi $sample.concat.fa.gz $sample.hifi.fq.gz

The mapped reads were then sorted using SAMtools sort. Each haplotype assembly was further mapped to chm13v2.0 using the following command:

minimap2 -t 64 -L --eqx --cs -ax asm5 chm13v2.0.fa $sample.$hap.polished.fa.gz

The Flagger WDL pipeline²⁹ was subsequently executed using the sorted bam files as input in no-variant-calling mode. The resulting bed files were analyzed to differentiate between the haplotypes within the assemblies.

All assembled contigs were aligned to the CHM13v2 reference genome using Minimap2 with -L --eqx option. After successful mapping, the chromosome to which each contig was predominantly aligned was identified for further analysis. The centromeric region corresponding to this chromosome was then extracted from the reference genome using SAMtools faidx. To assess the alignment quality of the assembly contig to the extracted centromeric region, we employed UniAligner³¹, specifically using the tandem_aligner command:

tandem_aligner --first chm13v2.chr1.fa --second $sample{}.1.polished.part_h1tg000053l.fa -o ${sample}.1_h1tg000053l_chr1”

The output of this alignment is a CIGAR string and each contig’s alignment to the centromere of its respective chromosome was further analyzed. A sliding window approach was applied to each CIGAR string, using a window size of 100 bp, to calculate the alignment percentage within each window. Finally, the alignment percentages from all windows were plotted to visualize the distribution of alignment quality across different sections of the contigs.

To calculate the Hamming rate and switch errors in assemblies constructed with Hi-C data, we used pstools (v0.1)³⁰. The phasing errors were assessed employing the following command: pstools phasing_error -t 96 $sample.1.polished.fa $sample.2.polished.fa $sample_r1.20fastq $sample_r2.20.fastq

Gene duplication

We used Liftoff (v1.6.3) (ref. ³²), a tool that accurately maps gene annotations between genome assemblies, to identify gene duplications in our dataset. Liftoff aligns gene sequences from a reference (GENCODE v38) to a target genome and finds the mapping that maximizes sequence identity while preserving the gene structure. For this analysis, we used a subset of GENCODE (v38) that excludes genes located on patches or haplotypes. Additionally, we included only one copy of the genes from the X/Y pseudoautosomal regions (PAR).

An identity threshold of 90% (-sc 0.9) was used, and the following command was used:

liftoff -p 64 -sc 0.90 -copies -g GENCODE.V38.gff3 -u unmapped.txt -o gene_dup.gff3 -polish {$ASSEMBLY} {$REFERENCE}

To remove partial matches from the analysis, we used the -exclude_partial option in Liftoff. We then quantitatively assessed the frequency of copy number variations (CNVs) for each gene in the target genome by comparing the number of gene copies to that in the reference (GENCODE GRCh38.p14 (GENCODE v38)). UPR-specific gene duplications were compared against HPRC and CPC gene duplication matrices, as reported in studies utilizing similar methodologies and tool versions. We have constructed a downstream analytical tool PanScan (https://github.com/muddinmbru/panscan) for analyses of complex structural variant, gene duplication analysis and unreported sequence estimation from pangenome graph vcf file.

Pangenome graph construction

We constructed Arab pangenome variation graphs using the Minigraph-Cactus pipeline (v2.7.2)³⁷, which combines structural variants and single-nucleotide variants (SNVs) into a single graph representation. We used two reference genomes, GRCh38 and CHM13, with CHM13 as the backbone for graph construction following the steps described in the Minigraph-Cactus documentation. Initially, an SV graph (>50 bp) was constructed using Minigraph⁶² by sequentially aligning the 106 haplotype assemblies to the reference genome. Centromeric and telomeric regions were masked using dna-brnn⁶³, and the assemblies were remapped to exclude highly repetitive sequences. Contigs were then split and assigned to chromosomes based on their alignment coordinates. Base-level alignment was performed using Cactus, and the HAL output⁶⁴ was converted to vg format (hal2vg). Paths >10 kb that were not aligned to the graph were removed, and the graph was normalized with GFAffix. The chromosome graphs were combined into a whole-genome graph, indexed with vg v1.50.1, and exported to VCF format (vg view). Subsequently, SNP variants were incorporated into the SV graph using the same pipeline. The assemblies were remapped to the SV graph, and quality control was applied, excluding softmasked sequences >100 kb and alignments with MAPQ < 5. Cactus was executed on the chromosome graphs to introduce base-level variants, and the outputs were converted to vg format. The same filtering, normalization, combining, and indexing steps were applied to produce the final SNP + SV Arab pangenome graph. All analyses were conducted using the CATG high-performance computing cluster Flamingo. The command line used in our end-to-end pangenome construction pipeline is as follows:

cactus-pangenome./apr-js-chm13.2908./seq_hifiasm.seq --latest --disableCaching --outName apr_review_v1_2902_chm13 --outDir apr_review_v1_2902_chm13 --reference CHM13 GRCh38 --filter 9 --giraffe clip filter --vcf --chrom-vg clip filter --gbz clip filter full --viz --gfa clip full --vcf --logFile apr-chm13.log --mgCores 160 --mapCores 160 --consCores 160 --indexCores 160 --binariesMode local --workDir wor

Pangenome growth curve

Panacus³⁸ was used to calculate the UPR pangenome growth. The percentages of samples included were >5% (common), >95% (core) of the total samples and 1 haplotype (singleton). The command used was:

panacus ordered-histgrowth -c bp -t320 -l 1,3,27,51 -S -e apr_v1_2902_chm13.paths.chm13.txt apr_v1_2902_chm13.gfa

Identification and visualization of population-specific variants and unreported sequences

Variants were called using CHM13-T2T as the reference genome. The multiallelic sites in the VCF file were split into biallelic records using the “bcftools norm -m-“ command, and then the VCF file was separated into SNPs and other complex variants using custom Perl script. The complex variants were decomposed into SNPs, indels (<=50 bp) and SVs (>50 bp) using the command:

vcfdecompose --break-indels --break-mnps

from RTG tools (v.3.12.1), and the genotype information from identical variants across multiple samples was merged to create a single record using custom Perl script. The SNPs obtained from “vcfdecompose” were also included in the downstream analysis. The same methodology was used for the analysis of the CPC-HPRC VCF file.

Unique SVs (<80% reciprocal overlap with HPRC and CPC, 1000 Genomes, and DGV) found in the Arab assemblies were classified as population specific. The unique SVs in the UPR were identified by comparison with the HPRC-CPC SVs using the Truvari command:

truvari bench -r 1000 -C 1000 -O 0.8 -p 0.8 -P 0.0 -s 50 -S 15 --sizemax 100000

A reciprocal overlap of 80% was the criterion used to classify an SV as unique. This process collapses multiple SVs into one unique SV within our cohort. The resulting SVs were compared against the DGV Gold Standard Variants and the 1000 Genomes Project data (phase1_v3.20101123) with 80% reciprocal overlap using custom perl script. To identify unreported sequences within the UPR unique SV insertions, we clustered them using cd-hit-est (4.8.1) with the default sequence identity threshold of 0.9.

We visualized SV distributions and enrichment significance on chromosomes with RIdeogram (v0.2.2)⁶⁵. Subgraphs surrounding SVs were extracted with gfabase (v0.6.0) and visualized in Bandage NG (v2022.09)⁶⁶ after aligning gene sequences with GraphAligner (v1.0.17)⁶⁷ using the following command:

GraphAligner -g {viz_output} -f {pc_fasta_file} -t 32 -a {pc_alignment_file} -x vg --multimap-score-fraction 0.1

The gfatools package v0.5-r287 was subsequently used to derive in-depth statistical data from the pangenome graphs.

To identify the repeat elements within unreported sequences, we screened fasta files with RepeatMasker (4.1.2-p1) using command:

RepeatMasker -e rmblast -pa 60 -species human APR_Unreported_Seq.fa -dir Result

Assessment of Mendelian concordance in trio samples

To evaluate the consistency of our trio samples’ variant calls with Mendelian concordance, we employed the PLINK command:

plink --bfile {trio-file.bed} --mendel --out {out.name}

Complex SV region and Bandage plotting

A site was considered complex if at least one variant larger than 10 kb was present, as well as at least five different alleles. The complex sites were identified by analyzing the snarls VCF files. The unique SVs were then overlapped with the complex sites to identify unique complex sites (see unique variants subsection) in the Arab population. The precise locations of the genes were determined by mapping the gene sequences to the graph using a graph aligner with the following parameter: --multimap-score 0.1. If multiple genes mapped to the same location (in the case of isoforms), only the gene mapping with the highest accuracy was retained. Complex structural variation (CSV) regions were defined as areas within a 100-kilobase (kb) window that consisted of two or more multiallelic complex SV sites, with each site containing at least one 10-kilobase (kb) SV within the haplotypes. The complex sites were plotted using Bandage. The 50 Kbp flanking region for each complex site was extracted using the Gfabase sub. To determine whether a haplotype consisted of a gene, the path was followed to determine whether it traversed through the gene region. Variation among haplotype walks that did not involve genes was visualized using color coded lines, from red to blue to indicate directions.

Mitochondrial pangenome construction

To construct a mitochondrial UAE pangenome (mtUPR) that captures the diversity of Arab mitochondrial DNA, we used high-quality long reads from 53 individuals. We mapped the reads to ChrM of the CHM13v2 reference genome using minimap2 (v2.26)⁶⁸ with 90% similarity and retained only reads that were longer than 15 kb; this threshold was set to substantially reduce the chances of inadvertent nuclear DNA contamination. This resulted in a total of 20,520 reads (19,251 ONT reads and 1385 HiFi reads). We used PGGB (v0.5.4)⁶⁹ to construct a mitochondrial pangenome graph from the mitochondrial contigs of HiFi reads of all individuals, and each read of >15 kb was concatenated in one fasta file along with the CHM13v2 mitochondrial chromosome. PGGB was then run on these samples with the following command: pggb -i pggb.in.90.no_dup.chm13.fasta.gz -o output_chm13_local -n 53 -t 90 -p 90 -s 100 -V ‘chm13v2:#:50’

We visualized the mtUPR graph using Bandage NG (v2022.09)⁶⁶ and displayed the nodes, edges and variants in different colors and shapes.

Mitochondrial pangenome variant identification

To process the VCF files of the mtUPR pangenome, we employed a multistage approach to ensure data accuracy and integrity. The initial step involved segregation of multiallelic variant sites into biallelic records, which was accomplished using the BCFtools normalization function (BCFtools norm -m). Next, we implemented the RTG tool vcfdecompose (v3.12.1) with the parameters --break-mnps and --break-indels to further resolve complex variants into their constituent single-nucleotide polymorphisms (SNPs), indels and SVs. To merge genotype information, identical variants identified across the dataset were consolidated into singular records. This step was carried out using a custom Perl script. Parallel to the UPR data, the HPRC mitochondrial pangenome VCF file was subjected to the same rigorous processing methodology to maintain consistency across datasets. The small variants (<10 bp) were further filtered to obtain those that were concordant with DeepVariant calls of the mitochondrial reads. To identify unique variants unique to the UPR mitochondrial pangenome, we systematically removed variants that were shared with the HPRC mitochondrial pangenome. We then extracted insertions of 10 base pairs or more that were subsequently clustered using the cd-hit-est program, which applied a stringent 90% similarity threshold (-c 0.9) to discern and characterize previously unobserved insertions.

Evaluation of short read whole genome and whole exome mapping

We used 9 WGS data from Arab descent⁹ to quantify the mappability accuracy of short read sequences into UPR and HPRC pangenome and CHM13 reference. The quality metrics of each samples were obtained using FASTQC tool (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and the low quality bases were removed using fastp (v0.23.4).

To calculate the mapping rate of short reads to CHM13, we aligned the paired end reads to CHM13v2 using bwa-mem. The following commands were used:

bwa mem -K 100000000 -Y -R ‘@RG\tID:Sample\tLB:Sample\tPL:ILLUMINA\tPM:HISEQ\tSM:Sample’ Filtered-Sample_R1.fastq Filtered-Sample_R2.fastq -t 64 > Sample.sam samtools view -@ 64 -bS Sample.sam | samtools flagstat -@ 64 - > Sample_Flagstat.txt

The variant calling was performed using DeepVariant (v1.5.0) by providing --model_type=WGS for the whole genome samples and --model_type=WES for the whole exome samples respectively.

To determine the mapping rate of short reads to the UPR and HPRC graphs and to conduct variant calling, filtered samples were mapped to pangenomes using vg Giraffe (v1.55.0). The “vg stats” command was used to produce stats for the properly paired reads for graph references and “vg call” command was used to call the variants:

vg giraffe -p -t 64 -Z APR.giraffe.gbz -d APR.giraffe.dist -m APR.giraffe.min -x APR.giraffe.xg -f Filtered-Sample_R1.fastq -f Filtered-Sample_R2.fastq > Sample.gam vg stats -p 64 -a sample.gam > Sample_Stat.txt vg pack -t 64 -x APR.giraffe.xg -g Sample.gam -Q 5 -o Sample_aln.pack vg call -t 64 APR.giraffe.xg -k Sample_aln.pack -S CHM13 > Sample-CHM13_graph_calls.vcf

Pathway enrichment analysis

To discern the biological pathways affected by gene sets, we conducted gene set enrichment analysis (GSEA) leveraging the KEGG and GO pathway databases, a well-curated repository featuring pathways that include molecular interactions, reactions, and relation networks. Gene names were converted to Entrez gene identifiers using the bioMart tool. Our analytical framework included determining the degree of overlap between the gene sets and the pathways in the KEGG-GO database, utilizing GeneOverlap to conduct one-sided Fisher’s exact tests (FETs). We excluded pathway gene sets that numbered fewer than 50 or exceeded 1000. To account for multiple hypothesis testing and control the false discovery rate (FDR), we applied the Benjamini‒Hochberg procedure to adjust the p values obtained from Fisher’s exact test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw sequencing reads generated in this study, including PacBio HiFi, ONT ultralong, and Hi-C data, have been deposited in the NCBI Sequence Read Archive (SRA) under accession code PRJNA1108179. The sample metadata can be accessed at SRP509490. The genome assemblies have been deposited in GenBank under accession codes PRJNA1151091-PRJNA1151118 and PRJNA1152014-PRJNA1152091 (full URL list provided in Supplementary Table 35). All sample assembly FASTA files, as well as the nuclear and mitochondrial pangenome files, including corresponding VCF and index files are publicly available at [https://www.mbru.ac.ae/the-arab-pangenome-reference/]. Previously published datasets used for comparison in this study include: HPRC alignments: [https://data.humanpangenome.org/alignments], CPC (Chinese Pangenome Consortium): [https://pog.fudan.edu.cn/cpc/#/data], Almarri et al. 2021: NCBI Bioproject PRJNA738764 [https://www.ncbi.nlm.nih.gov/bioproject/738764]. All additional data supporting the findings of this study are provided within the article, Supplementary Information files and Source data files. Source data are provided with this paper.

Code availability

The code used to reproduce the UAE-based Arab pangenome from this work is available at GitHub (https://github.com/muddinmbru/arab_pangenome_reference)⁷⁰ and archived with a DOI at Zenodo: 10.5281/zenodo.15524484. The PanScan tool is available at (https://github.com/muddinmbru/panscan). The commands used in other analyses can be found in the Methods and Supplementary Information sections.

References

Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Rhie, A. et al. The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Gao, Y. et al. A pangenome reference of 36 Chinese populations. Nature 619, 112–121 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Almarri, M. A. et al. The genomic history of the Middle East. Cell 184, 4612–4625.e14 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mbarek, H. et al. Qatar genome: Insights on genomics from the Middle East. Hum. Mutat. 43, 499–510 (2022).
Article PubMed Google Scholar
Tadmouri, G. O. et al. Consanguinity and reproductive health among Arabs. Reprod. Health 6, 17 (2009).
Article PubMed PubMed Central Google Scholar
Teebi, A. S. Autosomal recessive disorders among Arabs: an overview from Kuwait. J. Med. Genet. 31, 224–233 (1994).
Article CAS PubMed PubMed Central Google Scholar
Al-Gazali, L., Hamamy, H. & Al-Arrayad, S. Genetic disorders in the Arab world. BMJ 333, 831–834 (2006).
Article PubMed PubMed Central Google Scholar
Rahim, H. F. A. et al. Non-communicable diseases in the Arab world. Lancet 383, 356–367 (2014).
Article PubMed Google Scholar
El-Kebbi, I. M., Bidikian, N. H., Hneiny, L. & Nasrallah, M. P. Epidemiology of type 2 diabetes in the Middle East and North Africa: Challenges and call for action. World J. Diabetes 12, 1401–1425 (2021).
Article PubMed PubMed Central Google Scholar
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Scott, E. M. et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery. Nat. Genet. 48, 1071–1076 (2016).
Article CAS PubMed PubMed Central Google Scholar
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L. & Scherer, S. W. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).
Article CAS PubMed Google Scholar
Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature536, 419–424 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Article PubMed PubMed Central Google Scholar
Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. Plos Genet. 8, e1002453 (2012).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22, 312 (2021).
Article PubMed PubMed Central Google Scholar
GitHub - mobinasri/flagger: Evaluating genome assemblies. https://github.com/mobinasri/flagger.
GitHub - shilpagarg/pstools. https://github.com/shilpagarg/pstools.
Bzikadze, A. V. & Pevzner, P. A. UniAligner: a parameter-free framework for fast sequence alignment. Nat. Methods 20, 1346–1354 (2023).
Article CAS PubMed Google Scholar
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
Article CAS PubMed PubMed Central Google Scholar
McFarlane, C. et al. The deubiquitinating enzyme USP17 is highly expressed in tumor biopsies, is cell cycle regulated, and is required for G1-S progression. Cancer Res. 70, 3329–3339 (2010).
Article CAS PubMed Google Scholar
Komander, D., Clague, M. J. & Urbé, S. Breaking the chains: structure and function of the deubiquitinases. Nat. Rev. Mol. Cell Biol. 10, 550–563 (2009).
Article CAS PubMed Google Scholar
Luse, D. S. The RNA polymerase II preinitiation complex. Through what pathway is the complex assembled?. Transcription 5, e27050 (2014).
Article PubMed Google Scholar
Chen, C.-L. et al. Ethnically unique disease burden and limitations of current expanded carrier screening panels. Int. J. Gynaecol Obstet. https://doi.org/10.1002/ijgo.15072 (2023).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).
GitHub - marschall-lab/panacus: Panacus is a tool for computing statistics for GFA-formatted pangenome graphs. https://github.com/marschall-lab/panacus.
Kern, C. H., Yang, M. & Liu, W.-S. The PRAME family of cancer testis antigens is essential for germline development and gametogenesis†. Biol. Reprod. 105, 290–304 (2021).
Article PubMed Google Scholar
Proshkin, S. A., Shematorova, E. K. & Shpakovski, G. V. The Human Isoform of RNA Polymerase II Subunit hRPB11bα Specifically Interacts with Transcription Factor ATF4. Int. J. Mol. Sci. 21, 135 (2019).
Chauhan, S. et al. Evolution of the Cdk-activator Speedy/RINGO in vertebrates. Cell Mol. Life Sci. 69, 3835–3850 (2012).
Article CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).
Article CAS PubMed Google Scholar
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun.10, 1–10 (2019).
Article CAS Google Scholar
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
Article PubMed PubMed Central Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Aamer, W. et al. Burden of Mendelian disorders in a large Middle Eastern biobank. Genome Med. 16, 46 (2024).
Article CAS PubMed PubMed Central Google Scholar
Vollger, M. R. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Ravarani, C. N. J. et al. Molecular determinants underlying functional innovations of TBP and their impact on transcription initiation. Nat. Commun. 11, 2384 (2020).
Article PubMed PubMed Central ADS Google Scholar
Mallick, S. et al. The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci. Data 11, 1–10 (2024).
Article Google Scholar
Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
Article CAS PubMed PubMed Central Google Scholar
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 12, 246 (2011).
Article Google Scholar
Weissensteiner, H. et al. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res. 44, W58–W63 (2016).
Article CAS PubMed PubMed Central Google Scholar
Fähnrich, A. et al. North and East African mitochondrial genetic variation needs further characterization towards precision medicine. J. Adv. Res. 54, 59–76 (2023).
Article PubMed PubMed Central Google Scholar
Aljasmi, F. A. et al. Genomic landscape of the mitochondrial genome in the United Arab Emirates Native Population. Genes 11, 876 (2020).
Chen, H., Lu, Y., Lu, D. & Xu, S. Y-LineageTracker: a high-throughput analysis framework for Y-chromosomal next-generation sequencing data. BMC Bioinform. 22, 114 (2021).
Article Google Scholar
Hallast, P., Agdzhoyan, A., Balanovsky, O., Xue, Y. & Tyler-Smith, C. A Southeast Asian origin for present-day non-African human Y chromosomes. Hum. Genet. 140, 299–307 (2021).
Article CAS PubMed Google Scholar
Elliott, K. S. et al. Fine-scale genetic structure in the United Arab Emirates reflects endogamous and consanguineous culture, population history, and geography. Mol. Biol. Evol. 39, msac039 (2022).
Abu-Amero, K. K. et al. Saudi Arabian Y-Chromosome diversity and its relationship with nearby regions. BMC Genet. 10, 59 (2009).
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Article PubMed PubMed Central Google Scholar
Li, H. Identifying centromeric satellites with dna-brnn. Bioinformatics 35, 4408–4410 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hickey, G., Paten, B., Earl, D., Zerbino, D. & Haussler, D. HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics 29, 1341–1342 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hao, Z. et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6, e251 (2020).
Article PubMed PubMed Central Google Scholar
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).
Article CAS PubMed PubMed Central Google Scholar
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Garrison, E. et al. Building pangenome graphs. bioRxiv (2023) https://doi.org/10.1101/2023.04.05.535718.
GitHub - muddinmbru/arab_pangenome_reference. https://github.com/muddinmbru/arab_pangenome_reference.

Download references

Acknowledgements

We would like to thank all the study participants for giving us this opportunity of building a reference pangenome. Yehia Zakaria Kotp provided constant support for the CATG Flamingo computation cluster. Avinash Krishnan, and Freeda Pinto provided constant help with logistics at CATG sequencing lab. HPRC team for commenting on our assembly data quality. Dubai Academic Healthcare Corporation (DAHC) funded this work and supported the procurement of reagents, chemicals, and computing hardwares. The funders had no role in the design of the study, data collection, analysis, publish or preparation of the manuscript. We thank Nature Proofediting Services for its linguistic assistance, which greatly aided the preparation of this manuscript.

Author information

Authors and Affiliations

Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, UAE
Nasna Nassir, Muhammad Kumail, Nesrin Mohamed, Bipin Balan, Shehzad Hanif, Maryam AlObathani, Bassam Jamalalail, Hanan Elsokary, Dasuki Kondaramage, Suhana Shiyas, Noor Kosaji, Stefan S Du Plessis, Hamda Hassan Khansaheb, Alawi Alsheikh-Ali & Mohammed Uddin
College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, UAE
Nasna Nassir, Mohamed A. Almarri, Bipin Balan, Noor Kosaji, Dharana Satsangi, Hanan Al Suwaidi, Ammar Albanna, Stefan S Du Plessis, Hamda Hassan Khansaheb, Alawi Alsheikh-Ali & Mohammed Uddin
Genome Center, Department of Forensic Science and Criminology, Dubai Police GHQ, Dubai, UAE
Mohamed A. Almarri
Ambulatory Health Care, Dubai Health, Dubai, UAE
Madiha Hamdi Saif Abdelmotagali, Olfat Zuhair Salem Ahmed & Douaa Fathi Youssef
Dubai Health Genomic Medicine Center, Al Jalila Children’s Specialty Hospital, Dubai Health, Dubai, UAE
Ahmad Abou Tayoun
Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, UAE
Ahmad Abou Tayoun
AlAmal Psychiatric Hospital, Al Aweer, UAE
Ammar Albanna
Dubai Health Authority, Dubai, UAE
Alawi Alsheikh-Ali
GenomeArc Inc, Mississauga, ON, Canada
Mohammed Uddin

Authors

Nasna Nassir
View author publications
Search author on:PubMed Google Scholar
Mohamed A. Almarri
View author publications
Search author on:PubMed Google Scholar
Muhammad Kumail
View author publications
Search author on:PubMed Google Scholar
Nesrin Mohamed
View author publications
Search author on:PubMed Google Scholar
Bipin Balan
View author publications
Search author on:PubMed Google Scholar
Shehzad Hanif
View author publications
Search author on:PubMed Google Scholar
Maryam AlObathani
View author publications
Search author on:PubMed Google Scholar
Bassam Jamalalail
View author publications
Search author on:PubMed Google Scholar
Hanan Elsokary
View author publications
Search author on:PubMed Google Scholar
Dasuki Kondaramage
View author publications
Search author on:PubMed Google Scholar
Suhana Shiyas
View author publications
Search author on:PubMed Google Scholar
Noor Kosaji
View author publications
Search author on:PubMed Google Scholar
Dharana Satsangi
View author publications
Search author on:PubMed Google Scholar
Madiha Hamdi Saif Abdelmotagali
View author publications
Search author on:PubMed Google Scholar
Ahmad Abou Tayoun
View author publications
Search author on:PubMed Google Scholar
Olfat Zuhair Salem Ahmed
View author publications
Search author on:PubMed Google Scholar
Douaa Fathi Youssef
View author publications
Search author on:PubMed Google Scholar
Hanan Al Suwaidi
View author publications
Search author on:PubMed Google Scholar
Ammar Albanna
View author publications
Search author on:PubMed Google Scholar
Stefan S Du Plessis
View author publications
Search author on:PubMed Google Scholar
Hamda Hassan Khansaheb
View author publications
Search author on:PubMed Google Scholar
Alawi Alsheikh-Ali
View author publications
Search author on:PubMed Google Scholar
Mohammed Uddin
View author publications
Search author on:PubMed Google Scholar

Contributions

A.A.A., M.U., S.S.D.P, and M.A. conceived and designed the study. A.A.A. and M.U. were responsible for ethical, legal and social implications. A.A.A., M.U., and N.N. coordinated and supervised the project. B.J., N.N., B.A., M.A.O., M.H.S.A., A.A., O.Z.S.A., D.F.Y., H.A.S., and H.H.K. were responsible for recruitment and collection of blood samples and clinical data from medical health records. A.A.A., M.U., N.N., S.H., M.A., H.H.K., and B.J. were responsible for sample selection and population genetic analysis. M.A.O., M.U., N.N., N.M., H.E., D.K., D.S., were responsible for DNA and RNA sequencing. M.K., B.B., S.H., N.N., and M.U. were responsible for variant annotation. MK was responsible for assembly creation and M.K., M.U., and N.N. were responsible for assembly quality control and assembly reliability analysis. M.U., M.K., B.B., N.N., and S.H. were responsible for variant detection from assembly, pangenome empirical analysis and quality control. B.B., S.H., M.U., and N.N. were responsible for pangenome applications of SVs, short and long-read mapping. M.K., N.N., M.U., N.K., and S.S. were responsible for pangenome visualization and complex loci analysis and pangenome graph creation. M.U., M.K., B.B., N.N., and S.H. statistical analysis. A.A.A., M.U., N.N., S.S.D.P., M.A., and A.A.T. were responsible for manuscript writing. M.U., A.A.A., and N.N. were responsible for manuscript editing. A.A.A., M.U., and N.N. were responsible for data coordination and management.

Corresponding authors

Correspondence to Alawi Alsheikh-Ali or Mohammed Uddin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Tables 1-36 (download XLSX )

Reporting Summary (download PDF )

Peer Review file (download PDF )

Source data

Source data for figure 1-5 (download XLSX )

Source data for Supplementary Figs. (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nassir, N., Almarri, M.A., Kumail, M. et al. A draft UAE-based Arab pangenome reference. Nat Commun 16, 6747 (2025). https://doi.org/10.1038/s41467-025-61645-w

Download citation

Received: 13 September 2024
Accepted: 24 June 2025
Published: 24 July 2025
Version of record: 24 July 2025
DOI: https://doi.org/10.1038/s41467-025-61645-w

This article is cited by

From genomes to precision medicine: 25 years of breakthroughs in genetics and genomics
- Solomon G. Hailu
The Nucleus (2026)
Entering the era of precision medicine to treat amyotrophic lateral sclerosis
- Frances Theunissen
- Loren Flynn
- P. Anthony Akkari
Molecular Neurodegeneration (2025)