Introduction

The multifaceted nature of human genetic diversity is deeply influenced by demographic dynamics and biological mechanisms1. This complexity holds profound implications for elucidating the genetic etiology of diseases2. The Kingdom of Morocco, situated at the crossroads of Africa, Europe, and the Middle East, boasts a rich tapestry of cultural heritage and a diverse population shaped by centuries of historical migrations, trade, and conquests. This intricate history has contributed to the genetic diversity observed within the Moroccan population, characterized by a mosaic of ancestries and genetic admixture3.

Despite its significance, genetic studies focusing on populations of North Africa, including Morocco, have historically been limited in scope and scale. Early investigations primarily relied on small sample sizes and targeted genetic markers, providing only a partial understanding of the genetic diversity and population structure in the region. However, recent advancements in high-throughput sequencing technologies have revolutionized the field of population genetics, enabling comprehensive analyses of whole-genome sequences from large cohorts.

As personal genomics becomes increasingly common, population genetics is critical for precision medicine. To identify disease-related mutations, genomics-based precision medicine compares patients’ genetics to a reference genome4, mostly composed of European individuals. Numerous genetic and epidemiological studies have acknowledged the importance of ancestral origin in predicting disease risk. Risk alleles and structural variations (SVs) may be missing from reference genetic data or exhibit varying population frequencies, leading to unidentified disease pathways and significant health disparities across worldwide populations5,6,7,8. Moreover, non-European samples in the Genome-Wide Association Studies (GWAS) databases account for only around 20% of the total samples8. For instance, an analysis of the data in the NHGRI-EBI GWAS catalog, a comprehensive database of published genome-wide association studies, revealed a low representation (2.4%) of African ancestry9. The situation is especially concerning for North African and Moroccan populations, both underrepresented in the GWAS catalog. In fact, Polygenic risk scores were shown to be more accurate for individuals of European ancestry than for others10.

Using genetic references and allele frequencies from other populations will affect the country’s capacity to use precision medicine technologies and tailored therapeutic procedures; this drives national or multinational multi-ethnic genome initiatives11. Asia and England, for example, have launched population-based sequencing initiatives to map variations through 100,000 and 500,000 genome studies, respectively12,13. Large-scale sequencing studies are also being carried out in Europe, North America, and increasingly, Sub-Saharan Africa to study population, society, and history-specific genetic variants. The African Genome Variation Project (AGVP) examined numerous African populations14. Still, its primary emphasis was on those south of the Sahara, which included areas in the West, East, and South Africa. Because of this spatial emphasis, the research may underrepresent the genetic diversity of North African populations. The AGVP sought to capture genetic diversity throughout Sub-Saharan Africa, excluding genetic data from North African groups. This results in a gap in our comprehension of the African continent’s genetic landscape. For example, North Africans are underrepresented in genomic datasets such as 1000 Genomes15, TOPMED16, and gnomAD17. This region, including Morocco, Algeria, Tunisia, Libya, and Egypt, has a unique and diverse genetic legacy shaped by past migrations, trade routes, and conquests. The Arab conquest of North Africa 1,400 years ago had a huge cultural and genetic impact on the region. The expansion of the Arabic language and the intermingling of people have had a lasting genetic influence. The inhabitants in these areas have a combination of indigenous Berber heritage and Arab, Mediterranean, and sub-Saharan African gene flow18.

The Egyptian Genome Project, for instance, marked a significant milestone by producing a high-quality de novo genome assembly for a single Egyptian male and constructing a population genome using genome-wide SNV allele frequencies from 109 additional Egyptians. This comprehensive analysis identified over 19 million single nucleotide variants (SNVs) and more than 121,000 structural variants across 110 Egyptian individuals19. Similarly, in Tunisia, the Genome Tunisia Project, launched in 2022, aims to sequence the Tunisian genome by 2035. This ambitious initiative seeks to establish a reference sequence capturing the unique genetic composition of the Tunisian population, shaped by centuries of admixture involving African, European, and Asian ancestries. The project aspires to advance genomic research capabilities in Tunisia and contribute to broader North African genomic science20.

In Morocco, whole-genome sequences of three Moroccan individuals were released in 202021. Subsequently, a deep bioinformatic analysis of these genomes showed the highest similarities with African and Non-Finnish European ancestries according to gnomAD. Moreover, population structure inference demonstrated that the studied genomes have a mixed population structure, indicating a notable increase in genetic diversity within the Moroccan population22. However, the small sample size (3 genomes) is insufficient to capture all the genetic variations of the Moroccan population.

The MGP, initiated by the Mohammed VI Center for Research and Innovation, was established to study 1000 Moroccan genomes as part of the North African region. The MGP started in 2024 with a comprehensive plan to advance genomics by building expertise, creating bioinformatics infrastructure, and establishing a Moroccan biobank. It emphasizes precision medicine, a national genome database, and adherence to genomic data protection policies, all contributing to a robust national genomics initiative. This approach, in line with the New National Health vision, goes beyond collecting data and extends into areas such as policy and legislation, developing human capability, and integrating clinical practices.

This study aims to initiate the MGP by sequencing 109 whole Moroccan genomes to investigate the population’s genetic variations and build a Moroccan Major Allele Reference Genome (MMARG). Such research efforts could significantly contribute to reducing disparities caused by the under-representation of Moroccan and hence North African variants in the human reference genome. This improvement could enhance healthcare by enabling more personalized medicine.

Results

Overview of the genetic variations in the Moroccan genomes

Genotyping and variant calling of the 109 Moroccan genomes resulted in an initial VCF file containing 28,262,306 variants, including 24,356,267 SNVs and 4,181,156 indels, with 2,161,454 multiallelic sites. After applying the GATK VQSR filter, the number of variants was reduced to 24,958,854, comprising 21,760,118 SNVs and 3,400,454 indels, with a decrease in multiallelic sites to 1,533,238. These variants were initially used for Hardy Weinberg (HWE) and linkage disequilibrium analyses. The HWE analysis showed that most genetic variants across all chromosomes conformed to HWE expectations. Overall, 99.56% of variants were in Hardy-Weinberg equilibrium, while only 0.44% deviated from the equilibrium, as shown in Table S1. The analysis of linkage disequilibrium using the specified thresholds showed that 19,469,198 out of 22,827,466 variants exhibited high linkage disequilibrium, corresponding to approximately 85% of the total variants. For the subsequent analyses, vcf normalization increased the number of variants to 27,935,252, consisting of 21,878,061 SNPs and 5,502,684 indels, with all multiallelic sites split. These variants were distributed across all chromosomes, with a significant majority (94.96%) classified as “known” variants and approximately 5.04% categorized as “novel” (Table 1). While the percentage of novel variants is relatively consistent across most chromosomes, ranging from approximately 3.90% to 5.71%, chromosome Y stands out with nearly half (46.34%) of its alleles classified as novel. Mitochondrial DNA (chrM) also shows a higher percentage of novel variants (7.81%) than nuclear chromosomes.

Table 1 Summary of autosomal and sex chromosome variants identified through genotyping of 109 Moroccans, showing known and novel variants based on their overlap with dbSNP

The 26,985,607 variants obtained after removing variants with more than 11 missing genotypes were divided into three groups based on allele frequency (AF); the majority (61%) were rare alternate alleles (AF < 0.05) (Fig. 1).

Fig. 1: Histogram depicting allele frequency (AF) distribution for all filtered variants across 109 Moroccan samples.
figure 1

The histogram shows the count of variants per 5% AF interval. The most prominent peak corresponds to variants with rare alternate AFs below 5%, whereas rare or unobserved reference variants have 100% alternate AFs, indicating the less common variants.

We analyzed the variants using ClinVar annotation, identifying 231 pathogenic and likely pathogenic alterations, with the majority (205) being part of the exome. These alterations include 167 single nucleotide variations (SNVs) and 64 insertions or deletions (indels), affecting 191 unique genes. Most of these variants are rare, with over two-thirds having a minor allele frequency (MAF) of less than 1%. On average, each individual carries 21 of these variants, ranging from 12 to 29. The average allele frequency (AF) for these variants is 0.0598, with a range of 0.00458 to 0.9862.

The distribution of the variants and their top consequence are represented in Fig. 2 and Fig. S1, respectively. These representations reveal distinct patterns in variant types across the exome. Chromosome 1 exhibits the highest total variants (24,350), closely followed by Chromosome 19 with 19,794 variants. In contrast, Chromosome Y shows the lowest variant count (60 variants). SNVs are predominant on most chromosomes, with percentages ranging from 87.49% to 93.45% of the variants. Notably, Chromosome Y stands out with a higher proportion of insertions (13.33%) and complex variants (6.67%) compared to others. Analysis of pathogenic variants reveals significant concentrations on specific chromosomes. Chromosomes 1, 11, and 3 emerge as hotspots with 28, 21, and 18 pathogenic variants, respectively, indicating regions of potential clinical relevance. Chromosomes 6, 12, and 16 also display notable pathogenic variant counts, ranging from 10 to 12 variants. Conversely, Chromosomes 14, 15, and 21 show minimal pathogenic variants. Additionally, we have listed the most frequent variants (55 variants) with high functional impact in the Moroccan population compared to gnomAD (Supplementary Data 1).

Fig. 2: Circular plot showing the spatial distribution of variant counts in 2 Mbp windows and pathogenic variants across the exome.
figure 2

From outer to inner rings: blue represents SNV distribution, green shows deletions, orange indicates insertions, and red depicts complex variants (scales adjusted for visibility). The innermost ring displays pathogenic variants and their frequency in the Moroccan population.

Loss of function analysis

Using the LOFTEE plugin17 of VEP, 1086 variants with an allele frequency (AF) greater than 0.01 were predicted to cause a loss-of-function (LoF) with a high degree of confidence. These variants included 501 SNPs, 346 deletions, 210 insertions, and 29 complex variants. Narrowing the search to LoF variants prevalent in Moroccan samples (AF > 0.05) and rare in other populations (gnomAD Exome < 0.01) resulted in the detection of 184 variants (69 SNPs, 107 indels, and eight complex variants). While most of these variants were not reported in ClinVar and none were evaluated as pathogenic, some were classified as benign, and others had conflicting interpretations regarding their pathogenicity. The genes harboring these potentially pathogenic variants include PRSS1, related to hereditary pancreatitis; PEX5, involved in Peroxisome biogenesis disorder; MROH8, associated with congenital disorder of glycosylation; and OBSL1, implicated in 3 M syndrome 2.

Moroccan major-allele reference genome

The Moroccan population’s major Allele reference genome (MMARG) was based on 2,257,746 variants, including 1,907,253 SNPs and 350,493 indels. Compared to GRCh38, variant calling using MMARG showed a consistently lower variant count across all chromosomes (Table 2) (Fig. S2). The total number of detected variants using the GRCh38 reference was 4,978,994, while the total count using the MMARG was 2,737,930, resulting in a difference of 2,241,064, equivalent to a percentage reduction of 45.01%. Chromosome Y exhibited the highest reduction of 64.57%, followed by chromosome X at 52.78% and chromosome 21 at 51.87%. The lowest reductions were observed in chromosome M at 40.54%, chromosome 5 at 41.28%, and chromosome 16 at 41.90%.

Table 2 Variant call reduction using the MMARG on a Moroccan genome compared to GRCh38

Genetic relationships between Moroccan and worldwide populations

The genetic diversity of the Moroccan population was analyzed by comparing its genomic data with that of the 1000 Genomes Project and the Human Genome Diversity Project using various statistical methods and analyses. The principal component analysis (PCA) positioned the Moroccan population and the Mozabites within the same cluster along the Europe-Africa axis, showing strong genetic proximity between the two populations. A genetic proximity was also observed with the European and Middle Eastern clusters and, to a lesser extent, with the American cluster, as shown in Fig. 3a (for a more in-depth visualization, refer to the Supplementary data 2).

Fig. 3: Genetic structure of the Moroccan population.
figure 3

a Principal component analysis (PCA) was conducted using data from 3,586 individuals representing various populations worldwide. The points are color-coded according to the superpopulations. b ADMIXTURE results at K = 19 with a zoom on the Moroccan population, showing four major ancestral components. c. Heat map of pairwise Fst values between Moroccan genomes and various populations. The shown values correspond to the Fst multiplied by 1000. d. Box plot of the total length of Runs Of Homozygosity (ROHs) for Moroccans compared with other populations. Colors indicate superpopulations. The number of individuals per population is shown in brackets. Box plots indicate the median and lower/upper quartiles; whiskers represent the most extreme data points, not exceeding 1.5 times the interquartile range; and outliers are data points that fall outside the whiskers. Additionally, P-values comparing the mean total lengths of ROH have been estimated using ggpubr57. The choice of populations for calculating Fst and ROHs was based on their proximity to the Moroccan population, based on the PCA and ADMIXTURE results. The results of the PCA, Fst, and ROH analyses were visualized using R58. The ADMIXTURE results were visualized with Pong v 1.559.

The PCA results were corroborated by the ADMIXTURE analysis (Fig. S3). We chose K = 19 to estimate the ancestry of the Moroccan population, as this value exhibited the lowest cross-validation (CV) error. We found that 80% of the analyzed Moroccan variants consisted of four major ancestry components, namely North African (51.2%), European (10.9%), Middle Eastern (10.7%), and West African (6.8%). Additionally, these results demonstrate low genetic heterogeneity, evidenced by minimal variation in the proportion of ancestral components among individuals (Fig. 3b).

Furthermore, a subset of populations from the ADMIXTURE analysis, including European, African, North African, and Middle Eastern populations, was used to conduct pairwise Fst analysis due to their genetic closeness to the Moroccan population. In total, 618 individuals from 38 populations were included in the dataset. The analysis uncovered that Moroccans exhibited the lowest genetic distance with the Mozabites (Fst = 8.147), while the largest genetic distance was observed with the Surui population (Fst = 139.996) (Fig. 3c) (Supplementary data 3).

The average total length of ROHs ( > 1 Mb) (Supplementary data 4) in the Moroccan population was comparable to that in Middle Eastern and Mozabite populations, with no significant differences observed (p ≥ 0.05, Wilcoxon test) (Supplementary data 5). These populations exhibited relatively large ROHs compared to most other populations, which can be attributed to the widespread practice of consanguineous marriages in these regions. In addition, the Luhya population had the shortest ROHs, while the Karitiana population exhibited the largest ROHs (Fig. 3d).

Mitochondrial and Y DNA haplogroups identification

To further validate our previous findings, we conducted haplogroup analysis using mitochondrial DNA (MT-DNA) and Y chromosome markers. Mitochondrial haplogroups were categorized into three broad geographical categories based on the classification suggested by Coudray et al. 23: European haplogroups (H, HV, R0, J, T, U, W), Sub-Saharan African haplogroups (L0, L1, L2, L3), and North African lineages (U6, M1). Our results indicated that among the 109 Moroccan samples analyzed, 73% exhibited European haplogroups (H (29.4%), U (15.6%), T (8.3%), and J (2.8%)) This finding corroborates the hypothesis that the predominant European maternal contribution observed in Northwest African populations is likely linked to prehistoric origins, specifically the post-glacial expansion from the Iberian Peninsula, rather than more recent historical events24. Additionally, 19% of samples were Sub-Saharan African, including L2 (27.3%), L3 (11%), and L1 (10.1%), while 8% of mitochondrial haplogroups were attributed to indigenous North African lineages M (5.5%) (Fig. 4a). The Y chromosome analysis identified E1b1b1 (M35) haplogroup as the more frequent in the Moroccan population; this lineage is also found at various frequencies throughout North and East Africa25,26 (Fig. 5).

Fig. 4: Mitochondrial haplogroup distribution and frequency.
figure 4

a Total haplogroups frequency for the Moroccan population. b DNA D-loop Haplotype Network: Median-Joining Network Comparing 109 Moroccans with African, European, and American Populations. Green circle indicates Moroccan haplogroups.

Fig. 5: Y-chromosome haplogroup distribution in 109 Moroccan males.
figure 5

The bar chart depicts the frequencies of Y-chromosome haplogroups found in a sample of 109 Moroccans. E1b1b1 is the most common (36.6%), followed by F (19.5%) and G2 (17.1%). Less frequent haplogroups include E1b1, R1, E1, R1b1, and K. Colored bars represent each haplogroup, with corresponding percentages indicated.

Haplotype network

Upon initial examination, two prominent clusters emerged, effectively delineating African and European haplotypes. The American haplotypes, while predominantly situated within the European cluster, also exhibited a representation within the African cluster, forming discernible subclusters, particularly on the right side of the network. Notably, Moroccan haplotypes predominantly aligned with the European cluster, constituting approximately 66% of the total Moroccan samples, with the remaining 34% distributed among the African haplotype cluster. Additionally, Moroccan haplotypes demonstrated the formation of subclusters, accounting for roughly 24% of the total Moroccan samples, showcasing a degree of intra-group diversity within the population (Fig. 4b).

Discussion

In this study, the analysis of the 109 Moroccan genomes revealed 27,935,252 variants, indicating higher genetic variability in the Moroccan population compared to the Egyptian and UAE populations19,27. The proportion of new Moroccan variants ranged from 3.90% to 5.71% for autosomes and was 6.55% and 46.34% for the X and Y chromosomes, respectively. The number of these newly identified variants in autosomes was slightly lower than those observed in the UAE population (7-8%) and lower for the X chromosome (12%). However, the proportion of new Y chromosome variants in Moroccans was higher compared to the UAE population (46.34% vs. 41%)27.

According to the latest statistics from the World Health Organization (WHO), in Morocco, noncommunicable diseases such as, ischemic heart disease, stroke, hypertensive heart disease, kidney diseases, cancers (trachea, bronchus, lungs), and diabetes mellitus have the highest mortality rates28. The 109 Moroccan genomes analysis revealed the presence of several variants that are potentially pathogenic or represent risk factors associated with some of these diseases, according to ClinVar phenotypic data. Among the potentially pathogenic variants, the p.Leu1966Thrfs*4 variant (rs746838237) was identified in the PKHD1 gene, with a frequency of 0.00459. This variant is involved in autosomal recessive polycystic kidney disease, a genetic condition that can lead to pulmonary hypoplasia, the main cause of neonatal morbidity and mortality; and in survivors, hypertension and renal insufficiency29.

Regarding variants that represent risk factors, we identified three with relatively high allele frequencies: p.Thr60Asn (rs1041981, AF = 0.26606) in the LTA gene, p.Arg877Gln (rs5174, AF = 0.18349) in the LRP8 gene and Lys167Asn (rs11053646, AF = 0.09174) in the OLR1 gene, all associated with myocardial infarction. Notably, the Lys167Asn variant modifies the ligand-binding domain of the OLR1 protein, contributing to an increased risk of myocardial infarction by potentially altering the protein’s function30,31. Similarly, two other variants linked to diabetes mellitus type 2 (T2DM) susceptibility were identified, namely Arg276Trp (rs13266634, AF = 0.13761) in the gene SLC30A8 and Gly1057Asp (rs1805097, AF = 0.22477) in the gene IRS2. Studies have shown that individuals carrying the Gly1057Asp variant exhibit decreased serum insulin and C-peptide concentrations during oral glucose tolerance tests. Additionally, this variant increases T2DM risk in obese individuals but decreases it in lean individuals32,33.

Additionally, due to its cultural diversity, Morocco is a country where consanguineous marriage is widespread (as evidenced by the relatively high average total length of ROHs), thus increasing the risk of the appearance of genetic disorders34. For instance, according to ClinVar, a potentially pathogenic variant (rs72474224) in GJB2, a gene associated with hearing loss, was identified with a frequency of 0.045, with carriers being heterozygous. The GJB2 gene has been previously reported as the main cause of non-syndromic hereditary deafness in Morocco35. However, at the time of sampling, all individuals in our cohort reported no known genetic diseases or disorders. Therefore, it is crucial to conduct an in-depth analysis of the variants classified as pathogenic in the context of genetic penetrance, mode of transmission, and homozygosity status.

On the other hand, establishing a population-specific reference genome is necessary to reliably identify genetic variants unique to individuals and common within the population. The MMARG proposed in this study is a first step towards constructing the Moroccan reference genome. Indeed, variant calling using the MMARG has considerably reduced the number of variants in Moroccan genomes compared with the standard GRCh38 reference genome. These variants, with updated allele frequencies, represent the true genetic variability within the Moroccan population, improving the accurate identification of hereditary genetic risks in Moroccans.

Early genomic studies of North African populations often relied on small sample sizes and targeted specific genetic markers, resulting in a limited understanding of genetic variability and population structure. The first whole genome study has been so far conducted by Crooks et al. 21, on three Moroccan individuals. In this study, principal component analysis revealed that these individuals have both African and European ancestry. Boumajdi et al. 22 further validated and strengthened these findings by conducting an in-depth analysis of their ancestral backgrounds. Despite these results, it was concluded that determining the exact ancestry of the Moroccan population using only the genomes of three individuals remains challenging, as this is a small sample size.

The analysis of 109 sequenced genomes in the present study corroborates the results of the previous studies and further confirms the North African and Middle Eastern ancestries of the Moroccan population. Using ADMIXTURE analysis, a predominant genetic component specific to North Africans was observed in the Moroccan and Mozabite populations, likely representing the autochthonous component dating back to the Epipaleolithic era, which has persisted in North Africans as identified in studies of ancient and modern genomes36,37. Genetic influences from the Middle East, Mediterranean Europe, and West Africa can be attributed to the country’s geographical location and the significant migration events, which have slightly diluted the autochthonous North African component18. In contrast, this component is less represented in the Egyptian genome (15% relative influence) than the more predominant Middle Eastern component (27% relative influence)19.

This study provides valuable insights into the genetic architecture of the Moroccan population and highlights the importance of population-specific genetic studies for precision medicine and public health interventions. The analysis of genetic mutations presented here serves as a foundation for future research aimed at addressing the healthcare needs of the Moroccan population and advancing genomic medicine in North Africa. However, the potential limitations of this study concern sample size and geographical stratification, as recruitment in this first phase of the project was conducted randomly. A comprehensive strategy for the Moroccan Genomics Project (MGP) is currently being developed to address all critical aspects of this large-scale initiative. Furthermore, the newly identified variants have not yet been thoroughly analyzed and will be the subject of future studies to assess their functionality and impact.

Methods

Ethics statement

The Ethical Committee of the Faculty of Medicine and Pharmacy in Rabat, Morocco (Approval N ° 10/15) approved this study. All subjects gave written informed consent in accordance with the Declaration of Helsinki. All ethical regulations relevant to human research participants were followed.

Recruitment strategy

The recruitment strategy aimed to constitute a sample exclusively composed of Moroccan individuals to capture the genetic diversity of the population accurately. In this first phase of the MGP project, a cohort of 109 apparently healthy participants was recruited, representing around 10% of the targeted 1000 Moroccan genomes. The recruitment process was carried out in Rabat, Morocco’s capital, centrally located and serves as a key destination for individuals from various regions of the country. Following a public announcement of the project volunteers were randomly selected. Eligibility criteria included a minimum age of 18, no family ties between participants, and confirmed Moroccan ancestry over four generations. All volunteers signed an informed consent form after being fully briefed on the objectives of the study, which aimed to explore the genetic diversity of the Moroccan population. The final cohort is distributed as follows: 53 individuals from the north of the country, 47 from the center, and nine from the south of Morocco.

DNA isolation and sequencing

Genomic DNA was extracted from collected peripheral blood using the Qiagen QIAamp DNA Blood Mini Kit (QIAGEN, Hilden, Germany) according to the manufacturer’s instructions. DNA concentration and purity were measured using a NanoDrop spectrophotometer, and fluorometric quantification was performed with the Qubit dsDNA HS Assay Kit (Invitrogen™). Extracted DNA samples were provided to Yale (Yale Center for Genomic Analysis through the Pediatric Genomics Discovery Program (PGDP), https://www.yalemedicine.org/departments/pediatric-genomics) for sequencing. Whole-genome sequencing (WGS) libraries were prepared using the TruSeq DNA PCR-Free High-Throughput Library Prep Kit, following the manufacturer’s protocol. Briefly, 1 µg of DNA was fragmented using a Covaris E210 Ultrasonicator (adaptive focused acoustics). DNA fragments underwent bead-based size selection (SPRIselect, Beckman Coulter) and were subsequently end-repaired, adenylated, and ligated to Illumina sequencing adapters (IDT for Illumina – TruSeq DNA UD Indexes). Final libraries were evaluated using fluorescence-based assays, including concentration measurement with the Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies) and quantification by qPCR. Quality was assessed using the BioAnalyzer (Agilent 2100). Finally, libraries were sequenced on an Illumina NovaSeq 6000 system with 2×150 bp paired-end reads. All samples were sequenced to a minimum depth of 30× coverage.

Quality control, variant calling and annotation

Quality assessment of raw sequencing data was conducted using FastQC v0.11.338 to evaluate overall yield and base quality. The reads were then aligned to the GRCh38 human reference genome with BWA-MEM v0.7.1539, followed by duplicate marking using Picard MarkDuplicates v2.4.1 (https://broadinstitute.github.io/picard). Base Quality Score Recalibration (BQSR) was performed with GATK v3.540 as part of the Centers for Common Disease Genomics (CCDG) standardized functional equivalence pipeline41. Subsequent analyses focused on variant detection, wherein SNVs and INDELs were identified using GATK v3.5. The GVCFs were merged into batches of 109 samples with “CombineGVCFs” and genotyped using “GenotypeGVCFs”. To ensure high-quality variant calls, Variant Quality Score Recalibration (VQSR) was applied with sensitivity thresholds of 99.8% for SNVs and 99.0% for INDELs. The final variant call set (VCF file) was normalized, and multi-allelic sites were split using the “norm” option in BCFtools v1.1942 to facilitate downstream variant analysis. For Hardy-Weinberg equilibrium assessment and linkage disequilibrium analysis, variants that passed the GATK VQSR filter were processed and analyzed using several quality control steps to ensure high data integrity. First, multi-allelic variants were removed from the VCF files using bcftools, retaining only biallelic sites. Variants with more than 11 missing genotypes (representing >10% missing data) were filtered out to maintain dataset completeness. In contrast to the LD analysis, variants on the Y and M chromosomes were excluded from the HWE analysis, to avoid sex chromosome bias. A p-value threshold of 10–6 was applied to identify deviations from HWE using PLINK2 with the option “--hardy”43. For LD analysis, the data were tested with PLINK2 to identify high LD regions using a 200 kb window and an r² threshold of 0.5. Finally, all identified variants underwent annotation via ANNOVAR44 and a custom pipeline that integrated allele frequency data, OMIM and ClinVAR references, and predictive in silico attributes.

Moroccan major-allele reference genome construction and assessment

A Moroccan Major-Allele Reference Genome was constructed for improved variant calling within a Moroccan population. Initial filtering retained only SNPs and indels, excluding complex variants. Positions with a missing genotype rate exceeding 11 were excluded. Allele frequency counts were calculated for each position to identify major alternative alleles (AF > 0.5). The GRCh38 reference genome was then modified by replacing each site with the corresponding major allele using BCFtools consensus, resulting in a new reference genome.

To evaluate the effectiveness of the MMARG, a Moroccan whole genome sequenced sample (SRR12681649) was selected from NCBI. The variant calling pipeline was executed twice on the raw data, once using the MMARG and once using the standard GRCh38 reference. The resulting variant call sets were compared to assess differences in variant detection between the two reference genomes.

Population genetic structure

To study the Moroccan population genetics, we compared the variant data of 109 Moroccan individuals to those of 3,477 individuals from 75 global populations as part of the 1000 Genomes Project45 and Human Genome Diversity Project46 (Supplementary data 6). We selectively retained only autosomal SNPs and indels using BCFtools v1.1942 in the Moroccan dataset. Then, we filtered out multiallelic variants, those with missing genotypes, and variants with a minor allele frequency below 0.05. In addition, we removed variants not in Hardy-Weinberg equilibrium (p = 10-6) using VCFtools v0.1.1647. Furthermore, to eliminate variants in high LD (linkage disequilibrium) regions, we performed LD pruning with PLINK v1.948 with option “--indep-pairwise 1000 10 0.2”. We subsequently intersected all the datasets using “bcftools isec” and merged them with “bcftools merge”. From the resulting file, we removed indels and excluded variants with more than 5% missing genotypes, as well as those violating HWE (p = 10-6). Finally, we performed LD pruning using the “--indep-pairwise 100 10 0.4” parameter. After filtering, we retained 76,785 variants.

Next, we conducted a Principal Component Analysis (PCA) using smartPCA v18140 from the EIGENSOFT package v8.0.049. We then performed an ADMIXTURE analysis with the ADMIXTURE software v1.3.050. To determine the optimal number of ancestral populations (K value), we performed the cross-validation procedure for K values ranging from 2 to 25. Additionally, we performed a pairwise Fst analysis using the SNPRelate package in R v1.36.151. Finally, we analyzed runs of homozygosity (ROHs) in the Moroccan population and compared them with several populations from Europe, Africa, North Africa, and Middle East. ROHs were identified using the files employed for PCA, admixture, and Fst analyses. Detection was performed with PLINK using the “--homozyg” option, and ROHs larger than 1 Mb were summed to estimate the total ROH length per individual. Only 594 individuals were retained.

Haplogroups analysis

Mitochondrial DNA (mtDNA) analysis emerges as a compelling tool within population genetics, offering profound insights into genetic diversity and evolutionary processes52,53. We used the free mtDNA haplogroup classification service, Haplogrep354, to perform haplogroup calling (http://haplogrep.uibk.ac.at). Briefly, Haplogrep employs an unsupervised clustering approach based on Phylotree 17, a global phylogenetic tree representing human mitochondrial DNA diversity, to assign each participant to specific mitochondrial haplogroups according to detected mtDNA variants. As for Y-chromosome haplogroup classification, we used Y-LineageTracker55, referencing the Y-SNP markers and the topology of the human Y-chromosome tree, such as ISOGG Y-DNA tree (https://www.isogg.org/tree) and YFull tree (https://www.yfull.com/tree). This classification was performed on 42 male individuals.

Haplotype network construction

A phylogenetic network of the D-loop region of mtDNA haplotypes was constructed using the median-joining method within PopART v1.756. A multiple sequence alignment was derived from the VCF file using haplogrep354. To provide comparative context, we analyzed 90 random sequences from the 1000 Genomes Project including 30 Europeans, 30 Americans, and 30 Africans (Supplementary data 6).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.