Introduction

Data produced by whole mitochondrial genome (mitogenome) sequencing can substantially increase the discriminatory power of mitochondrial DNA (mtDNA), revealing ancestry-specific variants1. Due to its maternal inheritance, mtDNA can be used to obtain information about genetic ancestry through the identification of mitochondrial haplogroups, defined by the sequential accumulation of alterations inherited from a common ancestor1,2. This information can be used to trace the phylogeographic distribution of mtDNA in different populations and to identify lineages of specific geographic regions. Further, mtDNA variation may contribute to differences in the prevalence of diseases observed among different ethnic groups1,2.

However, current mtDNA haplogroup identification methods face certain difficulties, such as the bias of classification based on consensus haplotypes of the revised Cambridge Reference Sequence (rCRS)3,4. Additionally, admixed populations, such as those in the Americas, represent a peculiar case for genetic ancestry inference because they originated through a complex process of admixture from population of different continents: Indigenous, Europeans, Africans, and more recently East Asians5,6.

In this context, the Brazilian population is one of the most ethnically diverse in the world and is currently divided into five main macro-regions (North, Northeast, Central-West, Southeast, and South), each with distinct histories of colonization and settlement. This diversity is reflected in the genetic structure of the country7,8. Thus, characterization of mitochondrial ancestry markers can elucidate past and current patterns of admixture, complementing prior knowledge about the biological composition of the Brazilian population7,8,9,10.

Among the Brazilian macro-regions, the Amazonian (or North) region stands out for having a larger contribution of Indigenous mitochondrial haplogroups (A–D and X)11,12,13 in its population compared to other regions of the country7,14,15,16. However, there is still a lack of complete mitogenome data for this territory, which hinders demographic genetic studies and limits our understanding of individuals health-disease status, especially those of Indigenous ancestry.

Here, we investigated the mitochondrial haplogroups of the Amazonian region of Brazil using mitogenomes previously analyzed and published by our research group. We assessed the contribution of Indigenous lineages to the formation of this highly admixed population and evaluated the reliability of maternal ancestry predictors in defining Indigenous haplogroups.

Results and discussion

Characterization and genetic composition of the cohort

We investigated the mitogenome of 157 admixed individuals from the North region. Although there were more male than females, they both presented similar average age (Table 1). Thirteen macrohaplogroups were identified using HaploGrep3, which can be divided among those of Indigenous (macrohaplogroups A, B, C, and D), European (macrohaplogroups H, J, N, T, and U), and African (macrohaplogroups L0, L1, L2, and L3) ancestry (Fig. 1A). As expected, Indigenous haplogroups represented the majority (64.97%), followed by African (28.03%) and European haplogroups (7%). This haplotypic distribution is consistent with the literature and reflects the processes of occupation and colonization of the Brazilian territory, especially of the Amazon region7,17,18,19,20. When sub-structuring the Indigenous macrohaplogroups, haplogroups C, A2, and B2 were the most frequent in the cohort (Fig. 1B), consistent with previous description of Indigenous population from the region7,19,20,21. Although haplogroups A–D are also present in East Asia, fact that supports the East Asian origin of Indigenous lineages21, we find unlikely that all of the individuals here with these haplogroups are indeed of genetic East Asian origin.

Fig. 1: Frequency of mitochondrial macrohaplogroups and haplogroups in the Amazonian population.
Fig. 1: Frequency of mitochondrial macrohaplogroups and haplogroups in the Amazonian population.
Full size image

A It shows the frequency of all macrohaplogroups found in our cohort and B It shows only Indigenous haplogroups in detail. Bars were colored according to haplogroup ancestry.

Table 1 Demographics of the investigated cohort

Evaluation of indigenous haplogroup identification

While evaluating the mitochondrial haplogroup assignments of HaploGrep3 tool, we observed that those belonging to European and African ancestries had an average prediction confidence around 0.95 while Indigenous haplogroups was around 0.89 (Fig. 2A). The lower assignment confidence of Indigenous haplogroups reflects the limited data available for such ethnic background in global public databases22,23. Admixed Americans comprise only 3.72% gnomAD24, currently one of the largest genomic databases. The mtDNA-specialized database EMPOP25 has 17% of their samples attributed to Indigenous origin and 6% to admixed Americans, among which only eight Indigenous samples are from Brazil and none are from the Amazon region. Such restrictions interferes with the correct identification of mitochondrial lineages that are still represented today in the Amazon region.

Fig. 2: Haplogroup identification confidence using HaploGrep3.
Fig. 2: Haplogroup identification confidence using HaploGrep3.
Full size image

Each point represents a mitogenome haplogroup assignment confidence. Samples were grouped according to their geographic region and their average ±1 SD confidence indicated by black point and line. A presents mitogenomes from this study grouped according to their haplogroup ancestry in African (BRA [AFR]), European (BRA [EUR]), and Indigenous (BRA [IND]). B presents mitogenomes from 1000 Genomes Project grouped according to their geographic groups in African (AFR), East Asian (EAS), European (EUR), South Asia (SAS), and Admixed American (AMR) and those from this study (BRA [ALL]).

Further, European, African and East Asian populations from the 1000 Genomes Project (1000G) have an average prediction confidence greater than 0.95 when carrying out the same analysis, while our sample presented an average confidence of 0.9 (Fig. 2B). This result is not surprising given that the populations included in 1000G were more frequently investigated and have more information available in public databases, unlike of the Amazonian populations from Brazil21.

Underrepresentation of ethnically diverse populations, such as those from the Amazon, in global genetic datasets can impact ancestry inference, impair discovery of population-specific variants, hinder identification of disease causal variants, reduce accuracy of genetic risk predictions, and limit our understanding of shared genetic risk factors affecting health outcomes22,26,27.

Distribution of indigenous mitochondrial variants

We investigated all the variants highlighted by HaploGrep3, as well as all the expected variants for the assigned haplogroup that were absent in the sequence of 102 mitogenomes of Indigenous ancestry. Most ancestry-defining haplotypes were identified. However, the classification of some individuals presented a significant margin of error, attributed to the absence of certain ancestry-defining variants (Table 2).

Table 2 Distribution of haplotype-defining variants in the amazon population among indigenous haplogroups

Although most of A2, B2, and C individuals presented all their ancestry-defining variants, not all of them possess every required mutation, which suggests that the missing variants may have undergone reverse selection. Eight individuals from haplogroup A2 lacked the 16111T and 1736G variants and 14 individuals did not present 8027A, all haplogroup-defining variants, thus being uncertain if they were correctly classified. Unlike those, an A8 individual did not present the variant 16242T, essential for the characterization of such haplogroup, so this individual should have been assigned to the haplogroup A, due to the presence of 16223T, 16290T and 16319A and absence of 16242T28,29.

Regarding haplogroups B2 and B4, all B2 individuals lacked the 15535T mutation, and seven lacked the 3547G mutation. Among the C individuals, five lacked the 10873C mutation, and one individual did not present the 16327 T mutation, which is crucial for the characterization of this haplogroup. This suggests that this individual might be an intersection between haplogroups C and CZ28,29. In our whole cohort, only 23 individuals had the 7028T mutation (A2 = 4, B2 = 6, C = 4, C1 = 1, D1 = 1, L1 = 1, L2 = 3, L3 = 2, and N = 1). This is curious since this variant is highly frequent in most populations, especially in Asian, African, and Latin American/admixed populations, except for the European haplogroup H24,30.

Phylogenetic analyses

Aiming to explore the relationship between our samples, we reconstructed a phylogenetic tree using the maximum parsimony method and rooted the resulting tree on two L0 individuals (Methods; Figs. 3, S1). The resulting tree reflects the major human mitochondrial divisions presented in PhyloTree28,31, where all L samples were outgroups and meta-haplogroups M (composed by haplogroups C and D), N (A and N), and R (H, J, T, U, and B) grouped as expected. Overall, samples were clustered according to their assigned macrohaplogroups, thus reinforcing the robustness of this analysis.

Fig. 3: Maximum parsimony phylogenetic tree.
Fig. 3: Maximum parsimony phylogenetic tree.
Full size image

The tree was constructed using maximum parsimony method with 1500 bootstrap samples and rooted on two L0 individuals. Represented using a circular layout that preserves the topological information but not the branches length. Tree tips and branches shared by same haplogroup were colored according to their haplogroup assignment. Branches shared by multiple haplogroups remained gray. Internal clades with more than 50% of bootstrap support are indicated by a dot. Tip shapes indicate haplogroups ancestry.

However, we observed unexpected behaviors when analyzing within Indigenous haplogroups. Individuals classified as C and C1 constitute a single indistinguishable clade from each other. Haplogroup C1 is mainly distinguished from C by the presence of 16325C, but almost all samples classified as C present this variant. Individuals classified as D1 and D4 presented a similar pattern with all grouped under a single clade. Haplogroup D1 is distinguished from D4 by the presence of 16325C and 2092T, here we found that most D1 individuals lack 16325C while all D4 individuals presented the variant. These haplogroups are distinguished only by the presence of 2092T. Underrepresentation of samples of Indigenous ancestry in the PhyloTree construction, especially those from the Amazon region, may be impairing the HaploGrep3 classification.

Representativeness of cohort haplogroups in other populations

We constructed a PCA based on mitochondrial macrohaplogroup frequencies to explore the genetic relationship between Amazon population (BRA) and 1000G populations (Fig. 4A), which includes African, European, East Asian, South Asian and Admixed American populations32. PC1 explained 29.3% of the dataset variation and PC2 21.3%. PCA indicated the existence of five distinct clusters, corresponding to the global African, American, European, South and East Asian populations. Notably, our Amazonian sample was in an intermediate position between all clusters, which is in agreement with the literature and demonstrates the historical processes of migration, admixture and colonization of the Brazilian territory6,33,34.

Fig. 4: Worldwide analysis of haplogroups representation.
Fig. 4: Worldwide analysis of haplogroups representation.
Full size image

A PCA based on the frequency of mitochondrial macrohaplogroups among the Amazonian population (BRA) and 26 world populations from 1000G, colored according to their geographic region. B Absolute frequency of haplogroups found in this study among our cohort (BRA) and 1000 G populations. Only 1000G populations presenting an haplogroup identified in this study were included. Dots indicate a haplogroup presence with size and color indicating their absolute frequency. Selected populations (y-axis) are grouped according to their geographic regions.1000G population labels can be found at Supplementary Table 1.

Next, we compare the frequency of haplogroups from our cohort to selected populations from 1000G that presented haplogroups found among our samples (Fig. 4B). The African haplogroups from our cohort (L0, L1, L2 and L3) had a relevant presence only in populations from Nigeria, Sierra Leone, Kenya, Gambia, Barbados and among Afro-Americans from the US.

Indigenous haplogroups were only prevalent among American populations that underwent similar admixture processes as Brazil (Fig. 4B). Haplogroup A2 was found mostly in populations from Puerto Rico, Peru, Mexicans from California and Colombia. Peruvian population presented a significative presence of B2, C and D1 with a similar distribution as our sample. Although Indigenous haplogroups are found among Admixed American populations, they remain under-represented in general and misses specific haplogroups such as A8 and D4. Thus, this evaluation demonstrates the lack of Indigenous ancestry data in public datasets.

We further explored the genetic similarities and divergences between our sample and 1000 G populations by calculating their pairwise Fst genetic distance. For this analysis, we divided our cohort according to their haplogroup ancestry (European: BRA-EUR, African: BRA-AFR, and Indigenous: BRA-IND). These populations grouped with themselves and distinguished from all other populations (Figure S2A). BRA-EUR presented the largest distance to all other populations, with highest value against IBS (Fst=0.13). Our cohort presented the smallest distance to Afro-American populations, such as ASW and BRA-IND (Fst=0.08). These results were reinforced when analyzing pairwise Fst genetic distance of this study groups and 1000 G populations grouped by their geographic region in American, European, South Asian, East Asian, African and African-American (Figure S2B). Our cohort presented the largest distance to European populations (Fst = 0.10), and greater proximity to African-American (Fst = 0.08) and African populations (Fst = 0.09).

In one of the few studies about the complete mitogenome in Brazil, Avila et al.35 produced similar findings when analyzing 96 individuals from South, Southeast and Central-West regions of Brazil. They indicated that the Brazilian population had a high proximity to the African-American population, a low proximity to European populations, and an intermediate proximity to South Asian populations. These findings are supported by genetic, historical and ethnogeographic information previously reported in the Brazilian territory 5,7,33,35.

Haplogroups representation among indigenous populations

Next, we compared the frequency of Indigenous haplogroups (Fig. 5) on our Admixed Amazonian cohort to 74 Indigenous mitogenomes included in the construction of PhyloTree21,31,36 (Supplementary Table 2). Our cohort presented a greater haplogroup frequency similarity to Amazon and South American Indigenous population. Amazonian Indigenous and our cohort presented a greater predominance of haplogroups A2, B2, and C1 than other groups, as previously described1,11,13,18,37,38, while other Brazilian Indigenous populations lack A2 and C. Further, the X2 haplogroup specificity to North American Indigenous is in agreement with the literature21,39 and reflects the limited expansion of this group to the rest of the continent.

Fig. 5: Relative frequency of indigenous haplogroups among our cohort and indigenous populations.
Fig. 5: Relative frequency of indigenous haplogroups among our cohort and indigenous populations.
Full size image

Each point size and color indicate the relative frequency of Indigenous haplogroups (x-axis) among our cohort (Admixed Amazon) and Indigenous Populations along the American continent. Populations’ sample size is indicated next to their label.

We further explored the relationship between these samples by constructing a heatmap for mutation presence for haplogroups A to D (Supplementary Figs. 36). All samples presented a low haplogroup assignment confidence by HaploGrep3 (0.89 ± 0.06). In fact, samples assigned to different haplogroups presented a high genetic similarity. Two A2 samples clustered with A2 + (64), despite lacking 64 T, due to their shared 310C (Supplementary Fig. 3). Haplogroups B2, B2b, and B2h did not cluster together (Supplementary Fig. 4). One B2e and two B2 clustered due to their exclusive 8281–8289 deletion. B2 and B2*2 clustered due to three variants (103A, 14470C, and 16241G).

Among macrohaplogroup C, most C1b and C1c samples grouped according to their haplogroup assignments (Supplementary Fig. 5). Most samples assigned as C presented 16325C, which defines the C1 clade, and at least half of these also presented 493G, which defines C1b. C1d1 cluster included two C assigned samples, which presented C1d defining variants (16325C and 16051G). Another cluster of three C1c and six C samples presented all C1c defining variants (1888A and 15930A). These examples present strong indication of misassignment of those C samples.

Macrohaplogroup D presented the highest average assignment confidence (0.95 ± 0.02) compared to the others (Supplementary Fig. 6). All D4 samples lacked 2092T but presented 16325C, which defines haplogroup D1. They all clustered with other D1 samples and differed from them by a single mutation.

Among macrohaplogroups A, B, and C, potential new clusters can be identified (Supplementary Figs. 35). Three A2 samples clustered due to five exclusive variants (9096C, 10463G, 15951G, 16126C and 16259T). Three B2 samples shared four variants (2392C, 5150T, 15936G, 16178C) and other three B samples shared 16 variants (top of Supplementary Fig. 6). One C sample clustered with C1b*3 due to three exclusive variants (8383C, 10586A, and 16093C), a cluster of six C samples presented C1d defining variants and shared 7697A and 15004T, and other three C samples shared six variants that were almost exclusive (top of Supplementary Fig. 5).

Final remarks

Representation of ancestries and haplogroups from different continents or geographic regions varies widely between genetic panels. Panels more representative of ethnically diverse populations lead to better genomic imputation compared to others40. Previous studies have reported that under-representation of non-European populations influence the accuracy of genotype imputation27,41. Worldwide GWAS are mainly concentrated in USA and European countries, which contribute to underrepresentation of the worldwide genomic diversity, and leads to a low prediction quality among underrepresented populations22,26,27,42. As of November 2024, the majority of genomic studies were conducted on individuals of European ancestry, which represents 90% of all investigated individuals, while admixed individuals from Latin America accounted for only 0.34%42. Previous studies have also shown that Hispanic/Latino and African American genomic studies contribute with a disproportionate larger number of associations compared to a similarly sized study with European ancestry42,43. Therefore, increasing representation of samples of non-European ancestry in genomic datasets is critical to discover new variants associated with diseases and explain complex phenotypes.

Various Indigenous haplogroups are present among the Brazilian Amazon populations and their haplotype are underrepresented among large genomic datasets such as 1000 G and PhyloTree. This lack of data produces poor assignment confidence to Indigenous haplogroups tools such as HaploGrep3. Although there are efforts to increase the diversity of genomic information from Latin America countries, such as PoblAr44, Urugenomes45 and LatinGen46, there is still a scarcity of data from Brazil and Brazilian Amazon, especially regarding the mitochondrial genome. In this sense, this work complements efforts to improve mitochondrial reference data and increase our understanding of underrepresented populations phylogenetic distribution worldwide. However, due to limitations of the current study, further studies with larger cohorts should be carried out to characterize the genetic diversity of Amazonian populations.

Methodology

Sample characterization

A total of 157 mitogenomes from Brazilian Amazon were analyzed in this study. Data was gathered from two previous studies from our research group:

  • De Souza et al.47 analyzed the mitogenome of 33 individuals diagnosed with leprosy and 37 control individuals.

  • Matos et al.48 analyzed the mitogenome of 45 individuals diagnosed with Parkinson disease and 42 control individuals.

DNA extraction, quantification, amplification, and sequencing are described in their respective studies. This work adhered to the Declaration of Helsinki.

To expand our analysis, we also integrated to our dataset the mitogenome of 2504 individuals from the 1000 G phase 332 and 74 Indigenous mitogenome from various regions the Americas21,36. Their detailed population description can be found on Table S1 and Table S2, respectively.

Sequencing pre-processing and bioinformatics analysis

To assess sequencing quality, we used FastQC v0.12.149 and MultiQC v1.1950 before and after data pre-processing. We used FastP (v0.23.4)51 to remove low-quality bases, sequencing adapters, and reads with less than 36 bp. After treatment and quality assessment, reads were aligned to the mtDNA reference sequence (rCRS) using the Burrows-Wheeler alignment tool (BWA) v0.7.1752. The resulting alignments were sorted using SAMTools v1.15.153, and duplicate reads were removed using Picard v2.27.554. To identify and remove samples with contamination issues, we employed Haplocheck v1.3.355 tolerating up to 10% of contamination and excluding samples exceeding this threshold.

Detection and analysis of variants

After pre-processing and aligning the sequences, BAM files were submitted to mutserve v2.0.1 for single nucleotide polymorphisms (SNPs) calling, annotation and heteroplasmy detection (https://mitoverse.readthedocs.io/mutserve/mutserve/)56. Functional annotation of each variant, locus and mtDNA complex was then summarized through custom scripts developed with R programming language57 and RStudio environment58. To improve the reliability of the variant analyses, we required an overall average depth > 550x and heteroplasmy levels below < 0.10.

Inference of mitochondrial haplogroups

Inference of mitochondrial haplogroups was conducted using HaploGrep3 v3.2.159 (available at https://haplogrep.i-med.ac.at/), which performs automatic assignment of mtDNA haplogroups based on PhyloTree 17—Forensic Update 1.231. The inference is performed using the Kulczynski calculation between the sample and presumed haplogroup based on sets of expected and observed polymorphisms, weighted by the relative recurrence of each polymorphism in the phylogenetic tree.

Phylogenetic analysis

Mitogenome sequences were generated using HaploGrep3 with -write-fasta-msa option. The resulting.fasta files were aligned using MAFFT v7.526 software60,61 and later loaded into MEGA v11.0.10 software62,63. We opted to build a Maximum Parsimony tree with 1500x bootstrap64 and exported it in Newick format (.nwk). The resulting tree was rooted in two L0 individuals, to reflect our current understanding of human origin13,31. This strategy was developed following the recommendations from Behar et al.65. The resulting tree was visualized in R programming language using the treeio66 and ggtree67 packages.

Worldwide mitogenome diversity

We performed a Principal Component Analysis (PCA) based on the frequency of all Macrohaplogroups with 1000 G populations and this study cohort. To calculate the Fst pairwise distance between populations, we aggregated our genotype data to mitochondrial data from 1000 G to create a genotype matrix in the R environment using the vcfR package68. The resulting matrix was converted to a hierfstat dataframe used to calculate Fst according to Weir and Cockerham69 equation using hierfstat package70. The results were visualized using pheatmap package71, factoextra72, FactoMineR73 and plotly74.

Then, we combined data from this study with a 74 Indigenous mitogenome from various regions of the Americas21,36. We compared the relative haplogroup frequency among our cohort and different Indigenous populations. Next, we compared these samples mutation profile and presented it using a presence heatmap. The heatmap was visualized using ComplexHeatmap package75.

Statistics and reproducibility

Statistical analyses and visualizations were conducted using R (v. 4.3.2). The statistical methods and tests employed for each analysis are explained in the text and figure legends, when applicable. A comprehensive description of the statistical methodology applied for each of haplogroups assignment, phylogenetics and worldwide mitogenome diversity analysis, respectively, is available in the “Methods” section.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.