Introduction

Tylosema esculentum, commonly known as the marama bean, is a long-lived perennial legume native to southern Africa (Fig. 1)1. Adapted to arid and semi-arid desert environments, marama employs a unique drought avoidance strategy by growing tubers that can weigh over 250 kg2 to store water, enabling survival in the prolonged hot and dry conditions of the Kalahari Desert (Fig. 1D)3. The domestication of marama has the potential to improve local food security due to the high nutritional value of its edible seeds, whose protein and lipid contents are comparable to those of commercial crops like soybean and peanuts4,5. A significant obstacle to marama breeding is its delayed flowering, typically occurring in the second year or later. This extended juvenile phase forces breeders to wait years to harvest seeds and assess desirable traits, significantly slowing the breeding cycle. Exploring the genotypic and phenotypic diversity in natural populations and employing molecular marker-assisted breeding strategies are effective alternatives to traditional breeding methods6,7. Key breeding goals for marama include shortening the flowering time to expedite seed acquisition and developing an erect growth habit to facilitate field harvests8. Additionally, overcoming self-incompatibility is essential for creating inbred lines that ensure stable inheritance of desirable traits and enabling crosses between previously incompatible varieties to produce new cultivars with favorable allelic combinations9,10. Studying marama also provides insights into plant adaptation to harsh environments, which is increasingly relevant in the context of global warming. A high-quality genome assembly will provide a valuable reference for exploring the genetic basis of relevant traits.

Fig. 1
figure 1

Morphology of wild T. esculentum (marama bean) from Namibia. (A). Brownish-black seeds, up to one inch long, are edible when roasted. The protein content is 30–39% dm and the lipid content is 35–48% dm11,12. (B). Prostrate form with stems up to 3 m long13. (C). Yellow flowers, beginning to bloom in midsummer, typically starting in at least the second year after planting. (D). Giant tubers weighing over 500 pounds, 90% of the weight comes from water2.

The estimated total genome size of T. esculentum is 1 gigabase (Gb), consisting 44 chromosomes (2n = 4x = 44), as determined through next-generation sequencing data and Feulgen staining6,14. A comprehensive dataset, accessible under PRJNA779273, encompasses Illumina whole-genome sequencing data from over 80 marama individuals sourced from various geographical locations in Namibia and South Africa, along with PacBio long reads from selected individuals. These data were used in assembling and analyzing the chloroplast and mitochondrial genomes of marama15,16,17. Comparative genomic studies were conducted to explore the genetic diversity within the marama organelle genome17,18. These studies revealed the presence of two distinct organelle genome types with substantial differences, the functional implications remain unknown. The assembly of the marama nuclear genome remained in a rudimentary state, with an N50 value of only 3 kilobases (kb), by Dr. Kyle Logue solely using short Illumina reads of marama17.

The advent of next-generation sequencing has significantly advanced genome assembly due to its cost-effectiveness, high speed, and throughput19. However, challenges persist in assembling complex genomes, such as polyploid and repeat-rich genomes, when solely relying on short reads from next-generation sequencing techniques. As a third-generation sequencing technology, PacBio offers longer reads, averaging over 10 kb and extending up to 25 kb, which addresses the shortcomings of previous methods. The latest PacBio HiFi sequencing enhances accuracy to over 99.9% while maintaining read length20, improving genome assembly quality. To further enhance genome assembly accuracy, particularly for complex genomes, high-throughput chromatin conformation capture (Hi-C) leverages genome-wide chromatin interactions to capture the 3D structure of chromosomes21,22. This is followed by sequencing, enabling accurate scaffolding of genome assemblies. Hi-C has become a widely used technology for studying complex plant genomes23,24.

This research aimed to generate the first high-quality genome assembly of T. esculentum (marama) using PacBio HiFi sequencing data. Preliminary assemblies were conducted using HiCanu25 and Hifiasm26, followed by scaffolding with Hi-C data and the HiRise assembler from Cantata Bio LLC to address the complexities of polyploid genomes. Comprehensive gene prediction and annotation were performed, and a comparative genomics study was conducted to analyze gene families in marama and related legumes. Functional uniqueness within the marama genome was identified, alongside phylogenetic analyses to clarify its evolutionary relationships and divergence from related species. Additionally, samples from different geographical regions were collected, and variants identified from resequencing data were used to explore the genetic diversity and population structure within the species, providing insights into marama’s evolution and adaptation. This work offers a valuable genomic resource to support future research and breeding efforts, enhancing marama’s potential as a resilient and sustainable food crop.

Methods

Sample collection and sequencing

Sample 4 of T. esculentum, cultivated in the Case Western Reserve University greenhouse from seeds of unknown provenance in Namibia, was utilized for DNA extraction and sequencing. Fresh young leaves (1 g) were ground in liquid nitrogen using a mortar and pestle, and high-molecular DNA was extracted using the Quick-DNA HMW MagBead kit (Zymo Research). DNA concentration was quantified using an Invitrogen™ Qubit™ 3.0 Fluorometer, and quality was assessed by electrophoresing 200 ng of DNA on a 1.5% agarose TBE gel at 40 V for 24 h.

For PacBio sequencing, DNA samples were submitted to the Genomics Core Facility at the Icahn School of Medicine at Mount Sinai. Sequencing libraries were constructed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). The libraries were sequenced on two 8 M SMRT® Cells using the Sequel® II platform generating 2,184,811 reads with a total yield of 21.5 Gbp.

To support genome scaffolding, fresh leaf samples from the same plant were flash-frozen in liquid nitrogen and shipped on dry ice to Dovetail Genomics (Cantata Bio) for Omni-C library preparation. Omni-C, an advanced Hi-C technology, employs sequence-independent endonucleases to achieve uniform genome-wide coverage, eliminating biases from restriction enzyme-based methods. Chromatin was fixed with formaldehyde, extracted, and digested with DNAse I before ligation to biotinylated bridge adapters. Proximity ligation was followed by crosslink reversal, removal of non-internal biotin, and library preparation using NEBNext Ultra enzymes and Illumina-compatible adapters. Libraries were enriched via streptavidin bead capture and amplified by PCR. Sequencing was performed on an Illumina HiSeqX platform, achieving ~30x coverage.

For transcriptome sequencing, high-quality RNA was extracted from leaf tissue (young leaves at the growing tip) and root tissue (root tips and young roots of germinated seeds) using the Quick-RNA Plant MiniPrep™ Kit (Zymo Research Corporation, Catalog No. 50–444-618). RNA quality and quantity were estimated by running a sample on a 2% agarose gel. RNA sequencing libraries were prepared by Novogene following a standard workflow. Messenger RNA (mRNA) was enriched from total RNA using poly-T oligo-attached magnetic beads, fragmented, and reverse-transcribed to synthesize first-strand cDNA using random hexamer primers, followed by second-strand synthesis. Library construction involved end repair, A-tailing, adapter ligation, size selection, amplification, and purification. Quality assessment was conducted using a Qubit fluorometer and real-time PCR for quantification, as well as a bioanalyzer for size distribution evaluation. The libraries were pooled and sequenced on the NovaSeq 6000 platform, generating 6.8 Gbp of transcriptomic data for root tissues and 7.2 Gbp for leaf tissues.

For the population study, Illumina whole-genome sequencing (WGS) data were generated for 84 individuals collected from various geographic regions in Namibia and South Africa. Details of sequencing protocols and data processing, are described in a previous study16. These data are available in the NCBI SRA database under Bioproject PRJNA779273.

De novo genome assembly and quality assessment

The preliminary assembly of PacBio HiFi reads was generated using two assemblers: Hifiasm v.0.18.5 (Cheng et al. 2021) with a haplotype number set to four, and HiCanu25 with an estimated genome size of 1 Gb based on previous assessments6. The HiRise pipeline27,28 was then used to scaffold the de novo assembly with the help of Dovetail OmniC reads. These reads were aligned to the draft assembly using Burrows-Wheeler Aligner (BWA)16 (https://github.com/lh3/bwa). HiRise analyzed the read pair distances within scaffolds to generate a likelihood model for genomic distance, which was used to identify and correct misjoins.

K-mer analysis of the PacBio HiFi reads was performed using Jellyfish v. 2.3.029 (https://github.com/gmarcais/Jellyfish) with a k-mer length of 21. The results were used to construct k-mer spectra with GenomeScope 2.030 (http://qb.cshl.edu/genomescope/genomescope2.0/). Assembly quality was assessed with QUAST v. 5.2.031 (https://github.com/ablab/quast) and visualized using Matplotlib v. 1.3.132. Genome completeness was evaluated using BUSCO v. 5.3.033 (https://busco.ezlab.org/) against the Embryophyta ortholog database (embryophyta_odb10, 1614 genes) and the Fabales ortholog database (fabales_odb10, 5366 genes).

Additionally, the genome assembly was compared to that of Bauhinia variegata (ASM2237911v2)34, the closest species with an available genome sequence, by aligning the assemblies with minimap2 v. 2.2816 (https://github.com/lh3/minimap2). The pairwise mapping data (PAF) was visualized using a dot plot created with the R package pafr (https://github.com/dwinter/pafr) to assess synteny and validate the assembly’s structure.

Gene prediction and functional annotation

A de novo repeat library was constructed for the assembly using RepeatModeler (v. 2.0)35. This custom library, combined with the Dfam v3.0 database36, was used to annotate and mask repetitive elements in the genome assembly with RepeatMasker v. 4.1.437 (https://www.repeatmasker.org/). Alignments were conducted using the rmblastn search engine38. For gene prediction, transcriptomic data were aligned to the genome assembly using HISAT2 v. 2.2.123,39, resulting in SAM files that were converted to sorted BAM files and combined using SAMtools v. 1.20. The BRAKER v 3.0.8 pipeline B40, which integrates evidence-based and ab initio approaches for gene annotation, was employed. The pipeline utilized RNA-Seq spliced alignment data to train GeneMark-ET v. 4.7141, incorporating both genome sequence and RNA-Seq evidence. The resulting gene models were then used to train Augustus v. 3.5.042 for final gene predictions.

A statistical summary of the annotation was generated from the resulting GFF file using the AGAT v. 1.0.0 toolkit43. The completeness of the gene annotation was assessed using BUSCO against the embryophyta_odb10 database. Functional gene annotation was performed using eggnog-mapper 2.1.1244 (http://eggnog-mapper.embl.de/), referencing the eggNOG 5 database45.

Evolutionary analyses: phylogenetic relationships, whole genome duplication, and gene enrichment

To investigate the evolutionary relationships of T. esculentum and related species, a comparative genomics study was conducted on the gene families of 13 legumes and one outgroup species, soapbark. The protein data of 9 species were retrieved from the JGI Phytozome v. 13 database46: Arachis hypogaea (v1.0)47, Cicer arietinum (v1.0)48, Medicago truncatula (Mt4.0v1)49, Lotus japonicus (Lj1.0v1)5051, Glycine max (Wm82.a2.v1)52, Phaseolus acutifolius (v1.0)53, Vigna unguiculata (v1.1)54, Lupinus albus (v1)55, Cercis canadensis (v3.1)56, and four from NCBI GenBank: Bauhinia variegata (ASM2237911v2)34, Prosopis alba (ASM479914v2), Prosopis cineraria (ASM2901754v1)57, and Quillaja saponaria (AO_1.2)58. CD-HIT v. 4.8.159 with a threshold of 0.95 was applied to retain only the longest isoform of each protein. OrthoFinder v. 2.4.060 was then utilized with an all-against-all method to identify orthologous genes across these species, which revealed the evolutionary relationship among these plant taxa.

A total of 157 single-copy orthologs were identified among 14 species and aligned using MAFFT v. 7.52061. Regions with poor alignment and high divergence were removed using Gblocks v. 0.91b62, specifying a minimum block length of 5, while all gaps were retained as meaningful. The trimmed alignments were concatenated into a single FASTA file. A maximum likelihood phylogenetic tree was constructed using concatenated protein sequences in IQ-TREE v. 2.2.2.763 with the ModelFinder Plus (MFP)64 algorithm to automatically select the optimal substitution model. Tree topology robustness was assessed with 1000 bootstrap replications. Approximate divergence times for C. arietinum and M. truncatula (31.9 million years ago) and for G. max and V. unguiculata (25.3 million years ago) were retrieved from TimeTree (http://www.timetree.org)65. These divergence times were used to calibrate the overall divergence times in the phylogenetic tree using the Timetree Wizard66 in MEGA 1167.

Gene count data generated by OrthoFinder were used to calculate gene family size variation across the phylogenetic tree using CAFE5 v. 5.0.068. To mitigate potential noise from excessively large gene families and those with high variance, families with more than 100 gene copies were filtered out using the clade_and_size_filter.py script. The resulting data on gene family expansions and contractions (with a p-value < 0.05) were then mapped onto the phylogenetic tree, providing a visual representation of these evolutionary changes. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed using TBtools 2.069 and the KEGG database70,71,72 on genes from the expanded and contracted gene families of T. esculentum. The top enriched pathways were visualized using ggplot2 in R, sorted by enrichment factor.

Orthologous gene families were also analyzed among five selected legumes, T. esculentum, B. variegata, C. canadensis, G. max, and P. alba, which include both closely related species of T. esculentum as well as representatives from different phylogenetic clusters. The distribution of shared and species-specific gene families was visualized using the VennDiagram package v. 1.7.369 in R 4.3.128. Genes in core gene families, shared by all five legumes, and genes in the T. esculentum-specific gene family were used for KEGG enrichment analyses with TBtools v. 2.0. The results were visualized using ggplot273 in R, offering a comprehensive understanding of the common and unique functional pathways among these five legumes.

For further investigating the evolutionary relationships between T. esculentum and its closely related legume species, C. canadensis and B. variegata, whole genome duplication (WGD) events were analyzed. The coding sequences (CDS) of the genomes were self-aligned to identify homologous gene pairs using DIAMOND (v. 2.1.8)74 with an e-value threshold of 1 × 10−10, in conjunction with WGD tools (v. 1.1.1)75. The synonymous substitution rates (Ks values) were subsequently calculated using the ksd function from the WGD tools, which employs PAML (v. 4.9 h)76 for codon-based maximum likelihood analysis. A Ks threshold of 0.1 was applied to exclude local duplication minimizing noise. The results of WGD were visualized as a curve plot using ggplot2 (v. 3.5.1) in R (v. 4.3.1) (R Core Team, 2023), offering a graphical representation of evolutionary relationships.

Population genetics analysis

The population study included 31 independent marama samples, with 24 collected from wild plants across various locations in Namibia: 3 from Tsjaka (S22 75.039 E19 20.712), 1 from Okamatapati (S20 40.233 E18 21.59), 8 from Aminuis (S23 38.000 E19 22.00), 4 from Osire (S21 02.031 E17 21.244), 3 from Tsumkwe (S19 21.000 E20 16.000), 2 from Ombujondjou (S20 18.600 E17 58.525), 2 from Epukiro (S21 39.642 E19 25.092), and 1 from Otjiwarongo (S20 46.092 E16 65.123). 6 samples were collected from the University of Pretoria Farm, where they had been cultivated for over thirty years with an unknown original source. Additionally, one sample was grown from seed collected from the Namibia Farm (21°23′48.5″ S 19°44′59.6″ E). DNA extraction was described in detail in the previous study16,77, and the whole-genome sequencing (WGS) reads are available in the NCBI SRA database under Bioproject PRJNA779273.

Paired-end Illumina reads were aligned to the genome assembly with BWA v 0.7.17 mem16, followed by conversion of SAM files to sorted BAM files using SAMtools v. 1.20. SNPs were called using BCFtools mpileup78. SNP processing involved multiple filtering steps to ensure high-quality variants. VCF files from different samples were first merged using BCFtools merge, and then filtered to retain only biallelic SNP variants using BCFtools view. Quality and depth filtering was applied with BCFtools filter, using thresholds of QUAL ≥ 30, DP > 10, and DP ≤ 100. The filtered VCF file was then normalized with BCFtools norm using the reference genome to ensure consistent variant representation. A final filter step retained variants with a minor allele frequency (MAF) greater than 0.05 and SNP missing call rate lower than 0.3 for the population study.

Principal component analysis (PCA) was performed using PLINK v.2.0079, and the first two components were visualized using ggplot2 in R v.4.3.128. ADMIXTURE v.1.2380 was used to assess population structure, with the optimal number of populations determined through cross-validation errors and visualized with ggplot2. For phylogenetic analysis, the VCF file was converted to a PHYLIP alignment using vcf2phylip.py (v. 2.8)81 and a maximum likelihood tree was constructed with IQ-TREE (v. 2.2.2.7)63 using 1000 bootstrap replicates. The resulting tree was visualized using Interactive Tree of Life (iTOL) (v. 6)82.

Results

Estimation of genome size and heterozygosity

A total of 21.5 Gbp PacBio HiFi reads were generated and analyzed to characterize the genomic properties of T. esculentum. K-mer analysis was conducted using a k-mer length of 21, and the resulting distribution was modeled with GenomeScope 2.0, producing a k-mer spectra map that revealed peaks corresponding to 1-fold, 2-fold, and 4-fold coverage (Fig. 2). The data more closely aligned with a tetraploid genome model than the initially hypothesized hexaploid model for T. esculentum. The predicted genome size for a single chromosome set was 277.44 Mb, which is comparable to the compact genome of the legume species Amphicarpaea edgeworthii, with a genome size of 298.1 Mb and a haploid chromosome number of 11 (2n= 22)83,84. Additionally, the k-mer spectra indicated substantial heterozygosity, with 2.2% of the genome exhibiting heterozygous characteristics. Notably, both the aaab (1.410%) and aabb pattern (0.498%) were observed at high frequencies, suggesting that T. esculentum may possess a complex ploidy structure, potentially indicative of an ancient allotetraploid that has accumulated mutations over time, leading to an elevated aaab pattern ratio30.

Fig. 2
figure 2

K-mer spectra built on the PacBio HiFi reads of Sample 4 using GenomeScope 2.0. (A) Frequency-coverage k-mer spectrum. (B) Coverage*frequency-coverage k-mer spectrum.

De novo genome assembly and evaluation

2,184,811 PacBio HiFi reads were subjected to preliminary assembly using HiCanu and Hifiasm. HiCanu produced a complete genome assembly of 1.24 Gb, composed of 9,532 contigs, a size that aligns closely with the expected structure of four chromosome sets. This assembly exhibited an N50 value of 1.28 Mb and an L50 value of 252 (indicating the minimum number of contigs whose combined length equals half the genome size) (Table 1). In contrast, Hifiasm generated a partially phased assembly of 558.23 Mb with higher continuity, consisting of 4,175 contigs. This assembly demonstrated a markedly improved N50 of 2.75 Mb and an L50 of 35. Both assemblies achieved high completeness, with BUSCO scores exceeding 99% when evaluated against the embryophyta_odb10 database.

Table 1 T. esculentum sequencing and genome assembly statistics.

Subsequently, the Hifiasm assembly was submitted to Dovetail Genomics (Cantata Bio) for Scaffolding using Omni-C data to capture chromatin interactions (Supplementary Figure S1). The final assembly size was 558.78 Mb, which is close with the estimated size of 554.88 Mb for the two sets of chromosomes. Despite the relatively high contig count (3,888), continuity was significantly improved, as evidenced by an N50 of 22.68 Mb, reaching chromosome-level assembly. The longest scaffold was 56.19 Mb (Fig. 3A). The L50 was reduced to 8, meaning that the top eight scaffolds collectively represented 50% of the genome size (Fig. 3B). The average guanine-cytosine (GC) content across all contigs was 37.20% (Table 1; Fig. 3C). BUSCO completeness remained robust, with a score of 99.1% against the Embryophyta database and 93.6% against the Fabales database.

Fig. 3
figure 3

Genome assembly quality assessment plots drawn by QUAST 5.2.0. (A) Nx plot showing the distribution of contig lengths as x varies from 0 to 100%. (B) Cumulative length plot. The contigs were sorted from largest to smallest. (C) GC plot showing the distribution of GC content in the contigs.

Approximately 58.43% of the T. esculentum genome assembly was annotated as repetitive sequences (Table 2). The most prevalent repeat component, long-terminal repeat (LTR) retroelements, accounted for 22.61% of the genome, with Gypsy/DIRS1 elements comprising 15.65% and Ty1/Copia elements contributing 3.48%. Low-complexity regions (LCRs) and simple repeats represented 11.77% and 7.45% of the genome, respectively.

Table 2 Summary of repeat elements in the T. esculentum genome assembly by RepeatMasker.

A total of 49,343 protein-coding genes were predicted using BRAKER 3(Table 3), with an average of 6.3 exons per gene, with each exon measuring 218.35 bp, and 5.1 introns per gene, averaging 369.56 bp in length. Evaluation of gene set completeness using BUSCO, with reference to the Embryophyta core gene database, revealed a completeness of 95.8%. The predicted gene set was further annotated using eggNOG-MAPPER against the eggNOG database v. 5.0.2, with results summarized in Supplementary Table S1.

Table 3 Gene prediction statistics for T. esculentum genome assembly.

Comparison of the genome assemblies of T. e sculentum and B. variegata

Genomes of a limited number of plants from the Cercidoideae subfamily have been assembled, with B. variegata being the closest evolutionary relative to T. esculentum85. The genome assembly of B. variegata (ASM2237911v2) spans 326.4 Mb and consists of 14 chromosomes (2n= 28), ranging in size from 18.26 Mb to 27.62 Mb34. The T. esculentum genome assembly was aligned to the B. variegata genome using minimap2, and the results were visualized as a dot plot with the R package pafr (Fig. 4). The alignment revealed partial collinearity, with conserved regions forming distinct diagonal lines. However, the presence of numerous missing alignments and structural variations, including inversions and translocations, highlights substantial genomic divergence between the two species. To further investigate, Illumina reads from three randomly selected T. esculentum samples (M1, M40, Index1) were mapped to the B. variegata genome using Bowtie2 v2.4.486 (https://github.com/BenLangmead/bowtie2). The overall alignment rate was approximately 20.36%, whereas mapping the same reads to the Vigna radiata genome (PRJNA301363) resulted in a significantly lower alignment rate of 2.7%87. These findings underscore the highly divergent nature of the T. esculentum genome compared to other legumes.

Fig. 4
figure 4

Dot plot of T. esculentum scaffolds aligned to the 14 chromosomes of the B. variegata genome (ASM2237911v2). This figure, created using the R package pafr, visualizes the alignment of T. esculentum scaffolds to the B. variegata genome based on a pairwise mapping format (PAF) file generated by minimap2. Each row represents a B. variegata chromosome, labeled on the right with its GenBank ID34. Columns represent T. esculentum scaffolds, sorted by size, with only scaffolds exceeding 1 Mb shown. Highly fragmented contigs are excluded for clarity. Black dotted lines indicate alignment points between the two genomes. Axis ticks mark genomic scales in megabase pairs (Mb).

Phylogenetic analyses, along with the evolution of gene families and whole genome duplication analysis of T. esculentum and related species

Ortholog analyses were performed on 14 species including 13 legumes, including three Cercidoideae (C. canadensis, B. variegata, and T. esculentum), two Caesalpinioideae (P. alba and P. cineraria), and eight Faboideae (A. hypogaea, C. arietinum, M. truncatula, L. japonicus, G. max, P. acutifolius, V. unguiculata, and L. albus), along with one outgroup species Q. saponaria, after using 95% similarity threshold to retain only the longest isoform. Out of a total of 510,326 genes, 472,973 (92.68%) were assigned to 33,383 orthogroups, with 40,527 genes in 9,466 orthogroups, identified as species-specific. The 157 single-copy orthogroups were used to construct a phylogenetic tree, with Q. saponaria as outgroup to root the tree (Fig. 5A). The divergence between B. variegata and T. esculentum was estimated to occur approximately 27.22 million years ago (Ma), and divergence with C. canadensis occurred 31.68 million years ago. The gene family number variation is close to that of B. variegata, with more gene families (2,231) underwent expansion than contraction (1,155). A total of 6707 genes in the T. esculentum expanded gene family and 951 genes in the contracted families were underwent KEGG pathway enrichment analyses using TBtools (Fig. 5B and C, Supplementary Table S4 and S5).

Fig. 5
figure 5

Evolutionary dynamics of gene families in T. esculentum and 13 related species. (A) A maximum likelihood phylogenetic tree was constructed using concatenated protein sequences from 157 single-copy orthologs of T. esculentum and 13 other species, including 12 legumes and Q. saponaria as the outgroup. Sequences were aligned using MAFFT, and the tree topology was supported by 1000 bootstrap replicates. Gene family expansion and contraction counts, calculated by CAFE, are color-coded in blue (expansion) and red (contraction). Divergence times (million years ago, Ma) are annotated on the tree nodes. (B) KEGG enrichment analysis of genes in the 1155 contracted gene families of T. esculentum. (C) KEGG enrichment analysis of genes in the 2231 expanded gene families of T. esculentum. Enriched pathways are ranked by enrichment factor. Dot size indicates the number of enriched genes, while dot color represents significance, with red denoting higher significance and purple lower significance.

The contracted gene families in marama are primarily enriched in pathways associated with plant defense and stress adaptation, including plant secondary metabolite biosynthesis (00999), tropane, piperidine, and pyridine alkaloid biosynthesis (00960), and cutin, suberine, and wax biosynthesis (00073) (Fig. 5B). The contraction of genes in these pathways suggests a reduced capacity for synthesizing specialized metabolites and protective compounds, which are typically essential for defense against biotic stressors, such as pathogens88,89. This reduction likely reflects the lower pathogen pressure in marama’s native arid environment, where pathogen diversity and abundance are limited. Consequently, marama appears to prioritize resource allocation toward critical survival processes, such as drought tolerance, rather than extensive defense mechanisms. These findings highlight the trade-offs in marama’s genome that contribute to its resilience and efficient adaptation to harsh conditions.

The expanded gene families in marama are enriched in pathways essential for cellular function, energy metabolism, and stress adaptation, contributing to its remarkable resilience in arid environments (Fig. 5C). Key pathways such as the citrate cycle (TCA cycle) (00020), pyruvate metabolism (00620), and carbon fixation in photosynthetic organisms (00710) highlight marama’s ability to optimize energy production and carbon assimilation under resource-limited conditions, critical for survival in drought-prone areas90,91. The expansion of arginine biosynthesis (00220) is particularly important, as arginine serves as a precursor for molecules involved in stress signaling, osmotic balance, and the detoxification of reactive oxygen species92,93. The increased presence of GTP-binding proteins (04031) reflects an expanded set of signaling molecules that play key roles in cellular communication, stress responses, and protein trafficking, enabling rapid adaptation to fluctuating environmental conditions94,95. Additionally, the expansion of the spliceosome pathway (03040) enhances RNA processing and gene expression regulation, supporting marama’s ability to fine-tune its transcriptome under stress96,97,98. Finally, the enrichment in structural proteins (99992) underscores the importance of maintaining cellular integrity, including DNA repair and cytoskeletal stability, ensuring marama’s structural resilience in extreme conditions99. Together, these expanded gene families provide marama with enhanced capabilities for energy production, stress response, and cellular maintenance, reinforcing its capacity to thrive in harsh, resource-limited environments.

Gene families in T. esculentum were compared to those of four selected legumes, B. variegata, C. canadensis, G. max, and P. alba (Fig. 6A). A total of 24,995 orthogroups were identified, of which 13,977 (55.92%) were core gene families shared across all five legumes, encompassing 24,348 genes. Additionally, 5,824 (23.30%) species-specific orthogroups were identified including 1,271 exclusive to T. esculentum, comprising 4,191 genes. KEGG enrichment analyses were performed on T. esculentum genes in both core and species-specific gene families (Supplementary Table S4 and S5), providing insights into the functional roles of the genes in each group.

Fig. 6
figure 6

Comparative analysis of gene families and evolutionary patterns in T. esculentum and related legumes. (A) Venn diagram showing the distribution of shared and species-specific gene families among T. esculentum and four other legume species. (B) Density curve illustrating the distribution of synonymous substitution rates (Ks) in homologous gene pairs among T. esculentum, B. variegata, and C. canadensis, provides insights into whole-genome duplication (WGD) events. (C) KEGG enrichment analysis of genes in core gene families shared by T. esculentum and the four legumes, highlighting enriched pathways. (D) KEGG enrichment analysis of genes in gene families unique to T. esculentum. Pathways are sorted by enrichment factor, with dot size representing the number of enrich genes and color indicating significance (red for higher significance, purple for lower significance).

Genes within the core gene families shared by the five legumes were enriched in pathways fundamental to plant growth, energy production, and stress response. Key enriched pathways include photosynthesis (00195, 00196), and plant hormone signal transduction (04075), which are essential for maintaining photosynthetic efficiency and regulating growth, both critical for legume development60,100. Additionally, pathways such as N-Glycan biosynthesis (00510) and GPI-anchor biosynthesis (00563) emphasize the significance of protein modification and cell wall maintenance in shared gene functions101,102,103,104. Metabolic pathways, including galactose metabolism (00052), highlight the role of energy regulation and signaling in supporting core biological processes105,106).

In contrast, marama-specific gene families were enriched in pathways associated with stress tolerance, energy storage, and specialized metabolism, which are crucial for its adaptation to harsh environmental conditions. Key pathways, including amino acid catabolism (00280, 00330, 00380 etc) and fatty acid degradation (00071), play pivotal roles in generating signaling molecules that regulate stress-responsive genes and proteins under stress conditions107,108,109,110. Additionally, pathways involved in terpenoid backbone biosynthesis (00900) and glucosinolate biosynthesis (00966) contribute to the synthesis of defense-related secondary metabolites, potentially enhancing marama’s ability to cope with both biotic and abiotic stressors111,112,113. Pathways such as DNA replication (03030) and porphyrin metabolism (00860) further support cellular maintenance and survival in extreme environments114,115. These findings underscore marama’s genetic adaptations that enable it to thrive in arid conditions, highlighting its potential for use in breeding drought-tolerant crops.

The synonymous substitution rate Ks values were calculated for homologous gene pairs in T. esculentum, B. variegata, and C. canadensis to investigate whole genome duplication (WGD) events and their evolutionary timelines (Fig. 6B). The Ks distribution for B. variegata (red curve) peaked at Ks value of 0.24, consistent with previous studies34,116, indicating a relatively recent WGD event. In contrast, the T. esculentum Ks distribution peaked at 0.30, suggesting that a WGD event occurred earlier than in B. variegata. For C. canadensis, the green curve showed only a small peak at a Ks value of 1.77, corresponding to the γ-WGT event within core eudicots approximately 120 million years ago117,118, with a broad divergence time range. Additionally, T. esculentum exhibited a minor slope starting at a Ks value of 2.08, suggesting the presence of an even more ancient WGD event. The detection of both recent and ancient whole-genome duplication signals in T. esculentum further supports the hypothesis that this species underwent multiple rounds of whole genome duplication, which likely contributed to its genome complexity.

Population analysis unveiled two distinct clusters

A total of 958,637,676 Illumina reads, corresponding to an estimated size of 100.4 Gbp, were generated for 31 T. esculentum individuals collected from various locations in Namibia and South Africa (Table 4). Following quality control filtering, 23,772 bi-allelic SNPs were retained for population analysis. Principal component analysis (PCA) revealed two distinct clusters among the 31 individuals (Fig. 7A). Notably, samples from Pretoria Farm and Namibia Farm exhibited genomic differentiation from wild plants collected across several locations in Namibia. Additionally, plants from the Northwest (NW) and Southeast (SE) regions showed no discernible genetic differentiation, while a previous study suggested the potential for dividing these two regions into two separate clusters based on mitogenome variants (Fig. 7A)119.

Table 4 The geographical origins of the 31 T. esculentum samples utilized in the population study.
Fig. 7
figure 7

Population analysis of 31 T. esculentum samples reveals two distinct clusters. (A) Principal Component Analysis (PCA) plot of the first two principal components. Each dot represents one sample, colored by its sampling location. (B) Map showing sampling locations across Namibia, South Africa, and surrounding countries, with dot colors matching Panel (A) and sizes indicating sampling size. (C) Unrooted maximum likelihood (ML) tree constructed from a PHYLIP alignment of 23,772 SNPs, depicting the relationships among the 31 marama samples. (D) ADMIXTURE analysis of population structure at the optimal cluster number, K = 2.

A maximum likelihood (ML) phylogenetic tree, constructed from a PHYLIP alignment of the 23,772 SNPs, further corroborates the presence of two genetic clusters among the 31 T. esculentum individuals (Fig. 7C). Population structure analysis produced consistent results (Fig. 7D). Cross-validation error calculations performed using ADMIXTURE, indicated that the optimal number of clusters for the 31 individuals was K = 2 (Supplementary Figure S2). However, it remains unclear whether these two populations exhibit significant phenotypic differences, as considerable individual variation is already present within the species. Future systematic phenotyping and genotyping of larger sample sizes from diverse geographical locations will be essential for a more comprehensive understanding of T. esculentum’s evolutionary history and for providing guidance in variety selection for breeding programs.

Discussion

This study presents the first high-quality genome assembly of T. esculentum, featuring an N50 value of 2.75 Mb for contigs and 22.68 Mb for scaffolds, a significant improvement from the previous assembly’s 3 kb N50 achieved using only Illumina short reads by Dr. Kyle Logue. While the current genome assembly still contains numerous fragmented contigs, ongoing optimization efforts are anticipated to enhance its quality further. Despite these challenges, many contigs are sufficiently long, approaching near-chromosome level, enabling the study of genes of interest, providing a valuable reference for marama breeding and evolutionary research. This genomic resource establishes a foundation for investigating critical topics, such as the genetic mechanisms behind self-incompatibility, which is crucial for overcoming pollination barriers and developing stable inbred lines, as well as flowering time, which is essential for accelerating breeding cycles and improving crop productivity. Additionally, it provides insights into plant adaptation mechanisms, revealing how marama adapts to harsh desert and semi-desert environments. This information is critical for developing resilient varieties and enhancing the efficiency of breeding programs.

HiCanu generated an assembly with more fragmented contigs, yet it captured the entire genomic content, yielding a genome size approximating the tetraploid genome size of T. esculentum (marama). This is attributed to HiCanu’s default settings, which separate haplotypes at a low divergence threshold of 0.01%, preserving the integrity of the genome (https://canu.readthedocs.io/en/latest/faq.html). In contrast, the application of Hifiasm, coupled with the third-party purging tool Purge_Dups120, resulted in a genome assembly size that is closer to the expected two chromosome sets of marama. Despite this, the assembly still contains duplicated content. Further purging of duplications could refine the assembly by eliminating redundancies, but this may risk the collapse of critical repeats or segmental duplications essential for maintaining genomic stability.

To enhance assembly quality, future efforts could incorporate data from alternative long-read sequencing platforms, such as Oxford Nanopore Technologies (ONT), which generates reads of significantly greater length (up to several hundred kilobases)105. These longer reads would improve scaffolding continuity, enabling the generation of a chromosome-scale assembly. Additionally, increasing sequencing coverage would further enhance the assembly’s completeness and reduce fragmentation. The integration of these complementary technologies has the potential to address current limitations, such as phase ambiguities and incomplete scaffolding, ultimately producing a more refined and accurate reference genome. Further improvement in continuity and completeness is essential, particularly for the investigation of large structural variations121, which could play a crucial role in advancing marama breeding by uncovering their functional impact.

The improved assembly and annotation of T. esculentum establish a robust foundation for future research. This genomic resource enables deeper exploration into the genetic mechanisms underlying key traits, such as self-incompatibility and adaptation to harsh environments. It also supports the development of molecular markers for breeding programs aimed at enhancing marama as a crop. Continued advancements in genome assembly and annotation will be crucial for fully unlocking the genetic potential of marama and facilitating its broader application in food security and agricultural research.