Genome assembly and population analysis of tetraploid marama bean reveal two distinct genome types

Li, Jin; Cullis, Christopher

doi:10.1038/s41598-025-86023-w

Download PDF

Article
Open access
Published: 21 January 2025

Genome assembly and population analysis of tetraploid marama bean reveal two distinct genome types

Jin Li¹ &
Christopher Cullis¹

Scientific Reports volume 15, Article number: 2665 (2025) Cite this article

5386 Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 21 August 2025

This article has been updated

Abstract

Tylosema esculentum (marama bean), an underutilized orphan legume native to southern Africa, holds significant potential for domestication as a rescue crop to enhance local food security. Well-adapted to harsh desert environments, it offers valuable insights into plant resilience to extreme drought and high temperatures. In this study, k-mer analysis indicated marama as an ancient allotetraploid legume. Using 21.5 Gb of PacBio HiFi data, the genome was assembled with two assemblers, HiCanu and Hifiasm, followed by scaffolding with Omni-C data from Dovetail Genomics (Cantata Bio) using HiRise, resulting in a 558.78 Mb assembly with near chromosome-level continuity (N50 = 22.68 Mb, L50 = 8). Repeats accounted for 58.43% of the genome. Phylogenetic analysis indicated a close relationship with Bauhinia variegata and Cercis canadensis, diverging approximately 27.22 and 31.68 million years ago (Ma), respectively. Whole-genome duplication (WGD) analysis revealed an ancient duplication event in marama. Gene family analysis revealed expanded families enriched in pathways related to stress adaptation, energy metabolism, and environmental signaling, including the spliceosome, citrate cycle, and carbon fixation pathways. These findings highlight marama’s resilience to arid environments. In contrast, contracted gene families associated with secondary metabolite biosynthesis and defense pathways suggest a trade-off, potentially due to reduced pathogen pressure. Marama-specific genes were enriched in amino acid catabolism pathways, potentially playing roles in stress signaling and energy regulation. Core gene families shared with other legumes were enriched in conserved pathways, such as photosynthesis and hormone signaling, which are fundamental for plant growth and survival. Population analysis of geographically diverse samples revealed two distinct clusters, though phenotypic differences remain unclear. Overall, this study presents the first high-quality genome assembly of marama bean, offering a valuable genomic reference for understanding its unique biology and highlighting its potential for crop improvement in challenging environments.

A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation

Article 17 March 2025

Pre-breeding in alfalfa germplasm develops highly differentiated populations, as revealed by genome-wide microhaplotype markers

Article Open access 08 January 2025

Two near-chromosomal-level genomes of globally-distributed Macroascomycete based on single-molecule fluorescence and Hi-C methods

Article Open access 04 September 2024

Introduction

Tylosema esculentum, commonly known as the marama bean, is a long-lived perennial legume native to southern Africa (Fig. 1)¹. Adapted to arid and semi-arid desert environments, marama employs a unique drought avoidance strategy by growing tubers that can weigh over 250 kg² to store water, enabling survival in the prolonged hot and dry conditions of the Kalahari Desert (Fig. 1D)³. The domestication of marama has the potential to improve local food security due to the high nutritional value of its edible seeds, whose protein and lipid contents are comparable to those of commercial crops like soybean and peanuts^4,5. A significant obstacle to marama breeding is its delayed flowering, typically occurring in the second year or later. This extended juvenile phase forces breeders to wait years to harvest seeds and assess desirable traits, significantly slowing the breeding cycle. Exploring the genotypic and phenotypic diversity in natural populations and employing molecular marker-assisted breeding strategies are effective alternatives to traditional breeding methods^6,7. Key breeding goals for marama include shortening the flowering time to expedite seed acquisition and developing an erect growth habit to facilitate field harvests⁸. Additionally, overcoming self-incompatibility is essential for creating inbred lines that ensure stable inheritance of desirable traits and enabling crosses between previously incompatible varieties to produce new cultivars with favorable allelic combinations^9,10. Studying marama also provides insights into plant adaptation to harsh environments, which is increasingly relevant in the context of global warming. A high-quality genome assembly will provide a valuable reference for exploring the genetic basis of relevant traits.

The estimated total genome size of T. esculentum is 1 gigabase (Gb), consisting 44 chromosomes (2n = 4x = 44), as determined through next-generation sequencing data and Feulgen staining^6,14. A comprehensive dataset, accessible under PRJNA779273, encompasses Illumina whole-genome sequencing data from over 80 marama individuals sourced from various geographical locations in Namibia and South Africa, along with PacBio long reads from selected individuals. These data were used in assembling and analyzing the chloroplast and mitochondrial genomes of marama^15,16,17. Comparative genomic studies were conducted to explore the genetic diversity within the marama organelle genome^17,18. These studies revealed the presence of two distinct organelle genome types with substantial differences, the functional implications remain unknown. The assembly of the marama nuclear genome remained in a rudimentary state, with an N50 value of only 3 kilobases (kb), by Dr. Kyle Logue solely using short Illumina reads of marama¹⁷.

The advent of next-generation sequencing has significantly advanced genome assembly due to its cost-effectiveness, high speed, and throughput¹⁹. However, challenges persist in assembling complex genomes, such as polyploid and repeat-rich genomes, when solely relying on short reads from next-generation sequencing techniques. As a third-generation sequencing technology, PacBio offers longer reads, averaging over 10 kb and extending up to 25 kb, which addresses the shortcomings of previous methods. The latest PacBio HiFi sequencing enhances accuracy to over 99.9% while maintaining read length²⁰, improving genome assembly quality. To further enhance genome assembly accuracy, particularly for complex genomes, high-throughput chromatin conformation capture (Hi-C) leverages genome-wide chromatin interactions to capture the 3D structure of chromosomes^21,22. This is followed by sequencing, enabling accurate scaffolding of genome assemblies. Hi-C has become a widely used technology for studying complex plant genomes^23,24.

This research aimed to generate the first high-quality genome assembly of T. esculentum (marama) using PacBio HiFi sequencing data. Preliminary assemblies were conducted using HiCanu²⁵ and Hifiasm²⁶, followed by scaffolding with Hi-C data and the HiRise assembler from Cantata Bio LLC to address the complexities of polyploid genomes. Comprehensive gene prediction and annotation were performed, and a comparative genomics study was conducted to analyze gene families in marama and related legumes. Functional uniqueness within the marama genome was identified, alongside phylogenetic analyses to clarify its evolutionary relationships and divergence from related species. Additionally, samples from different geographical regions were collected, and variants identified from resequencing data were used to explore the genetic diversity and population structure within the species, providing insights into marama’s evolution and adaptation. This work offers a valuable genomic resource to support future research and breeding efforts, enhancing marama’s potential as a resilient and sustainable food crop.

Methods

Sample collection and sequencing

Sample 4 of T. esculentum, cultivated in the Case Western Reserve University greenhouse from seeds of unknown provenance in Namibia, was utilized for DNA extraction and sequencing. Fresh young leaves (1 g) were ground in liquid nitrogen using a mortar and pestle, and high-molecular DNA was extracted using the Quick-DNA HMW MagBead kit (Zymo Research). DNA concentration was quantified using an Invitrogen™ Qubit™ 3.0 Fluorometer, and quality was assessed by electrophoresing 200 ng of DNA on a 1.5% agarose TBE gel at 40 V for 24 h.

For PacBio sequencing, DNA samples were submitted to the Genomics Core Facility at the Icahn School of Medicine at Mount Sinai. Sequencing libraries were constructed using the SMRTbell^® Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). The libraries were sequenced on two 8 M SMRT^® Cells using the Sequel^® II platform generating 2,184,811 reads with a total yield of 21.5 Gbp.

To support genome scaffolding, fresh leaf samples from the same plant were flash-frozen in liquid nitrogen and shipped on dry ice to Dovetail Genomics (Cantata Bio) for Omni-C library preparation. Omni-C, an advanced Hi-C technology, employs sequence-independent endonucleases to achieve uniform genome-wide coverage, eliminating biases from restriction enzyme-based methods. Chromatin was fixed with formaldehyde, extracted, and digested with DNAse I before ligation to biotinylated bridge adapters. Proximity ligation was followed by crosslink reversal, removal of non-internal biotin, and library preparation using NEBNext Ultra enzymes and Illumina-compatible adapters. Libraries were enriched via streptavidin bead capture and amplified by PCR. Sequencing was performed on an Illumina HiSeqX platform, achieving ~30x coverage.

For transcriptome sequencing, high-quality RNA was extracted from leaf tissue (young leaves at the growing tip) and root tissue (root tips and young roots of germinated seeds) using the Quick-RNA Plant MiniPrep™ Kit (Zymo Research Corporation, Catalog No. 50–444-618). RNA quality and quantity were estimated by running a sample on a 2% agarose gel. RNA sequencing libraries were prepared by Novogene following a standard workflow. Messenger RNA (mRNA) was enriched from total RNA using poly-T oligo-attached magnetic beads, fragmented, and reverse-transcribed to synthesize first-strand cDNA using random hexamer primers, followed by second-strand synthesis. Library construction involved end repair, A-tailing, adapter ligation, size selection, amplification, and purification. Quality assessment was conducted using a Qubit fluorometer and real-time PCR for quantification, as well as a bioanalyzer for size distribution evaluation. The libraries were pooled and sequenced on the NovaSeq 6000 platform, generating 6.8 Gbp of transcriptomic data for root tissues and 7.2 Gbp for leaf tissues.

For the population study, Illumina whole-genome sequencing (WGS) data were generated for 84 individuals collected from various geographic regions in Namibia and South Africa. Details of sequencing protocols and data processing, are described in a previous study¹⁶. These data are available in the NCBI SRA database under Bioproject PRJNA779273.

De novo genome assembly and quality assessment

The preliminary assembly of PacBio HiFi reads was generated using two assemblers: Hifiasm v.0.18.5 (Cheng et al. 2021) with a haplotype number set to four, and HiCanu²⁵ with an estimated genome size of 1 Gb based on previous assessments⁶. The HiRise pipeline^27,28 was then used to scaffold the de novo assembly with the help of Dovetail OmniC reads. These reads were aligned to the draft assembly using Burrows-Wheeler Aligner (BWA)¹⁶ (https://github.com/lh3/bwa). HiRise analyzed the read pair distances within scaffolds to generate a likelihood model for genomic distance, which was used to identify and correct misjoins.

K-mer analysis of the PacBio HiFi reads was performed using Jellyfish v. 2.3.0²⁹ (https://github.com/gmarcais/Jellyfish) with a k-mer length of 21. The results were used to construct k-mer spectra with GenomeScope 2.0³⁰ (http://qb.cshl.edu/genomescope/genomescope2.0/). Assembly quality was assessed with QUAST v. 5.2.0³¹ (https://github.com/ablab/quast) and visualized using Matplotlib v. 1.3.1³². Genome completeness was evaluated using BUSCO v. 5.3.0³³ (https://busco.ezlab.org/) against the Embryophyta ortholog database (embryophyta_odb10, 1614 genes) and the Fabales ortholog database (fabales_odb10, 5366 genes).

Additionally, the genome assembly was compared to that of Bauhinia variegata (ASM2237911v2)³⁴, the closest species with an available genome sequence, by aligning the assemblies with minimap2 v. 2.28¹⁶ (https://github.com/lh3/minimap2). The pairwise mapping data (PAF) was visualized using a dot plot created with the R package pafr (https://github.com/dwinter/pafr) to assess synteny and validate the assembly’s structure.

Gene prediction and functional annotation

A de novo repeat library was constructed for the assembly using RepeatModeler (v. 2.0)³⁵. This custom library, combined with the Dfam v3.0 database³⁶, was used to annotate and mask repetitive elements in the genome assembly with RepeatMasker v. 4.1.4³⁷ (https://www.repeatmasker.org/). Alignments were conducted using the rmblastn search engine³⁸. For gene prediction, transcriptomic data were aligned to the genome assembly using HISAT2 v. 2.2.1^23,39, resulting in SAM files that were converted to sorted BAM files and combined using SAMtools v. 1.20. The BRAKER v 3.0.8 pipeline B⁴⁰, which integrates evidence-based and ab initio approaches for gene annotation, was employed. The pipeline utilized RNA-Seq spliced alignment data to train GeneMark-ET v. 4.71⁴¹, incorporating both genome sequence and RNA-Seq evidence. The resulting gene models were then used to train Augustus v. 3.5.0⁴² for final gene predictions.

A statistical summary of the annotation was generated from the resulting GFF file using the AGAT v. 1.0.0 toolkit⁴³. The completeness of the gene annotation was assessed using BUSCO against the embryophyta_odb10 database. Functional gene annotation was performed using eggnog-mapper 2.1.12⁴⁴ (http://eggnog-mapper.embl.de/), referencing the eggNOG 5 database⁴⁵.

Evolutionary analyses: phylogenetic relationships, whole genome duplication, and gene enrichment

To investigate the evolutionary relationships of T. esculentum and related species, a comparative genomics study was conducted on the gene families of 13 legumes and one outgroup species, soapbark. The protein data of 9 species were retrieved from the JGI Phytozome v. 13 database⁴⁶: Arachis hypogaea (v1.0)⁴⁷, Cicer arietinum (v1.0)⁴⁸, Medicago truncatula (Mt4.0v1)⁴⁹, Lotus japonicus (Lj1.0v1)⁵⁰⁵¹, Glycine max (Wm82.a2.v1)⁵², Phaseolus acutifolius (v1.0)⁵³, Vigna unguiculata (v1.1)⁵⁴, Lupinus albus (v1)⁵⁵, Cercis canadensis (v3.1)⁵⁶, and four from NCBI GenBank: Bauhinia variegata (ASM2237911v2)³⁴, Prosopis alba (ASM479914v2), Prosopis cineraria (ASM2901754v1)⁵⁷, and Quillaja saponaria (AO_1.2)⁵⁸. CD-HIT v. 4.8.1⁵⁹ with a threshold of 0.95 was applied to retain only the longest isoform of each protein. OrthoFinder v. 2.4.0⁶⁰ was then utilized with an all-against-all method to identify orthologous genes across these species, which revealed the evolutionary relationship among these plant taxa.

A total of 157 single-copy orthologs were identified among 14 species and aligned using MAFFT v. 7.520⁶¹. Regions with poor alignment and high divergence were removed using Gblocks v. 0.91b⁶², specifying a minimum block length of 5, while all gaps were retained as meaningful. The trimmed alignments were concatenated into a single FASTA file. A maximum likelihood phylogenetic tree was constructed using concatenated protein sequences in IQ-TREE v. 2.2.2.7⁶³ with the ModelFinder Plus (MFP)⁶⁴ algorithm to automatically select the optimal substitution model. Tree topology robustness was assessed with 1000 bootstrap replications. Approximate divergence times for C. arietinum and M. truncatula (31.9 million years ago) and for G. max and V. unguiculata (25.3 million years ago) were retrieved from TimeTree (http://www.timetree.org)⁶⁵. These divergence times were used to calibrate the overall divergence times in the phylogenetic tree using the Timetree Wizard⁶⁶ in MEGA 11⁶⁷.

Gene count data generated by OrthoFinder were used to calculate gene family size variation across the phylogenetic tree using CAFE5 v. 5.0.0⁶⁸. To mitigate potential noise from excessively large gene families and those with high variance, families with more than 100 gene copies were filtered out using the clade_and_size_filter.py script. The resulting data on gene family expansions and contractions (with a p-value < 0.05) were then mapped onto the phylogenetic tree, providing a visual representation of these evolutionary changes. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed using TBtools 2.0⁶⁹ and the KEGG database^70,71,72 on genes from the expanded and contracted gene families of T. esculentum. The top enriched pathways were visualized using ggplot2 in R, sorted by enrichment factor.

Orthologous gene families were also analyzed among five selected legumes, T. esculentum, B. variegata, C. canadensis, G. max, and P. alba, which include both closely related species of T. esculentum as well as representatives from different phylogenetic clusters. The distribution of shared and species-specific gene families was visualized using the VennDiagram package v. 1.7.3⁶⁹ in R 4.3.1²⁸. Genes in core gene families, shared by all five legumes, and genes in the T. esculentum-specific gene family were used for KEGG enrichment analyses with TBtools v. 2.0. The results were visualized using ggplot2⁷³ in R, offering a comprehensive understanding of the common and unique functional pathways among these five legumes.

For further investigating the evolutionary relationships between T. esculentum and its closely related legume species, C. canadensis and B. variegata, whole genome duplication (WGD) events were analyzed. The coding sequences (CDS) of the genomes were self-aligned to identify homologous gene pairs using DIAMOND (v. 2.1.8)⁷⁴ with an e-value threshold of 1 × 10⁻¹⁰, in conjunction with WGD tools (v. 1.1.1)⁷⁵. The synonymous substitution rates (Ks values) were subsequently calculated using the ksd function from the WGD tools, which employs PAML (v. 4.9 h)⁷⁶ for codon-based maximum likelihood analysis. A Ks threshold of 0.1 was applied to exclude local duplication minimizing noise. The results of WGD were visualized as a curve plot using ggplot2 (v. 3.5.1) in R (v. 4.3.1) (R Core Team, 2023), offering a graphical representation of evolutionary relationships.

Population genetics analysis

The population study included 31 independent marama samples, with 24 collected from wild plants across various locations in Namibia: 3 from Tsjaka (S22 75.039 E19 20.712), 1 from Okamatapati (S20 40.233 E18 21.59), 8 from Aminuis (S23 38.000 E19 22.00), 4 from Osire (S21 02.031 E17 21.244), 3 from Tsumkwe (S19 21.000 E20 16.000), 2 from Ombujondjou (S20 18.600 E17 58.525), 2 from Epukiro (S21 39.642 E19 25.092), and 1 from Otjiwarongo (S20 46.092 E16 65.123). 6 samples were collected from the University of Pretoria Farm, where they had been cultivated for over thirty years with an unknown original source. Additionally, one sample was grown from seed collected from the Namibia Farm (21°23′48.5″ S 19°44′59.6″ E). DNA extraction was described in detail in the previous study^16,77, and the whole-genome sequencing (WGS) reads are available in the NCBI SRA database under Bioproject PRJNA779273.

Paired-end Illumina reads were aligned to the genome assembly with BWA v 0.7.17 mem¹⁶, followed by conversion of SAM files to sorted BAM files using SAMtools v. 1.20. SNPs were called using BCFtools mpileup⁷⁸. SNP processing involved multiple filtering steps to ensure high-quality variants. VCF files from different samples were first merged using BCFtools merge, and then filtered to retain only biallelic SNP variants using BCFtools view. Quality and depth filtering was applied with BCFtools filter, using thresholds of QUAL ≥ 30, DP > 10, and DP ≤ 100. The filtered VCF file was then normalized with BCFtools norm using the reference genome to ensure consistent variant representation. A final filter step retained variants with a minor allele frequency (MAF) greater than 0.05 and SNP missing call rate lower than 0.3 for the population study.

Principal component analysis (PCA) was performed using PLINK v.2.00⁷⁹, and the first two components were visualized using ggplot2 in R v.4.3.1²⁸. ADMIXTURE v.1.23⁸⁰ was used to assess population structure, with the optimal number of populations determined through cross-validation errors and visualized with ggplot2. For phylogenetic analysis, the VCF file was converted to a PHYLIP alignment using vcf2phylip.py (v. 2.8)⁸¹ and a maximum likelihood tree was constructed with IQ-TREE (v. 2.2.2.7)⁶³ using 1000 bootstrap replicates. The resulting tree was visualized using Interactive Tree of Life (iTOL) (v. 6)⁸².

Results

Estimation of genome size and heterozygosity

A total of 21.5 Gbp PacBio HiFi reads were generated and analyzed to characterize the genomic properties of T. esculentum. K-mer analysis was conducted using a k-mer length of 21, and the resulting distribution was modeled with GenomeScope 2.0, producing a k-mer spectra map that revealed peaks corresponding to 1-fold, 2-fold, and 4-fold coverage (Fig. 2). The data more closely aligned with a tetraploid genome model than the initially hypothesized hexaploid model for T. esculentum. The predicted genome size for a single chromosome set was 277.44 Mb, which is comparable to the compact genome of the legume species Amphicarpaea edgeworthii, with a genome size of 298.1 Mb and a haploid chromosome number of 11 (2n= 22)^83,84. Additionally, the k-mer spectra indicated substantial heterozygosity, with 2.2% of the genome exhibiting heterozygous characteristics. Notably, both the aaab (1.410%) and aabb pattern (0.498%) were observed at high frequencies, suggesting that T. esculentum may possess a complex ploidy structure, potentially indicative of an ancient allotetraploid that has accumulated mutations over time, leading to an elevated aaab pattern ratio³⁰.

De novo genome assembly and evaluation

2,184,811 PacBio HiFi reads were subjected to preliminary assembly using HiCanu and Hifiasm. HiCanu produced a complete genome assembly of 1.24 Gb, composed of 9,532 contigs, a size that aligns closely with the expected structure of four chromosome sets. This assembly exhibited an N50 value of 1.28 Mb and an L50 value of 252 (indicating the minimum number of contigs whose combined length equals half the genome size) (Table 1). In contrast, Hifiasm generated a partially phased assembly of 558.23 Mb with higher continuity, consisting of 4,175 contigs. This assembly demonstrated a markedly improved N50 of 2.75 Mb and an L50 of 35. Both assemblies achieved high completeness, with BUSCO scores exceeding 99% when evaluated against the embryophyta_odb10 database.

Table 1 T. esculentum sequencing and genome assembly statistics.

Full size table

Subsequently, the Hifiasm assembly was submitted to Dovetail Genomics (Cantata Bio) for Scaffolding using Omni-C data to capture chromatin interactions (Supplementary Figure S1). The final assembly size was 558.78 Mb, which is close with the estimated size of 554.88 Mb for the two sets of chromosomes. Despite the relatively high contig count (3,888), continuity was significantly improved, as evidenced by an N50 of 22.68 Mb, reaching chromosome-level assembly. The longest scaffold was 56.19 Mb (Fig. 3A). The L50 was reduced to 8, meaning that the top eight scaffolds collectively represented 50% of the genome size (Fig. 3B). The average guanine-cytosine (GC) content across all contigs was 37.20% (Table 1; Fig. 3C). BUSCO completeness remained robust, with a score of 99.1% against the Embryophyta database and 93.6% against the Fabales database.

Approximately 58.43% of the T. esculentum genome assembly was annotated as repetitive sequences (Table 2). The most prevalent repeat component, long-terminal repeat (LTR) retroelements, accounted for 22.61% of the genome, with Gypsy/DIRS1 elements comprising 15.65% and Ty1/Copia elements contributing 3.48%. Low-complexity regions (LCRs) and simple repeats represented 11.77% and 7.45% of the genome, respectively.

Table 2 Summary of repeat elements in the T. esculentum genome assembly by RepeatMasker.

Full size table

A total of 49,343 protein-coding genes were predicted using BRAKER 3(Table 3), with an average of 6.3 exons per gene, with each exon measuring 218.35 bp, and 5.1 introns per gene, averaging 369.56 bp in length. Evaluation of gene set completeness using BUSCO, with reference to the Embryophyta core gene database, revealed a completeness of 95.8%. The predicted gene set was further annotated using eggNOG-MAPPER against the eggNOG database v. 5.0.2, with results summarized in Supplementary Table S1.

Table 3 Gene prediction statistics for T. esculentum genome assembly.

Full size table

Comparison of the genome assemblies of T. e sculentum and B. variegata

Genomes of a limited number of plants from the Cercidoideae subfamily have been assembled, with B. variegata being the closest evolutionary relative to T. esculentum⁸⁵. The genome assembly of B. variegata (ASM2237911v2) spans 326.4 Mb and consists of 14 chromosomes (2n= 28), ranging in size from 18.26 Mb to 27.62 Mb³⁴. The T. esculentum genome assembly was aligned to the B. variegata genome using minimap2, and the results were visualized as a dot plot with the R package pafr (Fig. 4). The alignment revealed partial collinearity, with conserved regions forming distinct diagonal lines. However, the presence of numerous missing alignments and structural variations, including inversions and translocations, highlights substantial genomic divergence between the two species. To further investigate, Illumina reads from three randomly selected T. esculentum samples (M1, M40, Index1) were mapped to the B. variegata genome using Bowtie2 v2.4.4⁸⁶ (https://github.com/BenLangmead/bowtie2). The overall alignment rate was approximately 20.36%, whereas mapping the same reads to the Vigna radiata genome (PRJNA301363) resulted in a significantly lower alignment rate of 2.7%⁸⁷. These findings underscore the highly divergent nature of the T. esculentum genome compared to other legumes.

Phylogenetic analyses, along with the evolution of gene families and whole genome duplication analysis of T. esculentum and related species

Ortholog analyses were performed on 14 species including 13 legumes, including three Cercidoideae (C. canadensis, B. variegata, and T. esculentum), two Caesalpinioideae (P. alba and P. cineraria), and eight Faboideae (A. hypogaea, C. arietinum, M. truncatula, L. japonicus, G. max, P. acutifolius, V. unguiculata, and L. albus), along with one outgroup species Q. saponaria, after using 95% similarity threshold to retain only the longest isoform. Out of a total of 510,326 genes, 472,973 (92.68%) were assigned to 33,383 orthogroups, with 40,527 genes in 9,466 orthogroups, identified as species-specific. The 157 single-copy orthogroups were used to construct a phylogenetic tree, with Q. saponaria as outgroup to root the tree (Fig. 5A). The divergence between B. variegata and T. esculentum was estimated to occur approximately 27.22 million years ago (Ma), and divergence with C. canadensis occurred 31.68 million years ago. The gene family number variation is close to that of B. variegata, with more gene families (2,231) underwent expansion than contraction (1,155). A total of 6707 genes in the T. esculentum expanded gene family and 951 genes in the contracted families were underwent KEGG pathway enrichment analyses using TBtools (Fig. 5B and C, Supplementary Table S4 and S5).

The contracted gene families in marama are primarily enriched in pathways associated with plant defense and stress adaptation, including plant secondary metabolite biosynthesis (00999), tropane, piperidine, and pyridine alkaloid biosynthesis (00960), and cutin, suberine, and wax biosynthesis (00073) (Fig. 5B). The contraction of genes in these pathways suggests a reduced capacity for synthesizing specialized metabolites and protective compounds, which are typically essential for defense against biotic stressors, such as pathogens^88,89. This reduction likely reflects the lower pathogen pressure in marama’s native arid environment, where pathogen diversity and abundance are limited. Consequently, marama appears to prioritize resource allocation toward critical survival processes, such as drought tolerance, rather than extensive defense mechanisms. These findings highlight the trade-offs in marama’s genome that contribute to its resilience and efficient adaptation to harsh conditions.

The expanded gene families in marama are enriched in pathways essential for cellular function, energy metabolism, and stress adaptation, contributing to its remarkable resilience in arid environments (Fig. 5C). Key pathways such as the citrate cycle (TCA cycle) (00020), pyruvate metabolism (00620), and carbon fixation in photosynthetic organisms (00710) highlight marama’s ability to optimize energy production and carbon assimilation under resource-limited conditions, critical for survival in drought-prone areas^90,91. The expansion of arginine biosynthesis (00220) is particularly important, as arginine serves as a precursor for molecules involved in stress signaling, osmotic balance, and the detoxification of reactive oxygen species^92,93. The increased presence of GTP-binding proteins (04031) reflects an expanded set of signaling molecules that play key roles in cellular communication, stress responses, and protein trafficking, enabling rapid adaptation to fluctuating environmental conditions^94,95. Additionally, the expansion of the spliceosome pathway (03040) enhances RNA processing and gene expression regulation, supporting marama’s ability to fine-tune its transcriptome under stress^96,97,98. Finally, the enrichment in structural proteins (99992) underscores the importance of maintaining cellular integrity, including DNA repair and cytoskeletal stability, ensuring marama’s structural resilience in extreme conditions⁹⁹. Together, these expanded gene families provide marama with enhanced capabilities for energy production, stress response, and cellular maintenance, reinforcing its capacity to thrive in harsh, resource-limited environments.

Gene families in T. esculentum were compared to those of four selected legumes, B. variegata, C. canadensis, G. max, and P. alba (Fig. 6A). A total of 24,995 orthogroups were identified, of which 13,977 (55.92%) were core gene families shared across all five legumes, encompassing 24,348 genes. Additionally, 5,824 (23.30%) species-specific orthogroups were identified including 1,271 exclusive to T. esculentum, comprising 4,191 genes. KEGG enrichment analyses were performed on T. esculentum genes in both core and species-specific gene families (Supplementary Table S4 and S5), providing insights into the functional roles of the genes in each group.

Genes within the core gene families shared by the five legumes were enriched in pathways fundamental to plant growth, energy production, and stress response. Key enriched pathways include photosynthesis (00195, 00196), and plant hormone signal transduction (04075), which are essential for maintaining photosynthetic efficiency and regulating growth, both critical for legume development^60,100. Additionally, pathways such as N-Glycan biosynthesis (00510) and GPI-anchor biosynthesis (00563) emphasize the significance of protein modification and cell wall maintenance in shared gene functions^{101,102,103,104}. Metabolic pathways, including galactose metabolism (00052), highlight the role of energy regulation and signaling in supporting core biological processes^105,106).

In contrast, marama-specific gene families were enriched in pathways associated with stress tolerance, energy storage, and specialized metabolism, which are crucial for its adaptation to harsh environmental conditions. Key pathways, including amino acid catabolism (00280, 00330, 00380 etc) and fatty acid degradation (00071), play pivotal roles in generating signaling molecules that regulate stress-responsive genes and proteins under stress conditions^{107,108,109,110}. Additionally, pathways involved in terpenoid backbone biosynthesis (00900) and glucosinolate biosynthesis (00966) contribute to the synthesis of defense-related secondary metabolites, potentially enhancing marama’s ability to cope with both biotic and abiotic stressors^111,112,113. Pathways such as DNA replication (03030) and porphyrin metabolism (00860) further support cellular maintenance and survival in extreme environments^114,115. These findings underscore marama’s genetic adaptations that enable it to thrive in arid conditions, highlighting its potential for use in breeding drought-tolerant crops.

The synonymous substitution rate Ks values were calculated for homologous gene pairs in T. esculentum, B. variegata, and C. canadensis to investigate whole genome duplication (WGD) events and their evolutionary timelines (Fig. 6B). The Ks distribution for B. variegata (red curve) peaked at Ks value of 0.24, consistent with previous studies^34,116, indicating a relatively recent WGD event. In contrast, the T. esculentum Ks distribution peaked at 0.30, suggesting that a WGD event occurred earlier than in B. variegata. For C. canadensis, the green curve showed only a small peak at a Ks value of 1.77, corresponding to the γ-WGT event within core eudicots approximately 120 million years ago^117,118, with a broad divergence time range. Additionally, T. esculentum exhibited a minor slope starting at a Ks value of 2.08, suggesting the presence of an even more ancient WGD event. The detection of both recent and ancient whole-genome duplication signals in T. esculentum further supports the hypothesis that this species underwent multiple rounds of whole genome duplication, which likely contributed to its genome complexity.

Population analysis unveiled two distinct clusters

A total of 958,637,676 Illumina reads, corresponding to an estimated size of 100.4 Gbp, were generated for 31 T. esculentum individuals collected from various locations in Namibia and South Africa (Table 4). Following quality control filtering, 23,772 bi-allelic SNPs were retained for population analysis. Principal component analysis (PCA) revealed two distinct clusters among the 31 individuals (Fig. 7A). Notably, samples from Pretoria Farm and Namibia Farm exhibited genomic differentiation from wild plants collected across several locations in Namibia. Additionally, plants from the Northwest (NW) and Southeast (SE) regions showed no discernible genetic differentiation, while a previous study suggested the potential for dividing these two regions into two separate clusters based on mitogenome variants (Fig. 7A)¹¹⁹.

Table 4 The geographical origins of the 31 T. esculentum samples utilized in the population study.

Full size table

A maximum likelihood (ML) phylogenetic tree, constructed from a PHYLIP alignment of the 23,772 SNPs, further corroborates the presence of two genetic clusters among the 31 T. esculentum individuals (Fig. 7C). Population structure analysis produced consistent results (Fig. 7D). Cross-validation error calculations performed using ADMIXTURE, indicated that the optimal number of clusters for the 31 individuals was K = 2 (Supplementary Figure S2). However, it remains unclear whether these two populations exhibit significant phenotypic differences, as considerable individual variation is already present within the species. Future systematic phenotyping and genotyping of larger sample sizes from diverse geographical locations will be essential for a more comprehensive understanding of T. esculentum’s evolutionary history and for providing guidance in variety selection for breeding programs.

Discussion

This study presents the first high-quality genome assembly of T. esculentum, featuring an N50 value of 2.75 Mb for contigs and 22.68 Mb for scaffolds, a significant improvement from the previous assembly’s 3 kb N50 achieved using only Illumina short reads by Dr. Kyle Logue. While the current genome assembly still contains numerous fragmented contigs, ongoing optimization efforts are anticipated to enhance its quality further. Despite these challenges, many contigs are sufficiently long, approaching near-chromosome level, enabling the study of genes of interest, providing a valuable reference for marama breeding and evolutionary research. This genomic resource establishes a foundation for investigating critical topics, such as the genetic mechanisms behind self-incompatibility, which is crucial for overcoming pollination barriers and developing stable inbred lines, as well as flowering time, which is essential for accelerating breeding cycles and improving crop productivity. Additionally, it provides insights into plant adaptation mechanisms, revealing how marama adapts to harsh desert and semi-desert environments. This information is critical for developing resilient varieties and enhancing the efficiency of breeding programs.

HiCanu generated an assembly with more fragmented contigs, yet it captured the entire genomic content, yielding a genome size approximating the tetraploid genome size of T. esculentum (marama). This is attributed to HiCanu’s default settings, which separate haplotypes at a low divergence threshold of 0.01%, preserving the integrity of the genome (https://canu.readthedocs.io/en/latest/faq.html). In contrast, the application of Hifiasm, coupled with the third-party purging tool Purge_Dups¹²⁰, resulted in a genome assembly size that is closer to the expected two chromosome sets of marama. Despite this, the assembly still contains duplicated content. Further purging of duplications could refine the assembly by eliminating redundancies, but this may risk the collapse of critical repeats or segmental duplications essential for maintaining genomic stability.

To enhance assembly quality, future efforts could incorporate data from alternative long-read sequencing platforms, such as Oxford Nanopore Technologies (ONT), which generates reads of significantly greater length (up to several hundred kilobases)¹⁰⁵. These longer reads would improve scaffolding continuity, enabling the generation of a chromosome-scale assembly. Additionally, increasing sequencing coverage would further enhance the assembly’s completeness and reduce fragmentation. The integration of these complementary technologies has the potential to address current limitations, such as phase ambiguities and incomplete scaffolding, ultimately producing a more refined and accurate reference genome. Further improvement in continuity and completeness is essential, particularly for the investigation of large structural variations¹²¹, which could play a crucial role in advancing marama breeding by uncovering their functional impact.

The improved assembly and annotation of T. esculentum establish a robust foundation for future research. This genomic resource enables deeper exploration into the genetic mechanisms underlying key traits, such as self-incompatibility and adaptation to harsh environments. It also supports the development of molecular markers for breeding programs aimed at enhancing marama as a crop. Continued advancements in genome assembly and annotation will be crucial for fully unlocking the genetic potential of marama and facilitating its broader application in food security and agricultural research.

Data availability

All sequencing data for marama are publicly accessible under BioProject PRJNA779273 at NCBI. This includes the PacBio HiFi data for Sample 4 (SRR23882924) (https://identifiers.org/ncbi/insdc.sra: SRR23882924) used for genome assembly in this study, as well as Illumina WGS data from 31 individuals for population analysis. The final genome assembly is deposited under BioProject PRJNA1197564.

Change history

21 August 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41598-025-10913-2

Reference

Kang, Y. J. et al. Genome sequence of mungbean and insights into evolution within Vigna species. Nat. Commun. 5 (1), 1–9. https://doi.org/10.1038/ncomms6443 (2014).
Article Google Scholar
Dakora, F. D. Biogeographic distribution, nodulation and nutritional attributes of underutilized indigenous African legumes. Acta Hortic. 979, 53–64. https://doi.org/10.17660/actahortic.2013.979.3 (2013).
Article Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9 (4), 357–359. https://doi.org/10.1038/nmeth.1923 (2012).
Article CAS PubMed PubMed Central Google Scholar
Evans, J. R. Improving photosynthesis. Plant Physiol. 162 (4), 1780–1793. https://doi.org/10.1104/pp.113.219006 (2013).
Article CAS PubMed PubMed Central Google Scholar
Reed, J. et al. Elucidation of the pathway for biosynthesis of saponin adjuvants from the soapbark tree. Science 379 (6638), 1252–1264. https://doi.org/10.1126/science.adf3727 (2023).
Article ADS CAS PubMed Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10 (2), 1–4. https://doi.org/10.1093/gigascience/giab008 (2021).
Article CAS Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9 (3), 90–95. https://doi.org/10.1109/mcse.2007.55 (2007).
Article Google Scholar
Del Carmen Martínez-Ballesta, M., Moreno, D., & Carvajal, M. The physiological importance of glucosinolates on plant response to abiotic stress in brassica. Int. J. Mol. Sci. 14 (6), 11607–11625. https://doi.org/10.3390/ijms140611607 (2013).
Article CAS Google Scholar
Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40 (D1), D1178–D1186. https://doi.org/10.1093/nar/gkr944 (2011).
Article PubMed PubMed Central Google Scholar
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36 (9), 2896–2898. https://doi.org/10.1093/bioinformatics/btaa025 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4 (1). https://doi.org/10.1186/s13742-015-0047-8 (2015).
Article PubMed PubMed Central Google Scholar
Chen, H. & Boutros, P. C. VennDiagram: a package for the generation of highly customizable Venn and Euler diagrams in R. BMC Bioinform. 12 (1), 1–10. https://doi.org/10.1186/1471-2105-12-35 (2011).
Article CAS Google Scholar
Chen, C. et al. TBTools: an integrative toolkit developed for interactive analyses of big biological data. Mol. Plant. 13 (8), 1194–1202. https://doi.org/10.1016/j.molp.2020.06.009 (2020).
Article CAS PubMed Google Scholar
Wang, J., Yu, Y., Li, Y. & Chen, L. Hexose transporter SWEET5 confers galactose sensitivity to Arabidopsis pollen germination via a galactokinase. Plant Physiol. 189 (1), 388–401. https://doi.org/10.1093/plphys/kiac068 (2022).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive tree of life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. 52 (W1), W78–W82. https://doi.org/10.1093/nar/gkae268 (2024).
Article PubMed PubMed Central Google Scholar
Liang, Z., Huang, P., Yang, J. & Rao, G. Population divergence in the amphicarpic species Amphicarpaea Edgeworthii Benth. (Fabaceae): microsatellite markers and leaf morphology. Biol. J. Linn. Soc. 96 (3), 505–516. https://doi.org/10.1111/j.1095-8312.2008.01154.x (2009).
Article ADS Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326 (5950), 289–293. https://doi.org/10.1126/science.1181369 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Insights into amphicarpy from the compact genome of the legume Amphicarpaea Edgeworthii. Plant. Biotechnol. J. 19 (5), 952–965. https://doi.org/10.1111/pbi.13520 (2020).
Article CAS PubMed Google Scholar
Zhou, K. The regulation of the cell wall by glycosylphosphatidylinositol-anchored proteins in Arabidopsis. Front. Cell. Dev. Biology. 10. https://doi.org/10.3389/fcell.2022.904714 (2022).
Article Google Scholar
Jackson, J. C. et al. The morama bean (Tylosema esculentum): a potential crop for southern Africa. Adv. Food Nutr. Res. 61, 187–246. https://doi.org/10.1016/b978-0-12-374468-5.00005-2 (2010).
Article CAS PubMed Google Scholar
Bertioli, D. J. et al. The genome sequence of segmental allotetraploid peanut (Arachis hypogaea). Nat. Genet. 51 (5), 877–884. https://doi.org/10.1038/s41588-019-0405-z (2019).
Article CAS PubMed Google Scholar
Mendes, F. K., Vanderpool, D., Fulton, B. & Hahn, M. W. CAFE 5 models variation in evolutionary rates among gene families. Bioinformatics 36 (22-23), 5516–5518. https://doi.org/10.1093/bioinformatics/btaa1022 (2020).
Article CAS Google Scholar
Li, J. Draft genome assembly, organelle genome sequencing and diversity analysis of Marama bean (Tylosema esculentum), the green gold of Africa. Doctoral dissertation, Case Western Reserve University, OhioLINK Electronic Theses and Dissertations Center. (2023).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites and allelic variants from high-fidelity long reads. Genome Res. 30 (9), 1291–1305. https://doi.org/10.1101/gr.263566.120 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11 (1), 1432. https://doi.org/10.1038/s41467-020-14998-3 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117 (17), 9451–9457. https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. https://doi.org/10.1093/bioinformatics/bts565 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gabriel, L. et al. BRAKER3: fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 34 (5), 769–777. https://doi.org/10.1101/gr.278090.123 (2024).
Article CAS PubMed PubMed Central Google Scholar
Nagashima, Y., von Schaewen, A. & Koiwa, H. Function of N-glycosylation in plants. Plant Sci. 274, 70–79. https://doi.org/10.1016/j.plantsci.2018.05.007 (2018).
Article CAS PubMed Google Scholar
Stai, J. S. et al. Cercis: a non-polyploid genomic relic within the generally polyploid legume family. Front. Plant Sci. 10. https://doi.org/10.3389/fpls.2019.00345 (2019).
Article Google Scholar
Phytozome Cercis canadensis v3.1 [Genome assembly]. DOE-JGI. (2023). http://phytozome.jgi.doe.gov/info/Ccanadensis_V3_1
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28 (11), 1947–1951. https://doi.org/10.1002/pro.3715 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinf. 25 (1). https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Ragaey, M. M., Sadak, M. S., Dawood, M. F. A., Mousa, N. H. S., Hanafy, R. S., & Latef, A. a. H. A. Role of signaling molecules sodium nitroprusside and arginine in alleviating Salt-Induced oxidative stress in wheat. Plants, 11(14), 1786. https://doi.org/10.3390/plants11141786 (2022).
Hubley, R. et al. The Dfam database of repetitive DNA families. Nucleic Acids Res. 44 (D1), D44–D81. https://doi.org/10.1093/nar/gkv1272 (2016).
Google Scholar
Jiao, Y. et al. A genome triplication associated with early diversification of the core eudicots. Genome Biol. 13 (1), R3. https://doi.org/10.1186/gb-2012-13-1-r3 (2012).
Article PubMed PubMed Central Google Scholar
Wunderlin, R. P. Reorganization of the cercideae (Fabaceae: Caesalpinioideae). Phytoneuron 48, 1–5 (2010).
Google Scholar
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38 (12), 5825–5829. https://doi.org/10.1093/molbev/msab293 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34 (18), 3094–3100. https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data. 7 (1), 399. https://doi.org/10.1038/s41597-020-00743-4 (2020).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37 (5), 1530–1534. https://doi.org/10.1093/molbev/msaa015 (2020).
Article CAS PubMed PubMed Central Google Scholar
Veličković, D. et al. Spatial mapping of plant N-Glycosylation cellular heterogeneity inside soybean root nodules provided insights into Legume-Rhizobia symbiosis. Front. Plant Sci. 13. https://doi.org/10.3389/fpls.2022.869281 (2022).
Article PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 (1), 27–30. https://doi.org/10.1093/nar/28.1.27 (2000).
Article CAS PubMed PubMed Central Google Scholar
Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17 (4), 540–552. https://doi.org/10.1093/oxfordjournals.molbev.a026334 (2000).
Article CAS PubMed Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. Kegg for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51 (D1), D587-D592. https://doi.org/10.1093/nar/gkac963 (2022).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47. D309-D314. https://doi.org/10.1093/nar/gky1085 (2018).
Article PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780. https://doi.org/10.1093/molbev/mst010 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z., Mao, C., Shi, Z. & Kou, X. The amino acid metabolic and carbohydrate metabolic pathway play important roles during salt-stress response in Tomato. Front. Plant Sci. 8. https://doi.org/10.3389/fpls.2017.01231 (2017).
Article PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 37 (8), 907–915. https://doi.org/10.1038/s41587-019-0201-4 (2019).
Manova, V. & Gruszka, D. DNA damage and repair in plants – from models to crops. Front. Plant Sci. 6. https://doi.org/10.3389/fpls.2015.00885 (2015).
Article PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27 (6), 764–770. https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article PubMed PubMed Central Google Scholar
Sudalaimuthuasari, N. et al. The genome of the mimosoid legume Prosopis cineraria, a desert tree. Int. J. Mol. Sci. 23 (15), 8503. https://doi.org/10.3390/ijms23158503 (2022).
Article CAS PubMed PubMed Central Google Scholar
Punzo, P., Grillo, S. & Batelli, G. Alternative splicing in plant abiotic stress responses. Biochem. Soc. Trans. 48 (5), 2117–2126. https://doi.org/10.1042/bst20200281 (2020).
Article CAS PubMed Google Scholar
Mello, B. Estimating TimeTrees with MEGA and the TimeTree resource. Mol. Biol. Evol. 35 (9), 2334–2342. https://doi.org/10.1093/molbev/msy133 (2018).
Article CAS PubMed Google Scholar
Lee, H. et al. Legume genome structures and histories inferred from Cercis Canadensis and Chamaecrista Fasciculata genomes. bioRxiv https://doi.org/10.1101/2024.09.03.611065 (2024).
Article PubMed PubMed Central Google Scholar
Serrano, M., Coluccia, F., Torres, M., L’Haridon, F. & Metraux, J. P. The cuticle and plant defense to pathogens. Front. Plant Sci. 5. https://doi.org/10.3389/fpls.2014.00274 (2014).
Article PubMed PubMed Central Google Scholar
Von Bubnoff, A. Next-generation sequencing: The race is on. Cell 132 (5), 721–723. https://doi.org/10.1016/j.cell.2008.02.028 (2008).
Article Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34 (Web Server), W435–W439. https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central Google Scholar
Li, J. & Cullis, C. A. The multipartite mitochondrial genome of marama (Tylosema esculentum). Front. Plant. Sci. 12, 787443. https://doi.org/10.3389/fpls.2021.787443 (2021).
Article PubMed PubMed Central Google Scholar
Li, J. & Cullis, C. A. Comparative analysis of 84 chloroplast genomes of Tylosema esculentum reveals two distinct cytotypes. Front. Plant. Sci. 13, 1025408. https://doi.org/10.3389/fpls.2022.1025408 (2023).
Article PubMed PubMed Central Google Scholar
Li, J. & Cullis, C. A. Comparative analysis of Tylosema esculentum mitochondrial DNA revealed two distinct genome structures. Biology 12 (9), 1244. https://doi.org/10.3390/biology12091244 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 (4), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pichersky, E. & Raguso, R. A. Why do plants produce so many terpenoid compounds? New Phytol. 220 (3), 692–702. https://doi.org/10.1111/nph.14178 (2016).
Article PubMed Google Scholar
Keith, M. & Renew, A. Notes on some edible wild plants found in the Kalahari. Koedoe 18 (1), 1–12. https://doi.org/10.4102/koedoe.v18i1.911 (1975).
Article Google Scholar
Liang, Q. et al. A view of the pan-genome of domesticated Cowpea (Vigna unguiculata [L.] Walp). Plant. Genome. 17 (1), e20319. https://doi.org/10.1002/tpg2.20319 (2023).
Article PubMed Google Scholar
Omotayo, A. O. & Aremu, A. O. Marama bean [Tylosema esculentum (Burch) A. Schreib.]: an indigenous plant with potential for food, nutrition, and economic sustainability. Food Funct. 12 (6), 2389–2403. https://doi.org/10.1039/d0fo01937b (2021).
Winter, G., Todd, C. D., Trovato, M., Forlani, G. & Funck, D. Physiological implications of arginine metabolism in plants. Front. Plant Sci. 6. https://doi.org/10.3389/fpls.2015.00534 (2015).
Article PubMed PubMed Central Google Scholar
Ortiz, E. M. vcf2phylip v2.0: Convert a VCF matrix into several matrix formats for phylogenetic analysis [Software]. https://doi.org/10.5281/zenodo.2540861 (2019).
Cullis, C. et al. Development of marama bean, an orphan legume, as a crop. Food Energy Secur. 8, e00164. https://doi.org/10.1002/fes3.164 (2019).
Article Google Scholar
Khedr, A. H. A. Proline induces the expression of salt-stress-responsive proteins and may improve the adaptation of Pancratium maritimum L. to salt-stress. J. Exp. Bot. 54 (392), 2553–2562. https://doi.org/10.1093/jxb/erg277 (2003).
Article CAS PubMed Google Scholar
Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature 544 (7651), 427–433. https://doi.org/10.1038/nature22043 (2017).
Article ADS CAS PubMed Google Scholar
Kong, W., Wang, Y., Zhang, S., Yu, J. & Zhang, X. Recent advances in assembly of complex plant genomes. Genomics Proteom. Bioinf. 21 (3), 427–439. https://doi.org/10.1016/j.gpb.2023.04.004 (2023).
Article Google Scholar
Hasan, N., Choudhary, S., Naaz, N., Sharma, N. & Laskar, R. A. Recent advancements in molecular marker-assisted selection and applications in plant breeding programmes. J. Genet. Eng. Biotechnol. 19 (1), 128. https://doi.org/10.1186/s43141-021-00231-1 (2021).
Article PubMed PubMed Central Google Scholar
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34 (13), i142–i150. https://doi.org/10.1093/bioinformatics/bty266 (2018).
Article CAS PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. Benchmarking orthogroup inference accuracy: revisiting Orthobench. Genome Biol. Evol. 12 (12), 2258–2266. https://doi.org/10.1093/gbe/evaa211 (2020).
Article PubMed PubMed Central Google Scholar
Moghaddam, S. M. et al. The tepary bean genome provides insight into evolution and domestication under heat stress. Nat. Commun. 12 (1), 1–12. https://doi.org/10.1038/s41467-021-22858-x (2021).
Article ADS Google Scholar
He, M. & Ding, N. Z. Plant unsaturated fatty acids: multiple roles in stress response. Front. Plant Sci. 11. https://doi.org/10.3389/fpls.2020.562785 (2020).
Article PubMed PubMed Central Google Scholar
Depuydt, S. & Hardtke, C. S. Hormone signaling crosstalk in Plant Growth Regulation. Curr. Biol. 21 (9). https://doi.org/10.1016/j.cub.2011.03.013 (2011).
Article CAS PubMed Google Scholar
Cullis, C. A., Chimwamurombe, P., Barker, N. P., Kunert, K. J. & Vorster, J. Orphan legumes growing in dry environments: marama bean as a case study. Front. Plant. Sci. 9, 1199. https://doi.org/10.3389/fpls.2018.01199 (2018).
Article PubMed PubMed Central Google Scholar
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19 (9), 1655–1664. https://doi.org/10.1101/gr.094052.109 (2009).
Article CAS PubMed PubMed Central Google Scholar
Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature 463 (7278), 178–183. https://doi.org/10.1038/nature08670 (2010).
Article ADS CAS PubMed Google Scholar
Li, H., Jiang, F., Wu, P., Wang, K. & Cao, Y. A high-quality genome sequence of model legume Lotus japonicus (MG-20) provides insights into the evolution of root nodule symbiosis. Genes 11 (5), 483. https://doi.org/10.3390/genes11050483 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kim, Y. & Cullis, C. A. A novel inversion in the chloroplast genome of marama (Tylosema esculentum). J. Exp. Bot. 68 (8), 2065–2072. https://doi.org/10.1093/jxb/erw500 (2017).
Article CAS PubMed PubMed Central Google Scholar
Putnam, N. H. et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26 (3), 342–350. https://doi.org/10.1101/gr.193474.115 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Laloum, T., Martín, G. & Duque, P. Alternative splicing control of abiotic stress responses. Trends Plant Sci. 23 (2), 140–150. https://doi.org/10.1016/j.tplants.2017.09.019 (2018).
Article CAS PubMed Google Scholar
Kumar, S. et al. TimeTree 5: an expanded resource for species divergence times. Mol. Biol. Evol. 39 (8), 1–12. https://doi.org/10.1093/molbev/msac174 (2022).
Article Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods. 14 (6), 587–589. https://doi.org/10.1038/nmeth.4285 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tahjib-Ul-Arif, M. et al. Citric acid-mediated abiotic stress tolerance in plants. Int. J. Mol. Sci. 22 (13), 7235. https://doi.org/10.3390/ijms22137235 (2021).
Article CAS PubMed PubMed Central Google Scholar
Showalter, A. M. Structure and function of plant cell wall proteins. Plant. Cell. 5 (1), 9. https://doi.org/10.2307/3869424 (1993).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, Bioinformatics and Applications. Nat. Biotechnol. 39 (11), 1348–1365. https://doi.org/10.1038/s41587-021-01108-x (2021).
Article CAS PubMed PubMed Central Google Scholar
Singh, B. & Sharma, R. A. Plant terpenes: defense responses, phylogenetic analysis, regulation and clinical applications. 3 Biotech. 5 (2), 129–151. https://doi.org/10.1007/s13205-014-0220-2 (2014).
Article PubMed PubMed Central Google Scholar
Bower, N. W., Hertel, K., Oh, J. & Storey, R. Nutritional evaluation of marama bean (Tylosema esculentum, Fabaceae): analysis of the seed. Econ. Bot. 42, 533–540 (1988).
Agarwal, P. K. et al. Constitutive overexpression of a stress-inducible small GTP-binding protein pgrab7 from Pennisetum glaucum enhances abiotic stress tolerance in transgenic tobacco. Plant Cell Rep. 27 (1), 105–115. https://doi.org/10.1007/s00299-007-0446-0 (2007).
Article CAS PubMed Google Scholar
Young, N. D. et al. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480 (7378), 520–524. https://doi.org/10.1038/nature10625 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant graphics for data analysis (2nd ed.) [PDF]. Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4 (2016).
Takundwa, M., Chimwamurombe, P. M. & Cullis, C. A. A chromosome count in marama bean (Tylosema esculentum) by Feulgen staining using garden pea (Pisum sativum l) as a standard. Res. J. Biol. 2, 177–181 (2012).
Google Scholar
Tamura, K., Stecher, G. & Kumar, S. MEGA11: molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38 (7), 3022–3027. https://doi.org/10.1093/molbev/msab120 (2021).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31 (19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article PubMed Google Scholar
Tuteja, N. & Sopory, S. K. Plant signaling in stress. Plant Signal. Behav. 3 (2), 79–86. https://doi.org/10.4161/psb.3.2.5303 (2008).
Article PubMed PubMed Central Google Scholar
Varshney, R. K. et al. Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nat. Biotechnol. 31 (3), 240–246. https://doi.org/10.1038/nbt.2491 (2013).
Article CAS PubMed Google Scholar
Basu, S., Ramegowda, V., Kumar, A. & Pereira, A. Plant adaptation to drought stress. F1000Research 5, 1554. https://doi.org/10.12688/f1000research.7678.1 (2016).
Article CAS Google Scholar
Zhong, Y. et al. Chromosomal-level genome assembly of the orchid tree Bauhinia variegata (Leguminosae; Cercidoideae) supports the allotetraploid origin hypothesis of Bauhinia. DNA Res. 29 (2). https://doi.org/10.1093/dnares/dsac012 (2022).
Yuan, Y., Bayer, P. E., Batley, J., & Edwards, D. Current status of structural variation studies in plants. Plant Biotechnology J, 19 (11), 2153–2163. https://doi.org/10.1111/pbi.13646 (2021).
Amarteifio, J. O. & Moholo, D. The chemical composition of four legumes consumed in Botswana. J. Food Compos. Anal. 11, 329–332 (1998).
Belitz, H.-D., Grosch, W. & Schieberle, P. Food chemistry (Springer Science & Business Media, 2004). Bower, N. W., Hertel, K., Oh, J. & Storey, R. Nutritional evaluation of marama bean (Tylosema esculentum, Fabaceae): analysis of the seed. Econ. Bot. 42, 533–540 (1988).
Iriti, M. & Faoro, F. Chemical diversity and defence metabolism: how plants cope with pathogens and ozone pollution. Int. J. Mol. Sci. 10 (8), 3371–3399. https://doi.org/10.3390/ijms10083371 (2009).
Article CAS PubMed PubMed Central Google Scholar
Enciso-Rodriguez, F. et al. Overcoming self-incompatibility in diploid potato using CRISPR-Cas9. Front. Plant. Sci. 10, 376. https://doi.org/10.3389/fpls.2019.00376 (2019).
Article PubMed PubMed Central Google Scholar
Batoko, H., Jurkiewicz, P. & Veljanovski, V. Translocator proteins, porphyrins and abiotic stress: new light? Trends Plant Sci. 20 (5), 261–263. https://doi.org/10.1016/j.tplants.2015.03.009 (2015).
Article CAS PubMed Google Scholar
Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Bio Evol, 24 (8), 1586–1591. https://doi.org/10.1093/molbev/msm088 (2007).
Cullis, C. A., Chimwamurombe, P. M., Kunert, K. J. & Vorster, J. Perspective on the present state and future usefulness of marama bean (Tylosema esculentum). Food Energy Secur. 12 (2). https://doi.org/10.1002/fes3.422 (2022).
Google Scholar
Shi, H. et al. Manipulation of arginase expression modulates abiotic stress tolerance in Arabidopsis: effect on arginine metabolism and ROS accumulation. J. Exp. Bot. 64 (5), 1367–1379. https://doi.org/10.1093/jxb/ers400 (2013).
Article PubMed PubMed Central Google Scholar
Dainat, J. A. G. A. T. Another GFF analysis toolkit to handle annotations in any GTF/GFF format. Zenodo. https://doi.org/10.5281/zenodo.3552717 (2020)
Beihammer, G., Maresch, D., Altmann, F. & Strasser, R. Glycosylphosphatidylinositol-Anchor synthesis in plants: a glycobiology perspective. Front. Plant Sci. 11. https://doi.org/10.3389/fpls.2020.611188 (2020).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C., & Huson, D. H. Fast and sensitive protein alignment using diamond. Nature Methods, 12 (1), 59–60. https://doi.org/10.1038/nmeth.3176 (2014).
Belton, J. et al. Hi–C: a comprehensive technique to capture the conformation of genomes. Methods 58 (3), 268–276. https://doi.org/10.1016/j.ymeth.2012.05.001 (2012).
Article CAS PubMed Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10 (1). https://doi.org/10.1186/1471-2105-10-421 (2009).
Chen, H., spsampsps Zwaenepoel, A. Inference of ancient polyploidy from Genomic Data. Methods Mol Biol. , 3–18. https://doi.org/10.1007/978-1-0716-2561-3_1 (2023).
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494–6506. https://doi.org/10.1093/nar/gki937 (2005).
Article CAS PubMed PubMed Central Google Scholar
Hufnagel, B. et al. High-quality genome sequence of white lupin provides insight into soil exploration and seed quality. Nat. Commun. 11 (1), 1–10. https://doi.org/10.1038/s41467-019-14197-9 (2020).
Article Google Scholar
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2021). https://www.R-project.org/

Download references

Acknowledgements

The authors would like to thank K. Logue for help with the initial genome assembly, and P. Chimwamurombe, M. Takundwa, J. Vorster, and K. Kunert for providing marama samples from Namibia and from the University of Pretoria Farm, and to students in BIOL 301/401 in 2015 for their assistance with DNA extraction and those in BIOL 309 in 2018 for their contributions to sample collection.

Funding

This work was supported by teaching resources from the Department of Biology, Case Western Reserve University. The collections were supported by a grant from the Kirkhouse Trust to P. Chimwamurombe.

Author information

Authors and Affiliations

Department of Biology, Case Western Reserve University, Cleveland, OH, USA
Jin Li & Christopher Cullis

Authors

Jin Li
View author publications
Search author on:PubMed Google Scholar
Christopher Cullis
View author publications
Search author on:PubMed Google Scholar

Contributions

J.L. conducted DNA extraction, performed genome assembly, data analysis, and wrote the manuscript. C.C. conceived the original project idea, extracted DNA, supervised the research, and assisted in writing and editing the manuscript. All authors contributed to the article and approved the submitted version.

Corresponding author

Correspondence to Christopher Cullis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The original version of this Article contained errors in the ordering of References. The references have been reordered and cited accordingly.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, J., Cullis, C. Genome assembly and population analysis of tetraploid marama bean reveal two distinct genome types. Sci Rep 15, 2665 (2025). https://doi.org/10.1038/s41598-025-86023-w

Download citation

Received: 15 October 2024
Accepted: 07 January 2025
Published: 21 January 2025
Version of record: 21 January 2025
DOI: https://doi.org/10.1038/s41598-025-86023-w

Subjects

Abstract

Similar content being viewed by others

A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation

Pre-breeding in alfalfa germplasm develops highly differentiated populations, as revealed by genome-wide microhaplotype markers

Two near-chromosomal-level genomes of globally-distributed Macroascomycete based on single-molecule fluorescence and Hi-C methods

Introduction

Methods

Sample collection and sequencing

De novo genome assembly and quality assessment

Gene prediction and functional annotation

Evolutionary analyses: phylogenetic relationships, whole genome duplication, and gene enrichment

Population genetics analysis

Results

Estimation of genome size and heterozygosity

De novo genome assembly and evaluation

Comparison of the genome assemblies of T. e sculentum and B. variegata

Phylogenetic analyses, along with the evolution of gene families and whole genome duplication analysis of T. esculentum and related species

Population analysis unveiled two distinct clusters

Discussion

Data availability

Change history

21 August 2025

Reference

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links