Background & Summary

Cassava (Manihot esculenta Crantz) is a crop of great importance in global food security, economic development, and industrial applications. This starchy root vegetable serves as a staple food for over 800 million people worldwide, particularly in tropical and subtropical regions1. Production reached 315 million tonnes in 2021, marking a 9% increase from 2017 worldwide2. Nigeria, the leading producer, accounted for approximately 63 million tonnes, representing 31% of African production and 20% of global production. Notably, the top three producers - Nigeria, the Democratic Republic of Congo, and Thailand - contribute a combined 44% share of global cassava production2. The resilience of cassava to challenging growing conditions, including drought and marginal soil, makes it an ideal crop for regions prone to climate variability3. This adaptability ensures a stable food supply, contributing significantly to food security in developing countries4. Additionally, its industrial applications are diverse, ranging from starch production to biofuels, sweeteners, glues and animal feed, with the global cassava starch market expected to reach USD 99.91 billion by 2032. Cassava’s role in local economies is equally significant, supporting rural development and poverty alleviation by providing income opportunities for smallholder farmers, processors, and traders.

Originating from the southern Amazon basin, where it was likely domesticated thousands of years ago, cassava’s cultivation spread throughout pre-Columbian South America5,6. Following European contact, Portuguese traders introduced cassava to Africa, likely starting in the Congo Basin region around the 16th century. Its remarkable adaptability to diverse climates and soils fueled its widespread adoption across the African continent, where it became a fundamental staple crop. Introduction into Asia occurred later, possibly around the 18th century via trade routes, establishing cassava as a vital agricultural commodity in countries like Thailand, Indonesia, and Vietnam. This historical global spread has shaped distinct regional germplasm pools and adaptation strategies.

To fully leverage cassava’s potential and accelerate breeding efforts, comprehensive genomic resources are indispensable. Initial efforts resulted in a draft genome sequence from the Latin American-derived cultivar AM560-2 (originating from Colombia/CIAT breeding programs)7, providing a foundational resource for the research community. Subsequent advancements, incorporating long-read sequencing technologies (like PacBio and ONT) and scaffolding methods such as chromatin conformation capture (Hi-C), have significantly enhanced the quality and contiguity of reference genomes8. This includes improved chromosome-level assemblies for AM560-2 (e.g., versions v6, v7, v8)9 and the African landrace TMEB117 (widely used in IITA breeding programs)10.

Manihot esculenta is a diploid species (2n = 36) with an estimated haploid genome size of approximately 750 Mbp11. A key complicating factor is the genome’s high degree of heterozygosity, often reported to be in the range of 1.0-1.5%12. This high level of sequence divergence between homologous chromosomes, a consequence of cassava’s typically outcrossing reproductive system and widespread clonal propagation, poses significant hurdles for genome assembly algorithms. Standard approaches often struggle to differentiate allelic sequences, leading to the artificial merging (collapse) of haplotypes into a single chimeric sequence or the fragmentation of the assembly where divergent alleles are represented as separate contigs. Accurately resolving these haplotypes is critical for understanding allele-specific expression, identifying causal variants for traits, and developing precise breeding strategies.

While reference genomes exist for American and African cassava germplasm, high-quality genomic resources for Asian ecotypes have been lacking. A draft assembly for the Thai cultivar ‘Kasetsart 50’ (KU50)13 was an important first step, but a broader, high-quality representation of the diversity within Thailand was needed to empower regional breeding and research efforts. This study addresses this gap by generating and validating ten chromosome-level genome assemblies from a panel of diverse Thai Manihot esculenta cultivars and a wild M. glaziovii relative. Here, we describe the methods used for plant selection, sequencing, genome assembly, and annotation. We then present a detailed technical validation of the assemblies and gene models, confirming their quality and completeness. The resulting datasets provide a foundational genomic resource for the cassava research community, enabling future studies into the genetic diversity and improvement of this crop.

Methods

Plant material

Ten distinct Manihot genotypes were selected for genome sequencing to capture genetic diversity related to various traits within cultivated cassava (Manihot esculenta) and its wild relative M. glaziovii. The panel included the established low-yield, sweet cultivar ‘Hanatee’. Four commercially important Thai cultivars were also sequenced: ‘Kasetsart50’, a widely recommended variety; ‘Rayong9’, noted for high ethanol yield potential; ‘Rayong72’, characterized by high yield, high dry matter, and adaptation to Northeast Thailand; and ‘Rayong90’, known for high root dry matter content.

Additionally, five accessions were collected during a field visit in Northern Thailand (Nan province). These included ‘HighlandRough’ and ‘HighlandSmooth’, which are putative ‘Hanatee’ variants selected specifically for their contrasting rough and smooth bark phenotypes. Two landraces distinguished by their storage root flesh color, ‘WhiteRoot’ and ‘YellowRoot’, were also collected in the tropical rainforests of Narathiwat province in the south of Thailand. Finally, an accession of the wild relative Manihot glaziovii (designated ‘M. glaziovii WildType’), historically significant in breeding programs (e.g., for rubber traits), was provided by the Rayong Field Crop Research Center. This diverse germplasm set provides a foundation for comparative genomics studies within the Manihot genus and especially the cassava ecotypes of Thailand.

Sample preparation and sequencing

Young leaves from mature cassava plants were collected and flash-frozen. High molecular weight genomic DNA (HMW gDNA) used for Illumina and Oxford Nanopore Technologies (ONT) sequencing was extracted from the leaf tissues using a protocol provided by ONT (https://nanoporetech.com/document/extraction-method/fever-tree-gdna). The concentration and quality of the extracted DNA were assessed using a NanoDrop spectrophotometer and Qubit. Short strands of DNA were removed from the samples using circulomic SRE XL.

ONT reads

The HMW gDNA was used for ONT DNA library prep using the SQK-LSK109 kit and sequenced either on a MinION using the FLO-MIN106 flow cell (21 libraries), or on a PromethION using the FLO-PR002 flow cell (19 libraries). Reads were basecalled using Dorado (v0.5.2) with the model r941_prom_sup_g507 which generated 793.4 Gbp in total14,15,16,17,18,19,20,21,22,23.

Illumina short reads

Illumina short-read library was constructed from the HMW gDNA and sequenced on Illumina NextSeq 2000 to generate 150 bp paired-end reads. The short-read sequencing generated approximately 138 Gbp of raw data, consisting of 460.1 million paired-end (2 × 150 bp) reads14,15,16,17,18,19,20,21,22,23.

RNA-seq reads

RNA used for gene prediction was obtained from a time-course experiment on cassava (Manihot esculenta and Manihot glaziovii) tubers. Total RNA was extracted from tubers at multiple tuber developmental stages and time points. RNA sequencing was performed by an external service provider using Illumina technology. The Manihot esculenta cultivar Hanatee was sequenced using 75 bp single-end reads, whereas all other samples (Manihot esculenta Kasetsart50, Rayong9, Rayong72, and Manihot glaziovii WildType) were sequenced using 150 bp paired-end reads (2 × 150 bp). In total approximately 1898 Gbp of raw data was generated24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111.

Genome size and heterozygosity estimation

The genome characteristics of the ten Manihot species, including genome size and heterozygosity were estimated using Illumina short read data and a k-mer based approach. A 21-mer frequency distribution was generated with Jellyfish (v2.3.1)112, and the genome’s key features were inferred using GenomeScope2 (v2.0)113. The haploid genome size of the nine Manihot esculenta genotypes was estimated between 556 Mbp and 676 Mbp, with a heterozygosity rate estimated between 1.30% and 1.79%, while the genome size of Manihot glaziovii was estimated at 659 Mbp, with a heterozygosity rate at 4.62%.

De novo genome assembly, Ragtag scaffolding and quality assessment

The assembly of cassava genomes was performed using a combination of long-read sequencing data and multiple assembly refinement steps. Oxford Nanopore Technologies (ONT) reads were assembled using Flye (v.2.9.3)114 with parameters --read-error 0.03, -m 10000, and NextDenovo (v.2.5.0)115 to generate two independent draft assemblies. The completeness and quality of these assemblies were then assessed using Merqury (v.1.3)116. To improve the base accuracy, the assemblies were then polished using Medaka (https://github.com/nanoporetech/medaka, v.1.12.0) with ONT read data. After that, Purge_Dups (v.1.2.6)117 was applied to both assemblies to reduce redundancy caused by haplotigs. The two purged assemblies for each genome were then merged using QuickMerge (v.0.3)118 to generate a consensus genome. To further refine the assembly structure, RagTag (v.2.1.0)119 was employed, with the correct submodule first applied using the published South American reference genome of Manihot esculenta AM560-2 (GCA_001659605.2), followed by the scaffold submodule to enhance contiguity.

Finally, the NextPolish programme (v.1.4.1)120 was used for two rounds of polishing with ONT read data to fill gaps and improve sequence accuracy. Following this, a contaminant screening step was performed. All unplaced contigs were subjected to a blastn (v2.16)121 search against the ‘core_nt’ database using the parameters: -max_target_seqs. 1 -evalue 1e-10 -culling_limit 5 -outfmt “6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames”. Any contigs whose best hit was not to a plant species were removed. The resulting assemblies provided high-quality genome sequences for all 10 cassava accessions (Table 1)122,123,124,125,126,127,128,129,130,131.

Table 1 Summary of the ten Manihot genome assemblies.

Genome annotation

Repeat elements within each of the ten cassava genome assemblies were identified and masked using the Extensive de novo TE Annotator (EDTA, v2.2.1)132. The identified repetitive sequences constitute a significant portion of each genome, ranging from 300.5 to 445.6 Mbp per assembly, which represents over 50% of the total genome size (Table 2). In addition to transposable element annotation by EDTA, SSRs were identified using MISA (v2.1)133. The continuity of the repeat regions was estimated by calculating the adjusted LAI134. Furthermore, to identify potential telomeric repeat sequences and assess chromosome completeness, the ten Manihot genome assemblies were analyzed using the tidk (telomere identification toolkit, v0.2.41)135 with the motif ‘CCCTAAA’ and a window size of 10 kbp.

Table 2 Summary of repeat annotations for ten Manihot assemblies.

The strategy for predicting protein-coding genes varied depending on the availability of transcriptomic data for each genotype. For five genotypes (Manihot Esculenta Hanatee, Kasetsart50, Rayong9, Rayong72, and Manihot Glaziovii WildType), available RNA-Seq data were utilized. Briefly, individual RNA-Seq library reads were aligned to their respective genome assemblies using HISAT2 (v2.2.1)136 with ‘--dta’ parameter. Using Samtools (v1.20)137 libraries were combined and coordinate-sorted. Transcripts were then assembled using StringTie2 (v2.2.3)138. Separately, deep learning gene predictions were generated using Helixer (v0.3.3)139 using the ‘Land plant’ lineage model via its web interface (https://www.plabipd.de/helixer_main.html). The transcript evidence from StringTie2 and the ab initio predictions from Helixer were then integrated using Mikado (v2.2.3)140 to produce a consolidated, non-redundant set of gene models for these five genotypes. For the remaining five cassava genotypes, where corresponding RNA-Seq data were not generated in this study, protein-coding genes were predicted solely using the deep-learning-based approach implemented in Helixer (v0.3.3), accessed via its web interface.

Functional annotations for the predicted proteomes derived from all ten genotypes were obtained using Mercator4 (v7)141 through the Plabipd web platform (https://www.plabipd.de/mercator_main.html). This process incorporated information from ProtScriber (v0.1.6, https://github.com/usadellab/prot-scriber) and Swiss-Prot142 to assign functional categories.

The completeness of the predicted protein-coding gene sets for each of the ten genotypes, in terms of expected gene content, was assessed using BUSCO (v5.8.3)143 against the eudicotyledons_odb12 lineage dataset. Furthermore, the quality and consistency were evaluated using Mercator4 (v7) and OMArk (v0.3.0, OMAmer v2.0.5)144,145 and PSAURON (v1.0.6)146. Results of the quality assessments are summarized in Table 3.

Table 3 Summary of gene annotations for ten Manihot assemblies.

Data Records

The raw sequencing data, including genomic DNA Illumina and ONT reads and RNA-Seq reads, have been deposited at the EMBL-EBI European Nucleotide Archive (ENA) under BioProject number PRJEB89494 (ERP172520)147. The genome assemblies of the ten genotypes have been submitted to ENA under the accessions GCA_965363265122, GCA_965363285124, GCA_965363275123, GCA_965363475125, GCA_965365905131, GCA_965364345128, GCA_965364185126, GCA_965364695130, GCA_965364235127, GCA_965364665129. The assembled genome, including annotations, is accessible via an interactive Jbrowse2148 instance at https://www.plabipd.de/ceplas/?config=cassavastore.json.

Technical Validation

Assembly and annotation quality assessment

We assessed the quality and completeness of the ten Manihot genome assemblies using DNA sequencing read mapping and Merqury k-mer based evaluation. Illumina paired-end reads were mapped using bwa-mem2 (v2.2.1)149, while ONT reads were aligned using minimap2 (v2.28)150. Across the ten assemblies, mapping rates ranged from 96.11% to 97.11% for Illumina reads and from 92.82% to 98.12% for ONT reads, indicating successful alignment of the majority of sequencing data to each assembly.

Assembly quality was further evaluated using CRAQ (v1.0.9)151 based on read mappings. The regional AQI (R-AQI) scores ranged from 80.89% to 93.33%, and the structural AQI (S-AQI) scores ranged from 56.38% to 71.64% across the assemblies. Assembly completeness was assessed with compleasm (v0.2.7)152 using the eudicotyledons_odb12 lineage database. The analysis identified between 95.69% and 99.21% of the expected BUSCO orthologous groups as complete within the assemblies (Table 1). A detailed summary of these genomic features is visualized for the ‘Hanatee’ assembly in Fig. 1.

Fig. 1
Fig. 1
Full size image

Genome characteristics of Manihot esculenta Hanatee. (A) Histogram displaying the distribution of proteins grouped by their percentage deviation from the median protein length. (B) Histogram showing the percentage of Mercator4 functional BINs occupied by the Hanatee proteins. (C) Histogram displaying the divergence of repeat elements by classes and their overall percentage of the genome contribution. (D) k-mer plot. (E) Circos diagram displaying the distribution of different repeat element classes over the individual chromosomes, compared directly to the found telomeric repeats and gene density.

To evaluate assembly continuity specifically in repetitive regions, we calculated the LTR Assembly Index (LAI). Across the ten assemblies, the adjusted LAI scores ranged from 18.78 to 30.61, indicating assembly qualities spanning high-quality draft (LAI > 10) to reference standard (LAI > 20), with the upper range approaching gold standard quality in terms of intact LTR retrotransposon representation (Table 2).

Finally, we performed Merqury (v1.3) analysis, using a Meryl (v1.3) database constructed from Illumina reads for each assembly, estimated k-mer based genome completeness ranging from 71.97% to 89.76% (Table 4). This k-mer completeness range is inherently influenced by the nature of these purged, single pseudo-haplotype assemblies. K-mers unique to the excluded alternative haplotype are intentionally absent from the reference pseudo-haplotype, thus preventing a 100% representation of all k-mers derived from the diploid sequencing reads.

Table 4 K-mer completeness analysis for pseudo-haplotype Manihot assemblies (k = 21).

The theoretical maximum k-mer completeness for an ideal pseudo-haplotype, when measured against the total unique k-mers from diploid reads, can be derived from established k-mer distribution models in heterozygous genomes153. This maximum is given by the formula:

$$Maximum\,pseudo-haplotype\,completeness=1/(2-{(1-r)}^{k})$$

where r is the organism’s heterozygosity rate and k is the k-mer size used in the analysis. For instance, using a k-mer size of 21 and representative organismal heterozygosity rates in the range of 0.5%-2.0%, this theoretical maximum k-mer completeness would typically fall between 74% (for r = 2.0%) and 91% (for r = 0.5%).

It is important to consider that the initial assemblies, prior to purging to create the pseudo-haplotypes, may not have captured the entirety of k-mers present in the Illumina reads. This can be due to factors such as incomplete genome coverage by the assembly, sequencing or assembly errors, or challenges in accurately resolving and representing complex heterozygous regions in the diploid state. To account for this, we use the observed k-mer completeness of the assembly prior to purging (denoted as scaling factor s) as the effective starting fraction of captured diploid k-mers. The expected pseudo-haplotype completeness, scaled by this initial capture rate, is then calculated as:

$${Expected\; pseudo}-{haplotype\; completeness}=s\ast {maximum\; pseudo}-{haplotype\; completeness}$$

The observed k-mer completeness values for the pseudo-haplotype assemblies generally align well with the expected completeness calculated from the respective pre-purged assembly completeness and heterozygosity. This correspondence suggests that the haplotype purging process was largely effective across these genomes, yielding results consistent with theoretical expectations for single haplotype representation, albeit with minor variations likely reflecting small inefficiencies or specific choices made during the purging process itself. Notably, the Manihot glaziovii sample exhibits a significant deviation from this trend. Its observed pseudo-haplotype k-mer completeness (89.76%) is substantially higher than both the theoretical maximum for a pseudo-haplotype given its heterozygosity (61.65%) and the scaled expectation based on its pre-purged assembly’s completeness (59.42%). This marked discrepancy strongly suggests that the haplotype purging process was incomplete for this particular, highly heterozygous (4.56%) genome. The retention of a significant portion of both haplotypes is further evidenced by the final ‘purged’ assembly size of 1299 Mbp, which is nearly twice the estimated haploid genome size of 659 Mbp. From this large assembly, only 727 Mbp could be assigned to the 18 chromosomes, indicating substantial unplaced or redundant sequence. Such an inflated assembly size relative to the haploid estimate, coupled with the exceptionally high k-mer completeness, indicates that the assembly for Manihot glaziovii more closely represents a partially diploid or largely unpurged state rather than a true pseudo-haplotype. The exceptionally high heterozygosity level in M. glaziovii WildType is a plausible factor that likely complicated the accurate differentiation and removal of the second haplotype during the purging stage.

The practical implications for users of this specific assembly are significant. The partially diploid nature leads to several unavoidable artifacts: an inflated total genome size (1299 Mbp vs. an estimated 659 Mbp) and gene count (51,770); a high proportion of duplicated gene models, as evidenced by the 70.38% duplicated HOGs in the OMArk analysis (Table 3); and a large amount of sequence (~572 Mbp) that could not be confidently placed onto chromosomes. Researchers should be aware that this redundancy can create challenges for read mapping, variant calling, and comparative genomic analyses, and the data for this specific genome should be interpreted with these limitations in mind.

The corresponding estimated Phred-scaled quality values from the Merqury analysis (QV) ranged from 33.47 to 37.67 across the ten genomes. These QV scores translate directly to high base-level accuracy, indicating estimated consensus error rates between approximately 1 error in 2,220 bases (QV = 33.47) and 1 error in 5,890 bases (QV = 37.67).

Completeness of the gene annotation for each assembly was assessed using OMArk (v0.3.0, OMAmer v2.0.2), PSAURON (v1.0.6), and Mercator4 (v7). OMArk analysis demonstrated that the annotations captured a high proportion of Hierarchical Orthologous Groups (HOGs), with missing HOGs ranging from only 1.69% to 4.18%. However, a substantial proportion of these captured HOGs were identified as duplicates, with duplication rates ranging from 39.85% to 70.38%, while single-copy HOGs ranged from 27.93% to 55.54% across the annotations (Table 3). Complementary analysis with PSAURON indicated high annotation completeness, yielding scores between 97.0 and 97.3 (Table 3). Protein classification via Mercator4 showed that 95.87% to 96.52% of proteins were annotated, with 62.18% to 63.57% being successfully classified into functional bins. Across the assemblies, the annotations covered 93.79% to 96.43% of the Mercator4 BINs (Table 3).

Limitations of Ab Initio Gene Annotation

It is important for users of this dataset to note a key difference in the gene prediction methodologies used. For five of the ten genotypes — Hanatee, Kasetsart50, Rayong9, Rayong72, and WildType — gene prediction was supported by organism-specific RNA-Seq data, which improves the accuracy of gene models. The remaining five genotypes — HighlandSmooth, HighlandRough, Rayong90, WhiteRoot, and YellowRoot — were annotated using only the deep-learning-based ab initio tool Helixer. While modern gene predictors like Helixer are powerful, annotations generated without direct transcriptomic evidence are more likely to contain errors, such as incorrect exon boundaries, missed exons, or falsely merged or split genes. We therefore advise that researchers exercise particular caution when analyzing genes from these five genomes, especially in studies focused on rapidly evolving gene families or novel genes, where ab initio models may be less reliable.