Abstract
Cocoa is a vital agricultural commodity that yields cocoa butter and powder, both essential to the confectionery industry. The Trinitario cacao varieties are key sources of fine flavor beans, among which ICS 1 stands out due to its high productivity and superior bean quality. ICS 1 is also recognized as a valuable parent in breeding programs for desirable agronomic traits such as fine flavor bean quality. Using sequencing reads from PacBio HiFi and Hi-C technologies, we generated two haploid genome assemblies (374.4 Mb and 410.8 Mb), characterized by low scaffold numbers, high N50 values (39.5 Mb and 39.7 Mb), and 10 pseudo-chromosomes. Genome annotation identified 22,477 and 22,263 protein-coding genes, with repeat content of 62.19% and 65.14%, respectively. BUSCO completeness exceeded 98%, confirming high assembly quality. Cacao breeders will benefit from these haploid genomes to develop high-yielding, climate-resilient, fine flavor cacao varieties, addressing challenges such as declining soil fertility, rising disease pressures, and accelerating climate change.
Similar content being viewed by others
Data availability
All genome sequencing data are available in NCBI through the Umbrella BioProject PRJNA1136499. The gene annotations (protein sequence and GFF files) are available via Figshare65.
Code availability
No original code was developed; all computational analyses were performed with published computer tools.
References
Davie, J. H. Chromosome studies in the Malvaceae and certain related families. II. Genetica 17, 487–498 (1935).
Cheesman, E. Notes on the nomenclature, classification and possible relationships of cacao populations. Tropical Agriculture 21, 144–159 (1944).
Bartley, B. G. The genetic diversity of cacao and its utilization. 1st edn, (CABI Publishing, Massachusetts, 2005).
Motamayor, J. C. et al. Geographic and genetic population differentiation of the Amazonian chocolate tree (Theobroma cacao L). PLoS One 3, e3311, https://doi.org/10.1371/journal.pone.0003311 (2008).
Zhang, D. et al. Genetic diversity and spatial structure in a new distinct Theobroma cacao L. population in Bolivia. Genetic Resources and Crop Evolution 59, 239–252, https://doi.org/10.1007/s10722-011-9680-y (2012).
Fouet, O. et al. Collection of native Theobroma cacao L. accessions from the Ecuadorian Amazon highlights a hotspot of cocoa diversity. Plants, People, Planet 4, 605–617, https://doi.org/10.1002/ppp3.10282 (2022).
Argout, X. et al. Pangenomic exploration of Theobroma cacao: New insights into gene content diversity and selection during domestication. bioRxiv, 2023.2011. 2003.565324, https://doi.org/10.1101/2023.11.03.565324 (2023).
ICCO. Fine Flavour Cocoa. Available online: https://www.icco.org/fine-or-flavor-cocoa/ (2020).
Colonges, K. et al. Variability and genetic determinants of cocoa aromas in trees native to South Ecuadorian Amazonia. Plants, People, Planet 4, 618–637, https://doi.org/10.1002/ppp3.10268 (2022).
Tscharntke, T. et al. Socio‐ecological benefits of fine‐flavor cacao in its center of origin. Conservation Letters 16, e12936, https://doi.org/10.1111/conl.12936 (2023).
Putri, D. N., De Steur, H., Juvinal, J. G., Gellynck, X. & Schouteten, J. J. Sensory attributes of fine flavor cocoa beans and chocolate: A systematic literature review. Journal of Food Science 89, 1917–1943, https://doi.org/10.1111/1750-3841.17006 (2024).
Fowler, M. S. & Coutel Cocoa beans: from tree to factory. in Beckett’s industrial chocolate manufacture and use, pp. 9-49 (Wiley-Blackwell, 2017).
Kongor, J. E. et al. Factors influencing quality variation in cocoa (Theobroma cacao) bean flavour profile—A review. Food Research International 82, 44–52, https://doi.org/10.1016/j.foodres.2016.01.012 (2016).
Zhang, D. & Motilal, L. Origin, dispersal, and current global distribution of cacao genetic diversity. Cacao diseases: A history of old enemies and new encounters (pp. 3–31. Springer, Cham, 2016).
Freitas, L. S. et al. Elite cacao clonal cultivars with diverse genetic structure, high potential of production, and good organoleptic quality are helping to rebuild the cocoa industry in Brazil. International Journal of Molecular Sciences 26, 3386, https://doi.org/10.3390/ijms26073386 (2025).
Bekele, F. & Phillips-Mora, W. Cacao (Theobroma cacao L.) breeding. in Advances in Plant Breeding Strategies: Industrial and Food Crops, pp. 409-487 (Springer, Cham, 2019).
Argout, X. et al. The cacao Criollo genome v2. 0: an improved version of the genome for genetic and functional genomic studies. BMC Genomics 18, 1–9, https://doi.org/10.1186/s12864-017-4120-9 (2017).
Motamayor, J. C. et al. The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color. Genome Biology 14, 1–25, https://doi.org/10.1186/gb-2013-14-6-r53 (2013).
Hämälä, T. et al. Genomic structural variants constrain and facilitate adaptation in natural populations of Theobroma cacao, the chocolate tree. Proceedings of the National Academy of Sciences 118, e2102914118, https://doi.org/10.1073/pnas.2102914118 (2021).
Nousias, O. et al. Three de novo assembled wild cacao genomes from the Upper Amazon. Scientific Data 11, 369, https://doi.org/10.1038/s41597-024-03215-1 (2024).
Cornejo, O. E. et al. Population genomic analyses of the chocolate tree, Theobroma cacao L., provide insights into its domestication process. Communications Biology 1, 167, https://doi.org/10.1038/s42003-018-0168-6 (2018).
Fernandes, Ld. S., Correa, F. M., Ingram, K. T., de Almeida, A.-A. F. & Royaert, S. QTL mapping and identification of SNP-haplotypes affecting yield components of Theobroma cacao L. Horticulture Research 7, 26, https://doi.org/10.1038/s41438-020-0250-3 (2020).
Osorio-Guarín, J. A. et al. Genome-wide association study reveals novel candidate genes associated with productivity and disease resistance to Moniliophthora spp. in cacao (Theobroma cacao L.). G3: Genes, Genomes, Genetics 10, 1713–1725, https://doi.org/10.1534/g3.120.401153 (2020).
Wickramasuriya, A. M. & Dunwell, J. M. Cacao biotechnology: current status and future prospects. Plant Biotechnology Journal 16, 4–17, https://doi.org/10.1111/pbi.12848 (2018).
Fernanda, M. R. M. et al. Cacao genome sequence reveals insights into the flavonoid biosynthesis. bioRxiv, 2024.2011. 2023.624982, https://doi.org/10.1101/2024.11.23.624982 (2024).
Zhang, R.-G. et al. Reticulate allopolyploidy and subsequent dysploidy drive evolution and diversification in the cotton family. Nature Communications 16, 7480, https://doi.org/10.1038/s41467-025-62644-7 (2025).
Argout, X. et al. The genome of Theobroma cacao. Nature Genetics 43, 101–108, https://doi.org/10.1038/ng.736 (2011).
Figueira, A., Janick, J. & Goldsbrough, P. Genome size and DNA polymorphism in Theobroma cacao. Journal of the American Society for Horticultural Science 117, 673–677, https://doi.org/10.21273/JASHS.117.4.673 (1992).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular Biology and Evolution 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Kane, N. et al. Ultra‐barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. American Journal of Botany 99, 320–329, https://doi.org/10.3732/ajb.1100570 (2012).
de Abreu, V. A. et al. Comparative analyses of Theobroma cacao and T. grandiflorum mitogenomes reveal conserved gene content embedded within complex and plastic structures. Gene 849, 146904, https://doi.org/10.1016/j.gene.2022.146904 (2023).
Chikhi, R. & Medvedev, P. Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37, https://doi.org/10.1093/bioinformatics/btt310 (2014).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11, 1432, https://doi.org/10.1038/s41467-020-14998-3 (2020).
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40, 1332–1335, https://doi.org/10.1038/s41587-022-01261-x (2022).
Challis, R., Richards, E., Rajan, J., Cochrane, G. & Blaxter, M. BlobToolKit–interactive quality assessment of genome assemblies. G3 Genes|Genomes|Genetics 10, 1361–1374, https://doi.org/10.1534/g3.119.400908 (2020).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293, https://doi.org/10.1126/science.1181369 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Computational Biology 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Systems 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Robinson, J. T. et al. Juicebox. js provides a cloud-based visualization system for Hi-C data. Cell Systems 6, 256–258. e251, https://doi.org/10.1016/j.cels.2018.01.001 (2018).
Liu, D. et al. Haplotype-resolved chromosomal-level genome assembly of Buzhaye (Microcos paniculata). Sci Data 10, 901, https://doi.org/10.1038/s41597-023-02821-9 (2023).
Campbell, M. S., Holt, C., Moore, B. & Yandell, M. Genome annotation and curation using MAKER and MAKER‐P. Current Protocols in Bioinformatics 48, 4.11.11–14.11.39, https://doi.org/10.1002/0471250953.bi0411s48 (2014).
Baruah, I. K., Ali, S. S., Shao, J., Lary, D. & Bailey, B. A. Changes in gene expression in leaves of cacao genotypes resistant and susceptible to Phytophthora palmivora infection. Frontiers in Plant Science 12, 780805, https://doi.org/10.3389/fpls.2021.780805 (2022).
Krueger, F. Trim Galore!: A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data. https://github.com/FelixKrueger/TrimGalore (2015).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21, https://doi.org/10.1093/bioinformatics/bts635 (2013).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652, https://doi.org/10.1038/nbt.1883 (2011).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 40, D1202–D1210, https://doi.org/10.1093/nar/gkr1090 (2012).
Ouyang, S. et al. The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Research 35, D883–D887, https://doi.org/10.1093/nar/gkl976 (2007).
Valliyodan, B. et al. Construction and comparison of three reference‐quality genome assemblies for soybean. The Plant Journal 100, 1066–1082, https://doi.org/10.1111/tpj.14500 (2019).
Amborella trichopoda V6.1 (CoGe). https://genomevolution.org/coge/GenomeInfo.pl?gid=50948 (2018).
Gu, K.-J., Lin, C.-F., Wu, J.-J. & Zhao, Y.-P. GinkgoDB: an ecological genome database for the living fossil, Ginkgo biloba. Database 2022, baac046, https://doi.org/10.1093/database/baac046 (2022).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7, https://doi.org/10.1186/gb-2008-9-1-r7 (2008).
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013–2015. https://www.repeatmasker.org/ (2015).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research 49, D192–D200, https://doi.org/10.1093/nar/gkaa1047 (2021).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Research 49, 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Sun, P. et al. WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Molecular Plant 15, 1841–1851, https://doi.org/10.1016/j.molp.2022.10.018 (2022).
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20, 1–13, https://doi.org/10.1186/s13059-019-1911-0 (2019).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_050656495.1 (2025).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_050656505.1 (2025).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR33364810 (2025).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR33323860 (2025).
Feng, X. et al. Chromosomal-scale and haplotype-resolved genome assembly of the first Trinitario hybrid cacao ICS 1. Figshare https://doi.org/10.6084/m9.figshare.31277062.
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075, https://doi.org/10.1093/bioinformatics/btt086 (2013).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576, https://doi.org/10.1093/bioinformatics/btw663 (2017).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Acknowledgements
We thank Indrani K. Baruah for sample preparation, Osman Gutierrez for comments and discussions during the project. This work was primarily funded by the United States Department of Agriculture (USDA)/Agricultural Research Service (ARS) awards [58-8042-9-089, 58-8042-3-076], and partially by the National Institutes of Health (NIH) R01 award [R01GM140370]. X.F. was also supported by a Shandong Agricultural University startup grant (539-700028) after 1/2025. The mention of trade names or commercial products in this publication is for informational purposes only and does not imply endorsement or recommendation by the U.S. Department of Agriculture. The USDA is an equal opportunity provider and employer.
Author information
Authors and Affiliations
Contributions
Y.Y., D.Z., L.W.M. conceived and designed the project. D.Z., B.B., S.P., S.P.C., L.W.M. collected the Cacao materials and generated the sequencing data. X.F., Y.Yan, R.S.K.R.P performed genome assembly evaluation, annotation, and all data analysis, V.A. contributed to the data analysis and management. X.F., D.Z., Y.Y. draft the manuscript. All authors contributed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Feng, X., Patel, R.S.K.R., Yan, Y. et al. Chromosomal-scale and haplotype-resolved genome assembly of the first Trinitario hybrid cacao ICS 1. Sci Data (2026). https://doi.org/10.1038/s41597-026-07054-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07054-0


