Abstract
Improving cacao yield, a key objective in post-domestication crop improvement, remains a primary goal for breeders, but progress is often hindered by the confounding effects of population structure. To overcome this, we analyzed 346 diverse cacao accessions using an ML-based association mapping framework (with and without population structure adjustment) and a phenotype-only ML prediction of yield. By correcting for population structure, our Bootstrap Forest-based GWAS produced SNP-importance rankings whose downstream functional summaries were enriched for ribosome/translation-related terms, and several top-ranked SNPs recurred across multiple yield components (e.g., pod index and seed number) in this panel. In parallel, Neural Networks were utilized to identify cotyledon mass and length as the most powerful predictors for total wet bean mass, providing a phenotype-only prediction example for this panel. Collectively, this study provides an ML-guided, low-density association workflow and a phenotype-only prediction example for this cacao panel, while explicitly outlining limitations related to marker density and phenotype provenance.
Data availability
The present study is a re-analysis of previously published, publicly available datasets. No new raw genotype or phenotype data were generated in this study. The raw phenotypic and genotypic data of the 346 Theobroma cacao accessions analyzed in this work are available in the Supporting Information of Bekele et al. (2022) at https://doi.org/10.1371/journal.pone.0260907. All derived results generated during the current study, including trait-importance rankings and GO enrichment results, are included in this published article (and its Supplementary Information files).
References
Díaz-Montenegro, J. Livelihood strategies and risk behavior of cacao producers in Ecuador: effects of national policies to support cacao farmers and specialty cacao landraces. PhD thesis, Universitat Politècnica de Catalunya (2019).
Hall, J. N. Applying a One Health approach to study the livelihoods of cocoa farming communities in Bougainville. (2022).
Bekele, F. & Phillips-Mora, W. Cacao (Theobroma cacao L.) breeding. Adv. Plant. Breed. Strategies: Industrial Food Crops: Volume. 6, 409–487 (2019).
Alden, D. The significance of cacao production in the Amazon region during the late colonial period: an essay in comparative economic history. Proc. Am. Philos. Soc. 120, 103–135 (1976).
Gardea, A. A. et al. Cacao (Theobroma cacao L.). Fruit and Vegetable Phytochemicals: Chemistry and Human Health, 2nd Edition 921–940 (2017).
Kongor, J. E., Owusu, M. & Oduro-Yeboah, C. Cocoa production in the 2020s: challenges and solutions. CABI Agric. Bioscience. 5, 102 (2024).
Walters, D. Chocolate Crisis: Climate Change and Other Threats to the Future of Cacao (University Press of Florida, 2020).
Boza, E. J. et al. Genetic characterization of the cacao cultivar CCN 51: its impact and significance on global cacao improvement and production. J. Am. Soc. Hortic. Sci. 139, 219–229 (2014).
Mustiga, G. M. et al. Phenotypic description of Theobroma cacao L. for yield and vigor traits from 34 hybrid families in Costa Rica based on the genetic basis of the parental population. Front. Plant Sci. 9, 808 (2018).
Izzah, N. K. et al. Improvement of Cacao Pod Characteristics and its Molecular Characterization in 4 F1 Cacao Populations. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-4766155/v1 (2024).
Bekele, F. L. et al. Genome-wide association studies and genomic selection assays made in a large sample of cacao (Theobroma cacao L.) germplasm reveal significant marker-trait associations and good predictive value for improving yield potential. Plos one. 17, e0260907 (2022).
Bediako, K. A., Padi, F. K., Obeng-Bio, E. & Ofori, A. Genetic diversity and parentage of cacao (Theobroma cacao L.) populations from Ghana using single nucleotide polymorphism (SNP) markers. Plant Genet. Resour. 23, 40–47 (2025).
Gutiérrez, O. A., Campbell, A. S. & Phillips-Mora, W. Breeding for disease resistance in cacao. In Cacao Diseases: A History of Old Enemies and New Encounters (eds Bailey, B. & Meinhardt, L.) 567–609 (Springer, 2016).
Rodriguez-Medina, C. et al. Cacao breeding in Colombia, past, present and future. Breed. Sci. 69, 373–382 (2019).
Wickramasuriya, A. M. & Dunwell, J. M. Cacao biotechnology: current status and future prospects. Plant Biotechnol. J. 16, 4–17 (2018).
McElroy, M. S. et al. Prediction of cacao (Theobroma cacao) resistance to Moniliophthora spp. diseases via genome-wide association analysis and genomic selection. Front. Plant Sci. 9, 343 (2018).
Romero Navarro, J. A. et al. Application of genome wide association and genomic prediction for improvement of cacao productivity and resistance to black and frosty pod diseases. Front. Plant Sci. 8, 1905 (2017).
Fernandes, L. S., Correa, F. M., Ingram, K. T., de Almeida, A. A. F. & Royaert, S. QTL mapping and identification of SNP-haplotypes affecting yield components of Theobroma cacao L. Hortic. 7, 26 (2020).
Duarte-Carvajalino, J. M., Paramo-Alvarez, M., Ramos-Calderón, P. F. & González-Orozco, C. E. Estimation of canopy attributes of wild cacao trees using digital cover photography and machine learning algorithms. iForest-Biogeosciences Forestry. 14, 517 (2021).
Omas-as, A. M. & DAANG, J. A. M. ARBOLEDA, E. R. Machine Learning as a Strategic Tool: A Comprehensive Literature Review for Advancing Agricultural Analysis, with Emphasis on the Cocoa Bean Quality Assessment. Int. J. Sci. Res. Eng. Dev. 7, 269 (2024).
Tan, J., Balasubramanian, B., Sukha, D., Ramkissoon, S. & Umaharan, P. Sensing fermentation degree of cocoa (Theobroma cacao L.) beans by machine learning classification models based electronic nose system. J. Food Process Eng. 42, e13175 (2019).
Lamos-Díaz, H., Puentes-Garzón, D. E. & Zarate-Caicedo, D. -A. Comparison between machine learning models for yield forecast in cocoa crops in Santander, Colombia. Rev. Fac. Ing. 29, 18 (2020).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677 (2025).
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).
Akbudak, M. A., Filiz, E., Vatansever, R. & Kontbay, K. Genome-wide identification and expression profiling of ascorbate peroxidase (APX) and glutathione peroxidase (GPX) genes under drought stress in sorghum (Sorghum bicolor L). J. Plant Growth Regul. 37, 925–936 (2018).
Kamruzzaman, M. et al. Pinpointing genomic loci for drought-induced proline and hydrogen peroxide accumulation in bread wheat under field conditions. BMC Plant Biol. 22, 584 (2022).
Zhou, Z. et al. Identification of genomic regions affecting grain peroxidase activity in bread wheat using genome-wide association study. BMC Plant Biol. 21, 523 (2021).
Richardson, K. & Jones, M. C. Why genome-wide associations with cognitive ability measures are probably spurious. New Ideas Psychol. 55, 35–41 (2019).
Sullivan, P. F. Spurious genetic associations. Biol. Psychiatry. 61, 1121–1126 (2007).
Clemente, A. et al. Eliminating anti-nutritional plant food proteins: the case of seed protease inhibitors in pea. PLoS One. 10, e0134634 (2015).
Schulthess, A. W. et al. The roles of pleiotropy and close linkage as revealed by association mapping of yield and correlated traits of wheat (Triticum aestivum L). J. Exp. Bot. 68, 4089–4101 (2017).
Liu, C. et al. Multi-trait genome-wide association studies reveal novel pleiotropic loci associated with yield and yield-related traits in rice. J. Integr. Agri. Advance online publication. https://doi.org/10.1016/j.jia.2024.07.026 (2024).
Kruger, N. J. & Von Schaewen, A. The oxidative pentose phosphate pathway: structure and organisation. Curr. Opin. Plant. Biol. 6, 236–246 (2003).
Tang, Q. et al. 6-Phosphogluconate dehydrogenase 2 bridges the OPP and shikimate pathways to enhance aromatic amino acid production in plants. Sci. China Life Sci. 67, 2488–2498 (2024).
Herrmann, K. M. & Weaver, L. M. The shikimate pathway. Annu. Rev. Plant Biol. 50, 473–503 (1999).
Maeda, H. & Dudareva, N. The shikimate pathway and aromatic amino acid biosynthesis in plants. Annu. Rev. Plant Biol. 63, 73–105 (2012).
Dahan, J. et al. Disruption of the Cytochrome c oxidase deficient1 Gene Leads to Cytochrome c Oxidase Depletion and Reorchestrated Respiratory Metabolism in Arabidopsis. Plant Physiol. 166, 1788–1802 (2014).
Mansilla, N., Garcia, L., Gonzalez, D. H. & Welchen, E. AtCOX10, a protein involved in haem o synthesis during cytochrome c oxidase biogenesis, is essential for plant embryogenesis and modulates the progression of senescence. J. Exp. Bot. 66, 6761–6775 (2015).
Lv, Q. et al. Wheat E3 ubiquitin ligase TaGW2-6A degrades TaAGPS to affect seed size. Plant Sci. 320, 111274 (2022).
Xia, T. et al. The Ubiquitin Receptor DA1 Interacts with the E3 Ubiquitin Ligase DA2 to Regulate Seed and Organ Size in Arabidopsis. Plant. Cell. 25, 3347–3359 (2013).
Popescu, S. C. & Tumer, N. E. Silencing of ribosomal protein L3 genes in N. tabacum reveals coordinate expression and significant alterations in plant growth, development and ribosome biogenesis. Plant J. 39, 29–44 (2004).
Tian, S. et al. Ribosomal protein NtRPL17 interacts with kinesin-12 family protein NtKRP and functions in the regulation of embryo/seed size and radicle growth. J. Exp. Bot. 68, 5553–5564 (2017).
Weis, B. L., Kovacevic, J., Missbach, S. & Schleiff, E. Plant-Specific Features of Ribosome Biogenesis. Trends Plant Sci. 20, 729–740 (2015).
Pescador-Dionisio, S. et al. Contribution of the regulatory miR156-SPL9 module to the drought stress response in pigmented potato (Solanum tuberosum L). Plant Physiol. Biochem. 217, 109195 (2024).
Werner, C., Fasbender, L., Romek, K. M., Yáñez-Serrano, A. M. & Kreuzwieser, J. Heat Waves Change Plant Carbon Allocation Among Primary and Secondary Metabolism Altering CO2 Assimilation, Respiration, and VOC Emissions. Front. Plant. Sci. 11, 1242 (2020).
Brown, D. C. W. & Thorpe, T. A. Mitochondrial activity during shoot formation and growth in tobacco callus. Physiol. Plant. 54, 125–130 (1982).
Jia, F. et al. Overexpression of Mitochondrial Phosphate Transporter 3 Severely Hampers Plant Development through Regulating Mitochondrial Function in Arabidopsis. PLOS ONE. 10, e0129717 (2015).
van der Merwe, M. J. et al. Tricarboxylic Acid Cycle Activity Regulates Tomato Root Growth via Effects on Secondary Cell Wall Production. Plant Physiol. 153, 611–621 (2010).
Verbančič, J., Lunn, J. E., Stitt, M. & Persson, S. Carbon Supply and the Regulation of Cell Wall Synthesis. Mol. Plant. 11, 75–94 (2018).
Bus, A. et al. Species- and genome-wide dissection of the shoot ionome in Brassica napus and its relationship to seedling development. Front. Plant Sci. 5, 485, (2014).
Wang, P., Zhou, G., Cui, K., Li, Z. & Yu, S. Clustered QTL for source leaf size and yield traits in rice (Oryza sativa L). Mol. Breeding. 29, 99–113 (2012).
Aguilar, M. & Prieto, P. Sequence analysis of wheat subtelomeres reveals a high polymorphism among homoeologous chromosomes. Plant. Genome. 13, e20065 (2020).
Fan, C. et al. The Subtelomere of Oryza sativa Chromosome 3 Short Arm as a Hot Bed of New Gene Origination in Rice. Mol. Plant. 1, 839–850 (2008).
Brown, C. A., Murray, A. W. & Verstrepen, K. J. Rapid Expansion and Functional Divergence of Subtelomeric Gene Families in Yeasts. Curr. Biol. 20, 895–903 (2010).
Saint-Leandre, B. & Levine, M. T. The Telomere Paradox: Stable Genome Preservation with Rapidly Evolving Proteins. Trends Genet. 36, 232–242 (2020).
Hesami, M., Naderi, R., Tohidfar, M. & Yoosefzadeh-Najafabadi, M. Development of support vector machine-based model and comparative analysis with artificial neural network for modeling the plant tissue culture procedures: effect of plant growth regulators on somatic embryogenesis of chrysanthemum, as a case study. Plant. methods. 16, 1–15 (2020).
Wu, D. et al. Combining high-throughput micro-CT-RGB phenotyping and genome-wide association study to dissect the genetic architecture of tiller growth in rice. J. Exp. Bot. 70, 545–561 (2019).
Seyum, E. G. et al. Genomic selection in tropical perennial crops and plantation trees: a review. Mol. Breeding. 42, 58 (2022).
Sallam, A., Alqudah, A. M., Baenziger, P. S. & Rasheed, A. Editorial: Genetic validation and its role in crop improvement. Front. Genet. 13, 1078246 (2023).
Crossa, J. et al. Expanding genomic prediction in plant breeding: harnessing big data, machine learning, and advanced software. Trends Plant Sci. 30, 756–774 (2025).
Motamayor, J. C. et al. Geographic and genetic population differentiation of the Amazonian chocolate tree (Theobroma cacao L). PloS one. 3, e3311 (2008).
Ge, S. X., Jung, D. & Yao, R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020).
Schwenk, H. & Bengio, Y. Boosting neural networks. Neural Comput. 12, 1869–1887 (2000).
Acknowledgements
We are also grateful to the reviewers for their constructive feedback. This work is supported by the U.S. Department of Agriculture, Agricultural Research Service, In-House Projects No. 8042-21220-258-000-D and 8042-21000-303-000-D. Mention of any trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U. S. Department of Agriculture. USDA is an equal opportunity provider and employer, and all agency services are available without discrimination.
Funding
This work is supported by the U.S. Department of Agriculture, Agricultural Research Service, In-House Projects No. 8042-21220-258-000-D and 8042-21000-303-000-D.
Author information
Authors and Affiliations
Contributions
E.A. conceptualized and designed the study. The investigation was performed by J.B., S.P., and E.A. Data analysis, validation, and visualization were conducted by J.B., D.L., S.P.C., S.P., and E.A. The methodology and resources were provided with contributions from J.B., I.B., S.L., J.H.J., S.P.C., A.H.L., and L.W.M. E.A. wrote the original manuscript draft. All authors reviewed, edited, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Baek, I., Bhatt, J., Lim, S. et al. A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction. Sci Rep (2026). https://doi.org/10.1038/s41598-026-42273-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-42273-w