Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 17 March 2026

A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction

  • Insuck Baek1 na1,
  • Jishnu Bhatt2 na1,
  • Seunghyun Lim2,
  • Dongho Lee3,
  • Jae Hee Jang2,
  • Stephen P. Cohen2,
  • Amelia H. Lovelace2,
  • Moon S. Kim1,
  • Lyndel W. Meinhardt2,
  • Sunchung Park2 &
  • …
  • Ezekiel Ahn2 

Scientific Reports , Article number:  (2026) Cite this article

  • 649 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational biology and bioinformatics
  • Genetics
  • Plant sciences

Abstract

Improving cacao yield, a key objective in post-domestication crop improvement, remains a primary goal for breeders, but progress is often hindered by the confounding effects of population structure. To overcome this, we analyzed 346 diverse cacao accessions using an ML-based association mapping framework (with and without population structure adjustment) and a phenotype-only ML prediction of yield. By correcting for population structure, our Bootstrap Forest-based GWAS produced SNP-importance rankings whose downstream functional summaries were enriched for ribosome/translation-related terms, and several top-ranked SNPs recurred across multiple yield components (e.g., pod index and seed number) in this panel. In parallel, Neural Networks were utilized to identify cotyledon mass and length as the most powerful predictors for total wet bean mass, providing a phenotype-only prediction example for this panel. Collectively, this study provides an ML-guided, low-density association workflow and a phenotype-only prediction example for this cacao panel, while explicitly outlining limitations related to marker density and phenotype provenance.

Data availability

The present study is a re-analysis of previously published, publicly available datasets. No new raw genotype or phenotype data were generated in this study. The raw phenotypic and genotypic data of the 346 Theobroma cacao accessions analyzed in this work are available in the Supporting Information of Bekele et al. (2022) at https://doi.org/10.1371/journal.pone.0260907. All derived results generated during the current study, including trait-importance rankings and GO enrichment results, are included in this published article (and its Supplementary Information files).

References

  1. Díaz-Montenegro, J. Livelihood strategies and risk behavior of cacao producers in Ecuador: effects of national policies to support cacao farmers and specialty cacao landraces. PhD thesis, Universitat Politècnica de Catalunya (2019).

  2. Hall, J. N. Applying a One Health approach to study the livelihoods of cocoa farming communities in Bougainville. (2022).

  3. Bekele, F. & Phillips-Mora, W. Cacao (Theobroma cacao L.) breeding. Adv. Plant. Breed. Strategies: Industrial Food Crops: Volume. 6, 409–487 (2019).

    Google Scholar 

  4. Alden, D. The significance of cacao production in the Amazon region during the late colonial period: an essay in comparative economic history. Proc. Am. Philos. Soc. 120, 103–135 (1976).

    Google Scholar 

  5. Gardea, A. A. et al. Cacao (Theobroma cacao L.). Fruit and Vegetable Phytochemicals: Chemistry and Human Health, 2nd Edition 921–940 (2017).

  6. Kongor, J. E., Owusu, M. & Oduro-Yeboah, C. Cocoa production in the 2020s: challenges and solutions. CABI Agric. Bioscience. 5, 102 (2024).

    Google Scholar 

  7. Walters, D. Chocolate Crisis: Climate Change and Other Threats to the Future of Cacao (University Press of Florida, 2020).

  8. Boza, E. J. et al. Genetic characterization of the cacao cultivar CCN 51: its impact and significance on global cacao improvement and production. J. Am. Soc. Hortic. Sci. 139, 219–229 (2014).

    Google Scholar 

  9. Mustiga, G. M. et al. Phenotypic description of Theobroma cacao L. for yield and vigor traits from 34 hybrid families in Costa Rica based on the genetic basis of the parental population. Front. Plant Sci. 9, 808 (2018).

    Google Scholar 

  10. Izzah, N. K. et al. Improvement of Cacao Pod Characteristics and its Molecular Characterization in 4 F1 Cacao Populations. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-4766155/v1 (2024).

  11. Bekele, F. L. et al. Genome-wide association studies and genomic selection assays made in a large sample of cacao (Theobroma cacao L.) germplasm reveal significant marker-trait associations and good predictive value for improving yield potential. Plos one. 17, e0260907 (2022).

    Google Scholar 

  12. Bediako, K. A., Padi, F. K., Obeng-Bio, E. & Ofori, A. Genetic diversity and parentage of cacao (Theobroma cacao L.) populations from Ghana using single nucleotide polymorphism (SNP) markers. Plant Genet. Resour. 23, 40–47 (2025).

  13. Gutiérrez, O. A., Campbell, A. S. & Phillips-Mora, W. Breeding for disease resistance in cacao. In Cacao Diseases: A History of Old Enemies and New Encounters (eds Bailey, B. & Meinhardt, L.) 567–609 (Springer, 2016).

  14. Rodriguez-Medina, C. et al. Cacao breeding in Colombia, past, present and future. Breed. Sci. 69, 373–382 (2019).

    Google Scholar 

  15. Wickramasuriya, A. M. & Dunwell, J. M. Cacao biotechnology: current status and future prospects. Plant Biotechnol. J. 16, 4–17 (2018).

    Google Scholar 

  16. McElroy, M. S. et al. Prediction of cacao (Theobroma cacao) resistance to Moniliophthora spp. diseases via genome-wide association analysis and genomic selection. Front. Plant Sci. 9, 343 (2018).

    Google Scholar 

  17. Romero Navarro, J. A. et al. Application of genome wide association and genomic prediction for improvement of cacao productivity and resistance to black and frosty pod diseases. Front. Plant Sci. 8, 1905 (2017).

    Google Scholar 

  18. Fernandes, L. S., Correa, F. M., Ingram, K. T., de Almeida, A. A. F. & Royaert, S. QTL mapping and identification of SNP-haplotypes affecting yield components of Theobroma cacao L. Hortic. 7, 26 (2020).

  19. Duarte-Carvajalino, J. M., Paramo-Alvarez, M., Ramos-Calderón, P. F. & González-Orozco, C. E. Estimation of canopy attributes of wild cacao trees using digital cover photography and machine learning algorithms. iForest-Biogeosciences Forestry. 14, 517 (2021).

    Google Scholar 

  20. Omas-as, A. M. & DAANG, J. A. M. ARBOLEDA, E. R. Machine Learning as a Strategic Tool: A Comprehensive Literature Review for Advancing Agricultural Analysis, with Emphasis on the Cocoa Bean Quality Assessment. Int. J. Sci. Res. Eng. Dev. 7, 269 (2024).

  21. Tan, J., Balasubramanian, B., Sukha, D., Ramkissoon, S. & Umaharan, P. Sensing fermentation degree of cocoa (Theobroma cacao L.) beans by machine learning classification models based electronic nose system. J. Food Process Eng. 42, e13175 (2019).

    Google Scholar 

  22. Lamos-Díaz, H., Puentes-Garzón, D. E. & Zarate-Caicedo, D. -A. Comparison between machine learning models for yield forecast in cocoa crops in Santander, Colombia. Rev. Fac. Ing. 29, 18 (2020).

  23. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  24. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).

    Google Scholar 

  25. Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677 (2025).

    Google Scholar 

  26. Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).

    Google Scholar 

  27. Akbudak, M. A., Filiz, E., Vatansever, R. & Kontbay, K. Genome-wide identification and expression profiling of ascorbate peroxidase (APX) and glutathione peroxidase (GPX) genes under drought stress in sorghum (Sorghum bicolor L). J. Plant Growth Regul. 37, 925–936 (2018).

    Google Scholar 

  28. Kamruzzaman, M. et al. Pinpointing genomic loci for drought-induced proline and hydrogen peroxide accumulation in bread wheat under field conditions. BMC Plant Biol. 22, 584 (2022).

    Google Scholar 

  29. Zhou, Z. et al. Identification of genomic regions affecting grain peroxidase activity in bread wheat using genome-wide association study. BMC Plant Biol. 21, 523 (2021).

    Google Scholar 

  30. Richardson, K. & Jones, M. C. Why genome-wide associations with cognitive ability measures are probably spurious. New Ideas Psychol. 55, 35–41 (2019).

    Google Scholar 

  31. Sullivan, P. F. Spurious genetic associations. Biol. Psychiatry. 61, 1121–1126 (2007).

    Google Scholar 

  32. Clemente, A. et al. Eliminating anti-nutritional plant food proteins: the case of seed protease inhibitors in pea. PLoS One. 10, e0134634 (2015).

    Google Scholar 

  33. Schulthess, A. W. et al. The roles of pleiotropy and close linkage as revealed by association mapping of yield and correlated traits of wheat (Triticum aestivum L). J. Exp. Bot. 68, 4089–4101 (2017).

    Google Scholar 

  34. Liu, C. et al. Multi-trait genome-wide association studies reveal novel pleiotropic loci associated with yield and yield-related traits in rice. J. Integr. Agri. Advance online publication. https://doi.org/10.1016/j.jia.2024.07.026 (2024).

  35. Kruger, N. J. & Von Schaewen, A. The oxidative pentose phosphate pathway: structure and organisation. Curr. Opin. Plant. Biol. 6, 236–246 (2003).

    Google Scholar 

  36. Tang, Q. et al. 6-Phosphogluconate dehydrogenase 2 bridges the OPP and shikimate pathways to enhance aromatic amino acid production in plants. Sci. China Life Sci. 67, 2488–2498 (2024).

  37. Herrmann, K. M. & Weaver, L. M. The shikimate pathway. Annu. Rev. Plant Biol. 50, 473–503 (1999).

    Google Scholar 

  38. Maeda, H. & Dudareva, N. The shikimate pathway and aromatic amino acid biosynthesis in plants. Annu. Rev. Plant Biol. 63, 73–105 (2012).

    Google Scholar 

  39. Dahan, J. et al. Disruption of the Cytochrome c oxidase deficient1 Gene Leads to Cytochrome c Oxidase Depletion and Reorchestrated Respiratory Metabolism in Arabidopsis. Plant Physiol. 166, 1788–1802 (2014).

    Google Scholar 

  40. Mansilla, N., Garcia, L., Gonzalez, D. H. & Welchen, E. AtCOX10, a protein involved in haem o synthesis during cytochrome c oxidase biogenesis, is essential for plant embryogenesis and modulates the progression of senescence. J. Exp. Bot. 66, 6761–6775 (2015).

    Google Scholar 

  41. Lv, Q. et al. Wheat E3 ubiquitin ligase TaGW2-6A degrades TaAGPS to affect seed size. Plant Sci. 320, 111274 (2022).

    Google Scholar 

  42. Xia, T. et al. The Ubiquitin Receptor DA1 Interacts with the E3 Ubiquitin Ligase DA2 to Regulate Seed and Organ Size in Arabidopsis. Plant. Cell. 25, 3347–3359 (2013).

    Google Scholar 

  43. Popescu, S. C. & Tumer, N. E. Silencing of ribosomal protein L3 genes in N. tabacum reveals coordinate expression and significant alterations in plant growth, development and ribosome biogenesis. Plant J. 39, 29–44 (2004).

    Google Scholar 

  44. Tian, S. et al. Ribosomal protein NtRPL17 interacts with kinesin-12 family protein NtKRP and functions in the regulation of embryo/seed size and radicle growth. J. Exp. Bot. 68, 5553–5564 (2017).

    Google Scholar 

  45. Weis, B. L., Kovacevic, J., Missbach, S. & Schleiff, E. Plant-Specific Features of Ribosome Biogenesis. Trends Plant Sci. 20, 729–740 (2015).

    Google Scholar 

  46. Pescador-Dionisio, S. et al. Contribution of the regulatory miR156-SPL9 module to the drought stress response in pigmented potato (Solanum tuberosum L). Plant Physiol. Biochem. 217, 109195 (2024).

    Google Scholar 

  47. Werner, C., Fasbender, L., Romek, K. M., Yáñez-Serrano, A. M. & Kreuzwieser, J. Heat Waves Change Plant Carbon Allocation Among Primary and Secondary Metabolism Altering CO2 Assimilation, Respiration, and VOC Emissions. Front. Plant. Sci. 11, 1242 (2020).

  48. Brown, D. C. W. & Thorpe, T. A. Mitochondrial activity during shoot formation and growth in tobacco callus. Physiol. Plant. 54, 125–130 (1982).

    Google Scholar 

  49. Jia, F. et al. Overexpression of Mitochondrial Phosphate Transporter 3 Severely Hampers Plant Development through Regulating Mitochondrial Function in Arabidopsis. PLOS ONE. 10, e0129717 (2015).

    Google Scholar 

  50. van der Merwe, M. J. et al. Tricarboxylic Acid Cycle Activity Regulates Tomato Root Growth via Effects on Secondary Cell Wall Production. Plant Physiol. 153, 611–621 (2010).

    Google Scholar 

  51. Verbančič, J., Lunn, J. E., Stitt, M. & Persson, S. Carbon Supply and the Regulation of Cell Wall Synthesis. Mol. Plant. 11, 75–94 (2018).

    Google Scholar 

  52. Bus, A. et al. Species- and genome-wide dissection of the shoot ionome in Brassica napus and its relationship to seedling development. Front. Plant Sci. 5, 485, (2014).

  53. Wang, P., Zhou, G., Cui, K., Li, Z. & Yu, S. Clustered QTL for source leaf size and yield traits in rice (Oryza sativa L). Mol. Breeding. 29, 99–113 (2012).

    Google Scholar 

  54. Aguilar, M. & Prieto, P. Sequence analysis of wheat subtelomeres reveals a high polymorphism among homoeologous chromosomes. Plant. Genome. 13, e20065 (2020).

    Google Scholar 

  55. Fan, C. et al. The Subtelomere of Oryza sativa Chromosome 3 Short Arm as a Hot Bed of New Gene Origination in Rice. Mol. Plant. 1, 839–850 (2008).

    Google Scholar 

  56. Brown, C. A., Murray, A. W. & Verstrepen, K. J. Rapid Expansion and Functional Divergence of Subtelomeric Gene Families in Yeasts. Curr. Biol. 20, 895–903 (2010).

    Google Scholar 

  57. Saint-Leandre, B. & Levine, M. T. The Telomere Paradox: Stable Genome Preservation with Rapidly Evolving Proteins. Trends Genet. 36, 232–242 (2020).

    Google Scholar 

  58. Hesami, M., Naderi, R., Tohidfar, M. & Yoosefzadeh-Najafabadi, M. Development of support vector machine-based model and comparative analysis with artificial neural network for modeling the plant tissue culture procedures: effect of plant growth regulators on somatic embryogenesis of chrysanthemum, as a case study. Plant. methods. 16, 1–15 (2020).

    Google Scholar 

  59. Wu, D. et al. Combining high-throughput micro-CT-RGB phenotyping and genome-wide association study to dissect the genetic architecture of tiller growth in rice. J. Exp. Bot. 70, 545–561 (2019).

    Google Scholar 

  60. Seyum, E. G. et al. Genomic selection in tropical perennial crops and plantation trees: a review. Mol. Breeding. 42, 58 (2022).

    Google Scholar 

  61. Sallam, A., Alqudah, A. M., Baenziger, P. S. & Rasheed, A. Editorial: Genetic validation and its role in crop improvement. Front. Genet. 13, 1078246 (2023).

  62. Crossa, J. et al. Expanding genomic prediction in plant breeding: harnessing big data, machine learning, and advanced software. Trends Plant Sci. 30, 756–774 (2025).

    Google Scholar 

  63. Motamayor, J. C. et al. Geographic and genetic population differentiation of the Amazonian chocolate tree (Theobroma cacao L). PloS one. 3, e3311 (2008).

    Google Scholar 

  64. Ge, S. X., Jung, D. & Yao, R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020).

    Google Scholar 

  65. Schwenk, H. & Bengio, Y. Boosting neural networks. Neural Comput. 12, 1869–1887 (2000).

    Google Scholar 

Download references

Acknowledgements

We are also grateful to the reviewers for their constructive feedback. This work is supported by the U.S. Department of Agriculture, Agricultural Research Service, In-House Projects No. 8042-21220-258-000-D and 8042-21000-303-000-D. Mention of any trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U. S. Department of Agriculture. USDA is an equal opportunity provider and employer, and all agency services are available without discrimination.

Funding

This work is supported by the U.S. Department of Agriculture, Agricultural Research Service, In-House Projects No. 8042-21220-258-000-D and 8042-21000-303-000-D.

Author information

Author notes
  1. Insuck Baek and Jishnu Bhatt contributed equally to this work.

Authors and Affiliations

  1. Environmental Microbial and Food Safety Laboratory, Agricultural Research Service, Department of Agriculture, Beltsville, 20705, MD, USA

    Insuck Baek & Moon S. Kim

  2. Sustainable Perennial Crops Laboratory, Agricultural Research Service, Department of Agriculture, Beltsville, 20705, MD, USA

    Jishnu Bhatt, Seunghyun Lim, Jae Hee Jang, Stephen P. Cohen, Amelia H. Lovelace, Lyndel W. Meinhardt, Sunchung Park & Ezekiel Ahn

  3. Soybean Genomics & Improvement Laboratory, Agricultural Research Service, Department of Agriculture, Beltsville, 20705, MD, USA

    Dongho Lee

Authors
  1. Insuck Baek
    View author publications

    Search author on:PubMed Google Scholar

  2. Jishnu Bhatt
    View author publications

    Search author on:PubMed Google Scholar

  3. Seunghyun Lim
    View author publications

    Search author on:PubMed Google Scholar

  4. Dongho Lee
    View author publications

    Search author on:PubMed Google Scholar

  5. Jae Hee Jang
    View author publications

    Search author on:PubMed Google Scholar

  6. Stephen P. Cohen
    View author publications

    Search author on:PubMed Google Scholar

  7. Amelia H. Lovelace
    View author publications

    Search author on:PubMed Google Scholar

  8. Moon S. Kim
    View author publications

    Search author on:PubMed Google Scholar

  9. Lyndel W. Meinhardt
    View author publications

    Search author on:PubMed Google Scholar

  10. Sunchung Park
    View author publications

    Search author on:PubMed Google Scholar

  11. Ezekiel Ahn
    View author publications

    Search author on:PubMed Google Scholar

Contributions

E.A. conceptualized and designed the study. The investigation was performed by J.B., S.P., and E.A. Data analysis, validation, and visualization were conducted by J.B., D.L., S.P.C., S.P., and E.A. The methodology and resources were provided with contributions from J.B., I.B., S.L., J.H.J., S.P.C., A.H.L., and L.W.M. E.A. wrote the original manuscript draft. All authors reviewed, edited, and approved the final manuscript.

Corresponding author

Correspondence to Ezekiel Ahn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download XLSX )

Supplementary Material 2 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baek, I., Bhatt, J., Lim, S. et al. A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction. Sci Rep (2026). https://doi.org/10.1038/s41598-026-42273-w

Download citation

  • Received: 08 September 2025

  • Accepted: 25 February 2026

  • Published: 17 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-42273-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Theobroma cacao
  • Yield
  • GWAS
  • Machine learning
  • Protein synthesis
  • Genomic prediction
  • Population structure
  • Plant breeding
  • Ribosome
Download PDF

Associated content

Collection

Horticultural crop improvement

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing