Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 26 March 2026

Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops

  • Jing Li  ORCID: orcid.org/0009-0004-3376-34151 na1,
  • Linfeng Yu1 na1,
  • Mengfan Li2 na1,
  • Rui Han2,
  • Yecheng Li1,
  • Abdulwahab Saliu Shaibu3,
  • Kwadwo Gyapong Agyenim-Boateng4,
  • Zhaoyi Hao1,
  • Yitian Liu1,
  • Bin Li  ORCID: orcid.org/0000-0002-9452-60831,
  • Shengrui Zhang1,
  • Liang Li1,
  • Lijuan Qiu  ORCID: orcid.org/0000-0001-5777-33441 &
  • …
  • Junming Sun  ORCID: orcid.org/0000-0002-5585-00161 

Nature Communications , Article number:  (2026) Cite this article

  • 4772 Accesses

  • 6 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Agricultural genetics
  • Machine learning
  • Plant breeding
  • Quantitative trait

Abstract

Understanding the relationship between genomic variation and phenotype is fundamental to deciphering the genetic architecture underlying complex traits. Yet, existing statistical models struggle to balance massive genomic datasets with biological interpretability. Here, we introduce GP-WAITER, a deep learning framework integrating GWAS-derived SNP weights into a hybrid convolutional neural network and Transformer architecture. By utilizing a weighted embedding mechanism and multi-head self-attention, GP-WAITER effectively captures long-range dependencies across ultra-long genomic sequences. The model consistently outperforms seven state-of-the-art genomic prediction models across six datasets, achieving up to a 77.5% improvement in prediction accuracy, a 78% reduction in mean squared error, and a 1.8-2.4fold increase in computational efficiency. Furthermore, GP-WAITER offers biological transparency by pinpointing key genetic variants driving specific traits. This scalable, interpretable framework provides a powerful tool for precision breeding and the functional interpretation of trait-associated variants.

Similar content being viewed by others

Weighted kernels improve multi-environment genomic prediction

Article Open access 15 December 2022

Genome-wide association analyses identify genotype-by-environment interactions of growth traits in Simmental cattle

Article Open access 25 June 2021

GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing

Article Open access 11 August 2022

Data availability

The genotype and phenotype of soybean1861, soybean192, maize244, wheat406, rice529, and soybean14460 datasets, the environmental data of soybean1861 are deposited on Zenodo [https://zenodo.org/records/18779208]. Source data are provided with this paper.

Code availability

The GP-WAITER scripts are available in the release package on Github [https://github.com/snowo-w/GP-WAITER/] under the Apache License. A specific version (v1.0.0) used for this study has been archived via Zenodo [https://doi.org/10.5281/zenodo.18809685]54.

References

  1. Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10, 381–391 (2009).

    Google Scholar 

  2. Xu, Y. et al. Smart breeding driven by big data, artificial intelligence, and integrated genomic-enviromic prediction. Mol. Plant 15, 1664–1695 (2022).

    Google Scholar 

  3. Alemu, A. et al. Genomic selection in plant breeding: key factors shaping two decades of progress. Mol. Plant 17, 552–578 (2024).

    Google Scholar 

  4. Endelman, J. B. Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome 4, 250–255 (2011).

    Google Scholar 

  5. Zhao, T. et al. Integration of eQTL and machine learning to dissect causal genes with pleiotropic effects in genetic regulation networks of seed cotton yield. Cell Rep. 42, 113111 (2023).

    Google Scholar 

  6. Wu, Y. et al. Phylogenomic discovery of deleterious mutations facilitates hybrid potato breeding. Cell 186, 2313–2328 (2023).

    Google Scholar 

  7. Long, N. et al. Application of support vector regression to genome-assisted prediction of quantitative traits. Theor. Appl. Genet. 123, 1065–1074 (2011).

    Google Scholar 

  8. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD) 785–794 (2016).

  9. Yan, J. et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 22, 271 (2021).

    Google Scholar 

  10. Ma, W. et al. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248, 1307–1318 (2018).

    Google Scholar 

  11. Wang, K. et al. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol. Plant 16, 279–293 (2023).

    Google Scholar 

  12. Wang, H. et al. Cropformer: an interpretable deep learning framework for crop genomic prediction. Plant Commun. 6, 101223 (2025).

    Google Scholar 

  13. Wu, C. et al. A transformer-based genomic prediction method fused with knowledge-guided module. Brief. Bioinform. 25, 1–11 (2023).

    Google Scholar 

  14. Deng, P. et al. DPCformer: an interpretable deep learning model for genomic prediction in crops. arXiv preprint arXiv:2510.08662 (2025).

  15. Ma, C. et al. Machine learning–based differential network analysis: a study of stress-responsive transcriptomes in Arabidopsis. Plant Cell 26, 520–537 (2014).

    Google Scholar 

  16. Abdollahi-Arpanahi, R., Gianola, D. & Peñagaricano, F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 52, 12 (2020).

    Google Scholar 

  17. Spindel, J. E. et al. Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116, 395–408 (2016).

    Google Scholar 

  18. Jubair, S. et al. GPTransformer: a transformer-based deep learning method for predicting Fusarium-related traits in barley. Front. Plant Sci. 12, 761402 (2021).

    Google Scholar 

  19. Li, J. et al. Natural variation of domestication-related genes contributed to latitudinal expansion and adaptation in soybean. BMC Plant Biol. 24, 651 (2024).

    Google Scholar 

  20. Zhang, Z. et al. Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS ONE 9, e93017 (2014).

    Google Scholar 

  21. Li, B. et al. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front. Genet. 9, 237 (2018).

    Google Scholar 

  22. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).

  23. Choi, S. R. & Lee, M. Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review. Biology 12, 1033 (2023).

    Google Scholar 

  24. Benegas, G. et al. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat. Biotechnol. 43, 1960–1965 (2025).

  25. Lin, F. et al. MMST-ViT: climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) 5774–5784 (2023).

  26. Xiong, X. et al. Daily DeepCropNet: a hierarchical deep learning approach with daily time series of vegetation indices and climatic variables for corn yield estimation. ISPRS J. Photogramm. Remote Sens. 209, 249–264 (2024).

    Google Scholar 

  27. Xu, Y., Ma, Y. & Zhang, Z. Self-supervised pre-training for large-scale crop mapping using Sentinel-2 time series. ISPRS J. Photogramm. Remote Sens. 207, 312–325 (2024).

    Google Scholar 

  28. Avsec, Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Google Scholar 

  29. Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).

    Google Scholar 

  30. Hollmann, N. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025).

    Google Scholar 

  31. Consens, M. E. et al. Transformers and genome language models. Nat. Mach. Intell. 7, 346–362 (2025).

    Google Scholar 

  32. Cai, Z. et al. MOTHER-OF-FT-AND-TFL1 regulates the seed oil and protein content in soybean. New Phytol. 239, 905–919 (2023).

    Google Scholar 

  33. Duan, Z. et al. Natural allelic variation of GmST05 controlling seed size and quality in soybean. Plant Biotechnol. J. 20, 1807–1818 (2022).

    Google Scholar 

  34. Zhang, C. et al. High-quality genome of a modern soybean cultivar and resequencing of 547 accessions provide insights into the role of structural variation. Nat. Genet. 56, 2247–2258 (2024).

    Google Scholar 

  35. Wang, M. et al. Parallel selection on a dormancy gene during domestication of crops from multiple families. Nat. Genet. 50, 1435–1441 (2018).

    Google Scholar 

  36. Crossa, J. et al. Expanding genomic prediction in plant breeding: harnessing big data, machine learning, and advanced software. Trends Plant Sci. 30, 756–774 (2025).

    Google Scholar 

  37. He, K. et al. Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2016, 770–778 (2016).

    Google Scholar 

  38. Balduzzi, D. et al. The shattered gradients problem: if ResNets are the answer, then what is the question? Proc. 34th Int. Conf. Mach. Learn. 70, 342–350 (2017).

    Google Scholar 

  39. Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, 1–11 (2021).

    Google Scholar 

  40. Feng, Y. et al. Dual-function C2H2-type zinc-finger transcription factor GmZFP7 contributes to isoflavone accumulation in soybean. New Phytol. 237, 1794–1809 (2023).

    Google Scholar 

  41. Liu, Y. et al. An R2R3-type MYB transcription factor, GmMYB77, negatively regulates isoflavone accumulation in soybean [Glycine max (L.) Merr. Plant Biotechnol. J. 23, 824–838 (2025).

    Google Scholar 

  42. Li, Y. et al. Genome-wide signatures of the geographic expansion and breeding of soybean. Sci. China Life Sci. 66, 350–365 (2023).

    Google Scholar 

  43. Azam, M. et al. Seed isoflavone profiling of 1168 soybean accessions from major growing ecoregions in China. Food Res. Int. 130, 108957 (2020).

    Google Scholar 

  44. Abdelghany, A. M. et al. Profiling of seed fatty acid composition in 1025 Chinese soybean accessions from diverse ecoregions. Crop J. 8, 635–644 (2020).

    Google Scholar 

  45. Sun, J. et al. Rapid HPLC method for determination of 12 isoflavone components in soybean seeds. Agric. Sci. China 10, 70–77 (2011).

    Google Scholar 

  46. Ghosh, S. et al. Seed tocopherol assessment and geographical distribution of 1151 Chinese soybean accessions from diverse ecoregions. J. Food Compos. Anal. 100, 103932 (2021).

    Google Scholar 

  47. Qi, J. et al. Profiling seed soluble sugar compositions in 1164 Chinese soybean accessions from major growing ecoregions. Crop J. 10, 1825–1831 (2022).

    Google Scholar 

  48. Agyenim-Boateng, K. G. et al. Profiling of naturally occurring folates in a diverse soybean germplasm by HPLC-MS/MS. Food Chem. 384, 132520 (2022).

    Google Scholar 

  49. Gebregziabher, B. S. et al. Identification of genomic regions and candidate genes underlying carotenoid accumulation in soybean using next-generation sequen-cing based bulk segregant analysis. J. Integr. Agric. 24, 2063–2079 (2025).

    Google Scholar 

  50. Agyenim-Boateng, K. G. et al. Identification of quantitative trait loci and candidate genes for seed folate content in soybean. Theor. Appl. Genet. 136, 149 (2023).

    Google Scholar 

  51. Li, Y. et al. Study on multi-environment genome-wide prediction of inbred agronomic traits in maize natural populations. Chin. Bull. Bot. 59, 1041 (2024).

    Google Scholar 

  52. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

  53. Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).

    Google Scholar 

  54. Li, J. et al. Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops. bioRxiv https://doi.org/10.5281/zenodo.18809685 (2026).

Download references

Acknowledgements

This work was supported by the Biological Breeding-National Science and Technology Major Project (2023ZD0403301 to J.L.) and National Natural Science Foundation of China (32272178 to J.S, 32472193 to B.L, 32001574 to J.L.).

Author information

Author notes
  1. These authors contributed equally: Jing Li, Linfeng Yu, Mengfan Li.

Authors and Affiliations

  1. The State Key Laboratory of Crop Gene Resources and Breeding, National Engineering Laboratory for Crop Molecular Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China

    Jing Li, Linfeng Yu, Yecheng Li, Zhaoyi Hao, Yitian Liu, Bin Li, Shengrui Zhang, Liang Li, Lijuan Qiu & Junming Sun

  2. Institute of Environment and Sustainable Development in Agriculture, Chinese Academy of Agricultural Sciences, Beijing, China

    Mengfan Li & Rui Han

  3. Department of Agronomy, Bayero University Kano, Kano, Nigeria

    Abdulwahab Saliu Shaibu

  4. DynaMo Center, Department of Plant and Environmental Sciences, University of Copenhagen, Frederiksberg, Denmark

    Kwadwo Gyapong Agyenim-Boateng

Authors
  1. Jing Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Linfeng Yu
    View author publications

    Search author on:PubMed Google Scholar

  3. Mengfan Li
    View author publications

    Search author on:PubMed Google Scholar

  4. Rui Han
    View author publications

    Search author on:PubMed Google Scholar

  5. Yecheng Li
    View author publications

    Search author on:PubMed Google Scholar

  6. Abdulwahab Saliu Shaibu
    View author publications

    Search author on:PubMed Google Scholar

  7. Kwadwo Gyapong Agyenim-Boateng
    View author publications

    Search author on:PubMed Google Scholar

  8. Zhaoyi Hao
    View author publications

    Search author on:PubMed Google Scholar

  9. Yitian Liu
    View author publications

    Search author on:PubMed Google Scholar

  10. Bin Li
    View author publications

    Search author on:PubMed Google Scholar

  11. Shengrui Zhang
    View author publications

    Search author on:PubMed Google Scholar

  12. Liang Li
    View author publications

    Search author on:PubMed Google Scholar

  13. Lijuan Qiu
    View author publications

    Search author on:PubMed Google Scholar

  14. Junming Sun
    View author publications

    Search author on:PubMed Google Scholar

Contributions

J.L. and J.S. designed the experiments and managed the project; L.Y. and J.L. analysed the data and wrote the manuscript, M.L., R.H., Yecheng Li, Z.H., and Yitian Liu performed part of the work. B.L., S.Z., and L.L. collected the dataset and performed data preprocessing, J.L., J.S., L.Q., A.S.S., and K.G.A-B. revised and edited the manuscript; and all authors read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Jing Li, Lijuan Qiu or Junming Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Lanzhi Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Peer Review file (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download XLSX )

Supplementary Data 4 (download XLSX )

Supplementary Data 5 (download XLSX )

Supplementary Data 6 (download XLSX )

Supplementary Data 7 (download XLSX )

Supplementary Data 8 (download XLSX )

Supplementary Data 9 (download XLSX )

Supplementary Data 10 (download XLSX )

Supplementary Data 11 (download XLSX )

Supplementary Data 12 (download XLSX )

Reporting Summary (download PDF )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Yu, L., Li, M. et al. Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71035-5

Download citation

  • Received: 07 August 2025

  • Accepted: 11 March 2026

  • Published: 26 March 2026

  • DOI: https://doi.org/10.1038/s41467-026-71035-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing