Abstract
Understanding the relationship between genomic variation and phenotype is fundamental to deciphering the genetic architecture underlying complex traits. Yet, existing statistical models struggle to balance massive genomic datasets with biological interpretability. Here, we introduce GP-WAITER, a deep learning framework integrating GWAS-derived SNP weights into a hybrid convolutional neural network and Transformer architecture. By utilizing a weighted embedding mechanism and multi-head self-attention, GP-WAITER effectively captures long-range dependencies across ultra-long genomic sequences. The model consistently outperforms seven state-of-the-art genomic prediction models across six datasets, achieving up to a 77.5% improvement in prediction accuracy, a 78% reduction in mean squared error, and a 1.8-2.4fold increase in computational efficiency. Furthermore, GP-WAITER offers biological transparency by pinpointing key genetic variants driving specific traits. This scalable, interpretable framework provides a powerful tool for precision breeding and the functional interpretation of trait-associated variants.
Similar content being viewed by others
Data availability
The genotype and phenotype of soybean1861, soybean192, maize244, wheat406, rice529, and soybean14460 datasets, the environmental data of soybean1861 are deposited on Zenodo [https://zenodo.org/records/18779208]. Source data are provided with this paper.
Code availability
The GP-WAITER scripts are available in the release package on Github [https://github.com/snowo-w/GP-WAITER/] under the Apache License. A specific version (v1.0.0) used for this study has been archived via Zenodo [https://doi.org/10.5281/zenodo.18809685]54.
References
Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10, 381–391 (2009).
Xu, Y. et al. Smart breeding driven by big data, artificial intelligence, and integrated genomic-enviromic prediction. Mol. Plant 15, 1664–1695 (2022).
Alemu, A. et al. Genomic selection in plant breeding: key factors shaping two decades of progress. Mol. Plant 17, 552–578 (2024).
Endelman, J. B. Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome 4, 250–255 (2011).
Zhao, T. et al. Integration of eQTL and machine learning to dissect causal genes with pleiotropic effects in genetic regulation networks of seed cotton yield. Cell Rep. 42, 113111 (2023).
Wu, Y. et al. Phylogenomic discovery of deleterious mutations facilitates hybrid potato breeding. Cell 186, 2313–2328 (2023).
Long, N. et al. Application of support vector regression to genome-assisted prediction of quantitative traits. Theor. Appl. Genet. 123, 1065–1074 (2011).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD) 785–794 (2016).
Yan, J. et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 22, 271 (2021).
Ma, W. et al. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248, 1307–1318 (2018).
Wang, K. et al. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol. Plant 16, 279–293 (2023).
Wang, H. et al. Cropformer: an interpretable deep learning framework for crop genomic prediction. Plant Commun. 6, 101223 (2025).
Wu, C. et al. A transformer-based genomic prediction method fused with knowledge-guided module. Brief. Bioinform. 25, 1–11 (2023).
Deng, P. et al. DPCformer: an interpretable deep learning model for genomic prediction in crops. arXiv preprint arXiv:2510.08662 (2025).
Ma, C. et al. Machine learning–based differential network analysis: a study of stress-responsive transcriptomes in Arabidopsis. Plant Cell 26, 520–537 (2014).
Abdollahi-Arpanahi, R., Gianola, D. & Peñagaricano, F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 52, 12 (2020).
Spindel, J. E. et al. Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116, 395–408 (2016).
Jubair, S. et al. GPTransformer: a transformer-based deep learning method for predicting Fusarium-related traits in barley. Front. Plant Sci. 12, 761402 (2021).
Li, J. et al. Natural variation of domestication-related genes contributed to latitudinal expansion and adaptation in soybean. BMC Plant Biol. 24, 651 (2024).
Zhang, Z. et al. Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS ONE 9, e93017 (2014).
Li, B. et al. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front. Genet. 9, 237 (2018).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Choi, S. R. & Lee, M. Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review. Biology 12, 1033 (2023).
Benegas, G. et al. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat. Biotechnol. 43, 1960–1965 (2025).
Lin, F. et al. MMST-ViT: climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) 5774–5784 (2023).
Xiong, X. et al. Daily DeepCropNet: a hierarchical deep learning approach with daily time series of vegetation indices and climatic variables for corn yield estimation. ISPRS J. Photogramm. Remote Sens. 209, 249–264 (2024).
Xu, Y., Ma, Y. & Zhang, Z. Self-supervised pre-training for large-scale crop mapping using Sentinel-2 time series. ISPRS J. Photogramm. Remote Sens. 207, 312–325 (2024).
Avsec, Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).
Hollmann, N. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025).
Consens, M. E. et al. Transformers and genome language models. Nat. Mach. Intell. 7, 346–362 (2025).
Cai, Z. et al. MOTHER-OF-FT-AND-TFL1 regulates the seed oil and protein content in soybean. New Phytol. 239, 905–919 (2023).
Duan, Z. et al. Natural allelic variation of GmST05 controlling seed size and quality in soybean. Plant Biotechnol. J. 20, 1807–1818 (2022).
Zhang, C. et al. High-quality genome of a modern soybean cultivar and resequencing of 547 accessions provide insights into the role of structural variation. Nat. Genet. 56, 2247–2258 (2024).
Wang, M. et al. Parallel selection on a dormancy gene during domestication of crops from multiple families. Nat. Genet. 50, 1435–1441 (2018).
Crossa, J. et al. Expanding genomic prediction in plant breeding: harnessing big data, machine learning, and advanced software. Trends Plant Sci. 30, 756–774 (2025).
He, K. et al. Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2016, 770–778 (2016).
Balduzzi, D. et al. The shattered gradients problem: if ResNets are the answer, then what is the question? Proc. 34th Int. Conf. Mach. Learn. 70, 342–350 (2017).
Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, 1–11 (2021).
Feng, Y. et al. Dual-function C2H2-type zinc-finger transcription factor GmZFP7 contributes to isoflavone accumulation in soybean. New Phytol. 237, 1794–1809 (2023).
Liu, Y. et al. An R2R3-type MYB transcription factor, GmMYB77, negatively regulates isoflavone accumulation in soybean [Glycine max (L.) Merr. Plant Biotechnol. J. 23, 824–838 (2025).
Li, Y. et al. Genome-wide signatures of the geographic expansion and breeding of soybean. Sci. China Life Sci. 66, 350–365 (2023).
Azam, M. et al. Seed isoflavone profiling of 1168 soybean accessions from major growing ecoregions in China. Food Res. Int. 130, 108957 (2020).
Abdelghany, A. M. et al. Profiling of seed fatty acid composition in 1025 Chinese soybean accessions from diverse ecoregions. Crop J. 8, 635–644 (2020).
Sun, J. et al. Rapid HPLC method for determination of 12 isoflavone components in soybean seeds. Agric. Sci. China 10, 70–77 (2011).
Ghosh, S. et al. Seed tocopherol assessment and geographical distribution of 1151 Chinese soybean accessions from diverse ecoregions. J. Food Compos. Anal. 100, 103932 (2021).
Qi, J. et al. Profiling seed soluble sugar compositions in 1164 Chinese soybean accessions from major growing ecoregions. Crop J. 10, 1825–1831 (2022).
Agyenim-Boateng, K. G. et al. Profiling of naturally occurring folates in a diverse soybean germplasm by HPLC-MS/MS. Food Chem. 384, 132520 (2022).
Gebregziabher, B. S. et al. Identification of genomic regions and candidate genes underlying carotenoid accumulation in soybean using next-generation sequen-cing based bulk segregant analysis. J. Integr. Agric. 24, 2063–2079 (2025).
Agyenim-Boateng, K. G. et al. Identification of quantitative trait loci and candidate genes for seed folate content in soybean. Theor. Appl. Genet. 136, 149 (2023).
Li, Y. et al. Study on multi-environment genome-wide prediction of inbred agronomic traits in maize natural populations. Chin. Bull. Bot. 59, 1041 (2024).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
Li, J. et al. Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops. bioRxiv https://doi.org/10.5281/zenodo.18809685 (2026).
Acknowledgements
This work was supported by the Biological Breeding-National Science and Technology Major Project (2023ZD0403301 to J.L.) and National Natural Science Foundation of China (32272178 to J.S, 32472193 to B.L, 32001574 to J.L.).
Author information
Authors and Affiliations
Contributions
J.L. and J.S. designed the experiments and managed the project; L.Y. and J.L. analysed the data and wrote the manuscript, M.L., R.H., Yecheng Li, Z.H., and Yitian Liu performed part of the work. B.L., S.Z., and L.L. collected the dataset and performed data preprocessing, J.L., J.S., L.Q., A.S.S., and K.G.A-B. revised and edited the manuscript; and all authors read and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Lanzhi Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, J., Yu, L., Li, M. et al. Leveraging weighted embedding and Transformer architecture to improve phenotype prediction of complex traits for crops. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71035-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-71035-5


