Introduction

Poplar is one of the fastest-growing temperate trees worldwide and serves as an important source of wood and biomass energy. The worldwide distribution of natural poplar forests covers ~54.5 million hectares, while the total area of cultivated poplar plantations reaches 31.4 million hectares1. With the increasing scale of poplar cultivation, breeding germplasm with fast growth, high quality, and strong disease resistance has become a research focus. Traditional poplar breeding strategies include establishing base populations, selecting appropriate parents for intra- or interspecific hybridization, screening superior individuals in progeny, and expanding them through asexual propagation2,3. Natural populations provide rich genetic materials for plant breeding, and their genetic diversity suggests that some individuals may carry commercially valuable genes4. The collection of germplasm resources and the establishment of core collections can enhance the utilization of these genetic resources. Several studies have focused on constructing core germplasm bank of poplar and assessing their genetic diversity5,6. However, due to the long generation cycles of forest trees, completing a breeding generation typically takes several to dozens of years. To meet the increasing demand for wood and biomass energy, accelerating the poplar breeding process has become increasingly critical.

With advancements in sequencing technology, it is now possible to efficiently and cost-effectively obtain dense and reliable genetic markers that cover the entire genome. Various methods have been widely employed in the genetic dissection and breeding of important traits in plants, achieving significant progress. Genome-wide association studies (GWAS) are a powerful tool for investigating complex traits using genome-wide single nucleotide polymorphisms (SNPs), and multiple statistical models have been developed to enhance computational efficiency and detection capability7. Association studies have revealed the genetic architecture of growth traits in poplar8. Using genome-wide association studies, genes associated with sex determination in poplar have been identified, marking a significant advancement in understanding the mechanisms of plant sex differentiation9. In terms of disease resistance traits, key genes associated with poplar’s resistance to leaf rust were identified through linkage analysis, providing a critical molecular foundation for breeding disease-resistant poplar varieties10. In Eucalyptus, SNPs associated with growth and wood properties identified through GWAS (top 1% or P < 10−3) significantly improved the prediction accuracy for all traits11. Association studies not only enhance our understanding of the genetic mechanisms underlying important traits in trees but also lay the groundwork for accelerating molecular breeding efforts in forestry.

Trees typically enter their reproductive phase only after several years of growth. Genomic selection (GS), which utilizes dense markers covering the entire genome to predict genetic value for selection, significantly reduces the time required for testing hybrid progeny in forests, thereby enhancing the genetic gain and accelerating the breeding process in plants12. For instance, the phenotypic selection cycle for oil palm requires 19 years, whereas the genomic selection cycle takes only 6 years13. Specifically, a model is constructed using a training population with known genotypes and phenotypes to estimate marker effects, allowing the prediction of breeding values for candidate individuals based solely on genotype data14. Prediction accuracy is influenced by the genetic relationship between the training and prediction populations15. Additionally, factors such as effective population size, genotype density, training population size, genetic complexity, and statistical methods also affect the accuracy of genomic selection16,17.

Despite the advantages of including GWAS in genomic selection, the application of these methods in poplar genetic breeding remains to be further explored. In this study, we aimed (1) to systematically characterize ten traits related to growth, wood properties, disease resistance, and leaf morphology in 237 poplar germplasm accessions; (2) to identify significant SNP loci associated with traits through genome resequencing and genome-wide association studies; (3) to analyze the effects of alleles at quantitative trait loci (QTL) on phenotypes and elucidate the relationship between favorable alleles and trait values; and (4) to assess the impact of incorporating multi-trait QTL markers on the prediction accuracy of genomic selection. Our study provides valuable insights into the genetic basis of growth, wood properties, disease resistance, and leaf morphology in poplar and significantly advances the development of genetic breeding in poplar.

Results

Phenotypic variation analysis of growth, wood properties, disease resistance, and leaf morphology in P. deltoides

The growth, wood properties, disease resistance, and leaf morphology of P. deltoides were systematically analyzed, focusing on phenotypic traits including diameter at breast height (DBH), basic density (BD), hemicellulose content, cellulose content, lignin content, black spot disease (BSD) infection rate, leaf area (LA), leaf length (LL), leaf width (LW), and leaf vein angle (LVA). Frequency distribution analysis indicated that all traits followed a bell-shaped curve, suggesting an approximate normal distribution was suitable for subsequent analysis (Fig. 1, Supplementary Table 1). The Shapiro–Wilk test results indicated that DBH, BD, and LVA followed a normal distribution (P > 0.05). In terms of growth traits, the average DBH was 37.5 ± 6.26 cm, with a range from 21.4 cm to 54.1 cm, highlighting significant variability among the individuals and suggesting a notable impact of growth conditions and genetic background. Regarding wood properties, the average BD was 0.40 ± 0.03 g/cm³, ranging from 0.31 g/cm³ to 0.49 g/cm³, indicating substantial potential for variation in basic density, likely associated with environmental adaptability and genetic factors. Furthermore, the average contents of hemicellulose, cellulose, and lignin were 23.48 ± 2.07%, 54.77 ± 2.66%, and 12.93 ± 2.54%, respectively, demonstrating significant differences in wood composition among individuals, thereby providing foundational data for wood utilization and improvement. For disease resistance, the average BSD was 16.35 ± 12.02%, with a range from 0.62% to 75.51%, indicating considerable individual variation in disease resistance, which may influence performance in diverse disease environments. In the leaf morphological trait analysis, the averages for LA, LL, LW, and LVA were 99.65 ± 31.01 cm², 11.99 ± 1.68 cm, 12.30 ± 1.9 cm, and 64.17 ± 7.58°, with respective ranges of 46.57 cm² to 214.78 cm², 8.83 cm to 18.45 cm, 8.71 cm to 18.04 cm, and 47.63° to 90.31°. These results reveal substantial variability in leaf-related traits, suggesting potential directions for further genetic improvement. Overall, the coefficients of variation (CV) for the ten traits ranged from 4.86% to 73.49%, with BSD exhibiting the highest CV (73.49%) and hemicellulose the lowest (4.86%), further emphasizing the differing degrees of phenotypic variation among traits.

Fig. 1: Phenotypic variation analysis in growth, wood properties, disease resistance, and leaf morphology.
figure 1

Frequency distribution of diameter at breast height (DBH), basic density (BD), hemicellulose (Hemi), cellulose (Cellu), lignin, infection rate of black spot disease (BSD), leaf area (LA), leaf length (LL), leaf width (LW), and leaf vein angle (LVA).

Estimates of the narrow-sense heritability indicated clear differences in the genetic background of various traits (Supplementary Table 1), with the genetic contributions being 21.58% for DBH, 50.00% for BD, 9.28% for hemicellulose, 27.14% for cellulose, 6.23% for lignin, 66.51% for BSD, 26.43% for LA, 16.88% for LL, 30.46% for LW, and 66.84% for LVA. Correlation analysis revealed significant relationships among traits (Supplementary Fig. 1), with a significant positive correlation between DBH and BD (r = 0.39), and a significant negative correlation between BD and hemicellulose (r = –0.14). Additionally, lignin showed significant negative correlations with both hemicellulose (r = –0.54) and cellulose (r = –0.66). Furthermore, there were strong positive correlations among LA, LL, and LW, while BSD exhibited significant negative correlations with LA (r = –0.22), LL (r = –0.18), and LW (r = –0.17). These findings provide important theoretical foundations for understanding the genetic characteristics and phenotypic values of P. deltoides.

Identification of QTL for complex quantitative traits

We performed re-sequencing on 237 samples, yielding a total of 1375 GB of high-quality clean data, with each individual having over 5 GB of clean data. The sequencing data exhibited high quality, with Q30 values surpassing 90.47% (Supplementary Data 1). After aligning to the P. deltoides reference genome and filtering out low-quality and linkage disequilibrium markers, a total of 685,181 SNPs that were evenly distributed across 19 chromosomes were identified (Supplementary Fig. 2). Based on geographic location, these 237 samples were categorized into three subgroups. The calculated linkage disequilibrium (LD) decay rates showed that the average decay distances for the Louisiana, Mississippi, and Texas subgroups were 4.1 kb, 6.9 kb, and 9.1 kb, respectively, with an overall average decay distance of 6.7 kb, which suggests that the LD decay distance in the Louisiana subgroup is significantly lower than in the other subgroups, possibly due to its geographical distribution characteristics. Phylogenetic tree analysis indicated that individuals from the three subgroups did not cluster together. Principal component analysis (PCA) revealed that the PC1-PC3 explained 15.65% of the genetic variation, roughly dividing the 237 accessions into three clusters. However, when the Admixture analysis was performed with K set to 2, 3, and 4, no clear population structure was detected (Supplementary Fig. 3). Furthermore, the pairwise fixation index (Fst) values computed among the geographically defined Louisiana, Mississippi, and Texas subgroups were 0.2 (Louisiana and Mississippi), 0.3 (Louisiana and Texas), and 0.4 (Texas and Mississippi), indicating a low level of genetic differentiation among these subgroups (Supplementary Fig. 4).

To identify genetic variations associated with multiple complex quantitative traits, a GWAS was conducted using filtered SNPs combined with phenotypic data for ten traits. Using significance threshold of P < 1 × 10−5, a total of 69 significant QTLs were identified, distributed across various chromosomes and associated with the following traits: 4 QTL for DBH, 8 QTL for BD, 9 QTL for hemicellulose, 7 QTL for cellulose, 3 QTL for lignin, 7 QTL for BSD, 6 QTL for LA, 11 QTL for LL, 5 QTL for LW, and 9 QTL for LVA (Supplementary Fig. 5, Supplementary Table 2). Among these, four significant QTLs were identified: chr2_6037140 and chr10_9264225 for cellulose, chr1_11210638 for LA, and chr7_15696963 for BSD (Fig. 2). Haplotype analysis of the candidate genes in these QTLs showed that Hap002 of EVM0002801 had a significantly higher cellulose content than Hap001 by 1.5%, and Hap003 of EVM0030857 had a significantly lower BSD by 8.43% compared to Hap001. Certain traits, particularly LL, showed clusters of QTLs within specific chromosome regions, suggesting potential linkage disequilibrium in these areas. Co-localization analysis further revealed an overlap of QTL associated with LW, LL, and LA on chromosome 1. Additionally, QTL associated with cellulose and lignin showed overlap on chromosomes 2. These results indicate that these chromosomal regions may contain key genes regulating multiple traits.

Fig. 2: Genome‑wide association study (GWAS) for ten traits in poplar.
figure 2

a Manhattan plots of GWAS results for diameter at breast height (DBH), basic density (BD), hemicellulose, cellulose, lignin, infection rate of black spot disease (BSD), leaf area (LA), leaf length (LL), leaf width (LW), and leaf vein angle (LVA). b Linkage disequilibrium analysis of SNPs associated with cellulose and BSD. c Haplotype analysis of EVM002801 and EVM0030857. The horizontal solid red line indicates the Bonferroni-corrected significance threshold, and the horizontal dashed red line indicates the standard significance threshold (-log10(P) = 5). A two-sided t-test was performed to compare different haplotypes in terms of their associated phenotypic variations. *P < 0.05; **P < 0.01.

A total of 130 annotated genes were involved in the 69 identified significant QTL regions (Fig. 3, Supplementary Data 2), which were distributed across various chromosomes and implicated in the genetic regulation of important traits related to plant growth, wood properties, disease resistance, and leaf morphology. Gene Ontology (GO) enrichment analysis of these 130 candidate genes revealed significant enrichment in terms related to receptor serine/threonine kinase binding, cytoplasmic translation, magnesium ion transmembrane transport, dynein intermediate chain binding, chloroplast envelope, early endosome, cytoplasmic dynein complex, magnesium ion transmembrane transporter activity, malate transport, tryptophan biosynthetic process, etc. (Supplementary Fig. 6). These results suggest that the identified candidate genes may play essential roles in various cellular and metabolic processes. The key genes EVM0002801 and EVM0030857, identified as regulators of cellulose content and BSD, were annotated as late embryogenesis abundant protein and uridine nucleosidase, respectively. The co-effect loci associated with LW, LL, and LA involves a key functional gene, EVM0007952, annotated as a MYB transcription factor, which plays a crucial role in regulating plant growth, development, and stress responses. Further analysis revealed that the co-effect loci associated with cellulose and lignin involve several key functional genes, including EVM0014191, EVM0002801, and EVM0037792. Specifically, EVM0014191 encodes desiccation protectant protein Lea14 homolog. EVM0037792 is annotated as having ubiquitin-protein transferase activity, a key function in protein degradation and cellular regulation.

Fig. 3: Genomic distribution of quantitative trait loci (QTL).
figure 3

Markers in different colors indicate QTL for various traits (-log10(P) = 5), including diameter at breast height (DBH), basic density (BD), hemicellulose content, cellulose content, lignin content, infection rate of black spot disease (BSD), leaf area (LA), leaf length (LL), leaf width (LW), and leaf vein angle (LVA). Black lines represent genes identified within the QTL regions. The heatmap shows the distribution density of genes on chromosomes, with red indicating high density and blue indicating low density.

Effects of QTL alleles on trait phenotypes

We investigated the effects of alleles at QTL on phenotype and found significant differences between in the impact of favorable and unfavorable alleles on the traits (Supplementary Data 3). Specifically, at locus chr13_1225902, allele A exhibited a significantly higher DBH compared to allele G. At locus chr4_5602349, the average BD was 0.40 g/cm³ for allele T and 0.36 g/cm³ for allele C. At locus chr11_16124994, the average hemicellulose content of allele A was 27.59%, compared to 23.34% for allele G. At locus chr2_7224508, allele A was associated with lower cellulose content, whereas allele C was associated with higher cellulose content. At locus chr1_27799505, the average lignin content of allele G was 13.25%, while that of allele A was 11.95%. Additionally, at locus chr7_10250865, the average BSD for allele T was 15.02%, compared to 29.66% for allele C. Furthermore, allele A at locus chr17_10568531, allele A at locus chr3_9907822, allele C at locus chr5_12597886, and allele A at locus chr13_15731917 exhibited higher mean values for LA, LL, LW, and LVA, respectively.

We further analyzed the influence of the number of favorable alleles on trait phenotypes. The distribution of the mean favorable frequency (MFAF) per individual among the 237 genotypes ranged from 0.03 to 0.74 across ten traits, with the highest MFAF of 0.74 for BSD and the lowest MFAF of 0.03 for LVA (Fig. 4). This result indicated that the relatively small population size and the low MFAF of favorable alleles for certain traits may have led to an overestimation of their effect size. Correlation analyses indicated significant associations between the number of favorable alleles and trait phenotypes. For instance, the number of favorable alleles in each individual was positively correlated with DBH, BD, and LA, with correlation coefficients of 0.37, 0.48, and 0.42, respectively, suggesting that individuals with a higher number of favorable alleles exhibited larger phenotypic values. Similarly, the number of favorable alleles showed a significant negative correlation with BSD, indicating that individuals with more favorable alleles had fewer disease lesions from black spot disease, thus exhibiting stronger disease resistance. Overall, the number of favorable alleles present in individuals significantly influences the breeding values of traits, positioning these favorable alleles as potential targets for future genetic improvement.

Fig. 4: Regression statistics between the number of favorable alleles and phenotypic data, and heatmap of favorable allele frequencies.
figure 4

MFAF represents the mean frequency of favorable alleles. DBH diameter at breast height, BD basic density, BSD infection rate of black spot disease, LA leaf area, LL leaf length, LW leaf width, LVA leaf vein angle.

Inclusion multi-trait QTLs enhance prediction accuracy in GS models

We assessed the impact of incorporating multi-trait QTL into genomic selection models on prediction accuracy, comparing them with traditional methods such as Bayesian ridge regression (BRR), BayesA, and BayesC models. The results demonstrated that incorporating multi-trait QTLs significantly improved prediction accuracy across all ten traits (Fig. 5, Supplementary Data 4). For the growth trait, integrating the 10−3sig marker set into the BRR and BayesA models resulted in the highest improvement in DBH prediction accuracy, with an increase of 0.37 (P < 0.0001). For wood properties, lignin content showed the greatest improvement in prediction accuracy, with the BRR model both achieving a significant increase of 0.48 incorporating the 10−4sig and 10−3sig marker set, compared to the unoptimized model (P < 0.0001). For the BSD, the BayesA model showed the greatest improvement in prediction accuracy, with an increase of 0.15 incorporating the 105sig marker set (P < 0.001). For leaf morphological traits, the BRR model incorporating the 10−4sig marker set resulted in the highest improvement in prediction accuracy for LL and LW, with improvement of 0.44 for both traits (P < 0.0001). Specifically, for DBH, prediction accuracies of the BRR, BayesA, and BayesC models were 0.15 ± 0.04, 0.13 ± 0.06, and 0.08 ± 0.06, respectively, and these were consistently improved by incorporating multi-trait QTLs, with the BRR model integrating the 10−3sig marker set achieving the highest accuracy of 0.52  ±  0.03. Similarly, for BSD, incorporating 10−5sig and 10−4sig marker set into the BRR model achieved the highest accuracy (0.52 ± 0.06), compared to baseline accuracies of 0.41 ± 0.08 (BRR), 0.36 ± 0.07 (BayesA), and 0.29 ± 0.06 (BayesC). A comparison of the effects of different marker sets on model performance across the ten traits and the BRR, BayesA, and BayesC models showed that incorporating the 10−5sig and 10−2sig marker sets significantly improved prediction accuracy in 97% of the models, compared to the baseline models (Supplementary Data 4). In contrast, integrating 10−4sig and 10−3sig marker sets led to significantly improved prediction accuracy in all models. Multiple comparisons revealed that models incorporating the 10−3sig marker set exhibited significantly higher predictive accuracy than those using other marker sets, further highlighting the effectiveness of this approach (P < 0.05, Supplementary Fig. 7). Subsequently, multiple comparisons among the BRR, BayesA, and BayesC models demonstrated that BRR consistently outperformed both BayesA and BayesC across all ten traits, highlighting the robustness of the BRR model in leveraging multi-trait QTL information for genomic selection (P < 0.05, Supplementary Fig. 8).

Fig. 5: Comparison of prediction accuracy between genomic selection (GS) models incorporating multi-trait QTL identified by GWAS and traditional GS models (BRR, BayesA, and BayesC).
figure 5

“Model” represents the GS model without optimization, “Model + 10−5sig”, “Model + 10−4sig”, “Model + 10−3sig”, and “Model + 10−2sig” refer to the GS model integrated with multi-trait QTLs identified at P < 1 × 10−5, P < 1 × 10−4, P < 1 × 10−3, and P < 1 × 10−2, respectively. DBH diameter at breast height, BD basic density, Hemic hemicellulose, Cellu cellulose, BSD infection rate of black spot disease, LA leaf area, LL leaf length, LW leaf width, LVA leaf vein angle.

Discussion

In this study, we genotyped 237 accessions from the P. deltoides germplasm plantation derived from natural populations in the southern United States to investigate the role of genetic variation in regulating complex traits such as growth, wood properties, disease resistance, and leaf morphology. Although these accessions were categorized into three subgroups based on geographical location, no definitive genetic structure was observed. This absence of distinct genetic differentiation may be attributed to the characteristics of poplar as a dioecious species, which allows for wind-mediated long-distance dispersal of pollen and seeds18. The extensive gene flow resulting from such dispersal may facilitate genetic exchange between populations, thereby accounting for the observed lack of genetic structure. However, PCA analysis classified the 237 accessions into three distinct groups. The majority of the 237 germplasms accessions are distributed within the Mississippi River basin. River networks can function as natural corridors or barriers, influencing dispersal patterns and gene flow within populations. These localized factors may contribute to the observed population structure, potentially shaping genetic differentiation at a finer scale. This suggests that while extensive gene flow may reduce large-scale genetic differentiation, regional river level may play a significant role in genetic structure within populations. These findings are consistent with previous studies indicating that P. deltoides individuals in Texas and Louisiana exhibit similar genetic structures at K = 2 and K = 319. Furthermore, a genetic structure analysis of 384 unrelated P. deltoides individuals demonstrated that those from Missouri, Louisiana, Tennessee, and Iowa clustered together, reinforcing the idea of widespread gene exchange among populations5.

GWAS identified 4, 27, 7, and 31 significant SNPs (P < 1 ×10−5) associated with growth, wood properties, disease resistance, and leaf morphology, respectively, to elucidate the genetic basis of these complex traits. By leveraging recombination events in diverse germplasm throughout evolutionary history, GWAS significantly enhances the resolution for identifying alleles associated with these traits8,20. Previous association analyses of 17 half-sib families of black poplar (P. nigra), 448 wild accessions of black cottonwood (P. trichocarpa), and 435 unrelated individuals of white poplar (P. tomentosa) also indicated the presence of polygenic structures21,22,23. The identified association loci and candidate genes may vary based on the statistical methods and population characteristics. In this study, GWAS was performed using 198 germplasms, a sample size that may constrain the generalizability and robustness of the findings. Although significant associations loci were identified, the relatively small population size may not fully capture the genetic structure underlying complex traits. As a result, the loci identified here may differ when applied to broader populations. To enhance the reliability and comprehensiveness of these findings, future studies should incorporate larger sample sizes to better capture the genetic architecture underlying the traits of interest. Nevertheless, conducting GWAS still poses challenges in balancing the risks of false positives and false negatives24. Employing high-density SNP markers for association analysis proves to be more effective in obtaining valuable information22,25. Overall, our findings contribute to an expanded understanding of the genetic structures associated with various quantitative traits in poplar.

Wood formation is a highly intricate process that involves vascular cambium proliferation, differentiation of xylem mother cells, secondary wall deposition, programmed cell death, and finally heartwood formation26. The function of the EVM0002801 genes identified in the GWAS for cellulose and lignin was annotated as late embryogenesis abundant protein (LEA proteins). In poplar, LEA genes are essential for growth and development and play a significant role in the response to salt stress27. Expression patterns analysis of the 53 LEA genes identified in poplar revealed that 26 of these genes were expressed in the phloem under normal growth conditions28. This suggests that the EVM0002801 gene may play a potential role in regulating the accumulation of cellulose and lignin content during wood formation. Further functional validation is needed to fully understand its regulatory mechanisms. Plants have complex mechanisms to protect themselves from pathogen attacks. Genes encoding uridine nucleosidase in grapevines were activated in response to infection by anthracnose29. Similarly, we identified the candidate gene associated with BSD as EVM0030857 with the functional annotation uridine nucleosidase. Leaf morphology represents a critical category of adaptive trait variation, making it a favorable system for evolutionary studies30. Additionally, we observed a significant negative correlation between BSD and leaf morphology (LA, LL, and LW). Previous GWAS efforts have identified key genes affecting leaf serration traits in poplar31, highlighting the pivotal role of cellular polarity in leaf morphological variation30. Notably, we found the same QTLs associated with LA, LL, and LW, Notably, and the EVM0007952 gene located within the QTL was functionally annotated as MYB transcription factor. Previous studies have shown that MYB-related gene family regulates leaf morphology in Arabidopsis32 and maize33, which is consistent with our findings that EVM0007952 also belongs to this family. Overall, existing evidence indicated that the candidate genes had been directly or indirectly validated for their impact on the target traits, providing a foundation for future functional validation. The candidate genes identified in this study enhance the understanding of the genetic mechanisms underlying growth, wood properties, disease resistance, and leaf morphology traits.

Tree breeding faces numerous challenges, including longer breeding cycles and larger field trial areas. Therefore, breeders urgently need to explore new methods to enhance the efficiency of forest breeding. In recent years, genomic selection has emerged as a powerful tool for accelerating tree breeding17,34. Our study demonstrates that the BRR model significantly outperforms the BayesA and BayesC models in predicting ten traits. Our result is consistent with previous studies, indicating that the BRR model excels in predicting both the vertical and transverse diameters of pear fruit cores compared to BayesA, BayesB, and BayesC models35. Additionally, in the context of predicting resistance to wheat rust, the BRR and BayesL models demonstrated higher accuracy than support vector regression models36. In high-density marker datasets, the majority of SNPs exhibit neutral phenotypic effects, with only a limited number associated with specific traits16,37,38. The accuracy of genomic selection relies heavily on the assumptions regarding marker effects within the model. Recent studies focusing on tree species have indicated that selecting markers with specific effects on traits can enhance the predictive accuracy of genomic selection39,40. This enhancement has been corroborated by empirical studies across various crops and horticultural plants41,42,43,44 and supported by relevant simulation studies45. The incorporation of QTLs as fixed effects further strengthens the predictive capabilities of genomic selection43,46,47,48. In this study, the integration of multi-trait QTLs as random effects within the BRR, BayesA, and BayesC model resulted in improved predictive accuracy for DBH, BD, hemicellulose, cellulose, lignin, BSD, LA, LL, and LW. While previous studies have demonstrated the benefits of incorporating major QTLs as fixed effects, our results indicate that incorporating multi-trait QTLs as random effects can lead to significant improvements in predictive accuracy across multiple traits. This improvement may be attributed to the presence of moderate to strong phenotypic correlations among traits, as well as the identification of overlapping QTL regions. These observations suggest that certain genomic regions may exert pleiotropic effects or contain tightly linked loci influencing multiple traits. Accordingly, selecting SNPs based on combined GWAS results across traits enabled the models to capture shared genetic signals, improving their ability to borrow information across traits during training and thereby enhancing prediction accuracy.

The previous research showed that selecting markers through GWAS and using the same population to construct genomic selection models reduces prediction accuracy and leads to biased estimates of predicted breeding values49. In contrast, GWAS was conducted on three different breeds of Nordic cattle, where higher accuracy was achieved in the populations used for GWAS, and the greatest improvement in genomic prediction was observed in the populations not included in the GWAS50. The linkage between markers and QTLs may not be consistent across different populations51,52. This inconsistency is not accounted in genomic selection models for populations without preselected markers, which can lead to reduced prediction accuracy across populations. We conducted a GWAS and marker selection on the P. deltoides germplasm population, which may impact the application of GWAS results across different poplar populations to improve genomic prediction accuracy. This limitation can be overcome by conducting GWAS across multiple populations53. In summary, including multi-trait QTLs as random effects within genomic selection models can significantly enhance the accuracy of trait predictions, thereby providing a robust approach to improving the efficiency of forest breeding programs.

In conclusion, this study identified the genomic sources of quantitative trait loci and their favorable alleles associated with growth, wood properties, disease resistance, and leaf morphology traits. These reliable markers, closely linked to the traits of interest, are significant for enhancing the accuracy of genomic selection in Populus breeding. Integrating these multi-trait QTL markers into genomic selection strategies can effectively accelerate the tree breeding process, thereby addressing the increasing demands of the timber industry.

Methods

Plant materials and phenotyping

The population of 237 germplasm accessions used in this study was derived from the Populus deltoides germplasm plantation (Supplementary Fig. 9, Supplementary Data 5). Specifically, 30, 90, and 117 germplasm accessions were collected from Texas, Louisiana, and Mississippi in the United States, respectively. The germplasm plantation was established in 1998 at the Sihong Forest Farm in Jiangsu Province, China, using a randomized block design, with a 6 ×6 m spacing and 12 ramets for each clone.

In this study, we measured ten traits related to growth performance, wood quality, and disease resistance, including DBH, BD, hemicellulose content, cellulose content, lignin content, BSD infection rate, LA, LL, LW, and LVA. In detail, DBH was measured using a measuring tape at 1.3 m above ground level. Wood core samples were collected from the same height of each tree and the basic wood density was determined using the maximum moisture content method1. The collected wood core samples were oven-dried at 105 °C, ground through a 40-mesh sieve, and hemicellulose, cellulose, and lignin contents were measured using the Van Soest method54. In October of the current growing season, mature leaves were collected from each tree at different heights and orientations. Images of the leaves were captured using a Canon camera (Canon, Tokyo, Japan), arranged at a 1:1 scale, and digitized using ImageJ version 1.54f software55 to measure the BSD-infected area, leaf area, leaf length, leaf width, and leaf vein angle (Supplementary Fig. 10). The BSD infection rate was calculated as the ratio of the infected area to the total leaf area. Spatial autocorrelation may influence disease spread and severity in forest experimental plots. To address micro-spatial variation in the analysis of BSD traits, spatial heterogeneity was corrected using the R package SpATS by considering row and column variables before estimating heritability56. The narrow-sense heritability (h²) was estimated using the formula h² = VG / (VG + VE), where VG is the genotypic variance and VE is the environmental variance. Pearson correlation analysis was conducted to explore the relationships between traits using the R package Hmisc and corrplot.

Whole-genome resequencing and SNP detection

The leaf samples from each genotype were used for DNA extraction. Total DNA was isolated using the magnetic bead plant DNA extraction kit (Tiangen Biotech, Beijing, China). DNA quality was assessed by measuring purity using NanoDrop 2000 & 8000 spectrophotometers (Thermo Fisher Scientific, Waltham, USA), quantifying concentration with Qubit fluorometer (Thermo Fisher Scientific, Waltham, USA), and integrity evaluation via 1% agarose gel electrophoresis (120 V, 45 min). Sequencing libraries for plant whole-genome resequencing were prepared using the Illumina DNA PCR-Free Prep, Tagmentation Kit (Illumina, San Diego, USA). Sequencing was performed on the Illumina NovaSeq 6000 platform (Beijing Annoroad Gene Technology Co., Beijing, China) using a paired-end 150 bp (PE150) strategy with a depth exceeding 10×. Raw reads were trimmed for quality using the Trimmomatic software57, and high-quality clean reads were aligned to the P. deltoides reference genome9 using the BWA-MEM algorithm58. Potential PCR duplicates were filtered, and single nucleotide polymorphism (SNP) calling was conducted using the Genome Analysis Toolkit (GATK) v4.559. SNP selection was performed using Plink v1.90b, retaining only SNPs located on chromosomes. Missing alleles were imputed using Beagle v5.4 software60. Biallelic SNPs were extracted and filtered using VCFtools v0.1.1661, applying a missing rate threshold of 0.2 and excluding SNPs with a minor allele frequency below 0.05.

Linkage disequilibrium, GWAS and QTL identification

The genetic structure of the population was assessed using PCA and phylogenetic tree analysis. PCA was performed with Plink v1.90b software62, while the phylogenetic tree was constructed using the SNPhylo software63. Population structure was determined using the Admixture v 1.3.0 software, and Fst between subpopulations was calculated using VCFtools v0.1.16. LD analysis was conducted using PopLDdecay software, with the parameter “-MaxDist 1000” to calculate the LD coefficients between SNPs and to evaluate LD decay64. SNP filtering based on LD and Hardy-Weinberg Equilibrium (HWE) was performed using Plink v1.90b with the parameters “-indep-pairwise 50 5 0.2” and “--hwe 0.000001”.

Following quality control, GWAS was performed using the BLINK model from the GAPIT v3.0 package in R65, based on phenotypic and genotypic data from all 237 individuals to provide a population-level overview of SNP-trait associations across the ten traits. The BLINK model is considered to have greater statistical efficiency, thereby enhancing the detection power of significant signals in GWAS66. It identifies quantitative trait nucleotides (QTNs) using a generalized linear model (GLM), with the equations as follows:

$$y=X\beta +Q{{{\rm{v}}}}+e$$
(1)

Where y is the vector of observed phenotypes, X is the matrix of molecular markers, \(\beta\) is the vector of additive SNP effects treated as fixed effects, Q is the population structure matrix, v is the vector of population structure fixed effects, \(e\) is the residuals. The BLINK model uses QTNs identified by GLM as covariates, incorporating LD information to further refine the identification of QTNs. A Bayesian information criterion (BIC) strategy is then applied to assess the accuracy of the QTNs, excluding false hypotheses and retaining true QTNs. The mixed model included the first three principal components (PCs) derived from whole-genome SNPs as fixed effects to account for population structure. The deviation of P-values from normality was evaluated using quantile-quantile (Q-Q) plots. To balance the need for stringent filtering while accommodating the broad scope of phenotypic analyses, a significance threshold (−log10(P) = 5.00) was selected based on the adjusted Bonferroni correction threshold (−log10(0.05/685181) = 7.13) to identify loci associated with the traits. The confidence intervals for significant SNPs were defined based on the average LD decay to r² = 0.1 among the subpopulations.

QTL were delineated as the confidence intervals upstream and downstream of significant SNPs67,68. Colocalization between traits was determined by assessing whether the QTLs identified for multiple traits through LD were located within the same genomic region. Candidate genes within the QTL regions were identified by aligning with the reference genome and annotated for their putative functions using eggNOG-mapper v.269. Haplotype analysis of candidate genes was performed using the R package geneHapR70. The candidate gene sets were analyzed for GO enrichment using R package clusterProfiler v4.10.171. The LD relationships between SNPs within the QTL regions were further analyzed using LDBlockShow software72. SNP markers with P < 1 × 10−5 associated with the traits were used to identify favorable alleles. Genotypes that associated with increased DBH, BD, hemicellulose, cellulose, lignin, BSD, LA, LL, and LW values, or with decreased BSD values were considered favorable alleles. The number of homozygous favorable alleles for each individual was defined as the number of favorable alleles. The relationship between favorable alleles and phenotypes was visualized using the ggplot2 package in R. Additionally, a heatmap was generated using the pheatmap package in R to illustrate the distribution characteristics and clustering patterns of the favorable alleles across different poplar genotypes.

Construction of genomic selection model

Genomic selection models were constructed using BRR, BayesA, and BayesC methods12,73. To integrate GWAS information into GS, these marker sets were incorporated into a mixed model as random effects for marker grouping, and the resulting Bayesian prediction model is formulated as follows.

$$y={\sum }_{i=1}^{l}{w}_{i}{\delta }_{i}+{\sum }_{K=1}^{m}{Z}_{k}{\gamma }_{k}+\varepsilon$$
(2)

where y is the phenotypic vector for n individuals, \({w}_{i}\) is the genotype vector for n individuals at the i-th marker from the four SNP marker sets, \({\delta }_{i}\) indicates the effect of the i-th SNP marker, \({\gamma }_{k}\) indicates the effect of the k-th marker, and \({Z}_{k}\) is the genotype vector for n individuals at the k-th marker from other markers. The term \(\varepsilon\) represents the random error component, where \(\varepsilon \sim N\left(0,{I}_{n}{\sigma }^{2}\right)\).

The predictive accuracy of GS was evaluated using five-fold cross-validation. The population of 237 accessions was randomly divided into five folds. In each iteration, four folds (190 accessions) were used as the training population, while the remaining fold (47 accessions) served as the validation population. To avoid bias in the cross-validation assessment caused by SNP selection from GWAS, both GWAS and SNP selection were conducted exclusively within the training population (190 accessions). Four SNP marker sets, identified as 10−5sig, 10−4sig, 10−3sig, and 10−2sig, were selected from the combined GWAS results of the ten traits, corresponding to significance thresholds of P < 1 × 10−5, P < 1 × 10−4, P < 1 × 10−3, and P < 1 × 10−2, respectively. These marker sets were then used to optimize GS models within their respective training populations. This process was repeated across all five folds, and prediction accuracy was assessed by calculating the Pearson correlation coefficient between observed phenotypes and genomic estimated values. To ensure robustness, the entire cross-validation procedure was repeated ten times. By setting the same random seed, the same training individuals were maintained across all models to ensure comparability. All GS models were constructed using the BGLR v1.1.3 package in R74.

Statistics and reproducibility

Statistical analyses were performed using R software (v4.4.0). For Fig. 2c, two-tailed unpaired Student’s t tests were conducted, while two-tailed paired t-tests were applied in Supplementary Data 5. P-values were adjusted using the Benjamini–Hochberg method to control the false discovery rate. Linear regression analysis was used in Fig. 4 to assess the relationship between genomic prediction accuracy and transcriptomic correlation. For Supplementary Figs. 7 and 8, linear mixed models (LMMs) were fitted using the lme4 package (v1.1-37), with model type and marker set included as fixed effects, and cross-validation replicates modeled as a random effect to account for variability across resampling iterations. Multiple comparisons were performed using the emmeans package (v1.11.1) with Bonferroni correction. Residual diagnostics were conducted to confirm the assumptions of normality and homoscedasticity. Statistical significance was set at P < 0.05. Significance levels were indicated as: ***P < 0.001; **P < 0.01; *P < 0.05; ns not significant. Data are presented as mean ± standard deviation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.