Integration of GWAS and transcriptome and haplotype analyses to identify QTNs and candidate genes controlling oil content in soybean seeds

Zhang, Kaixin; Hu, Bo; Wang, Wenhao; Li, Wen-Xia; Ning, Hailong

doi:10.1038/s41598-025-00869-8

Download PDF

Article
Open access
Published: 14 May 2025

Integration of GWAS and transcriptome and haplotype analyses to identify QTNs and candidate genes controlling oil content in soybean seeds

Kaixin Zhang^2,1,4,
Bo Hu¹,
Wenhao Wang¹,
Wen-Xia Li¹ &
…
Hailong Ning^1,3

Scientific Reports volume 15, Article number: 16803 (2025) Cite this article

2858 Accesses
Metrics details

Subjects

Abstract

The oil content (OC) of soybean is a critical trait with important applications in the development of both food and industrial products. Consequently, enhancing OC has consistently remained a significant objective in soybean breeding programs. In this research, a four-way recombinant inbred line (FW-RIL) population containing 144 lines developed from the cross (Kenfeng14 × Kenfeng15) × (Kenfeng19 × Heinong48) was planted in ten environments to investigate the phenotypic data for OC in seeds. On the basis of the genotype data for 109,676 nonredundant single-nucleotide polymorphism (SNP) markers obtained in previous studies, important quantitative trait nucleotides (QTNs) controlling OC in soybean were identified via five multilocus genome-wide association study (GWAS) methods. A total of 54 significant QTNs were detected by the five methods, including 21 QTNs identified by multiple methods and 2 QTNs detected in two environments. By combining pathway analysis, transcriptome sequencing and gene annotation information, two candidate genes, Glyma.18G027100 and Glyma.03G021800, which are likely related to oil synthesis and metabolism in soybean seeds, were identified, and they were then verified via haplotype analysis. Additionally, 12 of the 23 important QTNs were found for the first time in this study. These results not only contribute to the understanding of the genetic control of OC in soybean seeds but are also helpful in marker-assisted selection (MAS) for breeding high-oil varieties of soybean.

High-quality genome of a modern soybean cultivar and resequencing of 547 accessions provide insights into the role of structural variation

Article 09 September 2024

Across-environment seed protein stability and genetic architecture of seed components in soybean

Article Open access 16 July 2024

Natural allelic variation in SW14 determines seed weight and quality in soybean

Article Open access 29 August 2025

Introduction

Soybean (Glycine max(L.) Merr.) is the second largest source of vegetable oil in the world, and its consumption has increased rapidly as the population has grown, the standard of living has increased, and diets have changed¹. Owing to the limited availability of agricultural land, increasing the oil content (OC) of soybean seeds has long been a very important goal for breeders.

OC is a typical quantitative trait that is controlled by multiple genes and is affected by the environment. To improve the efficiency of genetic breeding, studies for mapping quantitative trait loci (QTLs) and nucleotides (QTNs) and mining genes associated with OC have been conducted. To date, 315 QTLs and 93 QTNs related to the OC of soybean have been identified in SoyBase (https://soybase.org). Furthermore, 7, 7, 23, 24, 10, 5, 14, and 5 OC QTLs were detected by Cao et al., Zhang et al., Li et al., Yao et al., Huang et al., Zhu et al., Liu et al. and Li et al., respectively^{2,3,4,5,6,7,8,9}, and 6, 14, 15, 44, and 44 OC QTNs were detected by Li et al., Li et al., Wang et al., Li et al. and Zhao et al., respectively^{10,11,12,13,14}. Moreover, researchers have identified several genes related to soybean oil anabolism, such as GmDof4, GmDof11, GmbZIP123, FAD2-1 A, GmMYB73, GmDREBL, GmNFYA, GmZF351, B1, GmOLEO1, GmSWEET39 and GmPDAT^{15,16,17,18,19,20,21,22,23,24,25}. These studies have enhanced our understanding of soybean lipid metabolism and provided valuable insights for improving soybean OC.

Genome-wide association study (GWAS) has gained significant traction in recent years^{26,27,28,29,30}, primarily because of their ability to refine QTL intervals at the gene level. Additionally, GWAS can be used for a diverse range of populations, thereby overcoming the population constraints inherent in linkage analysis. Multiparent advanced generation intercross (MAGIC) populations have more diverse alleles than biparental intermated populations do, which could increase the detection power for QTLs and genes. Therefore, an increasing number of plant MAGIC populations have been reported^{13,31,32,33,34,35,36,37,38}. This type of population remains amenable to analysis through GWAS, which offers greater precision in the identification of genomic regions and genes than linkage analysis does. With the expanding utilization of GWAS, a growing number of analytical methods have been developed. These methods can be broadly classified into two categories: single-locus GWAS and multilocus GWAS. Compared with single-locus GWAS, multilocus GWAS has more advantages, such as the ability to identify QTNs for small effects and a low false-positive rate, so it has been used extensively^{39,40,41,42,43}.

In this study, a soybean four-way recombinant inbred line (FW-RIL) population with 144 individuals was phenotyped for OC in 10 different environments. Combined with the SNP genotype data, the significant QTNs for seed OC in soybean were identified via 5 types of multilocus methods. In the linkage disequilibrium (LD) decay region of the common QTNs, the candidate genes involved in seed OC synthesis were selected via transcriptome sequencing and gene annotation analysis and then verified via haplotype analysis. This research could improve our understanding of the genetic control of seed OC in soybean and support marker-assisted selection for breeding soybean with high seed OC.

Results

Phenotype variation analysis

The general descriptive statistics of the OC trait for the 4 parents and the FW-RILs population are presented in Table 1; Fig. 1. For the parental lines, the OCs of Kenfeng19 and Heinong48 were lower than those of Kenfeng14 and Kenfeng15 in all 10 environments. For the FW-RILs, the “Range” (Range = OC_Max - OC_Min) of the 144 FW-RILs in 10 environments ranged from 2.96 to 6.18%, and the CV range was 3.07–5.99%. In nearly all environments, the absolute values of skewness and kurtosis were less than 1. Additionally, the Shapiro-Wilk test results demonstrated that p-values exceeded 0.05 across all environments. These findings suggest that the OC of the 144 FW-RILs adhered to a normal distribution in these environments (Supplementary Figure S1). The significant positive correlation of OC between different environments revealed that the stable expressed genetic basis among various environments (Fig. 2). The ANOVA results demonstrated significant differences among the genotypes, environments, and their interactions concerning the OC trait (Table 2). This suggests that the population is suitable for identifying QTNs associated with OC, and specific QTNs may be discerned across various environmental conditions. The higher broad-based heritability (0.65) was associated with a greater probability of identifying stable QTNs in multiple environments.

QTNs identified by multilocus GWAS methods

A total of 54 significant QTNs were identified for OC from the five multilocus GWAS methods in 10 environments (Supplementary Table S1). The numbers of significant SNPs detected by pLARmEB, ISIS EM-BLASSO, mrMLM, FASTmrMLM and FAST mrEMMA were 27, 21, 21, 21 and 10, respectively (Fig. 3A), and 22 QTNs could be detected by multiple methods (Table 3), with the PVE by each QTN ranging from 4.22 to 17.68%. In the 10 environments, the numbers of significant QTNs were 6, 6, 9, 9, 2, 9, 4, 3, 3 and 5 (Fig. 3B), and two common QTNs (qOC-7-1, qOC-18-1) were identified across multiple environments, with the PVE by each QTN ranging from 5.52 to 9.20% (Table 3). These 54 significant QTNs were distributed among 19 chromosomes of soybean, of which Chr09 had the most significant QTNs, with 9, and no significant QTN was detected on Chr12 (Fig. 3C).

A total of 23 common QTNs (Common QTN means the QTN was detected across multiple GWAS methods or in multiple environments) were found in this study (Table 3; Fig. 4). 21 common QTNs were detected by at least two methods in a single environment (listed in black font in Table 3); their LOD values ranged from 3.24 to 7.78, while the proportion of PVE by each QTN ranged from 4.22 to 17.68%. Two common QTNs were identified across multiple environments by single or multiple methods, and detailed information of these two QTNs (qOC-7-1, qOC-18-1) were shown in black bold font in Table 3; their LOD values ranged from 4.83 to 5.20, and the proportion of PVE by each QTN ranged from 5.52 to 9.20%. For all the 23 common QTNs, the direction of effect (positive or negative) of each QTN was consistent across the different methods or environments.

Transcriptome sequencing analysis and qRT-PCR validation

To identify DEGs among soybean varieties with high and low OC, transcriptome sequencing analysis was conducted on seeds from 25 days after flowering in KS01, KD31, KD48 and KD57. A total of 5.64 billion raw reads were generated from 12 samples, with individual samples ranging from 41.71 to 53.58 million reads. After filtering out low-quality and adapter reads, the number of clean reads in each sample ranged from 41.26 to 52.89 million, accounting for 98.44–99.06% of the raw reads (Supplementary Table S2 and Fig. 5A). The Q30 value and GC content ranges were 91.17–92.85% and 45.49–46.59%, respectively, which reflected the high quality of the transcriptome sequencing data (Supplementary Table S2). In addition, effective reads, after ribosomal RNA removal, were aligned with the reference genome. The results showed that the mapping ratios of the 12 samples ranged from 95.75 to 96.55%. (Supplementary Table S3). Principal component analysis (PCA) revealed that the samples had excellent repeatability and discrimination (Fig. 5B).

In total, 4559 (KD48-vs-KD31), 7013 (KD48-vs-KS01), 3491 (KD57-vs-KD31) and 11,382 (KD57-vs-KS01) DEGs were identified in the comparisons between different groups. There were more upregulated DEGs than downregulated DEGs in both “KD48-vs-KS01” and “KD57-vs-KS01”. In “KD48-vs-KD31” and “KD57-vs-KD31”, the number of upregulated DEGs was slightly lower than that of downregulated DEGs (Fig. 5C).

To validate the transcriptome sequencing results, we randomly selected 10 DEGs and analyzed their expression levels in KD31, KD57, KS01 and KD48 via qRT-PCR. The results validated the relevance of the transcriptome sequencing data, and the qRT-PCR results were highly consistent (Supplementary Figure S2).

Potential candidate gene analysis

There were 441 genes in the search scopes for 23 common QTNs, of which a total of 196 genes were highly expressed in seeds during lipid formation. Pathway analysis of the 196 genes revealed that 132 genes were enriched in 163 pathways. Among these pathways, only 15 were related to the synthesis and metabolism of lipids, including glycolysis/gluconeogenesis, the citrate cycle (TCA cycle), fructose and mannose metabolism, galactose metabolism, starch and sucrose metabolism, amino sugar and nucleotide sugar metabolism, glyoxylate and dicarboxylate metabolism, propanoate metabolism and inositol phosphate metabolism, which belong to the carbohydrate metabolism pathway; oxidative phosphorylation, photosynthesis-antenna proteins, nitrogen metabolism and sulfur metabolism, which belong to energy metabolism; glycerolipid metabolism, which belongs to lipid metabolism; and the carbon metabolism pathway (Supplementary Table S4).

A total of 13 genes were associated with these 15 pathways, and 8 genes were differentially expressed between the high and low OC varieties of soybean compared with the DEGs of the KD48-vs-KS01, KD57-vs-KS01, KD48-vs-KD31 and KD57-vs-KD31 groups (Table 4; Fig. 6A). The eight DEGs were Glyma.09G173200, Glyma.03G021800, Glyma.14G218800, Glyma.16G213200, Glyma.18G025500, Glyma.18G027100, Glyma.11G093200 and Glyma.18G027200.

For WGCNA based on transcriptome sequencing data, genes with similar expression patterns were clustered into 20 modules (Fig. 6B). An analysis of the relationship between each module and the OC phenotype revealed that MM.darkseagreen 4 (p = 4e-07), MM.violet (p = 2e-04), MM.darkred (p = 0.01) and MM.floralwhite (p = 0.04) presented extremely significant positive correlations with OC traits. The module with an extremely significant negative correlation with the OC trait was MM.brown 4 (p = 0.003) (Fig. 6B). A correlation analysis of the 8 DEGs with the above 5 modules was further performed, and 7 of the DEGs (all except Glyma.16G13200) were significantly associated with one or more of the 5 modules. Glyma.11G093200 and Glyma.18G027100 presented significant or extremely significant positive correlations with 4 modules (red font in Fig. 6C), which presented positive correlations with the OC trait, and Glyma.18G027200 presented a significant positive correlation with MM.darkred and an extremely significant positive correlation with MM.floralwhite. Glyma.03G021800, Glyma.09G173200, Glyma.18G025500 and Glyma.14G218800 presented significant or extremely significant positive correlations with the MM.brown 4 module, which was negatively correlated with the OC trait (Fig. 6C). Therefore, the seven DEGs (all except Glyma.16G213200) were considered potential candidates for further haplotype analysis.

Haplotype analysis

Haplotypes were classified through a natural population that included 141 soybean accessions for which genotypic data were available (https://ngdc.cncb.ac.cn/soyomics). Through haplotype analysis of the above 7 DEGs, we found that the haplotypes of Glyma.18G027100 and Glyma.03G021800 presented significant differences in OC in the natural population. For Glyma.18G027100, two haplotypes were identified on the basis of SNPs at different positions in the exonic region: Hap1 (TAACACAT) and Hap2 (CGGTGGAC) (Fig. 7A). Hap1 presented the highest frequency (82%). In the population used for haplotype analysis, the mean OC of Hap1 was significantly greater than that of Hap2 (Fig. 7B). These results showed that Hap1 (TAACACAT) of Glyma.18G027100 could increase the OC. Three distinct haplotypes were identified from the analysis of Glyma.03G021800: Hap1 (AGGA), Hap2 (AGAC) and Hap3 (CAAC) (Fig. 7C). Hap1, comprising 51% of the population, had a significantly greater mean OC than the remaining two haplotypes and was an excellent haplotype for enhancing the OC (Fig. 7D). In summary, Glyma.18G027100 and Glyma.03G021800 are considered candidate genes related to soybean OC.

Discussion

In this study, the distribution of seed OC followed a normal distribution in the FW-RIL population in all 10 environments, which indicates that the OC was controlled via a complex genetic basis (Supplementary Figure S1). Although the OC had a relatively high h², the extremely significant variance in genotype×environment interaction effects showed that it was difficult to identify stable QTNs in different environments. In the present study, only 2 QTNs (qOC-7-1 and qOC-18-1) could be stably detected in multiple environments. qOC-7-1 was detected simultaneously in the E6 and E7 environments, and qOC-18-1 was detected simultaneously in the E4 and E10 environments (Table 3). According to the results of the correlation analysis, E4 and E10 did not exhibit a significant correlation (Fig. 2), which indicated that the effects of the E4 and E10 environments on OC differed greatly. Therefore, we believe that qOC-18-1 was more stable and more reliable for searching for candidate genes.

However, to avoid the omission of important QTNs due to this environmental impact, we also treated the QTNs identified by multiple methods as important target QTNs. In this way, stable environment-specific QTNs could be obtained by further reducing false positives, thereby preventing valuable genes from being overlooked. Many studies have shown that QTNs identified by multiple methods are reliable and suitable for identifying candidate genes^{42,53,54,55,56}. The candidate gene Glyma.03G021800 found in this study was identified in qOC-3-1, which was detected via two methods.

A total of 23 important QTNs were identified in this study. Among them, 12 QTNs existed in the QTL intervals associated with OC in soybean that have been mapped in previous studies: qOC-2-2, qOC-2–3, qOC-3-1, qOC-5-2, qOC-7-1, qOC-7-3, qOC-9-6, qOC-10-2, qOC-11-1, qOC-14-1, qOC-16-2 and qOC-17-4 (Table 3)^{2,44,45,46,47,48,49,50,51,52}. In addition, we predicted that one significant SNP, AX-157,333,937, is associated with protein content (PC) in a previous analysis³⁷, and it was found to be related to OC in soybean seeds in this study (Table 3); thus, AX-157,333,937, which belongs to qOC-18-1, is related to both PC and OC in soybean seeds. PC and OC are a pair of antagonistic traits, and AX-157,333,937 showed that the QTN effect was positive for PC (QTN effect: 0.48%~0.96%) and negative for OC (−0.23%~−0.16%). It is predicted that AX-157,333,937 is an important QTN related to soybean quality traits and may play an important role in the breeding of high-quality soybean.

Through GWAS, transcriptomic analysis and haplotype analysis, the genes Glyma.18G027100 and Glyma.03G021800 have been identified as candidate genes related to soybean OC. Glyma.18G027100, which was found across qOC-18-1, is a GmPLCδ gene belonging to the PLC gene family and encodes phosphatidylinositol phospholipase C delta (PLCδ). Gene Ontology annotation revealed that these genes are involved in the lipid metabolism process (GO:0006629). Twelve PLC genes were detected in the soybean genome, and they were distributed across chromosomes 2, 11, 14 and 18 equally. Interestingly, all three PLC genes on each chromosome were located adjacent to one another, except Glyma.14G193800 (GmPLC1) on chromosome 14. A similar phenomenon has been observed in Arabidopsis thaliana, in which AtPLC1, AtPLC4 and AtPLC5are located in a 12 kb fragment on chromosome 5⁵⁷. The PLC gene family has been confirmed in A. thaliana mutants to be associated with acyl-lipid metabolism (http://aralip.plantbiology.msu.edu)⁵⁸. Some studies on PLCgenes in soybean have also been reported, for example, studies on the induction of pathogenesis-related genes in soybean and the regulation of abiotic stress responses^59,60. However, a PLC gene has not been reported to be involved in acyl-lipid metabolism in soybean. We can perform further studies on PLC genes in soybean to determine the relationship between PLC genes and OC in soybean.

Glyma.03G021800 encodes a NADH-ubiquinone oxidoreductase that contains a CHCH domain, which acts on the oxidative phosphorylation pathway and is involved in mitochondrial ATP synthesis coupled with electron transport (GO:0042775). The synthesis of oils requires the combination of glycerol triphosphate with long-chain fatty acids, which are synthesized from short chains on the endoplasmic reticulum. Short-chain fatty acids are synthesized in the mitochondria by ATP as the main energy source and then transported out of the mitochondria. Since oxidative phosphorylation is the main pathway for ATP formation in mitochondria, we predicted that Glyma.03G021800 is closely related to the synthesis of soybean oil.

In summary, A total of 23 common QTNs associated with OC were detected via five multilocus GWAS methods in 10 environments. Among them, 12 QTNs were found for the first time in this study. Among these QTNs, 2 candidate genes, Glyma.18G027100 and Glyma.03G021800, were identified. These results contribute to the analysis of the regulatory network of OC in soybean seeds and are valuable for MAS in breeding of high-oil soybean varieties.

Materials and methods

Plant materials and collection of phenotypic data

A population with 144 FW-RILs, derived from a four-way cross of the parental lines Kenfeng14 (OC 20.88%), Kenfeng15 (OC 21.66%), Kenfeng19 (OC 18.33%), and Heinong48 (OC 17.76%), was used in the present study. Details regarding FW-RIL population construction were described in a previous paper published by our research team³⁷.

The 144 FW-RILs and 4 parental lines were planted in 10 environments (hereafter expressed as E1-E10) in different locations and years (Supplementary Table S5). All the materials in each environment were planted in a completely randomized block design with three replications. Three lines were set in a block, and the length and width of the lines were 5 m and 0.7 m, respectively. The experimental plots were managed in a manner consistent with local soybean production. At maturity, ten plants from the middle of each line were harvested and threshed separately. The OC of dry seeds (with approximately 10% moisture) of each line was determined via a near-infrared analyzer (Infratec 1241, Foss, Denmark) at the Key Laboratory of Soybean Biology of the Chinese Education Ministry at Northeast Agricultural University in China. The calibration regression technique used for this near-infrared analyzer was the partial least square (PLS) method, which involves combining spectral data with laboratory data (Kjeldahl method) to calculate the seed OC, described as the percentage of seed weight. The phenotypic values given for each parental and FW-RIL individual used in this study were all the averages of three repetitions.

Analysis of phenotypic data

The analysis of phenotypic data included determination of the mean, standard deviation, range, minimum, maximum, skewness, kurtosis, coefficient of variation (CV) values, variance (ANOVA) and broad-sense heritability (h²). All the statistical analyses were implemented in SAS 9.2 (SAS Institute, Cary, USA). The ANOVA for multiple environments was conducted jointly via the generalized linear model (GLM) method, and the variance components were estimated via the mixed linear model (MLM). Then, h² was calculated via the following equation:

$${h}^{2}={\sigma}_{g}^{2}/\left({\sigma}_{g}^{2}+{\sigma}_{ge}^{2}/n+{\sigma}_{e}^{2}/nr\right)$$

where ${\sigma}_{g}^{2}$, ${\sigma}_{ge}^{2}$ and ${\sigma}_{e}^{2}$ represent the variance in genotype, genotype-by-environment interaction and error variance, respectively; r represents the number of replications within an environment; and n represents the number of environments.

Genotyping, structure and linkage disequilibrium (LD)

Genomic DNA was extracted from the juvenile leaves of parent and FW-RIL plants via the CTAB method⁶¹. The DNA concentration was determined via a UV752 N spectrophotometer (Shanghai Jingke Science Instrument Co. Ltd.), and the DNA was diluted to 100 ± 1 ng in deionized water. SNP genotyping was based on the SoySNP660 K BeadChip method and performed at Beijing Boao Biotechnology Co., Ltd. By performing quality filtration with the criteria maximum missing site ratio < 10% and minor allele frequency (MAF) > 5%⁶², 109,676 SNPs were selected for GWAS analysis. The total number of SNPs on each chromosome is shown in Supplementary Table S6.

By analyzing population structure, the FW-RIL population was divided into two subpopulations, and a Q matrix was obtained and used for multilocus GWAS. The LD region decayed fastest before 200 kb and then tended to flatten, so the 200 kb regions flanking the targeted SNP were determined to search for potential candidate genes. The results have been described in a previously published paper³⁷.

Genome-wide association study (GWAS)

Five kinds of multilocus GWAS methods, including mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO^63,64,65,66, were used to perform GWAS of the OC in soybean seeds. Calculations were performed via mrMLM.GUI (version 3.0) software. The critical P value parameters for these methods at the first stage were set to 0.01, except for FASTmrEMMA, where the critical P value was set to 0.005; the critical LOD score was set to 3, and the P value was set to 0.0002 for a significant QTN at the last stage. The parameter settings were Likelihood="REML”, SearchRadius = 100, CriLOD = 3, SelectVariable = 143, and Bootstrap = FALSE. The kinship matrix was also obtained via mrMLM.GUI 3.0 software.

The significant SNPs were considered the same QTN if they met one of the following conditions: (1) the same SNP markers were detected; (2) the attenuation intervals of different SNP markers intersected, and the same QTNs detected repeatedly by multiple methods were calculated only once.

The QTNs were named as follows: q + trait name-chromosome name-QTN number, where q represents the QTN, the trait name is the OC, and the QTN number represents the ranking of the QTN on the corresponding chromosome.

The QTNs detected via multiple GWAS methods or in multiple environments were considered common QTNs.

Transcriptome sequencing analysis

The soybean materials used for transcriptome sequencing were Keshan 01 (KS01), Kedou 31 (KD31), Kedou 48 (KD48) and Kedou 57 (KD57). KS01 and KD31 were high-oil cultivars with 21.82% and 21.26% OC, respectively, and KD48 and KD57 were low-oil cultivars with 15.78% and 16.44% OC, respectively. These cultivars were provided by the Keshan Branch of Heilongjiang Academy of Sciences, Keshan, Heilongjiang, China.

Uniform seeds of the four cultivars were chosen and sown in pots (five seeds/pot, 25*18 cm pots), and after seed germination, the number of seedlings was reduced to 1 per pot. At 25 days after flowering (the rapid oil accumulation stage l), tender seeds collected from three replicate plants were frozen in liquid nitrogen and stored in an ultralow-temperature freezer at −80 °C until transcriptome sequencing analysis. All four samples were subjected to transcriptome sequencing with three biological repeats.

Transcriptome sequencing analysis was performed on an Illumina HiSeq2500/4000 system by Gene Denovo Biotechnology Co. (Guangzhou, China). To obtain high-quality clean reads, the sequencing data were processed via fastp⁶⁷(version 0.18.0) for read filtering. The parameters used in this analysis were as follows: (1) reads containing adapters were removed; (2) reads containing more than 10% unknown nucleotides (N) were removed; and (3) low-quality reads containing more than 50% low-quality (Q value ≤ 20) bases were removed. The mapped reads of each sample were assembled via StringTie v1.3.1^68,69. For each transcription region, an FPKM (fragment per kilobase of transcript per million mapped reads) value was calculated to quantify its expression abundance and variations via RSEM⁷⁰ software. The FPKM formula is as follows:

$$\:\text{F}\text{P}\text{K}\text{M}=\frac{{10}^{6}\text{C}}{\raisebox{1ex}{$\text{N}\text{L}$}\!\left/\:\!\raisebox{-1ex}{${10}^{3}$}\right.}$$

Given FPKM(A) to be the expression of gene A, C to be number of fragments mapped to gene A, N to be total number of fragments that mapped to reference genes, and L to be number of bases on gene A. The FPKM method is able to eliminate the influence of different gene lengths and sequencing data amount on the calculation of gene expression. Therefore, the calculated gene expression can be directly used for comparing the difference of gene expression among samples.

Differential expression analysis was performed using DESeq2⁷¹software between two different groups (and by edgeR⁷² between two samples). The genes/transcripts with a false discovery rate (FDR) of less than 0.05 and an absolute fold change of ≥ 2 were considered differentially expressed genes/transcripts.

Weighted gene correlation network analysis (WGCNA) was performed using Omicsmart, a real-time interactive online platform for data analysis (http://www.omicsmart.com). The expression similarity between genes was calculated, and the Pearson correlation coefficient was used to measure the correlation between gene pairs. The similarity matrix was converted into a weighted adjacency matrix, and a soft thresholding method was used to enhance strong and weaken weak correlations. The appropriate soft threshold parameter (power value) was selected to assign the network to the characteristics of a scale-free network. Genes were divided into different modules via hierarchical clustering and dynamic tree cut methods. Each module consists of a set of highly coexpressed genes, and the genes within the module have similar expression patterns. Module-trait associations were estimated via the correlation (Pearson correlation) between the module and the OC trait.

Real-time PCR verification

Total RNA extraction, cDNA synthesis and qRT-PCR analysis for each sample were performed according to Zhang et al.⁷³. The internal reference gene was Acint7, and data analysis was performed via the 2^−ΔΔCT method. The primers used for qRT-PCR are shown in Supplementary Table S7.

Search for potential candidate genes

On the basis of the rate of LD decay, the 100 kb intervals on either side of each common QTN were set as the range for seeking potential candidate genes. On the Phytozome website (https://phytozome.jgi.doe.gov), all the genes in these intervals were found out. According to the transcriptomic datasets for the seed developmental stages in soybean (GES42871) downloaded from the GEO (Gene Expression Omnibus) database, genes that are highly expressed in seeds during oil formation were subsequently chosen. All the highly expressed genes were subsequently combined for pathway analysis via the Kyoto Encyclopedia of Genes and Genomes (KEGG) website (http://www.kegg.jp)^74,75,76, after which genes enriched in lipid anabolism-related pathways were identified. Finally, according to the transcriptome sequencing results, differentially expressed genes (DEGs) between the high-oil and low-oil cultivars were identified in lipid anabolism-related pathways. By comparing these DEGs with the results of WGCNA and annotation information for the genes, potential candidate genes were identified.

Haplotype analysis

The haplotype analysis of genes among 141 soybean accessions was performed using the available genotypic data (https://ngdc.cncb.ac.cn/soyomics). First, the markers with missing genotypes more than 20% were deleted. Next, imputation using markov function of NAM package in R was used to fill with missing genotypes. Finally, markers with minor allele frequency less than 0.05 were deleted. The haplotypes were constructed according to SNPs from the untranslated region (UTR) and exon regions via Haploview v4.2 software⁷⁷. Differences between haplotypes for each gene were tested by analysis of variance (contains more than two haplotypes of genes) or unpaired t-test (contains two haplotypes of genes). Multiple comparisons of phenotype differences among various haplotypes were conducted using the LSD.test function of agricolae package in R.

Table 1 Statistical characteristics of OC in the parents and FW-RIL populations grown in ten environments.

Full size table

Table 2 Joint ANOVA of the OC of FW-RILs in multiple environments and heritability.

Full size table

Table 3 Twenty-three common QTNs for seed OC in soybean across different methods or in multiple environments.

Full size table

Table 4 4 Details of 13 genes annotated in the KEGG database and information regarding homologous genes in Arabidopsis Thaliana.

Full size table

Data availability

Transcriptomic data that support the findings of this study have been deposited in the NCBI (BioProject ID PRJNA1185960, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1185960).

References

Marchive, C., Nikovics, K., To, A., Lepiniec, L. & Baud, S. Transcriptional regulation of fatty acid production in higher plants: molecular bases and biotechnological outcomes. Eur. J. Lipid Sci. Tech. 116, 1332–1343 (2014).
Article CAS Google Scholar
Cao, Y. C. et al. Identification of major quantitative trait loci for seed oil content in soybeans by combining linkage and Genome-Wide association mapping. Front. Plant. Sci. 8 (2017).
Zhang, T., Wu, T., Wang, L., Jiang, B. & Sun, S. A. Combined linkage and GWAS analysis identifies QTLs linked to soybean seed protein and oil content. Int. J. Mol. Sci. 20, 5915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, S. G., Xu, H. F., Yang, J. Y. & Zhao, T. J. Dissecting the genetic architecture of seed protein and oil content in soybean from the Yangtze and Huaihe river valleys using Multi-Locus Genome-Wide association studies. Int. J. Mol. Sci. 20 (2019).
Yao, Y. J. et al. Quantitative trait loci analysis of seed oil content and composition of wild and cultivated soybean. BMC Plant. Biol. 20 (2020).
Huang, J. H. et al. Identification and mapping of stable QTLs for seed oil and protein content in soybean [Glycine max (L.) Merr]. J. Agric. Food Chem. 68, 6448–6460 (2020).
Article CAS PubMed Google Scholar
Zhu, X. T., Leiser, W. L., Hahn, V. & Würschum, T. Identification of seed protein and oil related QTL in 944 RILs from a Diallel of early-maturing European soybean. Crop J. 9, 238–247 (2021).
Article Google Scholar
Liu, J. Q. et al. QTL mapping for seed quality traits under multiple environments in soybean (Glycine max L). Agronomy-Basel 13 (2023).
Li, B. et al. Identification of an important QTL for seed oil content in soybean. Mol. Breeding 43 (2023).
Li, Y. H. et al. Genome-wide association mapping of QTL underlying seed oil and protein contents of a diverse panel of soybean accessions. Plant. Sci. 266, 95–101 (2018).
Article CAS PubMed Google Scholar
Li, D. M., Zhao, X., Han, Y. P., Li, W. B. & Xie, F. T. Genome-wide association mapping for seed protein and oil contents using a large panel of soybean accessions. Genomics 111, 90–95 (2019).
Article CAS PubMed Google Scholar
Wang, Y. Y. et al. Genotyping of soybean cultivars with Medium-Density array reveals the population structure and QTNs underlying maturity and seed traits. Front. Plant. Sci. 9 (2018).
Li, X. Y. et al. Detection of QTL and QTN and candidate genes for oil content in soybean using a combination of four-way-RIL and germplasm populations. Crop J. 8, 802–811 (2020).
Article Google Scholar
Zhao, X. C. et al. GWAS and WGCNA analysis uncover candidate genes associated with oil content in soybean. Plants-Basel 13 (2024).
Wang, H. W. et al. The soybean Dof-type transcription factor genes, GmDof4 and GmDof11, enhance lipid content in the seeds of Transgenic Arabidopsis plants. Plant J. 52, 716–729 (2007).
Article CAS PubMed Google Scholar
Song, Q. X. et al. Soybean GmbZIP123 gene enhances lipid content in the seeds of Transgenic Arabidopsis plants. J. Exp. Bot. 64, 4329–4341 (2013).
Article CAS PubMed PubMed Central Google Scholar
Haun, W. et al. Improved soybean oil quality by targeted mutagenesis of the fatty acid desaturase 2 gene family. Plant Biotechnol. J. 12, 934–940 (2014).
Article CAS PubMed Google Scholar
Liu, Y. F. et al. Soybean GmMYB73 promotes lipid accumulation in Transgenic plants. BMC Plant. Biol. 14 (2014).
Zhang, L. et al. An integrated bioinformatics analysis reveals divergent evolutionary pattern of oil biosynthesis in High- and Low-Oil plants. Plos One 11 (2016).
Lu, X. et al. The transcriptomic signature of developing soybean seeds reveals the genetic basis of seed trait adaptation during domestication. Plant. J. 86, 530–544 (2016).
Article CAS PubMed Google Scholar
Li, Q. T. et al. Selection for a Zinc-Finger protein contributes to seed oil increase during soybean domestication. Plant. Physiol. 173, 2208–2224 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhang, D. J. et al. Elevation of soybean seed oil content through selection for seed coat shininess. Nat. Plants. 4, 30–35 (2018).
Article CAS PubMed Google Scholar
Zhang, D. et al. Artificial selection on GmOLEO1 contributes to the increase in seed oil during soybean domestication. PLoS Genet. 15 (2019).
Zhang, H. Y. et al. Dual use and selection of GmSWEET39 for oil and protein improvement in soybean. PLoS Genet. 16 (2020).
Liu, J. Y. et al. An evolutionary population structure model reveals pleiotropic effects of GmPDAT for traits related to seed size and oil content in soybean. J. Exp. Bot. 71, 6988–7002 (2020).
Article CAS PubMed Google Scholar
Zhang, J. et al. Identification of QTNs, QTN-by-environment interactions and genes for yield-related traits in rice using 3VmrMLM. Front. Plant. Sci. 13 (2022).
Zhao, Q. et al. Identification of QTNs, QTN-by-environment interactions, and their candidate genes for grain size traits in main crop and Ratoon rice. Front. Plant. Sci. 14 (2023).
Sui, Y., Che, Y. R., Zhong, Y. & He, L. Q. Genome-wide association studies using 3VmrMLM model provide new insights into branched-chain amino acid contents in rice grains. Plants-Basel 12 (2023).
Zhang, P. et al. Genome-Wide Association Study and Candidate Gene Mining of Seed Size Traits in Soybean. Agronomy-Basel 14 (2024).
Yang, B. et al. Genome-wide association studies for rice seed germination under drought stress using 3VmrMLM. Food Energy Secur. 13 (2024).
Huang, B. E. et al. A multiparent advanced generation inter-cross population for genetic analysis in wheat. Plant Biotechnol. J. 10, 826–839 (2012).
Article CAS PubMed Google Scholar
Bandillo, N. et al. Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice 6 (2013).
Mackay, I. J. et al. An Eight-Parent multiparent advanced generation Inter-Cross population for Winter-Sown wheat: creation, properties, and validation. G3-Genes Genomes Genet. 4, 1603–1610 (2014).
Article Google Scholar
Dell’Acqua, M. et al. Genetic properties of the MAGIC maize population: a new platform for high definition QTL mapping in Zea mays. Genome Biol. 16 (2015).
Pascual, L. et al. Potential of a tomato MAGIC population to Decipher the genetic control of quantitative traits and detect causal variants in the resequencing era. Plant Biotechnol. J. 13, 565–577 (2015).
Article CAS PubMed Google Scholar
Meng, L. J., Guo, L. B., Ponce, K., Zhao, X. Q. & Ye, G. Y. Characterization of Three Rice Multiparent Advanced Generation Intercross (MAGIC) Populations for Quantitative Trait Loci Identification. Plant Genome 9 (2016).
Zhang, K. X. et al. Identification of QTNs controlling seed protein content in soybean using Multi-Locus Genome-Wide association studies. Front. Plant. Sci. 9 (2018).
Li, X. Y. et al. Fine mapping QTL and mining genes for protein content in soybean by the combination of linkage and association analysis. Theor. Appl. Genet. 134, 1095–1122 (2021).
Article CAS PubMed Google Scholar
Shang, G. X. et al. Genome-Wide Association Study of Phenylalanine Derived Glucosinolates in (Brassica rapa). Plants-Basel 11 (2022).
Asfaw, A., Mondo, J. M., Agre, P. A., Asiedu, R. & Akoroda, M. O. Association mapping of plant sex and cross-compatibility related traits in white Guinea Yam (Dioscorea rotundata Poir.) clones. BMC Plant. Biol. 22 (2022).
Guo, S. J. et al. High-Throughput Phenotyping Accelerates the Dissection of the Phenotypic Variation and Genetic Architecture of Shank Vascular Bundles in Maize (Zea mays L.). Plants-Basel 11 (2022).
Park, M. et al. Chromosome-level genome sequence assembly and genome-wide association study of (Muscadinia rotundifolia) reveal the genetics of 12 berry-related traits. Hortic. Res. 9 (2022).
Delfan, S., Bihamta, M. R., Dadrezaei, S. T., Abbasi, A. & Alipoor, H. Exploring genomic regions involved in bread wheat resistance to leaf rust at seedling/adult stages by using GWAS analysis. BMC Genom. 24 (2023).
Kabelka, E. A. et al. Putative alleles for increased yield from soybean plant introductions. Crop Sci. (2004).
Akond, M., Liu, S., Boney, M., Kantartzi, S. K. & Kassem, M. A. Identification of quantitative trait loci (QTL) underlying protein, oil, and five major fatty acids’ contents in soybean. Am. J. Plant. Sci. 5, 158–167 (2014).
Article CAS Google Scholar
Qi, Z. M. et al. Soybean oil content QTL mapping and integrating with meta-analysis method for mining genes. Euphytica 179, 499–514 (2011).
Article Google Scholar
Hwang, E. Y. et al. A genome-wide association study of seed protein and oil content in soybean. BMC Genom. 15 (2014).
Tajuddin, T., Watanabe, S., Yamanaka, N. & Harada, K. Analysis of quantitative trait loci for protein and lipid contents in soybean seeds using Recombinant inbred lines. Breed. Sci. 53, 133–140 (2003).
Article CAS Google Scholar
Leite, D. C., Pinheiro, J. B., Campos, J. B., Di Mauro, A. O. & Unêda-Trevisoli, S. H. QTL mapping of soybean oil content for marker-assisted selection in plant breeding program. Genet. Mol. Res. 15 (2016).
Mao, T. T. et al. Identification of quantitative trait loci underlying seed protein and oil contents of soybean across multi-genetic backgrounds and environments. Plant. Breed. 132, 630–641 (2013).
Article CAS Google Scholar
Gai, J. Y. et al. A comparative study on segregation analysis and QTL mapping of quantitative traits in plants-with a case in soybean. Front. Agric. China. 1, 1–7 (2007).
Article Google Scholar
Seo, J. H. et al. Quantitative trait locus analysis for soybean (Glycine max) seed protein and oil concentrations using selected breeding populations. Plant. Breed. 138, 95–104 (2019).
Article CAS Google Scholar
Cui, Y. R., Zhang, F. & Zhou, Y. L. The application of Multi-Locus GWAS for the detection of Salt-Tolerance loci in rice. Front. Plant. Sci. 9 (2018).
Ma, L. L. et al. Genetic dissection of maize embryonic callus regenerative capacity using Multi-Locus Genome-Wide association studies. Front. Plant. Sci. 9 (2018).
Zhang, Y. M., Jia, Z. Y. & Dunwell, J. M. Editorial: The applications of new Multi-Locus GWAS methodologies in the genetic dissection of complex traits. Front. Plant. Sci. 10 (2019).
Ikram, M. et al. Identification of QTNs and their candidate genes for 100-seed weight in soybean (Glycine max L.) using multi-locus genome-wide association studies. Genes 11 (2020).
Tasma, I. M., Brendel, V., Whitham, S. A. & Bhattacharyya, M. K. Expression and evolution of the phosphoinositide-specific phospholipase C gene family in Arabidopsis thaliana. Plant Physiol. Biochem. 46, 627–637 (2008).
Article CAS PubMed Google Scholar
Mcglew, K. et al. An annotated database of Arabidopsis mutants of acyl lipid metabolism. Plant Cell Rep. 34, 519–532 (2015).
Article CAS PubMed Google Scholar
Chou, W. M., Shigaki, T., Dammann, C., Liu, Y. Q. & Bhattacharyya, M. K. Inhibition of phosphoinositide-specific phospholipase C results in the induction of pathogenesis-related genes in soybean. Plant Biol. 6, 664–672 (2004).
Article CAS PubMed Google Scholar
Wang, F. W. et al. Genome-Wide analysis and expression profiling of the phospholipase C gene family in soybean (Glycine max). Plos One 10 (2015).
Doyle, J. J., Doyle, J. L. & Brown, A. Analysis of a polyploid complex in Glycine with Chloroplast and nuclear DNA. Aust. Syst. Bot. 3 (1990).
Belamkar, V. et al. Genomics-assisted characterization of a breeding collection of Apios americana, an edible tuberous legume. Sci. Rep. 6 (2016).
Wang, S. B. et al. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci. Rep. 6 (2016).
Tamba, C. L., Ni, Y. L. & Zhang, Y. M. Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies. PLoS Comput. Biol. 13 (2017).
Wen, Y. J. et al. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief. Bioinform. 19, 700–712 (2018).
Article PubMed Google Scholar
Zhang, J. et al. pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity 118, 517–524 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. bioRxiv, 274100 (2018).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. Transcript-level expression analysis ofRNA-seq experiments with HISAT, stringtie and ballgown. Nat. Protoc. 11, 650 (2016).
Article Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).
Article CAS Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated Estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Zhang, W. et al. A cation diffusion facilitator, GmCDF1, negatively regulates salt tolerance in soybean. PLoS Genet. 15 (2019).
Kanehisa, M. & Goto, S. K. E. G. G. Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. Toward Understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).
Article CAS PubMed Google Scholar
Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported in part by theNational Natural Science Foundation of China (32201874), the grant from Opening Competition Mechanism to Select the BestCandidates Project of Heilongjiang Province Department of Science and Technology (2023ZXJ02B02) and ScienceFoundation of Heilongjiang Province, China (LH2021C028).

Author information

Authors and Affiliations

Northeast Agricultural University, Harbin, 150030, China
Kaixin Zhang, Bo Hu, Wenhao Wang, Wen-Xia Li & Hailong Ning
Yancheng Institute of Technology, Yancheng, 224051, China
Kaixin Zhang
Zhongnongfa Wudalianchi Agricultural Technology Co., Ltd, Wudalianchi, 164100, China
Hailong Ning
National Center for Soybean Improvement, National Key Laboratory of Crop Genetics and Germplasm Enhancement, Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing, 210095, China
Kaixin Zhang

Authors

Kaixin Zhang
View author publications
Search author on:PubMed Google Scholar
Bo Hu
View author publications
Search author on:PubMed Google Scholar
Wenhao Wang
View author publications
Search author on:PubMed Google Scholar
Wen-Xia Li
View author publications
Search author on:PubMed Google Scholar
Hailong Ning
View author publications
Search author on:PubMed Google Scholar

Contributions

H.N. and W-X. L. designed the research and applied funds. K.Z. and B.H. analyzed the data and wrote the manuscript, and measured the phenotype data. W. W. conducted phenotypic collection. All authors revised the manuscript and approved the final version to be published.

Corresponding authors

Correspondence to Wen-Xia Li or Hailong Ning.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, K., Hu, B., Wang, W. et al. Integration of GWAS and transcriptome and haplotype analyses to identify QTNs and candidate genes controlling oil content in soybean seeds. Sci Rep 15, 16803 (2025). https://doi.org/10.1038/s41598-025-00869-8

Download citation

Received: 17 October 2024
Accepted: 02 May 2025
Published: 14 May 2025
Version of record: 14 May 2025
DOI: https://doi.org/10.1038/s41598-025-00869-8

Subjects

Abstract

Similar content being viewed by others

High-quality genome of a modern soybean cultivar and resequencing of 547 accessions provide insights into the role of structural variation

Across-environment seed protein stability and genetic architecture of seed components in soybean

Natural allelic variation in SW14 determines seed weight and quality in soybean

Introduction

Results

Phenotype variation analysis

QTNs identified by multilocus GWAS methods

Transcriptome sequencing analysis and qRT-PCR validation

Potential candidate gene analysis

Haplotype analysis

Discussion

Materials and methods

Plant materials and collection of phenotypic data

Analysis of phenotypic data

Genotyping, structure and linkage disequilibrium (LD)

Genome-wide association study (GWAS)

Transcriptome sequencing analysis

Real-time PCR verification

Search for potential candidate genes

Haplotype analysis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links