Introduction

Soybean (Glycine max(L.) Merr.) is the second largest source of vegetable oil in the world, and its consumption has increased rapidly as the population has grown, the standard of living has increased, and diets have changed1. Owing to the limited availability of agricultural land, increasing the oil content (OC) of soybean seeds has long been a very important goal for breeders.

OC is a typical quantitative trait that is controlled by multiple genes and is affected by the environment. To improve the efficiency of genetic breeding, studies for mapping quantitative trait loci (QTLs) and nucleotides (QTNs) and mining genes associated with OC have been conducted. To date, 315 QTLs and 93 QTNs related to the OC of soybean have been identified in SoyBase (https://soybase.org). Furthermore, 7, 7, 23, 24, 10, 5, 14, and 5 OC QTLs were detected by Cao et al., Zhang et al., Li et al., Yao et al., Huang et al., Zhu et al., Liu et al. and Li et al., respectively2,3,4,5,6,7,8,9, and 6, 14, 15, 44, and 44 OC QTNs were detected by Li et al., Li et al., Wang et al., Li et al. and Zhao et al., respectively10,11,12,13,14. Moreover, researchers have identified several genes related to soybean oil anabolism, such as GmDof4, GmDof11, GmbZIP123, FAD2-1 A, GmMYB73, GmDREBL, GmNFYA, GmZF351, B1, GmOLEO1, GmSWEET39 and GmPDAT15,16,17,18,19,20,21,22,23,24,25. These studies have enhanced our understanding of soybean lipid metabolism and provided valuable insights for improving soybean OC.

Genome-wide association study (GWAS) has gained significant traction in recent years26,27,28,29,30, primarily because of their ability to refine QTL intervals at the gene level. Additionally, GWAS can be used for a diverse range of populations, thereby overcoming the population constraints inherent in linkage analysis. Multiparent advanced generation intercross (MAGIC) populations have more diverse alleles than biparental intermated populations do, which could increase the detection power for QTLs and genes. Therefore, an increasing number of plant MAGIC populations have been reported13,31,32,33,34,35,36,37,38. This type of population remains amenable to analysis through GWAS, which offers greater precision in the identification of genomic regions and genes than linkage analysis does. With the expanding utilization of GWAS, a growing number of analytical methods have been developed. These methods can be broadly classified into two categories: single-locus GWAS and multilocus GWAS. Compared with single-locus GWAS, multilocus GWAS has more advantages, such as the ability to identify QTNs for small effects and a low false-positive rate, so it has been used extensively39,40,41,42,43.

In this study, a soybean four-way recombinant inbred line (FW-RIL) population with 144 individuals was phenotyped for OC in 10 different environments. Combined with the SNP genotype data, the significant QTNs for seed OC in soybean were identified via 5 types of multilocus methods. In the linkage disequilibrium (LD) decay region of the common QTNs, the candidate genes involved in seed OC synthesis were selected via transcriptome sequencing and gene annotation analysis and then verified via haplotype analysis. This research could improve our understanding of the genetic control of seed OC in soybean and support marker-assisted selection for breeding soybean with high seed OC.

Results

Phenotype variation analysis

The general descriptive statistics of the OC trait for the 4 parents and the FW-RILs population are presented in Table 1; Fig. 1. For the parental lines, the OCs of Kenfeng19 and Heinong48 were lower than those of Kenfeng14 and Kenfeng15 in all 10 environments. For the FW-RILs, the “Range” (Range = OCMax - OCMin) of the 144 FW-RILs in 10 environments ranged from 2.96 to 6.18%, and the CV range was 3.07–5.99%. In nearly all environments, the absolute values of skewness and kurtosis were less than 1. Additionally, the Shapiro-Wilk test results demonstrated that p-values exceeded 0.05 across all environments. These findings suggest that the OC of the 144 FW-RILs adhered to a normal distribution in these environments (Supplementary Figure S1).  The significant positive correlation of OC between different environments revealed that the stable expressed genetic basis among various environments (Fig. 2). The ANOVA results demonstrated significant differences among the genotypes, environments, and their interactions concerning the OC trait (Table 2). This suggests that the population is suitable for identifying QTNs associated with OC, and specific QTNs may be discerned across various environmental conditions. The higher broad-based heritability (0.65) was associated with a greater probability of identifying stable QTNs in multiple environments.

QTNs identified by multilocus GWAS methods

A total of 54 significant QTNs were identified for OC from the five multilocus GWAS methods in 10 environments (Supplementary Table S1). The numbers of significant SNPs detected by pLARmEB, ISIS EM-BLASSO, mrMLM, FASTmrMLM and FAST mrEMMA were 27, 21, 21, 21 and 10, respectively (Fig. 3A), and 22 QTNs could be detected by multiple methods (Table 3), with the PVE by each QTN ranging from 4.22 to 17.68%. In the 10 environments, the numbers of significant QTNs were 6, 6, 9, 9, 2, 9, 4, 3, 3 and 5 (Fig. 3B), and two common QTNs (qOC-7-1, qOC-18-1) were identified across multiple environments, with the PVE by each QTN ranging from 5.52 to 9.20% (Table 3). These 54 significant QTNs were distributed among 19 chromosomes of soybean, of which Chr09 had the most significant QTNs, with 9, and no significant QTN was detected on Chr12 (Fig. 3C).

A total of 23 common QTNs (Common QTN means the QTN was detected across multiple GWAS methods or in multiple environments) were found in this study (Table 3; Fig. 4). 21 common QTNs were detected by at least two methods in a single environment (listed in black font in Table 3); their LOD values ranged from 3.24 to 7.78, while the proportion of PVE by each QTN ranged from 4.22 to 17.68%. Two common QTNs were identified across multiple environments by single or multiple methods, and detailed information of these two QTNs (qOC-7-1, qOC-18-1) were shown in black bold font in Table 3; their LOD values ranged from 4.83 to 5.20, and the proportion of PVE by each QTN ranged from 5.52 to 9.20%. For all the 23 common QTNs, the direction of effect (positive or negative) of each QTN was consistent across the different methods or environments.

Transcriptome sequencing analysis and qRT-PCR validation

To identify DEGs among soybean varieties with high and low OC, transcriptome sequencing analysis was conducted on seeds from 25 days after flowering in KS01, KD31, KD48 and KD57. A total of 5.64 billion raw reads were generated from 12 samples, with individual samples ranging from 41.71 to 53.58 million reads. After filtering out low-quality and adapter reads, the number of clean reads in each sample ranged from 41.26 to 52.89 million, accounting for 98.44–99.06% of the raw reads (Supplementary Table S2 and Fig. 5A). The Q30 value and GC content ranges were 91.17–92.85% and 45.49–46.59%, respectively, which reflected the high quality of the transcriptome sequencing data (Supplementary Table S2). In addition, effective reads, after ribosomal RNA removal, were aligned with the reference genome. The results showed that the mapping ratios of the 12 samples ranged from 95.75 to 96.55%. (Supplementary Table S3). Principal component analysis (PCA) revealed that the samples had excellent repeatability and discrimination (Fig. 5B).

In total, 4559 (KD48-vs-KD31), 7013 (KD48-vs-KS01), 3491 (KD57-vs-KD31) and 11,382 (KD57-vs-KS01) DEGs were identified in the comparisons between different groups. There were more upregulated DEGs than downregulated DEGs in both “KD48-vs-KS01” and “KD57-vs-KS01”. In “KD48-vs-KD31” and “KD57-vs-KD31”, the number of upregulated DEGs was slightly lower than that of downregulated DEGs (Fig. 5C).

To validate the transcriptome sequencing results, we randomly selected 10 DEGs and analyzed their expression levels in KD31, KD57, KS01 and KD48 via qRT-PCR. The results validated the relevance of the transcriptome sequencing data, and the qRT-PCR results were highly consistent (Supplementary Figure S2).

Potential candidate gene analysis

There were 441 genes in the search scopes for 23 common QTNs, of which a total of 196 genes were highly expressed in seeds during lipid formation. Pathway analysis of the 196 genes revealed that 132 genes were enriched in 163 pathways. Among these pathways, only 15 were related to the synthesis and metabolism of lipids, including glycolysis/gluconeogenesis, the citrate cycle (TCA cycle), fructose and mannose metabolism, galactose metabolism, starch and sucrose metabolism, amino sugar and nucleotide sugar metabolism, glyoxylate and dicarboxylate metabolism, propanoate metabolism and inositol phosphate metabolism, which belong to the carbohydrate metabolism pathway; oxidative phosphorylation, photosynthesis-antenna proteins, nitrogen metabolism and sulfur metabolism, which belong to energy metabolism; glycerolipid metabolism, which belongs to lipid metabolism; and the carbon metabolism pathway (Supplementary Table S4).

A total of 13 genes were associated with these 15 pathways, and 8 genes were differentially expressed between the high and low OC varieties of soybean compared with the DEGs of the KD48-vs-KS01, KD57-vs-KS01, KD48-vs-KD31 and KD57-vs-KD31 groups (Table 4; Fig. 6A). The eight DEGs were Glyma.09G173200, Glyma.03G021800, Glyma.14G218800, Glyma.16G213200, Glyma.18G025500, Glyma.18G027100, Glyma.11G093200 and Glyma.18G027200.

For WGCNA based on transcriptome sequencing data, genes with similar expression patterns were clustered into 20 modules (Fig. 6B). An analysis of the relationship between each module and the OC phenotype revealed that MM.darkseagreen 4 (p = 4e-07), MM.violet (p = 2e-04), MM.darkred (p = 0.01) and MM.floralwhite (p = 0.04) presented extremely significant positive correlations with OC traits. The module with an extremely significant negative correlation with the OC trait was MM.brown 4 (p = 0.003) (Fig. 6B). A correlation analysis of the 8 DEGs with the above 5 modules was further performed, and 7 of the DEGs (all except Glyma.16G13200) were significantly associated with one or more of the 5 modules. Glyma.11G093200 and Glyma.18G027100 presented significant or extremely significant positive correlations with 4 modules (red font in Fig. 6C), which presented positive correlations with the OC trait, and Glyma.18G027200 presented a significant positive correlation with MM.darkred and an extremely significant positive correlation with MM.floralwhite. Glyma.03G021800, Glyma.09G173200, Glyma.18G025500 and Glyma.14G218800 presented significant or extremely significant positive correlations with the MM.brown 4 module, which was negatively correlated with the OC trait (Fig. 6C). Therefore, the seven DEGs (all except Glyma.16G213200) were considered potential candidates for further haplotype analysis.

Haplotype analysis

Haplotypes were classified through a natural population that included 141 soybean accessions for which genotypic data were available (https://ngdc.cncb.ac.cn/soyomics). Through haplotype analysis of the above 7 DEGs, we found that the haplotypes of Glyma.18G027100 and Glyma.03G021800 presented significant differences in OC in the natural population. For Glyma.18G027100, two haplotypes were identified on the basis of SNPs at different positions in the exonic region: Hap1 (TAACACAT) and Hap2 (CGGTGGAC) (Fig. 7A). Hap1 presented the highest frequency (82%). In the population used for haplotype analysis, the mean OC of Hap1 was significantly greater than that of Hap2 (Fig. 7B). These results showed that Hap1 (TAACACAT) of Glyma.18G027100 could increase the OC. Three distinct haplotypes were identified from the analysis of Glyma.03G021800: Hap1 (AGGA), Hap2 (AGAC) and Hap3 (CAAC) (Fig. 7C). Hap1, comprising 51% of the population, had a significantly greater mean OC than the remaining two haplotypes and was an excellent haplotype for enhancing the OC (Fig. 7D). In summary, Glyma.18G027100 and Glyma.03G021800 are considered candidate genes related to soybean OC.

Discussion

In this study, the distribution of seed OC followed a normal distribution in the FW-RIL population in all 10 environments, which indicates that the OC was controlled via a complex genetic basis (Supplementary Figure S1). Although the OC had a relatively high h2, the extremely significant variance in genotype×environment interaction effects showed that it was difficult to identify stable QTNs in different environments. In the present study, only 2 QTNs (qOC-7-1 and qOC-18-1) could be stably detected in multiple environments. qOC-7-1 was detected simultaneously in the E6 and E7 environments, and qOC-18-1 was detected simultaneously in the E4 and E10 environments (Table 3). According to the results of the correlation analysis, E4 and E10 did not exhibit a significant correlation (Fig. 2), which indicated that the effects of the E4 and E10 environments on OC differed greatly. Therefore, we believe that qOC-18-1 was more stable and more reliable for searching for candidate genes.

However, to avoid the omission of important QTNs due to this environmental impact, we also treated the QTNs identified by multiple methods as important target QTNs. In this way, stable environment-specific QTNs could be obtained by further reducing false positives, thereby preventing valuable genes from being overlooked. Many studies have shown that QTNs identified by multiple methods are reliable and suitable for identifying candidate genes42,53,54,55,56. The candidate gene Glyma.03G021800 found in this study was identified in qOC-3-1, which was detected via two methods.

A total of 23 important QTNs were identified in this study. Among them, 12 QTNs existed in the QTL intervals associated with OC in soybean that have been mapped in previous studies: qOC-2-2, qOC-2–3, qOC-3-1, qOC-5-2, qOC-7-1, qOC-7-3, qOC-9-6, qOC-10-2, qOC-11-1, qOC-14-1, qOC-16-2 and qOC-17-4 (Table 3)2,44,45,46,47,48,49,50,51,52. In addition, we predicted that one significant SNP, AX-157,333,937, is associated with protein content (PC) in a previous analysis37, and it was found to be related to OC in soybean seeds in this study (Table 3); thus, AX-157,333,937, which belongs to qOC-18-1, is related to both PC and OC in soybean seeds. PC and OC are a pair of antagonistic traits, and AX-157,333,937 showed that the QTN effect was positive for PC (QTN effect: 0.48%~0.96%) and negative for OC (−0.23%~−0.16%). It is predicted that AX-157,333,937 is an important QTN related to soybean quality traits and may play an important role in the breeding of high-quality soybean.

Through GWAS, transcriptomic analysis and haplotype analysis, the genes Glyma.18G027100 and Glyma.03G021800 have been identified as candidate genes related to soybean OC. Glyma.18G027100, which was found across qOC-18-1, is a GmPLCδ gene belonging to the PLC gene family and encodes phosphatidylinositol phospholipase C delta (PLCδ). Gene Ontology annotation revealed that these genes are involved in the lipid metabolism process (GO:0006629). Twelve PLC genes were detected in the soybean genome, and they were distributed across chromosomes 2, 11, 14 and 18 equally. Interestingly, all three PLC genes on each chromosome were located adjacent to one another, except Glyma.14G193800 (GmPLC1) on chromosome 14. A similar phenomenon has been observed in Arabidopsis thaliana, in which AtPLC1, AtPLC4 and AtPLC5are located in a 12 kb fragment on chromosome 557. The PLC gene family has been confirmed in A. thaliana mutants to be associated with acyl-lipid metabolism (http://aralip.plantbiology.msu.edu)58. Some studies on PLCgenes in soybean have also been reported, for example, studies on the induction of pathogenesis-related genes in soybean and the regulation of abiotic stress responses59,60. However, a PLC gene has not been reported to be involved in acyl-lipid metabolism in soybean. We can perform further studies on PLC genes in soybean to determine the relationship between PLC genes and OC in soybean.

Glyma.03G021800 encodes a NADH-ubiquinone oxidoreductase that contains a CHCH domain, which acts on the oxidative phosphorylation pathway and is involved in mitochondrial ATP synthesis coupled with electron transport (GO:0042775). The synthesis of oils requires the combination of glycerol triphosphate with long-chain fatty acids, which are synthesized from short chains on the endoplasmic reticulum. Short-chain fatty acids are synthesized in the mitochondria by ATP as the main energy source and then transported out of the mitochondria. Since oxidative phosphorylation is the main pathway for ATP formation in mitochondria, we predicted that Glyma.03G021800 is closely related to the synthesis of soybean oil.

In summary, A total of 23 common QTNs associated with OC were detected via five multilocus GWAS methods in 10 environments. Among them, 12 QTNs were found for the first time in this study. Among these QTNs, 2 candidate genes, Glyma.18G027100 and Glyma.03G021800, were identified. These results contribute to the analysis of the regulatory network of OC in soybean seeds and are valuable for MAS in breeding of high-oil soybean varieties.

Materials and methods

Plant materials and collection of phenotypic data

A population with 144 FW-RILs, derived from a four-way cross of the parental lines Kenfeng14 (OC 20.88%), Kenfeng15 (OC 21.66%), Kenfeng19 (OC 18.33%), and Heinong48 (OC 17.76%), was used in the present study. Details regarding FW-RIL population construction were described in a previous paper published by our research team37.

The 144 FW-RILs and 4 parental lines were planted in 10 environments (hereafter expressed as E1-E10) in different locations and years (Supplementary Table S5). All the materials in each environment were planted in a completely randomized block design with three replications. Three lines were set in a block, and the length and width of the lines were 5 m and 0.7 m, respectively. The experimental plots were managed in a manner consistent with local soybean production. At maturity, ten plants from the middle of each line were harvested and threshed separately. The OC of dry seeds (with approximately 10% moisture) of each line was determined via a near-infrared analyzer (Infratec 1241, Foss, Denmark) at the Key Laboratory of Soybean Biology of the Chinese Education Ministry at Northeast Agricultural University in China. The calibration regression technique used for this near-infrared analyzer was the partial least square (PLS) method, which involves combining spectral data with laboratory data (Kjeldahl method) to calculate the seed OC, described as the percentage of seed weight. The phenotypic values given for each parental and FW-RIL individual used in this study were all the averages of three repetitions.

Analysis of phenotypic data

The analysis of phenotypic data included determination of the mean, standard deviation, range, minimum, maximum, skewness, kurtosis, coefficient of variation (CV) values, variance (ANOVA) and broad-sense heritability (h2). All the statistical analyses were implemented in SAS 9.2 (SAS Institute, Cary, USA). The ANOVA for multiple environments was conducted jointly via the generalized linear model (GLM) method, and the variance components were estimated via the mixed linear model (MLM). Then, h2 was calculated via the following equation:

$${h}^{2}={\sigma}_{g}^{2}/\left({\sigma}_{g}^{2}+{\sigma}_{ge}^{2}/n+{\sigma}_{e}^{2}/nr\right)$$

where \({\sigma}_{g}^{2}\), \({\sigma}_{ge}^{2}\) and \({\sigma}_{e}^{2}\) represent the variance in genotype, genotype-by-environment interaction and error variance, respectively; r represents the number of replications within an environment; and n represents the number of environments.

Genotyping, structure and linkage disequilibrium (LD)

Genomic DNA was extracted from the juvenile leaves of parent and FW-RIL plants via the CTAB method61. The DNA concentration was determined via a UV752 N spectrophotometer (Shanghai Jingke Science Instrument Co. Ltd.), and the DNA was diluted to 100 ± 1 ng in deionized water. SNP genotyping was based on the SoySNP660 K BeadChip method and performed at Beijing Boao Biotechnology Co., Ltd. By performing quality filtration with the criteria maximum missing site ratio < 10% and minor allele frequency (MAF) > 5%62, 109,676 SNPs were selected for GWAS analysis. The total number of SNPs on each chromosome is shown in Supplementary Table S6.

By analyzing population structure, the FW-RIL population was divided into two subpopulations, and a Q matrix was obtained and used for multilocus GWAS. The LD region decayed fastest before 200 kb and then tended to flatten, so the 200 kb regions flanking the targeted SNP were determined to search for potential candidate genes. The results have been described in a previously published paper37.

Genome-wide association study (GWAS)

Five kinds of multilocus GWAS methods, including mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO63,64,65,66, were used to perform GWAS of the OC in soybean seeds. Calculations were performed via mrMLM.GUI (version 3.0) software. The critical P value parameters for these methods at the first stage were set to 0.01, except for FASTmrEMMA, where the critical P value was set to 0.005; the critical LOD score was set to 3, and the P value was set to 0.0002 for a significant QTN at the last stage. The parameter settings were Likelihood="REML”, SearchRadius = 100, CriLOD = 3, SelectVariable = 143, and Bootstrap = FALSE. The kinship matrix was also obtained via mrMLM.GUI 3.0 software.

The significant SNPs were considered the same QTN if they met one of the following conditions: (1) the same SNP markers were detected; (2) the attenuation intervals of different SNP markers intersected, and the same QTNs detected repeatedly by multiple methods were calculated only once.

The QTNs were named as follows: q + trait name-chromosome name-QTN number, where q represents the QTN, the trait name is the OC, and the QTN number represents the ranking of the QTN on the corresponding chromosome.

The QTNs detected via multiple GWAS methods or in multiple environments were considered common QTNs.

Transcriptome sequencing analysis

The soybean materials used for transcriptome sequencing were Keshan 01 (KS01), Kedou 31 (KD31), Kedou 48 (KD48) and Kedou 57 (KD57). KS01 and KD31 were high-oil cultivars with 21.82% and 21.26% OC, respectively, and KD48 and KD57 were low-oil cultivars with 15.78% and 16.44% OC, respectively. These cultivars were provided by the Keshan Branch of Heilongjiang Academy of Sciences, Keshan, Heilongjiang, China.

Uniform seeds of the four cultivars were chosen and sown in pots (five seeds/pot, 25*18 cm pots), and after seed germination, the number of seedlings was reduced to 1 per pot. At 25 days after flowering (the rapid oil accumulation stage l), tender seeds collected from three replicate plants were frozen in liquid nitrogen and stored in an ultralow-temperature freezer at −80 °C until transcriptome sequencing analysis. All four samples were subjected to transcriptome sequencing with three biological repeats.

Transcriptome sequencing analysis was performed on an Illumina HiSeq2500/4000 system by Gene Denovo Biotechnology Co. (Guangzhou, China). To obtain high-quality clean reads, the sequencing data were processed via fastp67(version 0.18.0) for read filtering. The parameters used in this analysis were as follows: (1) reads containing adapters were removed; (2) reads containing more than 10% unknown nucleotides (N) were removed; and (3) low-quality reads containing more than 50% low-quality (Q value ≤ 20) bases were removed. The mapped reads of each sample were assembled via StringTie v1.3.168,69. For each transcription region, an FPKM (fragment per kilobase of transcript per million mapped reads) value was calculated to quantify its expression abundance and variations via RSEM70 software. The FPKM formula is as follows:

$$\:\text{F}\text{P}\text{K}\text{M}=\frac{{10}^{6}\text{C}}{\raisebox{1ex}{$\text{N}\text{L}$}\!\left/\:\!\raisebox{-1ex}{${10}^{3}$}\right.}$$

Given FPKM(A) to be the expression of gene A, C to be number of fragments mapped to gene A, N to be total number of fragments that mapped to reference genes, and L to be number of bases on gene A. The FPKM method is able to eliminate the influence of different gene lengths and sequencing data amount on the calculation of gene expression. Therefore, the calculated gene expression can be directly used for comparing the difference of gene expression among samples.

Differential expression analysis was performed using DESeq271software between two different groups (and by edgeR72 between two samples). The genes/transcripts with a false discovery rate (FDR) of less than 0.05 and an absolute fold change of ≥ 2 were considered differentially expressed genes/transcripts.

Weighted gene correlation network analysis (WGCNA) was performed using Omicsmart, a real-time interactive online platform for data analysis (http://www.omicsmart.com). The expression similarity between genes was calculated, and the Pearson correlation coefficient was used to measure the correlation between gene pairs. The similarity matrix was converted into a weighted adjacency matrix, and a soft thresholding method was used to enhance strong and weaken weak correlations. The appropriate soft threshold parameter (power value) was selected to assign the network to the characteristics of a scale-free network. Genes were divided into different modules via hierarchical clustering and dynamic tree cut methods. Each module consists of a set of highly coexpressed genes, and the genes within the module have similar expression patterns. Module-trait associations were estimated via the correlation (Pearson correlation) between the module and the OC trait.

Real-time PCR verification

Total RNA extraction, cDNA synthesis and qRT-PCR analysis for each sample were performed according to Zhang et al.73. The internal reference gene was Acint7, and data analysis was performed via the 2−ΔΔCT method. The primers used for qRT-PCR are shown in Supplementary Table S7.

Search for potential candidate genes

On the basis of the rate of LD decay, the 100 kb intervals on either side of each common QTN were set as the range for seeking potential candidate genes. On the Phytozome website (https://phytozome.jgi.doe.gov), all the genes in these intervals were found out. According to the transcriptomic datasets for the seed developmental stages in soybean (GES42871) downloaded from the GEO (Gene Expression Omnibus) database, genes that are highly expressed in seeds during oil formation were subsequently chosen. All the highly expressed genes were subsequently combined for pathway analysis via the Kyoto Encyclopedia of Genes and Genomes (KEGG) website (http://www.kegg.jp)74,75,76, after which genes enriched in lipid anabolism-related pathways were identified. Finally, according to the transcriptome sequencing results, differentially expressed genes (DEGs) between the high-oil and low-oil cultivars were identified in lipid anabolism-related pathways. By comparing these DEGs with the results of WGCNA and annotation information for the genes, potential candidate genes were identified.

Haplotype analysis

The haplotype analysis of genes among 141 soybean accessions was performed using the available genotypic data (https://ngdc.cncb.ac.cn/soyomics). First, the markers with missing genotypes more than 20% were deleted. Next, imputation using markov function of NAM package in R was used to fill with missing genotypes. Finally, markers with minor allele frequency less than 0.05 were deleted. The haplotypes were constructed according to SNPs from the untranslated region (UTR) and exon regions via Haploview v4.2 software77. Differences between haplotypes for each gene were tested by analysis of variance (contains more than two haplotypes of genes) or unpaired t-test (contains two haplotypes of genes). Multiple comparisons of phenotype differences among various haplotypes were conducted using the LSD.test function of agricolae package in R.

Fig. 1
figure 1

Phenotypic variations in the OC of the FW-RIL population in 10 environments. The interquartile region, median, and range are represented by the box, the vertical line in the box, and the vertical lines at both ends, respectively.

Fig. 2
figure 2

Correlation analysis of OC in 10 environments. The number at the bottom left represents the corresponding correlation coefficient with the dot at the top right. * represents a significant correlation at the 0.05 level, ** represents a significant correlation at the 0.01 level, and *** represents a significant correlation at the 0.001 level.

Fig. 3
figure 3

(A) Venn diagram of the total numbers of significant QTNs detected by 5 multilocus GWAS methods. (B) The total number of significant QTNs detected in 10 environments across the 5 methods. (C) The total number of significant QTNs detected in 10 chromosomes.

Fig. 4
figure 4

Distribution of significant QTNs on 14 chromosomes. No common QTN were found in the remaining 6 chromosomes (chromosome 06, 12, 13, 15,19 and 20). The common QTNs detected in multiple environments are shown in red.

Fig. 5
figure 5

(A) Statistical graph of the clean read proportions. (B) PCA of the twelve samples used for transcriptome sequencing. (C) Number of DEGs in these four groups.

Fig. 6
figure 6

(A) Expression heatmap of the 8 DEGs in these four groups. The blue represents downregulated genes, and the red represents upregulated genes. The white blank represents a nonsignificant difference. The genes in red font were upregulated in all the groups, and the genes in blue font were downregulated in all the groups, indicating a significant difference in expression. (B) Correlations of the OC trait with WGCNA modules. The red color represents a positive correlation, and the blue color represents a negative correlation. * represents a significant correlation at the 0.05 level, ** represents a significant correlation at the 0.01 level, and *** represents a significant correlation at the 0.001 level. (C) Correlations between modules and 8 DEGs. The red color represents a positive correlation, and the blue color represents a negative correlation. * represents a significant correlation at the 0.05 level, ** represents a significant correlation at the 0.01 level, and *** represents a significant correlation at the 0.001 level.

Fig. 7
figure 7

(A) Haplotype information for Glyma.18G027100 and the corresponding numbers of accessions. The upper part shows the gene structure diagram of Glyma.18G027100. (B) Boxplot showing the OC (percentage of dry weight) of the Hap1 and Hap2 haplotypes of Glyma.18G027100 among 121 accessions. (C) Haplotype information for Glyma.03G021800 and the corresponding numbers of accessions. The upper part shows the gene structure diagram for Glyma.03G021800. (D) Boxplot showing the OC (percentage of dry weight) of the Hap1, Hap2 and Hap3 haplotypes of Glyma.03G021800 among 141 accessions.

Table 1 Statistical characteristics of OC in the parents and FW-RIL populations grown in ten environments.
Table 2 Joint ANOVA of the OC of FW-RILs in multiple environments and heritability.
Table 3 Twenty-three common QTNs for seed OC in soybean across different methods or in multiple environments.
Table 4 4 Details of 13 genes annotated in the KEGG database and information regarding homologous genes in Arabidopsis Thaliana.