Introduction

Polygenic inheritance forms the genetic basis of complex traits1, and hence their dissection necessitates genomes with high-precision resolution. Advances in high-throughput sequencing technologies and emerging genome-wide association studies (GWAS) have revealed substantial genetic variation associated with complex traits in humans2,3, plants4,5, and animals6,7. Knowledge of these variations have facilitated the resolution of the genetic architectures of complex traits, genetic ancestry identification, and accurate prediction of phenotypes from genotypes. However, although genomic variation, represented by single-nucleotide polymorphisms (SNPs) and small structural variations, has illuminated a wide range of genetic mechanisms, the gene remains the determining biological factor in most applied genetic studies. Thus, while a plethora of GWAS signals have been identified, these associations rarely lead to the identification of causal genes8, and the overall goal of assigning genetic mechanisms remains unfulfilled. This discrepancy arises from the fact that strongly associated variants are often in linkage disequilibrium (LD) with causal variants9, and such LD may be unexpectedly strong in plant and animal species subjected to long-term artificial selection, resulting in excessively long candidate intervals for putative quantitative trait loci (QTL). Additionally, genome complexity, especially in polyploid species, leads to higher gene density, further complicating the efficacy of gene mapping within current GWAS frameworks. Whole-exome sequencing (WES) focuses on genetic signatures within gene exons10, but the association strategy does not differ from that used for whole genome sequencing (WGS), and at present it still faces potential issues with detection power11. To date, few studies have comprehensively characterized genetic variation in the gene unit, much less explored the potential of using gene-based variation for direct mapping of functional genes.

Cotton is the most important source of natural fibers. Since 2012, genome assemblies of representative cotton species have been constructed12,13,14,15, large-scale WGS of major germplasm populations have been conducted and more than 4000 genomes of cotton accessions have been determined13,16,17,18,19,20,21,22. Most modern cultivars of upland cotton (Gossypium hirsutum L.) were domesticated from the Yucatan Peninsula of Mesoamerica and therefore possess the genetic ancestry of American cotton. Long-term domestication has led to a genetic bottleneck and high levels of LD23; moreover, as a typical allotetraploid crop, the cotton has a complex genome which characterized by variable structure and high gene density. Consequently, progress in mapping functional genes underlying key agronomic traits remains slow, presenting both significant opportunities and challenges for the advancement of functional genomics in cotton.

In this work, we propose a genetic strategy to transform discrete genomic variants into composite factorial functional haplotypes (FHs). By recoding non-synonymous variations within the coding regions of genes, FH reflects variations in protein sequences across a population. We analyze gene-level FH in a comprehensive collection of 3724 accessions, encompassing 3636 wild and/or semi-wild species, landraces, varieties, and modern cultivars of upland cotton and 88 G. barbadense accessions as outgroup; investigate genetic diversity changes related to cotton adaptation and domestication at the gene level. Importantly, we apply a revised linear mixed model to the factorial FH and conduct gene-level association analysis on a collection of 245 phenotypes covering 20 traits, including fiber quality, lint yield and its components, plant architecture, maturity, and resistance. The FH-based GWAS enables the efficient and direct mapping of quantitative trait genes (QTGs), which initially identify 10,279 QTGs, and further estimation of exact effect sizes confirms 532 genes with significant breeding potentials. Finally, we perform CRISPR-Cas9 knock-out experiments on a QTG encoding ferulic acid 5-hydroxylase 1, demonstrating its loss to result in the expected superior phenotypic changes in fiber quality.

Results

Mining of gene-level functional haplotype

We collected a total of 4392 upland cotton cultivars and lines (accessions) of G. hirsutum from four of our previously published datasets, as well as four other publicly available datasets (Supplementary Data 1). An additional 88 G. barbadense accessions were included as outgroups. We categorized all samples into those with high (>10×) and low (<10×) sequencing depth, and used the samples with high depth as a reference panel to impute low-depth samples with the same geographic origin, i.e., population-specific imputation (Supplementary Data 2). After performing quality control for the genomic variants, we evaluated the combined samples and found those samples with lower sequencing depth exhibited unique but abnormal genetic ancestry and sub-branching in the phylogenetic tree, and these samples cannot cluster with either early landraces or modern cultivars of upland cotton with same geographic origin (Supplementary Fig. 1). In addition, we observed a distinct bimodal pattern of variant heterozygosity in low-depth samples (Supplementary Fig. 2). The discrepancies between low- and high-depth samples suggest significant technical rather than biological differences. Consequently, we removed 756 accessions with sequencing depth lower than 10×. For the remaining 3724 accessions, we re-performed population-specific imputation without a reference panel. After further quality control (biallelic, mapping quality >30, inbreeding coefficient >−0.6, minor allele frequency [MAF] >0.001), we finalized a dataset of 3724 accessions (Supplementary Data 3) with 23,057,253 genomic variants for subsequent FH identification. These cotton accessions represent a wide range of geographic origins, including wild species, semi-wild races, and early landraces from Central America (CAL), North America (NAL), South America (SAL), and South China (SCL); improved varieties from Africa, Asia, Europe, and Oceania; and modern cultivars from the United States of America (USA), the former Soviet Union (FSU), and China, including cotton growing regions in Yellow River (YRR), Yangtze River (YZR), Northwest China (NWC), Southwest China (SWC), North China (NC), and East China (EC).

Regarding genome-wide variation, great attention was paid to non-synonymous mutations, with different genotype combinations for a given gene within a population being considered to constitute distinct FHs (Fig. 1). For example, when there is a single non-synonymous mutation in a gene, the FH genotype (i.e., FH-1, FH-2, and FH-3) is equivalent to the additive genotype of that mutation (i.e., AA, Aa, and aa); whereas when there are two non-synonymous mutations, there are at most 32, i.e., nine FH types in the population according to the combinatorial genotypes of the two mutations; and if there are three non-synonymous mutations, the gene has up to 27 FH types. By analogy, when there are M non-synonymous mutations in a gene, the gene has at most 3 M FH types, but its maximum will not exceed the population size N. By further analyzing the frequency of FHs within different populations, such as those of different geographic origins, and their correlations with agronomic traits of interest, key FHs with their represented genes can be identified. The fundamental rationale of our strategy is the extraction of protein variation based on genomic variation, an approach that transforms the analysis object from single-point variation to gene/protein variation, thereby greatly optimizing and simplifying the interpretability of subsequent genetic analysis. Furthermore, this approach facilitates efficient gene mapping in association analysis. In analyzing the 3724 accessions, we screened out 1,402,043 exonic variants, which included 334,406 non-synonymous variants located in 60,154 out of 71,994 genes that directly change the corresponding protein sequences. These variants resulted in a genome-wide total of 1,368,685 FHs across the entire population, with an average of over 22 FHs in each gene (Supplementary Data 4).

Fig. 1
figure 1

Strategy of FH construction.

The landscape of FHs reveals directional gene selection during cotton adaptation and domestication

In the global set of cotton accessions, genomic FHs are unevenly distributed across chromosomes, with FH abundance correlating positively with the number of genes on the chromosome (Fig. 2a, Supplementary Fig. 3, and Supplementary Table 1). To make gene diversity comparable across populations with different sample sizes, we calculated Shannon’s equilibrium index (EH) for each gene in the population. The average EH indicated significant differences in gene diversity across chromosomes and sub-genomes, suggesting that the adaptation and domestication of upland cotton introduced varying degrees of directional selection on the genome (Fig. 2a, b). For instance, chr. D02 is among the chromosomes containing the most FHs (91,206) and also has the highest average EH (0.0573), which suggests genes on this chromosome have a higher level of diversity and are less subject to natural and artificial selection. Conversely, chr. D11, despite having an even higher number of FHs (92,593), exhibited the lowest average EH (0.0391), implying strong intervention from artificial selection on its genes. Further population-level analysis revealed significantly higher gene diversities in cottons from CAL, NAL (Mexico), SAL, and SCL, which represent wild and semi-wild race species and early landraces. In contrast, modern cultivars, especially from the USA and Chinese YZR and YRR, exhibited relatively lower gene diversities (Fig. 2c). Notably, the gene diversity pattern in the SCL population (Hainan province and Guangxi province) is similar to that in American landraces, indicating the ancient genetic components of early landraces in SCL and their geographic origin. These results are consistent with modern cotton breeding history across the world.

Fig. 2: FH landscape revealed gene diversity and directional selection in the global upland cotton population.
figure 2

a FH numbers identified on distinct chromosomes (bar, left y-axis) and the average gene diversity EH (line, right y-axis). Multiple comparisons of EH in different chromosomes were conducted by the least significant difference (LSD) test, with p values adjusted by the Benjamini–Hochberg method. b Distribution of EH across sub-genomes At and Dt. **p < 0.01. c Changes in gene diversity characteristics from cotton semi-wild species, early landraces to modern cultivated accessions. Red dots indicate the mean genetic diversity of each population. N represents the sample size. d Functional investigation for sequence-conserved and highly polymorphic genes. e Enrichment analysis revealing the directional selection from early landraces to modern cultivars. Source data are provided as a Source Data file.

To investigate the functions of genes with different degrees of FH diversity, we first focused on sequence-conserved genes (SCGs), which lack any missense variants across the entire population. By integrating RNA-Seq data from whole cotton tissues (Supplementary Data 5), we found that 57.54% of SCGs had potential functions related to cellular existence and cell fate, which are critical to the organism (Fig. 2d, Supplementary Table 2, and Supplementary Data 6). Regarding function classifications, ~29.92% of SCGs were house-keeping genes expressed in all 34 tested tissues, 21.98% were silent genes not expressed in any tested tissue, and 5.65% were genes expressed in specific organisms. For the remaining 42.46% of SCGs, functional enrichment analysis identified many as related to actin and microtubule processes, such as microtubule-based process and structural constituent of cytoskeleton, and others as involved in basic cell development functions such as electron transfer activity and photosynthesis (Supplementary Data 7). These fiber-associated SCGs are considered to have been fixed in the cotton breeding process.

At the other extreme, we also analyzed the functions of genes with highly variable sequences (highly polymorphic genes, HPGs, EH > 0.3). Interestingly, the majority (62.69%) of HPGs had seemingly random functions, while the remaining 37.31% were clearly related to disease response and foreign substances recognition (Supplementary Table 3). Furthermore, most HPGs (93.53%) were subject to strong positive selection (Ka/Ks > 1, Supplementary Data 8). This finding is similar to the major histocompatibility complex (MHC) and T cell receptors (TR) in human genetic research, both of which possess an unusually large number of alleles and extremely diverse protein types that determine unique antigenic compositions at the individual level and their interactions. This result suggests the possibility of a shared or similar genetic mechanism underlying immunity among life kingdoms.

We further analyzed the functions of genes with rapidly decreased gene diversity (ΔEH > 0.3) in modern cotton cultivars compared to early landraces. These genes are enriched in biological pathways related to carbohydrates, polysaccharides, ADP binding, pollen recognition, and intracellular signaling (Fig. 2e and Supplementary Data 9). Finally, genes with moderately reduced EH (0.1 < ΔEH < 0.3) in modern cultivars relative to earlier accessions are enriched in pathways involving microtubule binding and movement, ATPase activity, ubiquitination, and cytokinin metabolism. These findings are consistent with the results of recent cotton breeding efforts to improve agronomic traits, particularly lint yield and fiber quality.

FH-based haplotype analysis reflects population structure and elite gene transferring during development

Utilizing the genotypes encoded by factorial FH, we applied a general method to calculate the genetic distance between sample pairs, and then constructed a phylogenetic tree based upon those distances (Methods). This analysis classified the 3724 cotton accessions into six clusters, in which the 88 G. barbadense accessions were clustered into group 1 (Outgroup, G1) and the remaining 3636 G. hirsutum accessions into five groups (G2 to G6; Fig. 3a). Among these groups, G2 collected most accessions from YZR and a considerable portion from YRR, due to similar geographical and climatic conditions and frequent introductions; G3 consisted of another portion of the YRR cultivars; G4 contained most accessions from the USA; G5 included most accessions from NWC and FSU; and G6 encompassed most early landraces from America and Southern China, consistent with cotton breeding practice in China.

Fig. 3: Population structure and domestication pattern revealed by FH in the 3724 cottons.
figure 3

a Phylogenetic analysis of the global cotton population based on FH. Germplasm geographical origins and release dates are annotated on the periphery of the phylogenetic tree. All germplasms are clustered into six groups G1–G6 based on their geographic origin and tree topology. b Comparison of sample clustering between the FH tree and SNP tree. The geographical origins of the clusters’ main accessions (>5%) are annotated by the fan chart. Genomic divergence in chr. c A06 and d A08 as revealed by FH. Germplasms with the same FH are assigned to one haplotype block (HB). e Different HBs in chr. A06 and A08 are associated with fiber improvement. The center line represents the median, the lower and upper box hinges corresponding to the first and third quantiles, and whiskers extend to the minimum and maximum values. n represents the sample size. f Intensity of gene transference between population pairs. Only intensities exceeding 0.15 are labeled. Source data are provided as a Source Data file.

We further compared the similarities and differences of phylogenetic trees constructed on the basis of FH and SNP information. Despite the distribution density of genes being far lower than that of SNPs, FH-based genotyping completely reproduced the subgroup characteristics observed when classifying on SNPs, and also accurately reflected the population structure. At the group level, FH-G1 corresponds to SNP-G1, FH-G2, and FH-G3 to SNP-G4 and SNP-G5, FH-G4 to SNP-G6, FH-G5 to SNP-G3, and SNP-G6 and FH-G6 to SNP-G2 and SNP-G6 (Fig. 3b). Nonetheless, some discrepancies were observed in the attributional assignments of genetic components between FHs and SNPs. For example, the cultivars from SWC in FH-G6, which clustered with early landraces from the American continents and SC, were assigned to SNP-G6, which contained modern cultivars from the USA and NWC, indicating potential differences in genetic characteristics reflected.

Next, FH-based haplotypes on two typical regions of structural variation on chr. A06 and A08 were further investigated. Comparisons of the merged FH and SNP haplotypes suggested that FHs are able to capture structural variations previously implied by SNPs (Supplementary Figs. 4, 5). Additionally, since FH allows genes to be coded in both a more flexible and more diverse manner, the FH-based haplotype allows dividing structural variation into finer haplotype blocks, enabling more accurate tracking of genetic ancestry. On chr. A06, hap-block (HB)−2 and HB-3 reflected different genetic components that could not be distinguished based on SNP data; and the genetic ancestry of HB-1 and HB-4 were easily attributed to NAL and SCL, respectively (Fig. 3c). In addition, evidence of genetic sharing in HB-2, HB-3, and HB-4 supported genetic transferring between SCL and SWC. Meanwhile, HBs on chr. A08 revealed unique genetic footprints of adaptation in cottons from NWC, YRR, and YZR. HB-1 on chr. A08 was unique to the NWC cultivars, while the other FH-HBs reflected the discovery and fixation of superior genes originated from USA cottons in Chinese breeding programs (Fig. 3d). This result is consistent with the historical introduction of upland cotton cultivars from the USA to China over the last century, and subsequent pedigree selection and cross breeding in major cultivation areas over the past 70 years. Integration with agronomic traits revealed the genetic polymorphisms reflected by HBs on chr. A06 and A08 to be associated with important lint yield components, lint percentage (LP), and fiber qualities such as fiber length (FL) and strength (FS) (Fig. 3e). This genetic pleiotropy of A08 HB is indicated as the result of the combined action of pleiotropy QTL and LD linked non-pleiotropy QTL (Supplementary Table 4). Beyond chr. A06 and chr. A08, FH also provided a more accurate and finer genetic adaptation pattern and population structure for chr. A01, A13, D01, D06, D07, and D08 (Supplementary Fig. 6).

Haplotype blocks of FH on major structural variants strongly suggest superior allele/gene transferring between the introduced germplasm and modern cultivars in the breeding process. We accordingly used the number of shared FHs to calculate a coefficient reflecting the intensity of gene transference between populations (Methods, Fig. 3f, and Supplementary Data 10). The results highlighted two transferring hotspots among earlier landraces and modern cultivars. First, in early landraces, a relatively high gene transference was observed between G. barbadense accessions and cottons sampled from Central American countries and Caribbean islands, this finding is inferred to relate to the similar geographical origins of upland and island cotton. In addition, strong gene transference was observed among CAL, NAL, and SAL, which suggests the introduction of early cotton landraces within the American continents. A very high level of gene transference was also found from NAL to SCL, once again indicating an ancient genetic background for landraces in Southern China. The transferences from NAL to SCL may have originated from the global trading activities of European colonizers during the Age of the Great Seas19. A recent genomics study found that wild cotton seeds from the island of Hainan were able to survive floating in seawater for more than six months, raising the possibility that SCL progenitors drifted over the Pacific Ocean24. Second, modern cotton cultivars exhibited another hotspot with frequent gene transference, particularly from the USA to YRR and YZR, and among those from NWC, YRR, FSU, and NC. These results are in line with current understanding of cotton breeding history in China19,25.

FH-based GWAS mapped QTGs associated with agronomic traits

Despite the fact that short variations are abundantly investigated in genetic research, the vast majority of genetic studies, especially in plant and animal genetics, still rely on genes for further elucidation of regulatory mechanisms and the development of improved cultivars. In this study, we converted short variant information into gene-level FH, enabling the direct mapping of QTGs related to target agronomic traits. For the QTG mapping, we collected a total of 245 sets of phenotypic data covering 20 agronomy traits regarding lint yield, fiber quality, plant architecture, resistance, and maturity (Supplementary Data 11). Using a linear mixed model that incorporates the genetic kinship matrix generated from FH genotype as random effects and the first two principal components (PCs) as fixed covariates, we performed single-gene regression for 60,154 genes against the 245 phenotypes. The factorial FH genotype were dummy coded, and a reduced phenotype was applied in regression to save computational time (Methods). This FH-GWAS initially identified 10,279 genes significantly associated with agronomic traits (Supplementary Data 12). After simple clumping to remove other significant genes within 500 kilobase (kb), a total of 5022 QTG clusters were identified (Supplementary Data 13). As an essential comparison, we also conducted conventional SNP/Indel GWAS for the 245 phenotypes, which incorporates the genetic kinship matrix generated from SNP/Indel genotype instead as random effects and the first two PCs as fixed covariates. After clumping, this yielded 6533 candidate QTLs significantly associated with the phenotypes (Supplementary Data 14). Generally, the number of QTLs is higher than the number of QTGs, likely due to possible non-coding associations (Fig. 4a). Nonetheless, the number of QTGs showed a strong positive correlation with the number of QTLs, especially in those traits with multiple environments and large sample sizes. Chr. A08 and D11 were notable for containing the most QTGs. The high number in chr. A08 may result from the extremely large LD caused by structural variation covering over 70 megabases (Mb); in chr. D11, it is likely due to associations with numerous phenotypes. This high degree of functional influence may also explain why chr. D11 features the highest number of FHs but the lowest average gene density.

Fig. 4: FH-based GWAS identified key QTGs associated with agronomic traits and revealed the genetic basis of variety improvement in cotton.
figure 4

a Comparison of the numbers of QTGs and QTLs associated with the phenotype. b Proportion of overlapped QTLs and QTGs according to flanking region size. c Distribution of QTGs on the sub-genome. d Relationship of the relative number of superior FHs (left panel), average effects of all significant FHs (right panel), and phenotypes. Each point represents a sample, point size represents the average effect size of significant FHs. e Relative number of superior FHs possessed by cottons registered or introduced in different time periods. n is the sample size. f Genetic basis of phenotypic correlations in cotton. Pie charts plot the proportion of FHs with significant pleiotropy effects on FL, FS, and FS, LP (upper panel), and boxplots show the absolute effect sizes for FHs with inverse or isotropic pleiotropic effects (lower panel), n is the number of genes with corresponding pleiotropic effects, two-sided student t-test are used for statistical inference (**p < 0.01, ns not significant). In a boxplot, the center line represents the median, the lower and upper box hinges corresponding to the first and third quantiles, whiskers extend to the minimum and maximum values, and individual points beyond the whiskers (if present) denote outliers. Source data are provided as a Source Data file.

We next explored whether QTLs and QTGs reflect similar genomic regions associated with a phenotype. The number of overlapped QTLs and QTGs increased with flanking region size (1 to 5 Mb; Fig. 4b). In crops with high genome linkage, casual genes may be located very far away from significant QTLs; this once again emphasizes the superiority and necessity of FH-based GWAS in gene mapping studies. Furthermore, under the condition where the flanking region was 5 Mb, only approximately 35% of QTLs and 25% of QTGs were found to overlap. This indicates that different genomic signatures are discovered by SNP/Indel and gene-level FH association analyses. We found that QTGs for fiber quality, lint yield, plant architecture, and resistance were situated mainly in the A sub-genome (Fig. 4c), which explains its lower gene diversity (Fig. 2b), in that the A sub-genome has experienced stronger selection pressures than the D sub-genome.

Taking the FH in TM-1 (coded as FH-1) as the reference FH, we calculated exact phenotypic effects for the alternative FHs in all 10,279 potential QTGs (Supplementary Data 15). Among all 72,559 alternative FHs, 19,228 FHs in 8917 genes were found to have significant phenotypic effects compared to the reference FH. Of these, 9411 FHs were identified to have significantly superior effects on agronomic traits, whereas 9817 FHs which mainly concentrated in the early landraces (Supplementary Data 16) had significantly adverse effects. A total of 7724 FHs in 3701 genes were implied to be pleiotropy across multiple traits. The association study detected statistically significant phenotypic correlations in 8917 genes, while most of the identified effect sizes were low (Supplementary Fig. 7), which is consistent with polygenic architecture for complex traits26. Further screening for rare variants and minor effects revealed 532 high-frequency genes with major effects, which are of direct breeding value (Supplementary Data 17).

At the individual level, we found the relative number of superior FHs (the number of superior FHs minus the number of inferior FHs in a given sample) and the average effects of significant FHs to be highly correlated with most phenotypes (Fig. 4d, Supplementary Fig. 8, and Supplementary Table 5). This result illustrates the reliability of the QTGs identified in FH-GWAS. Considering the release dates of the various accessions, we examined the pattern of change in the number of superior FHs in accessions from the beginning of the 20th century to the present (Fig. 4e and Supplementary Fig. 9). This revealed superior FH presence to have massively increased over time for lint yield traits of LP, lint index (LI), fiber weight per boll (FWPB), boll weight (BW); for fiber quality traits of fiber uniformity (FU), spinning consistency index (SCI); and for plant height (PH), which relates to plant architecture, and verticillium wilt index (VW), which relates to disease and stress resistance. These trends are particularly evident in cultivars selected after the 1990s, where the efficiency of breeding selection markedly improved due to advancements in and maturation of the molecular biology techniques applied in genetic breeding. For some agronomic traits such as FL, FS, and FD, the long history of cultivar improvement resulted in only a relatively limited introduction of superior FHs until the early 21st century. However, these FHs became more abundant in cultivars from the year 2000 onwards, suggesting a shift in the goals of cotton breeding. Interestingly, superior FHs related to FM and FE showed obvious decreases in cultivars released over the last 30 years (Supplementary Fig. 10), a result that is consistent with the decreased fiber fineness in recent cultivars and implies great potential for improvement of these traits in future cotton genetic breeding.

There are well-known phenotypic correlations for several important traits related to cotton fiber, particularly the positive correlations between fiber quality traits (e.g., between FL and FS) and negative correlations between fiber quality and lint yield traits. We resolved the genetic basis of these phenotypic correlations from the gene perspective. For the positively correlated FL and FS, we found almost all pleiotropic FHs (99.73%) to have the same effect direction (Fig. 4f and Supplementary Table 6), which implies a synergistic improvement in FL and FS by the selection of a causal gene; meanwhile, for FHs with isotropic or inverse effects for FL and FS, the exact phenotypic effects showed no significant difference. For the negatively correlated FS and LP, although isotropic FHs also constituted the major part (79.41%) of pleiotropic FHs, we observed a clearly higher proportion of FHs (20.59%) with inverse effects. More importantly, these inverse-effect FHs are generally capable of improving one of the two traits to a greater extent, meaning they are often more observable and preferable in breeding activities, albeit at the expense of impairing the other trait.

Validation of the effects of key QTGs on fiber quality

To validate the reliability of QTG identification by FH-GWAS, we chose the gene GH_D11G1903, whose FH diversity showed significant association with fiber quality traits in multiple phenotypes (Fig. 5a). This gene, named GhFAH1 hereafter, encodes ferulic acid 5-hydroxylase 1, which has previously been reported to affect the biosynthesis of phenylpropanoids27, and phenylalanine-derived specialized metabolites that are included in the structural components of plant cell walls. There are three missense variants identified in GhFAH1 (Supplementary Data 18), which derive eight types of FHs in the entire population (Supplementary Table 7). Our analysis identified three major FHs (FH-1, FH-2, and FH-4) for GhFAH1; relative to the reference FH in TM-1, the other two FHs contain heterozygous and homozygous missense mutations at the same position, resulting in a change of amino acid from glycine to serine. FH haplotype analysis showed significant decreases in FL and SCI for samples carrying the alternative FHs (Fig. 5b). The gene GhFAH1 is located in the QTL candidate region represented by lead variant D11_20177498 (Fig. 5c) and neighboring variant D11_20531989 (Supplementary Fig. 11). We also investigated GhFAH1 expression across the entire TM-1 plant tissues and found it to be specifically highly expressed in 20 and 25 days-post-anthesis (DPA) fibers (Supplementary Fig. 12). Furthermore, examination of its expression in 20 DPA fibers of 207 G. hirsutum accessions revealed subtle yet notable correlations with the fiber traits FL, FS, and FM (Fig. 5d and Supplementary Data 19, 20). In addition, earlier landraces, especially Yucatanense (YUC), showed predominant GhFAH1 expression in 20 DPA ovules, while its expression was lacking in most modern cultivars (Fig. 5e and Supplementary Data 21). These transcriptomic evidences suggest a potential negative impact of GhFAH1 on superior fiber establishment. We successfully performed CRISPR-Cas9 editing of GhFAH1 (Fig. 5f and Supplementary Data 22) that resulted in significantly longer and finer fibers (Fig. 5g, h and Supplementary Data 23).

Fig. 5: CRISPR-Cas9 knock-out (KO) experiment validates the phenotypic effects of GhFAH1 on fiber quality.
figure 5

a Association study revealed GH_D11G1903 (GhFAH1) as significantly affecting FL and SCI. b Haplotype analysis showed the phenotypic effects of different FHs of GhFAH1. n is the sample size. The center line represents the median, the lower and upper box hinges corresponding to the first and third quantiles, and whiskers extend to the minimum and maximum values. c Colocalization between GhFAH1 and the candidate QTLs. d Correlations between GhFAH1 expression in 20 DPA fiber and 90 phenotypes in 207 G. hirsutum cultivars. e Expression of GhFAH1 (TPM) in 20 DPA ovules of several early landraces and modern cultivars. Data were presented as mean values ± SEM. f Sanger sequencing chromatogram illustrating a 2 bp deletion in target GhFAH1. g Fiber quality evaluation for receptor parent W0 and KO lines. h Phenotypes of mature fibers of W0 and KO lines. Two-sided Mann–Whitney U-test was applied for statistical test (*p < 0.05, p < 0.1). Error bar represents the standard error, n is the sample size, data were presented as mean values ± SEM. Source data are provided as a Source Data file.

Finally, we compared our QTG results with previous genetic fine mapping studies and transgenic experiments, which literature provided a total of seven genes with validated functions in fiber development (Supplementary Data 24). These results confirm the broad validity of FH-GWAS as an effective method for identifying key QTGs related to agronomic traits.

Discussion

Large-scale population sequencing and a sophisticated statistical genetics framework have led to major breakthroughs in the life sciences, resolving the genetic mechanisms of complex life activities with unprecedented precision. Although a large number of QTLs have been confirmed in GWAS studies, these results have generally disappointed in terms of driving the discovery of biological causes of phenotypes. Achieving sufficient identification of QTGs from QTLs remains fraught with difficulty. This challenge is particularly prominent in plants, which typically exhibit stronger LD due to extensive artificial selection and complex genome structures28. Recently, the genetic characterization of large populations has provided a new opportunity for overcoming this challenge. In this study, we proposed a strategy in which a multidimensional reduction-like analysis29 is performed on the non-synonymous variants present in a gene. In this first FH panel of 3724 cotton accessions representing the most comprehensive upland cotton collections to date, our analysis yielded a total of 1,368,685 FHs representing global cotton protein diversity. Although FH utilizes only about 3.8% of the total genomic information, it is fully capable of reflecting both known structural information, such as in cotton chr. A06 and A0819, along with adaptive signatures in the genome (Figs. 2e, 3a). Being highly informative, FH and its merged blocks allow more accurate tracing of genetic ancestry for genomic segments (Fig. 3c, d and Supplementary Figs. 46) and also reveal additional knowledge about the history of human introductions as well as gene transference events at the gene level (Fig. 3f).

Genetic mapping has always been a central task in applied genetics. Our study leveraged the FH coding strategy and a tailored linear mixed model to achieve efficient narrowing of candidate gene sets and direct mapping of QTGs in some scenarios at the GWAS stage. Compared with conventional GWAS based on short variants, the FH-based GWAS detected fewer significant clumped QTGs (5022) than significant clumped QTLs (6533), as the wider distribution of SNP or short structural variants revealed extra genomic signatures in non-coding regions. In addition, only 25% of the identified QTLs and 35% of QTGs shared the same candidate interval (Fig. 4b). This finding suggests long-range linkage between the significant and causal variant or gene23,30, and additionally illustrates the unique advantage of FH in discovering associations between protein sequence variation and phenotype. That is, although both SNP-GWAS and FH-GWAS could identify unique genetic signatures, apparently the QTGs identified by FH-GWAS are more readily applicable and more amenable to genetic interpretation. We selected GhFAH1 for validation, FH of which were significantly associated with fiber quality and specifically expressed during the critical development stages of fiber. The impact of this gene on fiber quality was confirmed by CRISPR-Cas9 gene editing. Taken together, these results demonstrate the feasibility of direct mapping of QTGs through FH-based GWAS. Efficient gene mapping for complex traits in animal and plant species will promote and accelerate the development of breeding 5.0, which leverages big data and artificial intelligence for deep mining and breeding design, and operates at the gene level through transgenic technology (Fig. 6).

Fig. 6
figure 6

Pipeline of intelligent precision breeding (Breeding 5.0).

In addition to comparing FH-GWAS with SNP-GWAS, the polygenicity of common complex traits reflected in FH-GWAS is also worth attention. For the 20 agronomic traits tested in this study, a seemingly large number of 8916 genes were identified as significantly related to phenotypes (Supplementary Data 15). This result preserves the potential impact of rare variants on phenotypes, which is commonly present in plant mutants. In other words, we did not adopt a very strict screening of FH frequency in the initial association analysis, in light of having applied stringent screening conditions to the initially identified variants to ensure their authenticity. In addition, the effect sizes of significant FHs were consistent with the theoretical normal distribution in the polygenicity framework (Supplementary Fig. 7), and these low-effect but undeniable associations indicate a small genetic contribution of common genetic variants or a highly polygenic contribution involving many variants. Nevertheless, we need to emphasize that addition evaluation is especially required for the application of those significant FH with low frequency under the current strategy. Upon further screening, we identified 532 high-frequency genes with major effects, which are observable in conventional breeding and can be used as reliable references for further breeding practices.

Gene-level FH-based association analysis has laid the foundation for broad and more readily interpretable genetic discoveries. However, as a refined model within the GWAS framework, this method remains susceptible to various forms of cryptic relatedness that may introduce spurious associations. Multi-omics assessment of significantly associated QTGs remains indispensable to refine the genetic rationale of QTG-influenced traits. Besides, it is also important to acknowledge that our analytic framework focuses on genotypic configuration and does not account for haplotype phase information. This modeling strategy introduces theoretical limitations, as heterozygous genotypes within FHs may encode distinct peptide products depending on allelic phase. Consequently, haplotype-specific effects at the gene level may be incompletely resolved. Nevertheless, we should emphasize that the association between phenotypes and FHs containing homozygous non-synonymous mutations under this framework remains valid and appropriate, but the indeterminate classification of heterozygous sites reduces statistical power to distinguish causal FHs from confounding associations, resulting in conservative bias in statistical interpretations. Further inclusion of phase information is essential, while currently there is still a lack of a haplotype-resolved reference genome in cotton, and the polyploid complexity of the species has limited the application of computationally inferred phase information, which risks introducing substantial technical artifacts.

China has achieved remarkable success in improving cotton cultivars and promoting their production. However, in recent years, both the acreage of cotton cultivation and total yield have declined31. While ensuring high yields, it is crucial to develop high lint yield cultivars with excellent fiber quality. This study has found that, over the past 30 years, the number of superior FHs related to lint yield have increased significantly (Fig. 4e). Meanwhile, since 2000, superior FHs related to FL have also increased, but those related to FM and FE have decreased in number, which is consistent with the development trend of China’s cotton production. At present, the number of superior genes related to lint yield has basically reached a bottleneck, and the number of inferior genes is very limited (Supplementary Data 16). However, there are relatively few superior genes and many inferior genes related to fiber fineness, elongation rate, and maturity, which could be considered the next main direction for cotton genetic breeding.

Concomitant achievement of high lint yield and superior fiber quality have been the main goal of cotton genetic breeding in the past few decades. However, long-term phenotypic studies and recent genomic analyses have revealed what is now a well-established negative correlation of these traits in both morphology and genetics. Several previous studies have analyzed the genetic basis for this strong negative correlation, examining additive effects of QTL (clusters). These studies have found that most QTLs affecting both lint yield and fiber quality have opposite additive effects, while those affecting multiple fiber quality traits such as FL and FS share the same effect direction32,33,34. In the present study, our identification of consistent effect direction of QTGs for FL and FS is consistent with previous findings; however, and quite interestingly, we found that most QTGs with pleiotropic effects on LP and FL also have the same effect directions (Fig. 4f), however, these same-direction QTGs have very small effects, while those with inverse directions are relatively fewer, but their effect in improving one of the traits is very strong. Thus, our findings strongly indicate that there are a large number of genes in the cotton genome that can simultaneously improve both fiber quality and lint yield, but they have rarely been mapped in breeding practice because of their minor effects. Meanwhile, genes that can significantly improve yield or quality are more likely to be selected by breeders, but potentially impair the other trait. These results provide insights into key issues in cotton breeding and suggest a theoretical basis for achieving current major breeding goals.

Methods

Sample collection and calling of genomic variations

In this study, we collected the raw sequencing data of 4480 cottons from eight public datasets (Supplementary Data 1). These cottons represent global cotton origin areas as well as major cultivation areas. The raw sequencing data were trimmed using fastp (ver. 0.23.4) with default parameters35, and the clean reads were mapped against the G. hirsutum genetic standard reference genome TM-112 using BWA (ver. 0.7.17-r1188)36. The mapping results were sorted and compressed with samtools (ver. 1.11)37. PCR duplication was marked using sambamba (ver. 0.6.6)38. Genomic variations were identified using the bcftools (ver. 1.11) “mpileup” and “call” commands in batch39. Variants with mapping quality (QUAL) lower than 30 were discarded. Indel markers from different batches were left-aligned and normalized with the bcftools “norm” command to avoid ambiguity in the merge step, and all datasets were merged using the bcftools “merge” command. Variant imputation was conducted by Beagle (ver. 5.4)40. After quality control for the genomic variations (biallelic, mapping quality >30, heterozygosity <0.1, number of homozygotes for alternative allele >10), a total of 4480 accessions and 9,052,728 variants were retained. Samples with different sequencing depths were evaluated by principal component analysis (PCA) and ancestry estimation. After evaluation of the merged dataset, we removed samples with low sequencing depth, and variants with an inbreeding coefficient lower than −0.6 and MAF lower than 0.001. Finally, 3724 samples with 23,057,253 genomic variants were retained (Supplementary Data 3).

Construction of the FH dataset

For the 23,057,253 genomic variants, we annotated their effects using snpEff (ver. 5.1)41 and extracted a total of 334,406 non-synonymous variants that directly alter protein sequences (Supplementary Data 25). In cases where multiple missense variants were present, the genotype combination in each sample was extracted as the FH genotype of that sample for that gene. To illustrate, if the genotype of FH-5 for a gene is 022, it indicates the presence of three non-synonymous mutations. The additive genotypes for these three variants are 0, 2, and 2, respectively, indicating that the sample is homozygous for the reference allele in the first variant and homozygous for the alternative allele in the other two variants. Based on this principle, the FH for every gene in the reference genome TM-1 will always be a combination of 0 s (referred to as FH-1). This facilitates estimation of the phenotypic effects of different FHs. Notably, the genotype coding is changed from the numeric method used for SNPs or Indels that counts allele number, such as 0, 1, 2, to the factorial method, for example, FH-10 represents type ten for this gene.

Genetic diversity and selection pressure analysis

Shannon’s equitability index (EH) is commonly used to characterize the distribution equitability of species with given abundance42,43. In this study, we use EH to measure the gene diversity across different chromosomes and populations. For a given gene, the EH is calculated as follows:

$${E}_{H}=-\frac{1}{{{{\mathrm{ln}}}}\,{N}}\sum \limits_{i}{p}_{i}\,{{{\mathrm{ln}}}}\,{p}_{i}$$
(1)

in which \({p}_{i}\) is the prevalence of the i-th FH for this gene in the population, and N is the population size. The EH ranges from 0, indicating this gene is sequence-conserved as only one reference FH (FH-1, \({p}_{i}\) = 1) exists in the population, to 1, indicating this gene is extremely variable because every sample in the population has a unique FH (\({p}_{i}\) = 1/N). For genes with varying diversity across populations, we investigated their potential biological functions by gene ontology (GO) enrichment analysis using the R package clusterProfiler (ver. 4.6.2)44.

To quantify selection pressure, we calculated the ratio of non-synonymous substitutions and synonymous substitutions (Ka/Ks)45 for each gene:

$${K}_{a}/{K}_{s}=\frac{{\sum }_{i}{N}_{i}{Q}_{i-{variant}}/{Q}_{i-{allele}}}{{\sum }_{i}{N}_{i}{X}_{i-{variant}}/{X}_{i-{allele}}}$$
(2)

in which Ni is the number of samples with the i-th FH, Qi-variant and Xi-variant denote the number of non-synonymous and synonymous variants in the i-th FH, and Qi-allele and Xi-allele represent the number of non-synonymous and synonymous alleles. When Ka/Ks is equal to 1, it indicates the gene is subject to neutral selection, while Ka/Ks greater than 1 or less than 1 represents positive selection and purifying selection, respectively.

Phylogenetic analysis and ancestry estimation

We constructed phylogenetic trees from both genomic markers and FH markers. For the regular genomic variants, PCA, phylogenetic analysis, and ancestral estimation were conducted on the LD-pruned dataset comprising 43,602 independent variants. These variants were extracted on the basis of having pairwise correlations lower than 0.05 with any other variants located within a 100 kb window. The LD-pruning step and PCA analysis were conducted using the PLINK2 (ver. 2.00a6LM)46 software with respective input parameters “indep-pairwise 100 kb 0.05” and “--pca”. The neighbor-joining tree was constructed for the large genomic variant dataset using the FastTree (ver. 2.1.11SSE3)47 with default parameters. Ancestral estimation for all samples was conducted using Admixture (ver. 1.3.0)48 with default parameters. For the FH dataset, the genotype coding being factorial rather than numeric made the previous method not applicable. Instead, we firstly calculated the general genetic distance

$$d=D/L$$
(3)

in which D is the number of genes that possess different FH between two individuals and L is the total number of genes, along with its associated variance D(1-D)/L. The general genetic relatedness score is then calcuted by 1-d. Then we used the general relatedness matrix to construct the neighbor-joining tree. These two steps were implemented using the functions “dist.gene()” and “nj()” in the R package ape (ver. 5.8)49.

Gene transference intensity

Our study measured the level of pairwise subpopulation gene transference caused by natural factors or human activities by calculating the relative number of genes sharing the same type (i.e., FH). Specifically, the gene transference intensity (\({G}_{s}\)) between subpopulations A and B was calculated as

$${G}_{s}=\frac{1}{M}\sum\limits_{i}{w}_{i}{n}_{i}$$
(4)

in which M is the total number of genes; \({n}_{i}\) is the number of genes specifically share same FH between \(i\) subpopulations; \(i\) ranges from 2, representing the gene share the same FH only between subpopulations A and B, which provides the most direct evidence of gene transference between those subpopulations, to the maximum number of subpopulations, representing gene shares same FHs across the entire population; and \({w}_{i}\) is the weight score

$${w}_{i}={e}^{-\frac{{(i-2)}^{2}}{10}}$$
(5)

which considers the contribution of genes with same FH shared in different numbers of subpopulations to determine gene transference in the pair. When \(i\) is 2, \({w}_{i}\) reaches it maximum value of 1, and as \(i\) increases, it decreases. When \(i\) is higher than 8, \({w}_{i}\) is 0.027, meaning that FHs occurring in more than eight subpopulations are not informative in revealing the gene transference between subpopulations A and B. After all \({G}_{s}\) were computed, they were further normalized through division by the intra-population \({G}_{s}\) value.

FH-GWAS modeling

We performed linear mixed model-based GWAS to identify candidate QTGs for the agronomic traits of interest in cotton. The statistical model is described as:

$${{{\rm{y}}}}={{{{\rm{X}}}}}_{c}{{{{\rm{\beta }}}}}_{c}+{{{{\rm{F}}}}}_{j}{{{{\rm{g}}}}}_{j}+{{{\rm{r}}}}+{{{\rm{e}}}}$$
(6)

where \({{{\rm{y}}}}\) is an \(n\times 1\) vector of phenotypic value for \(n\) samples; \({{{{\rm{X}}}}}_{c}\) is the incidence matrix of fixed covariates, which includes the intercept coefficient 1 and the top two PCs, with their corresponding effects \({{{{\rm{\beta }}}}}_{c}\); \({{{{\rm{g}}}}}_{j}\) is a vector of genetic effect captured by the tested gene with \({{{{\rm{g}}}}}_{j} \sim {{{\rm{N}}}}(0,{{{\rm{I}}}}{\sigma }_{g}^{2})\), in which \({\sigma }_{g}^{2}\) is variance; and \({{{{\rm{F}}}}}_{j}\) is the \(n\times m\) coding matrix for the \(j\)-th gene. For sample \(i\) that carries the factorial FH \(q\) in the \(j\)-th gene, \(q\) was coded to the dummy vector \({{{{\rm{F}}}}}_{{ij}}\) by setting the \(q\)-th element in an all-zero vector to 1,

$${{{{\rm{F}}}}}_{{ij}}=\left[{0}_{1},{0}_{2},\ldots,{0}_{q-1},1,{0}_{q+1},\ldots {0}_{m}\right]$$
(7)

in which \(m\) is the total number of FH types observed for the tested gene. Additionally, \({{{\rm{r}}}}\) is a vector of genetic effect captured by genetic relatedness with \({{{\rm{r}}}} \sim {{{\rm{N}}}}(0,{{{\rm{G}}}}{\sigma }_{r}^{2})\), the covariance matrix of \({{{\rm{r}}}}\) is \({{{\rm{G}}}}{\sigma }_{r}^{2}\), meaning the variance of each individual’s genetic effect is scaled by their kinship coefficient (diagonal of \({{{\rm{G}}}}\), the general relatedness matrix applied for phylogenetic analysis), and the covariance between individuals is determined by their pairwise genetic relatedness (off-diagonal of \({{{\rm{G}}}}\)); and \({{{\rm{e}}}}\) is the residual vector with \({{{\rm{e}}}} \sim {{{\rm{N}}}}(0,{{{\rm{I}}}}{\sigma }_{e}^{2})\). To reduce the computational burden, we firstly fitted the following model

$${{{\rm{y}}}}={{{{\rm{X}}}}}_{c}{{{{\rm{\beta }}}}}_{c}+{{{\rm{r}}}}+{{{\rm{e}}}}$$
(8)

and calculated the reduced phenotype \(\widetilde{{{{\bf{y}}}}}\) as

$$\widetilde{{{{\rm{y}}}}}={{{\rm{y}}}}-\hat{{{{\rm{r}}}}}$$
(9)

in which \(\hat{{{{\rm{r}}}}}\) is the vector of estimated effect of genetic relatedness. Then, two models with or without the tested gene were fitted, the null model (\({H}_{0}\)) is

$$\widetilde{{{{\rm{y}}}}}={{{{\rm{X}}}}}_{c}{{{{\rm{\beta }}}}}_{c}+{{{\rm{e}}}}$$
(10)

and alternative model (\({H}_{1}\)) is

$$\widetilde{{{{\rm{y}}}}}={{{{\rm{X}}}}}_{c}{{{{\rm{\beta }}}}}_{c}+{{{{\rm{F}}}}}_{j}{{{{\rm{g}}}}}_{j}+{{{\rm{e}}}}$$
(11)

The likelihood ratio test was applied to the two models to determine whether the variance \({\sigma }_{g}^{2}\) of \({{{{\rm{g}}}}}_{j}\) is significantly greater than 0 (one-sided alternative), with degree of freedom 1. A suggestive threshold of 1e−5 was applied for determining the significance of the test gene in affecting the phenotype. The GWAS analysis was conducted by the R package lme4qtl (ver. 0.2.2)50.

SNP and indel GWAS

The filtered SNP and Indel set in VCF format are converted to variant-major text genotype table (tped) using PLINK2, and GWAS was conducted by software efficient mixed model association eXpedited (EMMAX, ver. 20120210)51, with genetic kinship matrix and the first two principal components as covariates. The statistical analysis tests the significance of the variant's effect using a two-sided generalized least squares F-test, and a suggestive threshold of 1e−5 was taken as the threshold.

Clumping of QTLs and QTGs

Significant genomic variants (SNPs and Indels) and genes were clumped in order to take LD into account in the results interpretation. For significant variants, any sites less than 1 Mb away from an index variant (variant having p value lower than 1e-5) and greater than 0.1 correlation with it were assigned to that index variant’s clump; this clumping step was conducted using the PLINK2 “--clump” command, with the current population as the reference panel. For genes significantly associated with phenotypes in the FH-based GWAS, a simple clumping was conducted by keeping the most significant gene while removing any other significant genes within 500 kb.

Calculation of FH effects on phenotypes

For a given QTG identified as significantly associated with a phenotype, we calculated the phenotypic effects of the FHs it contains across the population. The FH in the reference genome TM-1 was always defined as FH-1 and used as the base FH to which average phenotypes of the other FHs (such as FH−2, FH-3, etc.) were compared. FHs appearing in less than two samples were removed. The average phenotypic value among samples containing the FH-q was calculated as

$${\bar{y}}_{q}=\frac{1}{{n}_{q}}\sum\limits _{i}{y}_{{qi}}$$
(12)

and the effect size of FH-q was defined as the difference between \({\bar{y}}_{q}\) and \({\bar{y}}_{1}\). Multiple comparisons of phenotype for samples with different FHs were conducted by means of the least significant difference (LSD) test, with p values adjusted by the Benjamini–Hochberg method. Genes of high breeding value were screened based on a frequency of occurrence of more than 30 in the population, and an effect size of less than 5% quantile or more than 95% quantile for the corresponding phenotype.

Validation of biological functions of target QTGs

CRISPR/Cas9 genome editing was used to generate knock-out lines for GhFAH1. The sgRNA sequences (5’-GGACCTGTGGAGCCATCTCG-3’) were designed by the web tool CRISPR-P 2.0 and cloned into the sgRNA-Cas9 expression vector (with oligonucleotides forward sequence 5’-ACGGCCCGTAATGAGCGAAA-3’, and reverse sequence 5’-CACCTCAAGATGGGGTTCCT-3’), then introduced into G. hirsutum accession W0 via Agrobacterium tumefaciens-mediated transformation. Primer pairs bracketing the target regions were used to amplify the DNA of the transgenic plants, and the derived PCR products were sequenced to confirm mutation of the target genes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.